Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 2016_11_15_chrome_requests_bodies table has incorrect URLs #6

Open
foolip opened this issue Dec 19, 2016 · 4 comments
Open

The 2016_11_15_chrome_requests_bodies table has incorrect URLs #6

foolip opened this issue Dec 19, 2016 · 4 comments

Comments

@foolip
Copy link

foolip commented Dec 19, 2016

Example query:
https://bigquery.cloud.google.com:443/savedquery/762219082167:af96186e5c904f698b123b74869fd98f

For example, https://wiggio.com/images/facebook_home.png (from page http://www.wiggio.com/) shows up amongst the result, with a body containing "OpenTok.js 2.9.3 41dae66" close to the beginning. This appears to be some mixup, and far from the only one.

I don't know if the error is in the original HARs.

@igrigorik

@igrigorik
Copy link
Collaborator

Err.. @pmeenan tracked down and fixed a related issue back in ~August, wonder if we had a regression?

@pmeenan
Copy link
Member

pmeenan commented Dec 20, 2016

It is a problem in the uploaded HAR and in the HAR from the original test.

Looking into what caused the mis-alignment now. I did fix an issue with something similar (as well as an issue with invalid UTF8 strings) but the UI is showing the correct bodies (which is the earlier fix) so it might be something HAR-specific. Looking into it now.

@pmeenan
Copy link
Member

pmeenan commented Dec 20, 2016

Made a few changes to hopefully help but won't know for sure until I can look at a newer data set. One issue is that the HARs were always for the first run instead of the median run but that shouldn't have affected the bodies association (just makes it harder to investigate because we only archive the bodies for the median run so I don't have the source data for some of the HARs).

I also switched the HAR export to use the newer ID-based association but I'll need to verify it worked as expected.

Keeping this query here to re-use for later:

SELECT pages.wptid,bodies.page,bodies.url
FROM [httparchive:har.2016_11_15_chrome_requests_bodies] as bodies
JOIN EACH [httparchive:runs.2016_11_15_pages] as pages
ON bodies.page=pages.url
WHERE bodies.url LIKE '%.png'
AND bodies.body CONTAINS 'function'
AND NOT bodies.body CONTAINS 'DOCTYPE'
AND NOT bodies.body CONTAINS 'doctype'
AND NOT bodies.body CONTAINS 'html';

Filtering out the HTML eliminates a lot of cases where "friendly" not-found HTML responses were being sent for image requests and ending with .png helps filter out things like .pngfix.js but still catches a good number of non-png requests (may join with the requests table to check the actual mime type but until then this works well enough).

@foolip
Copy link
Author

foolip commented Dec 21, 2016

Thanks for looking into this so quickly, @pmeenan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants