The 2016_11_15_chrome_requests_bodies table has incorrect URLs #6

foolip · 2016-12-19T20:35:16Z

Example query:
https://bigquery.cloud.google.com:443/savedquery/762219082167:af96186e5c904f698b123b74869fd98f

For example, https://wiggio.com/images/facebook_home.png (from page http://www.wiggio.com/) shows up amongst the result, with a body containing "OpenTok.js 2.9.3 41dae66" close to the beginning. This appears to be some mixup, and far from the only one.

I don't know if the error is in the original HARs.

@igrigorik

igrigorik · 2016-12-20T01:56:24Z

Err.. @pmeenan tracked down and fixed a related issue back in ~August, wonder if we had a regression?

pmeenan · 2016-12-20T14:00:34Z

It is a problem in the uploaded HAR and in the HAR from the original test.

Looking into what caused the mis-alignment now. I did fix an issue with something similar (as well as an issue with invalid UTF8 strings) but the UI is showing the correct bodies (which is the earlier fix) so it might be something HAR-specific. Looking into it now.

pmeenan · 2016-12-20T16:35:45Z

Made a few changes to hopefully help but won't know for sure until I can look at a newer data set. One issue is that the HARs were always for the first run instead of the median run but that shouldn't have affected the bodies association (just makes it harder to investigate because we only archive the bodies for the median run so I don't have the source data for some of the HARs).

I also switched the HAR export to use the newer ID-based association but I'll need to verify it worked as expected.

Keeping this query here to re-use for later:

SELECT pages.wptid,bodies.page,bodies.url
FROM [httparchive:har.2016_11_15_chrome_requests_bodies] as bodies
JOIN EACH [httparchive:runs.2016_11_15_pages] as pages
ON bodies.page=pages.url
WHERE bodies.url LIKE '%.png'
AND bodies.body CONTAINS 'function'
AND NOT bodies.body CONTAINS 'DOCTYPE'
AND NOT bodies.body CONTAINS 'doctype'
AND NOT bodies.body CONTAINS 'html';

Filtering out the HTML eliminates a lot of cases where "friendly" not-found HTML responses were being sent for image requests and ending with .png helps filter out things like .pngfix.js but still catches a good number of non-png requests (may join with the requests table to check the actual mime type but until then this works well enough).

foolip · 2016-12-21T07:35:07Z

Thanks for looking into this so quickly, @pmeenan!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The 2016_11_15_chrome_requests_bodies table has incorrect URLs #6

The 2016_11_15_chrome_requests_bodies table has incorrect URLs #6

foolip commented Dec 19, 2016

igrigorik commented Dec 20, 2016

pmeenan commented Dec 20, 2016

pmeenan commented Dec 20, 2016

foolip commented Dec 21, 2016

The 2016_11_15_chrome_requests_bodies table has incorrect URLs #6

The 2016_11_15_chrome_requests_bodies table has incorrect URLs #6

Comments

foolip commented Dec 19, 2016

igrigorik commented Dec 20, 2016

pmeenan commented Dec 20, 2016

pmeenan commented Dec 20, 2016

foolip commented Dec 21, 2016