Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and maintain a 10k-row subset table #27

Open
rviscomi opened this issue Mar 21, 2018 · 10 comments
Open

Create and maintain a 10k-row subset table #27

rviscomi opened this issue Mar 21, 2018 · 10 comments

Comments

@rviscomi
Copy link
Member

Suggested in the HTTP Archive Slack channel:

Was wondering if it makes sense to add a "sample" dataset that contains data for the first ~1000 pages. This way you can easily test out a query on httparchive.latest.response_bodies_desktop using something smaller like httparchive.sample.response_bodies_desktop. I manually create sample datasets for the same reason when working with the larger tables.

having an official 10K subset would make this process cheaper for non-Google folks, and would make it feasible to create an occasional query without hitting the free plan limits

Just need to figure out which tables to subset, how to organize them, and how to keep them updated with the latest release.

@igrigorik
Copy link
Collaborator

How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"?

@sleevi
Copy link

sleevi commented Mar 21, 2018 via email

@paulcalvano
Copy link

I tested out a VIEW with a a LIMIT 1000 clause in it. It still caused a full scan of all rows in the table that the VIEW was based on.

@paulcalvano
Copy link

As for how to define first, I've using the top N based on the Alexa rank . WHERE rank > 0 LIMIT 1000

For example my pages_mobile sample query reduced the 342MB table to 765KB:

SELECT *
FROM `httparchive.runs.2018_02_15_pages_mobile` 
WHERE rank > 0
ORDER BY rank ASC
LIMIT 1000 

Then create other tables based on the pageid column in this dataset, or the url for the HAR datasets or the NET.HOSTNAME(url) match for CrUX datasets.

This request sample table uses the following query, which processes 43.3GB and results in a 132MB table that I can use to test JOINs with the pages table later on :

SELECT * 
FROM `httparchive.runs.2018_02_15_requests_mobile`  
WHERE pageid IN (
  SELECT pageid 
  FROM `httparchive_sampleset.pages_mobile`
)

And this sample table creates a Lighthouse version of it, reducing the 115GB table down to 280MB.

SELECT * 
FROM `httparchive.lighthouse.2018_02_15_mobile`   
WHERE url IN (
  SELECT url 
  FROM `httparchive_sampleset.pages_mobile`
)
AND report IS NOT NULL 

Creating the sample tables for the HAR and Lighthouse datasets will help reduce processing overhead on developing queries more than the pages/requests tables. But we should still include them so that we can test JOINs across these tables.

@rviscomi
Copy link
Member Author

WDYT of just having a random sample? This is future-proof if we're ever moving away from Alexa ranks and possibly more representative of the full dataset than just the head URLs.

@dougsillars
Copy link

I think this is a great idea. I have had a lot of hesitation digging into the Lighthouse data - knowing each query would eat 13% of my free 1TB. Letting me refine my queries on a 280 MB ( or even a couple GB) table would lower the barrier to entry.

@paulcalvano
Copy link

I created a few sample tables here. These use the queries I mentioned above for the alexa top 1000.

A random sample makes sense too - but we would need to make sure that the URLs are consistent across sample tables so that JOINs can be tested properly

I didn't create them for every table yet, but this is a start. I'm happy to share this data set more broadly, but I think it's better off living under the HTTP Archive. Let me know if there is anything I can do to help with this!

@andydavies
Copy link

andydavies commented Nov 14, 2018

@paulcalvano In the short term any chance you can recreate these with the new structure tables?

Longer term if we want people to query the raw data then I think maintaining these in the public set is a must.

As it stands it's just too expensive to query the main tables, and it's only going to get more expensive as the HTTP Archive grows.

For example this seemingly innocuous query processes 708GB!

select page, extractLinkHeaders(JSON_EXTRACT(payload, '$.response.headers'))
from `httparchive.requests.2018_10_15_mobile`
where url = "https://starpowertalent.com/";

(Using SQL as the query language makes BigQuery accessible but it also 'tricked' me to think about processing in a relational DB way which of course isn't how it works - for a while was confused as to why the where clause didn't limit the bytes processed)

@bkardell
Copy link

It seems that things that want to query the response bodies with a regexp would benefit a lot from this as regexps are frequently a little tricky and you'd like to be able to test and make mistakes with a much smaller dataset than even the oldest/smallest ones I can find.

@paulcalvano
Copy link

1k and 10k tables created for the web almanac -

HTTPArchive/almanac.httparchive.org#34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants