Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monthly CWV Tech Report materialization query #237

Open
rviscomi opened this issue Jan 25, 2024 · 6 comments · May be fixed by #254
Open

Add monthly CWV Tech Report materialization query #237

rviscomi opened this issue Jan 25, 2024 · 6 comments · May be fixed by #254
Assignees

Comments

@rviscomi
Copy link
Member

The query to materialize the technologies table with the latest CWV Tech Report data should be run before the new pipeline, as those queries read from the table.

@giancarloaf
Copy link
Collaborator

giancarloaf commented Mar 10, 2024

I've added a scheduled query with the hard-coded default date replaced with @run_date

Note: there is a bug with scheduled queries using @run_date with partitioned tables and required filters (issue tracker), so the scheduled query includes a hacky workaround not present in the original query.

This scheduled query is currently set to run on-demand in order to be triggered manually from some other source (e.g. API call) rather than on the first of the month, since the execution date may vary from the run date.

@giancarloaf
Copy link
Collaborator

Testing as of 2024-03-10 after the most recent crawl completed for 2024-03-01 and using a run date of 2024-02-01. This produced no results since the new pipeline has not finished yet and upstream tables have not been updated in the all and summary_* datasets. This query may need adjustment if it needs to be run before the new pipeline.

@tunetheweb
Copy link
Member

Testing as of 2024-03-10 after the most recent crawl completed for 2024-03-01 and using a run date of 2024-02-01. This produced no results since the new pipeline has not finished yet and upstream tables have not been updated in the all and summary_* datasets. This query may need adjustment if it needs to be run before the new pipeline.

The crawl starts after the CrUX data is published on the second Tuesday of the month. So the all data is available for February, but March will only start this week. We also don't need the summary_* datasets for this report as it uses the all datasets (which is good as we're having trouble with that dataset as discussed!) The CrUX data is for the full month so won't show until April 9th (second Tuesday of the next month) and it's joined to the crawl data for this report.

I also went to run this manually this month (for February) but it says it'll process 337 TB, whereas the manual one say 18 TB so went with that for this month as quite a difference! It could be over estimating how much it'll take but 337 TB is a LOT so didn't want to take the chance. Thoughts?

@giancarloaf
Copy link
Collaborator

The crawl starts after the CrUX data is published on the second Tuesday of the month.

Ok, perfect! The data isn't "delayed"; instead, I ran the query with dates too far into the future at that point in time.

We also don't need the summary_* datasets for this report as it uses the all datasets (which is good as we're having trouble with that dataset as discussed!)

That's good. The query mentioned in the issue description is written to use tables in the all dataset only. I mentioned the summary_* datasets as well just in case there may be a reason to use them as an alternative.

I also went to run this manually this month (for February) but it says it'll process 337 TB, whereas the manual one say 18 TB so went with that for this month as quite a difference! It could be over estimating how much it'll take but 337 TB is a LOT so didn't want to take the chance. Thoughts?

In my experience, the Google Cloud console's query validator is not very accurate, especially when working with scripts, partitioned tables, and filters. I ran this scheduled query during testing and it processed 17.27TB (rather than the estimated 337TB).

giancarlo_faranda@cloudshell:~ (httparchive)$ bq show --location=US --job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c                                                                                                                                                     
Job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c

  Job Type    State      Start Time         Duration              User Email            Bytes Processed    Bytes Billed    Billing Tier                       Labels                       
 ---------- --------- ----------------- ---------------- ----------------------------- ----------------- ---------------- -------------- ------------------------------------------------- 
  query      SUCCESS   10 Mar 22:41:23   0:07:59.888000   Giancarlo.Faranda@gmail.com   18990487562745    18990488354816                  dts_run_id:65fd9c6e-0000-255a-8802-582429c39d0c  
                                                                                                                                          data_source_id:scheduled_query                  

image

@tunetheweb
Copy link
Member

OK cool. Will give it a go next month then. I presume it is NOT automated yet to kick off? Especially for this month since I've manually run it for this month.

@giancarloaf
Copy link
Collaborator

@tunetheweb Correct, it is not automated yet but I think we trigger this from the same Pub/Sub subscription that initiates the crawl.

Is this the topic which signals the start of the crawl? projects/httparchive/topics/crux-updated (CC @pmeenan @rviscomi )

@giancarloaf giancarloaf linked a pull request Mar 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants