-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add monthly CWV Tech Report materialization query #237
Comments
I've added a scheduled query with the hard-coded default date replaced with
Note: there is a bug with scheduled queries using This scheduled query is currently set to run on-demand in order to be triggered manually from some other source (e.g. API call) rather than on the first of the month, since the execution date may vary from the run date. |
Testing as of 2024-03-10 after the most recent crawl completed for 2024-03-01 and using a run date of 2024-02-01. This produced no results since the new pipeline has not finished yet and upstream tables have not been updated in the |
The crawl starts after the CrUX data is published on the second Tuesday of the month. So the I also went to run this manually this month (for February) but it says it'll process 337 TB, whereas the manual one say 18 TB so went with that for this month as quite a difference! It could be over estimating how much it'll take but 337 TB is a LOT so didn't want to take the chance. Thoughts? |
Ok, perfect! The data isn't "delayed"; instead, I ran the query with dates too far into the future at that point in time.
That's good. The query mentioned in the issue description is written to use tables in the
In my experience, the Google Cloud console's query validator is not very accurate, especially when working with scripts, partitioned tables, and filters. I ran this scheduled query during testing and it processed 17.27TB (rather than the estimated 337TB). giancarlo_faranda@cloudshell:~ (httparchive)$ bq show --location=US --job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c
Job httparchive:scheduled_query_65fd9c6e-0000-255a-8802-582429c39d0c
Job Type State Start Time Duration User Email Bytes Processed Bytes Billed Billing Tier Labels
---------- --------- ----------------- ---------------- ----------------------------- ----------------- ---------------- -------------- -------------------------------------------------
query SUCCESS 10 Mar 22:41:23 0:07:59.888000 Giancarlo.Faranda@gmail.com 18990487562745 18990488354816 dts_run_id:65fd9c6e-0000-255a-8802-582429c39d0c
data_source_id:scheduled_query |
OK cool. Will give it a go next month then. I presume it is NOT automated yet to kick off? Especially for this month since I've manually run it for this month. |
@tunetheweb Correct, it is not automated yet but I think we trigger this from the same Pub/Sub subscription that initiates the crawl. Is this the topic which signals the start of the crawl? projects/httparchive/topics/crux-updated (CC @pmeenan @rviscomi ) |
The query to materialize the
technologies
table with the latest CWV Tech Report data should be run before the new pipeline, as those queries read from the table.The text was updated successfully, but these errors were encountered: