Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3: Refactor attempt creation to be worker requested #1077

Merged
merged 68 commits into from
May 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
5ed700d
WIP worker TaskRunAttempt creation
ericallam Apr 30, 2024
d1bdd0c
Handling failing task runs that cannot create an attempt for whatever…
ericallam Apr 30, 2024
e24631e
Move the visibility queue stuff into a graphile job
ericallam Apr 30, 2024
f124d94
Fixed task runs with unsanitized queue names
ericallam Apr 30, 2024
3b3b07a
“Borrow” the code from alerts PR to get self hosted deployments working
ericallam Apr 30, 2024
5ca9e56
Add an admin API endpoint to get info about the shared marqs queue
ericallam May 1, 2024
1bba5d5
Allow admins to view any project metrics
ericallam May 1, 2024
14992f4
start adding lazy attempts to prod
nicktrn May 1, 2024
c75bbfd
lazy attempt creation for prod workers
nicktrn May 2, 2024
f53004d
resurrect prod stack traces
nicktrn May 2, 2024
1919b6f
add exception event to failed run spans
nicktrn May 2, 2024
a86b2e6
simplify dependency resumes
nicktrn May 2, 2024
dcc9745
fix typecheck
nicktrn May 2, 2024
4e18a42
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 2, 2024
90153c3
fix merge
nicktrn May 2, 2024
0552a8e
fresh process for all attempts
nicktrn May 3, 2024
c7fee76
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 3, 2024
1286147
always try sigterm first
nicktrn May 3, 2024
30b6c2c
stop heartbeat timeout on non-inplace replace message
nicktrn May 3, 2024
4ace3a4
add missing ack on checkpoint creation service failure
nicktrn May 3, 2024
78a1e57
bypass dequeue for retries with running worker
nicktrn May 3, 2024
1f11944
respect retry delays
nicktrn May 3, 2024
ba72219
crash runs with invalid run status for execution
nicktrn May 7, 2024
d60cf55
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 7, 2024
5dfaf99
remove debug logs
nicktrn May 7, 2024
93dca36
fix nack message
nicktrn May 7, 2024
bf79e6b
fix version locking
nicktrn May 8, 2024
6ad28b6
fresh attempt processes in dev and prod
nicktrn May 8, 2024
c6b1a29
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 8, 2024
0c0eb02
improve handling of ipc timeouts
nicktrn May 8, 2024
091f1d8
consider checkpoint failures on cancellation
nicktrn May 8, 2024
2a43a21
add basic chaos monkey to checkpointer
nicktrn May 8, 2024
e6cea79
changeset
nicktrn May 8, 2024
55cd522
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 8, 2024
2d30f86
Merge branch 'v3/fix-checkpoint-failures' into v3/worker-attempt-crea…
nicktrn May 8, 2024
7a9cd8d
control forced checkpoint simulation via env var
nicktrn Apr 26, 2024
e0423cb
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 20, 2024
334cf0e
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 21, 2024
19f6568
fix merge
nicktrn May 21, 2024
181641d
kill old attempt processes before checkpointing
nicktrn May 21, 2024
0837825
detailed perf logging for checkpointing
nicktrn May 21, 2024
59a6476
add coordinator otlp endpoint example
nicktrn May 21, 2024
833a0f1
improve prod run cancellation
nicktrn May 24, 2024
92e257f
rename supports lazy attempts migration
nicktrn May 24, 2024
1e8743d
fix graceful exit
nicktrn May 24, 2024
e913c41
fix retry mechanics
nicktrn May 24, 2024
6aba347
clear paused state before retry
nicktrn May 9, 2024
40a99f8
remove checkpoint image after push
nicktrn May 9, 2024
5e4b4a3
crash worker on unrecoverable errors
nicktrn May 9, 2024
bc71e2c
refactor unrecoverable error emit
nicktrn May 24, 2024
48aadea
switch to do hosted busybox image
nicktrn May 24, 2024
127d1aa
increase wait for duration ipc timeout
nicktrn May 24, 2024
02ae3f8
add changeset for misc fixes
nicktrn May 24, 2024
ef100ad
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 24, 2024
0ad7b83
fix merge
nicktrn May 24, 2024
8e5b71d
fix retry delay span runId
nicktrn May 27, 2024
ee660a3
fix dev retries
nicktrn May 28, 2024
b6e105a
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 28, 2024
fed79e7
Merge branch 'main' into v3/worker-attempt-creation
nicktrn May 28, 2024
8f378b2
improve prod worker logging
nicktrn May 28, 2024
839b349
log checkpoint sizes
nicktrn May 28, 2024
16a365f
add lazy attempts catalog entries
nicktrn May 28, 2024
d137e4e
Fixed merge issue: use zodFetch, not wrapZodFetch
matt-aitken May 28, 2024
79b47fd
Revert "Fixed merge issue: use zodFetch, not wrapZodFetch"
matt-aitken May 28, 2024
23eb918
importEnvVars uses wrapZodFetch now
matt-aitken May 28, 2024
0e7e0df
add backwards compat for retries without checkpoints
nicktrn May 29, 2024
66c9186
handle more cases of unrecoverable runs
nicktrn May 29, 2024
2099d91
don't kill the child process if it shouldn't be killed
nicktrn May 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
14 changes: 14 additions & 0 deletions .changeset/tricky-keys-attack.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
"trigger.dev": patch
"@trigger.dev/core": patch
---

- Clear paused states before retry
- Detect and handle unrecoverable worker errors
- Remove checkpoints after successful push
- Permanently switch to DO hosted busybox image
- Fix IPC timeout issue, or at least handle it more gracefully
- Handle checkpoint failures
- Basic chaos monkey for checkpoint testing
- Stack traces are back in the dashboard
- Display final errors on root span
5 changes: 5 additions & 0 deletions .changeset/warm-olives-provide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@trigger.dev/core": patch
---

Improve handling of IPC timeouts and fix checkpoint cancellation after failures