-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core): add RSS memory limit to prevent OOM kills from the OS #4480
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
core/src/main/java/io/questdb/cairo/wal/seq/TableSequencerAPI.java
Outdated
Show resolved
Hide resolved
core/src/main/java/io/questdb/cairo/wal/seq/TableSequencerImpl.java
Outdated
Show resolved
Hide resolved
bluestreak01
previously approved these changes
May 20, 2024
bluestreak01
approved these changes
May 20, 2024
[PR Coverage check]😍 pass : 2181 / 2539 (85.90%) file detail
|
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds the setting
cairo.rss.memory.limit
and enforces it inUnsafe.checkAllocLimit()
. Throws a non-criticalCairoException
with a new flag,isOutOfMemory
, raised.Also fixes #4510.
Future work
We need to show the proper error message to the user. It must say which query ran out of memory.
We need to implement a heuristic to set the default RSS limit automatically.
Discussion
The majority of the PR deals with the new failure point this introduces, and making the code robust to it.
There are two main categories of problems: initializing objects, and reopening objects.
1. Initializating objects
Object initialization (constructor) is a critical moment in its lifecycle: many constructors create several
Closeable
child objects, but the constructor isn't typically wrapped in atry-catch
that closes any objects already constructed. If the constructor throws an exception, these child objects become unreachable, and leak all the resources they allocated.Another case is higher up the call stack, complex business logic that prepares the object graph to serve a query. Prime example is
SqlCodeGenerator
. This code creates some objects, keeping them in local variables, and then eventually passes them to aRecordCursorFactory
. It is highly complex and branchy, and it's difficult to keep track of each individual closeable object along many different code paths.2. Reopening objects
Reopenable
objects are typically stored in object pools, and reused through theirof()
method. Theclose()
method in many of these objects has anif
guard that does cleanup only if the object is open. Most typically it'sif (isOpen) { cleanup }
. The critical problem arises inreopen()
, where this flag may be raised not as the first thing before any native allocation, but in the end -- which makes intuitive sense, but introduces a leak. An OOM exception thrown from one of the initialization steps will trigger callingclose()
, but it will be a no-op becauseisOpen
is still false.This is a very subtle problem and it's likely to keep getting re-introduced to the codebase.
Testing
In order to find failure points in the codebase, I made some temporary changes in
assertMemoryLeak()
:Unsafe.setRssMemLimit(8_750_000)
just abovecode.run()
, then reset it to zero right after.Unsafe.checkAllocLimit()
such as printing the stack trace of the OOM exception. Exceptions sometimes get translated and the original stack trace is lost.malloc()
call.These are the commits that make the temporary changes:
7e5d4a4
4d88403
63edda9
a3b453d
These commits are reverted later on in the PR.
We should use a similar approach to create an automated test run that will keep monitoring our codebase for leaks.