-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-compaction and compaction options. #1
Comments
So I'm looking at adding an option to Medea like so: var db = new Medea({
compaction: 'auto' // value: 'manual' or 'auto'
}); Compaction would spawn a new process to get the work done. Question: Should this default to |
Automatically by default. Keep it well documented that processes are going to be spawned for the cleanup process. I'd rather have a disposable process spawned to clean this up then to have it grow out of control. -Matt |
Aside from making a messy file structure and eventually losing all your disk space, what are the other downside to not ever running compaction? |
Thanks, fellas. I'm on the same page. Just wanted to make sure the page wasn't torn out of some twisted version of Alice in Wonderland. @sandfox Another downside is a longer startup time. Medea iterates through all the hint files on startup to build the keydir. If it has to parse through lots of dead entries, that takes a while. Out of the box, Medea has a maximum file size limit and will open new data/hint files as needed. You can set this as an option on the Medea constructor ( For some use cases, this probably won't matter much. For a lot, it will. Scheduling the merge process is another feature of Bitcask (always, never, or within a certain time range). I'm wondering how system-intensive this process will be. I guess I won't know until it's built. :) |
@kevinswiber Does compaction ever block? (in my head i think it must do at least some of the time) |
@sandfox Yep. Trying to figure out how to make this work elegantly in Node. In Bitcask, the keydir is shared between multiple Erlang processes. Without using a memory mapping trick, I can't do that with legit OS processes. I'm afraid we might have to block for the entire duration of the merge process. I still need to map out what this process will do in Node-land. Because of this, merge strategies might be:
Thoughts? |
I think default should be on start-up and throw out a little log message saying whats happening. This way it's very non-surprising when getting started. (nothing more annoying that things stopping midway through being on for "unknown" reasons). Otherwise let people specify attempting compaction every [x timeframe]. in combination with som configurable threshold for when compaction should happen too so it doesn't do it needlessly? |
There are a few thresholds in Bitcask.
I think I'm going to try something in-process. We might be able to do it with minimal or no blocking.
This might make everything slower during a merge as more system resources will be utilized, but we may be able to get away with not blocking (for very long). That would be effin' sweet. |
I've got the compaction code working now. You can find it in master at commit 26681ef. This doesn't do any automatic scheduling yet. Below is some output. I had 47MB worth of key-value pairs in file 81 that I duplicated in a subsequent call to
|
FYI: You may notice the file number jumps from 82 to 84. That's because Medea holds file 83 open while compacting. File 83 is supposed to record ongoing file updates while compaction is running. File 84 becomes the newly merged data file. There were no updates in this case, and when Medea was properly closed, it deleted the empty file. (Note: These numbers don't mean anything special. Medea just increments the file ID for data and hint files. It starts at 1.) Medea has a "wrap" concept that closes an active file and starts a new one when it reaches a configurable size limit. That should also be in merge cases, though I haven't tested it yet. I expect ongoing reads/writes to work as compaction is running, but both will be impacted as resource usage spikes on the machine. |
Hey, let's get this discussion going again! I'd like to opt for automatic compaction! And running it in the same process, and running it on startup. Perhaps we could count how many keys that's been changed/added/removed and then run compaction when that number is above a certain threshold? Since the compaction is running mostly asynchronous, but in series (only one asynchronous operation running at the time) the performance impact shouldn't be to large. But that should definitely be benchmarked, so we can find a good balancen |
We need smart defaults for auto-compaction that apply to multiple use cases. Just not sure what those are at this time. |
The log-structured hash table creates data on each operation. A separate merge process is needed to clean up the .data and .hint files.
The text was updated successfully, but these errors were encountered: