Skip to content

dnr/styx

Repository files navigation

Styx

Introduction

Styx is an alternate binary substitution mechanism for Nix. It downloads data on-demand, stores common chunks of data only once to reduce local disk usage, and uses differential compression to reduce bandwidth usage.

It's currently quite experimental, though the basic features are working. If you want to try it, skip down to the bottom.

Motivation

Before explaining exactly what it does and how it works, let me motivate things a little:

We all know that Nix uses a ton of storage and bandwidth. This is because of how it works: each package contains absolute references to its dependencies, so if package with dependents changes all the packages that depend on it have to change also. If a package near the root of the dependency tree (e.g. glibc or systemd) changes, pretty much everything changes.

When a package changes, you have to download a nar file containing that package. The nar file is compressed with xz, which is as good as it gets for general purpose compression, but it can only eliminate redundancy within the package, not across packages or versions of a package.

When a package's dependency changes without any other change, the new version is usually very similar to the old one. Small version bumps produce slightly bigger changes but often still quite small. Nix's binary fetching and storage mechanism can't take advantage of any of that, though, it always downloads a full xz-compressed nar. If we could take advantage of the similarities of data across versions of a package, we could save both bandwidth and storage space.

My first try at seeing if I could improve this situation was nix-sandwich, which does differential compression for nar downloads. This works surprisingly well! But it has a bunch of limitations: it struggles with large packages, and it does nothing to improve the situation with local disk storage. I wanted to use some of those ideas but also try to address the storage problem at the same time.

Storage

Nix's optimize-store works by hard-linking files. This provides some benefit, but only if the whole file is identical. Files that are very similar, even the same except for a few changed hashes, get no benefit at all.

To do better we have to take advantage of common data within files. Some modern filesystems like btrfs provide a way to share extents (block-aligned ranges of files). If we combined that with differential-compressed downloads, that would address both parts of the problem.

On-demand

But I was also thinking about another obvious way to reduce bandwidth and storage: not downloading or storing files that aren't used. For various reasons, it's common for packages to end up in NixOS system closure that aren't actually used. It seems silly to download new versions of them at every system update.

I also have a bunch of packages in my configuration that I use rarely. For example, I might fire up darktable only a few times a year, but I end up downloading it with every update. Sure, I could leave it out of my configuration and just use nix-shell -p, but that's extra hassle I shouldn't have to do.

What if most packages on my system were "installed", but the data wasn't even fetched until they were used? Of course, after it was fetched then it should be cached locally for fast access.

What about FUSE?

If you're familiar with the Linux filesystem space, you might be thinking FUSE: run a daemon that serves a Nix store in a smart way, doing fancy [de]compression and fetching data on-demand.

Well, Styx is similar to that, but better:

The problem with FUSE is the performance overhead. Since each stat and read (no writes, packages in the store are immutable) has to go to the kernel, then to userspace, then back to the kernel, there's unavoidable latency.

What we really want for this on-demand thing is a way for the kernel to ask us for files or parts of files, and then store that data and serve it from kernel space in the future.

EROFS

It turns out there's some new experimental stuff in Linux that works exactly like this: EROFS + on-demand fscache was created for the container ecosystem, but there's a lot of similarities with what we want for Nix.

EROFS is a read-only filesystem format and driver, similar to SquashFS and others, but much more flexible in where the "data" actually comes from. It can do things like mount tar files without moving or copying the data. Most importantly for us, it can mount filesystems where the data, and even the metadata, isn't present on the system in any form yet. Instead, the data can be requested from a userspace daemon on-demand using the fscache mechanism.

Additionally, it can share data between different "filesystems" and present them while only storing one backing copy.

Styx hooks up this EROFS + fscache stuff to a Nix binary cache and local store, plus an external service for on-demand differential compression, to get all the properties we're looking for.

There's still some overhead compared to a plain filesystem, since we essentially have a filesystem on a filesystem. However, once cached, stats and reads are served with one trip into the kernel. (Benchmarks TBD)

Overall design

Let's get the big complex diagram out there for reference:

flowchart BT
    subgraph localmachine[Local machine]
        subgraph kernel[Kernel]
            erofs
            fscache
            cachefiles
            filesystem[(filesystem)]
        end
        subgraph sd[Styx daemon]
            styxdaemon[Styx daemon]
            boltdb[BoltDB]
        end
        userspace([userspace programs])
        subgraph Nix daemon
            nixdaemon[Nix local store]
        end
    end

    subgraph cloud[Cloud]
        nixcache((  Nix binary cache  ))
        subgraph ssc[Styx server components]
            subgraph lambda[AWS lambda]
                manifester[Manifester]
                chunkdiffer[Chunk differ]
            end
            subgraph s3[AWS S3]
                chunkstore[(Chunk store)]
                manifestcache[(Manifest cache)]
            end
        end
    end

    manifester-- gets nar from -->nixcache
    manifester-- writes cached manifest -->manifestcache
    manifester-- writes chunks -->chunkstore
    chunkdiffer-- reads data -->chunkstore
    erofs-- requests data -->fscache
    cachefiles-- requests data -->styxdaemon
    userspace-- reads+stats files -->erofs
    userspace-- realizes store path -->nixdaemon
    nixdaemon-- requests mount of store path -->styxdaemon
    styxdaemon-- requests new manifest -->manifester
    styxdaemon-- requests diff -->chunkdiffer
    styxdaemon-- reads single chunks ----->chunkstore
    styxdaemon-- gets manifest ----->manifestcache
    nixdaemon-- gets narinfo -->nixcache
    fscache-- requests data -->cachefiles
    cachefiles-- stores data on -->filesystem
    styxdaemon-- stores metadata in -->boltdb
    boltdb-- stores data in -->filesystem
    styxdaemon-- requests mount -->erofs

    style nixcache stroke-width:6px

Then a few definitions:

Block: EROFS filesystems have a block size, like all filesystems. Styx uses 4KiB blocks. (I think currently the block size must match the system page size.)

Chunk: To reduce metadata overhead, the unit of data sharing and data movement is larger than a single block. Styx currently uses 64KiB, 16 blocks.

Chunk digest: Chunks are identified by a cryptographic digest. Currently we use a 192-bit prefix of SHA-256.

Manifest: The goal is that we don't want to download a full nar file (which has metadata and file data), but we do need all the metadata to create an erofs image. So we need a thing that has just the metadata from the nar file. But, it also needs digests of the data, or actually the chunks of the data, so that we can figure out the sharing structure. We call this a "manifest". It's similar to a nar listing (supported by some binary caches), but with two addtions: For tiny files and symlinks, we do include the data inline. For larger files, we break them into chunks and include the digest of each chunk.

EROFS image: Styx constructs one EROFS image per store path. The image has only metadata and small inline file data. The actual data is pointers to slab devices.

Slab: A slab is a virtual device that contains all the chunks that we know about, concatenated together (padded to the block size). A slab can hold one TB and we can have many slabs. Even though it looks big, don't worry, disk space is allocated only on-demand.

Slab image: Due to the design of cachefiles, we can only write data through its interface, we can't read it back out. We need to read chunks that we've written to construct the base for binary diffing. To do this, we mount a special image that exposes the entire slab device as a single large file. This also looks big but doesn't take any actual disk space.

Manifest slab: Large manifests themselves are broken into chunks. These chunks aren't stored in a slab, though, they're stored in a manifest slab. This one is just a plain old file, no EROFS or cachefiles stuff going on.

Styx daemon: This runs in your local machine, next to the Nix daemon.

Chunk store: An object store with one object per chunk.

Manifest cache: An object store with one object per constructed manifest.

Manifester: Server that can convert nar files into manifests, storing chunks of data in a chunk store.

Chunk differ: Server that can compute binary diffs between arbitrary sequences of chunks on demand.

Now we can describe the pieces and flow:

For requesting a new store path, it goes something like this:

  1. A user asks the Nix daemon to realize a store path.
  2. The Nix daemon queries a binary cache to see if it can substitute (requests narinfo).
  3. If if finds it, and if configured, it asks Styx to substitute it instead of downloading the nar file.
  4. Styx checks if it has an EROFS image already built. If not:
  5. It checks the manifest cache to see if it can just download one. If not:
  6. It asks the manifester to create a manifest for the nar.
    1. The manifester gets the narinfo and nar file.
    2. It breaks the data into chunks and stores them in the chunk store.
    3. It stores the completed manifest in the manifest cache for future requests.
    4. It returns the manifest to the Styx daemon.
  7. Now the Styx daemon has a manifest. It creates an EROFS image from the manifest and holds it in memory.
  8. It asks the kernel to mount an EROFS on the destination store path.
  9. The kernel calls into EROFS, which calls into fscache, which calls into cachefiles, which makes a request to the userspace cachefiles daemon.
  10. Styx acts as the userspace cachefiles daemon and gets the request for the contents of the EROFS image. It supplies the image that it kept in memory.
  11. Cachefiles writes the image to disk in its internal cache and supplies the data to EROFS.
  12. The kernel complets the mount operation. The store path is now mounted in the filesystem and can be stat'ed and readdir'ed with no additional interaction from Styx or Nix.
  13. Styx reports success to Nix.
  14. Nix records that the store path has been successfully substituted.

That's the simple case. There are a few complications that might happen:

  1. The manifest might be "chunked" itself. We do this when it's too big and could benefit from differential compression. In this case, we have to get the chunks of the manifest to reassemble it before we can build the image. This basically follows the read path below, so read that first.
  2. If this is the first image we're mounting, EROFS will also request to open the slab device(s). The Styx daemon responds to that by setting up some internal data structures. But then it also mounts the slab image (this is so we can read from it). This is another round-trip through the kernel and back up to Styx, where it returns a single 4KiB image. It then keeps an open file descriptor to the slab file within the slab image.

Now suppose a userspace program does a read from a file. That looks like this:

  1. Userspace calls read on a file in a Styx EROFS mount.
  2. The EROFS filesystem that we constructed tells EROFS that the chunk at that offset is actually found at a particular offset in a different "device", which is actually our slab file.
  3. If the data at that offset of the slab file is cached, it can return that. Note that the data may have been cached by another image, the slab file is shared.
  4. If not, it requests the data from Styx.
  5. Styx figures out what chunk is at that offset.
  6. Now there are a few options for how it can get the data for that chunk:
    1. It can request the single chunk from the chunk store.
    2. It can guess that some nearby chunks are also going to be requested, and ask the chunk differ to recompress and send a batch of chunks at once. This is higher latency (the server side is doing more work), but will generally give better compression than individual chunks.
    3. It can guess that the requested chunk and some nearby chunks are likely to be similar to a run of chunks that it already has. E.g., a bunch of chunks for a newer version of a package are similar to the corresponding chunks from the previous version of the same package. It can ask the chunk differ for a binary diff, which can provide much better compression.
  7. Once it has the data, it supplies it to cachefiles, and the read completes.

Discussion and notes

Security

Nix's security for binary substitution is based on the nar hash: a cryptographic digest of the whole nar file. Since our goal is to avoid downloading nar files, we can't verify a nar hash at substitution time. (We can verify it if all data for a package is completely fetched, but that might never happen for some packages.)

Instead, the manifester verifies the nar hash, and then signs the manifest with its own key. The Styx daemon verifies the manifest signature. This obviously means that the end user needs to trust the manifester, in addition to the nar signer as usual. This is sort of unfortunate but it's the best we can do without changing Nix itself. With tighter intergration to Nix and a different binary cache substitution protocol, we could of course do better here.

Note that we don't have to trust the chunk store, manifest cache, or chunk differ: chunks are content-addressed so can be verified independently.

Chunking schemes

EROFS basically imposes a chunking scheme on us: aligned chunks of some fixed power-of-two multiple of the block size, sequentially from the start of each file.

(Note that EROFS supports more flexible chunking when using its compression features. As far as I can tell, you can't use both compression and cross-image sharing at once. Maybe that'll be possible in the future.)

How does this compare to content-defined chunking (CDC)? CDC can theoretically share more data, especially if you use a smaller chunk size. But chunks at non-aligned offsets add more overhead to reconstructing files. Of course you can transfer data with CDC and reconstruct it into a filesystem. But then you lose the benefits of CDC in the local filesystem.

The idea of Styx is to partially decouple local storage sharing from network transfer sharing. Local storage uses aligned fixed size chunks for performance (this part isn't really changeable), but as long as we can reconstruct the data, we can inject anything into Styx: whole nar files, CDC-reconstituted nars, differentially-compressed nars, or (the current choice) differentially-compressed/individually-retrieved chunks.

There are some nice properties of using the same chunks for transfer that we do for storage, notably it reduces the metadata overhead of keeping track of what we have and what we're missing, and makes building manifests simpler. But it's not strictly required.

Expanding compressed files

If you've poked around in your Nix store, you might have noticed some packages contain compressed files. Prominent examples are man pages (compressed with gzip), Linux kernel module files (compressed with xz), and Linux firmware (compressed with xz). This makes some sense for normal substitution, since those files aren't often used and can be smaller on disk.

For Styx, though, this is actually conterproductive: Files that are not used are simply never downloaded, so it doesn't matter whether they're compressed or not. Per-file also interferes with Styx's differential compression (and nar compression for that matter): A small change in a file will produce a large change in the compressed version of that file, at least after the first point where they differ.

So it would be nice to undo this compression so we can get better binary diffs. Actually we can: the manifester can simply un-gzip or un-xz the files in the nar as it reads it, and present to Styx a package made of uncompressed files. There's only two issues:

  • We should probably rename the files by removing the .gz/.xz extension so programs don't get confused. We'll also have to be careful to rewrite symlinks.
  • If we modify the contents of the package, it will no longer match the nar and Nix will fail to verify it. This is fine as long as we understand what's happening. We could re-generate the nar signature with our own key if we really wanted.

(This isn't fully implemented yet.)

GC

So far we've only written data. How do we clean things up? There's enough metadata in the db to trace all chunk references, so we can find unused chunks. The next part is uncertain: based on a few experiments, I think that we can use fallocate with FALLOC_FL_PUNCH_HOLE to free space in the slab file.

(This isn't implemented yet.)

Cachefiles culling

Cachefiles has a culling mechanism that kicks in when the disk gets full. It's not clear what we should do since we can't just throw away the whole slab (the bulk of the data). Maybe we could keep some LRU info and throw out some old data.

For now, don't let the disk get full or things will probably go badly.

(This isn't implemented yet.)

Cost of server components

TBD

How to use it

Note that Styx requires two "experimental" kernel options to be enabled, so all of the following commands will build a custom kernel. I'll set up a binary cache with kernel builds soon.

At least kernel 6.8 is recommended since some bugs were fixed. These commands will use linuxPackages_latest.

This will also build a patched Nix.

Run the test suite in a VM

testvm

Run the test suite on this machine (requires custom kernel build)

testlocal

Start a VM with Styx running and available

This will use services in my AWS account. I may turn it off or break it at any time.

runvm, log in with test/test

The VM will be set up with <nixpkgs> shared on /tmp/nixpkgs and set on NIX_PATH. Styx will be configured to substitute all packages ≥ 32KiB. So start with nix-shell -p ... and see what happens.

Use it in your system configuration

Don't do this yet, this is still pretty experimental

   imports = [
      ./path/to/this/repo/module
   ];
   services.styx.enable = true;
   nix.settings.styx-include = [ "list" "of" "package" "name" "regexp-.*" ];

Roadmap and future work

Realistic

  • Improve chunk diff selection algorithm. The current algorithm is a very crude heuristic. There's a lot of potential work here.
    • Improving base selection
      • Getting a 'system" for each package and only using same system as base
      • Using multiple bases
    • Better selection of base chunks (currently using offset in nar)
    • Better readahead heuristics (whole file at a time)
    • Exploring other approaches like simhash
    • Consider how to make diffs more cacheable
  • Prefetch {file, package} command
  • Expanding compressed files
  • Support for bare files
  • Combine multiple store paths into images to reduce overhead
  • GC
  • Respond to cachefiles culling requests
  • CI to pre-create manifests for core packages after channel bumps
  • Run a system with everything not needed by stage1+stage2 on Styx
  • Adaptive chunk sizes for less overhead on very large packages
  • Closer integration into other binary caches and maybe Nix itself

Slightly crazy

  • Boot a system with almost everything in the store in Styx (kernel + initrd copied to /boot, Styx started in stage1)

License

GPL-2.0-only

Styx incorporates bits of code from Linux so it could be considered a derived work. Most of the code incorporated is either dual-licenced with Apache 2.0 or with the Linux syscall exception, so theoretically Styx would not have to use GPL-2.0, but that's the simplest option for now. If you're interested in using this code under different terms, let me know.