Skip to content

Latest commit

 

History

History
613 lines (503 loc) · 31 KB

design.md

File metadata and controls

613 lines (503 loc) · 31 KB

VFSForGit Linux Client Design

Scope

This document describes the architecture of the Linux port of the VFSForGit client and projected filesystem, and some of the expected development roadmap for these components.

While we intend to keep this document reasonably up-to-date, it is always advisable to check the current state of the code in the features/linuxprototype branch of the Microsoft's upstream VFSForGit repository, in GitHub's copy of the features/linuxprototype branch, and in this libprojfs repository.

Overview

VFSForGit on Windows

The client-side Windows architecture of a virtualized Git working tree includes three components:

  • Windows Projected File System, including the PrjFlt kernel-mode NTFS filter driver, and the ProjectedFSLib user-space library to interact with the driver.
  • VFSForGit, including the GVFS provider which provides callbacks via the ProjFS library to the PrjFlt filter driver, as well as scripts which are installed as Git hooks.
  • Git for Windows with GVFS patches, which functions as a normal Git client but with adjustments for a virtualized working tree.

VFSForGit on Windows architecture diagram

The Windows design projects the external Git repository into the virtualized working tree by moving files through five states:

  • Virtual: file exists as metadata only in the GVFS provider, not on disk at all.
  • Placeholder: file exists on disk but has no contents.
  • Hydrated: file exists on disk with actual contents, but is still tracked by provider so it can be excluded from certain Git actions which only require modified files.
  • Full: file exists on disk with modified contents, and is no longer tracked by provider.
  • Tombstone: file has been deleted by user, but a "whiteout" placeholder is needed because it is still in the provider's metadata and would otherwise be treated as virtual.

Directories move through a smaller set of three states: virtual, placeholder (empty), and full.

VFSForGit on macOS

The client-side macOS architecture of a virtualized Git working tree includes three components:

  • PrjFSKext, a kernel extension (kext) providing "kauth" authorization hooks which run prior to key filesystem operations, and a PrjFSLib user-space library to interact with it.
  • VFSForGit, including the GVFS provider which provides callbacks via the PrjFSLib library to the PrjFSKext kext, as well as scripts which are installed as Git hooks.
  • Git for Windows with GVFS patches, compiled for macOS, which functions as a normal Git client but with adjustments for a virtualized working tree.

VFSForGit on macOS architecture diagram

The macOS design projects the external Git repository into the virtualized working tree by moving files through three states:

  • Placeholder: file exists on disk but has no contents (actual size but filled with zeros).
  • Hydrated: file exists on disk with actual contents, but is still tracked by provider so it can be excluded from certain Git actions which only require modified files.
  • Full: file exists on disk with modified contents, and is no longer tracked by provider.

Directories move through a smaller set of two states: placeholder (empty) and full.

The smaller set of states for files and directories in the virtualized working tree is due in part to the limitations of the kauth hook approach. Directory enumeration (directory listings) can not be generated by an authorization hook. However, the kauth hook which runs prior to directory enumeration can, instead, pre-populate the directory with placeholders prior to completing the hook, which ensures the directory appears "full" to the subsequent in-kernel HFS+ or APFS directory listing code. This approach has proven more efficient than the original Windows one, and is anticipated to the used by the VFSForGit Windows implementation in the future.

VFSForGit on Linux

The current long-term plan for the client-side Linux architecture of a virtualized Git working tree will ultimately include three components:

  • projfs, a stackable kernel filesystem module providing a passthrough set of operations to a lower-level "real" filesystem, but with additional logic on key operations, and a libprojfs user-space library to interact with the projfs kernel module by means of a socket channel.
  • VFSForGit, including the GVFS provider which provides callbacks via the libprojfs library (accessed through a Linux PrjFSLib C# wrapper to the projfs kernel module, as well as scripts which are installed as Git hooks.
  • The upstream Git client, modified with Linux-specific variants of the Git for Windows GVFS patches, which will function as a normal Git client but with adjustments for a virtualized working tree.

The Linux design projects the external Git repository into the virtualized working tree by moving files through the same three states used for the macOS implementation:

  • Placeholder: file exists on disk but has no contents (actual size but filled with zeros).
  • Hydrated: file exists on disk with actual contents, but is still tracked by provider so it can be excluded from certain git actions which only require modified files.
  • Full: file exists on disk with modified contents, and is no longer tracked by provider.

Directories also move through the same set of two states as for macOS: placeholder (empty) and full.

While the set of file and directory states will be the same as for the macOS implementation, the architecture will more closely resemble the Windows one, because Linux fully supports the concept of a "filesystem filter" through the use of a stackable filesystem which implements a passthrough to a lower-level filesystem on most or all operations.

VFSForGit on Linux architecture diagram

projfs Kernel Module

Stackable File Systems

Unlike a true filter such as the Windows PrjFlt NTFS filter driver, a Linux "passthrough filesystem" has to be responsible for all filesystem operations, even if it defers them to a second lower filesystem for completion.

Production-quality passthrough stackable Linux filesystems include eCryptfs, which is in the mainline Linux kernel. The eCryptfs filesystem does not, itself, perform any storage. It merely intercepts all file operations, encrypts or decrypts file names and content as necessary, and relies on a second "true" filesystem to perform the actual storage. Another stackable filesystem in the Linux mainline kernel is OverlayFS, which provides union-mount functionality, including making a read-write overlay on a read-only filesystem, or a read-only snapshot of a read-write filesystem.

Diagram illustrating how stackable filesystems work

In fact, stackable file systems can be stacked on top of each other (just like Windows NTFS filters), so an OverlayFS mount could use eCryptfs mounts which use ext4 mounts for storage.

It is worth noting that the stackable filesystem model in Linux does not, in itself, prevent clients from reading or writing to the underlying "real" storage filesystem. Since every filesystem is another mount point, a user or client with sufficient privileges can always simply read the lower-level filesystem directly at its mount point (i.e., full directory path) instead of reading through the higher-level filesystem at its own, necessarily different, mount point.

For instance, if the lower-level filesystem was mounted at /mnt/ext4, and an eCryptfs filesystem was mounted at /mnt/crypt using the /mnt/ext4 directory as its underlying storage, then when a file named foo.txt is written to /mnt/crypt/foo.txt, a corresponding new file will be readable at /mnt/ext4/<encrypted filename> which contains foo.txt's contents (in encrypted form).

While the Linux stackable filesystem approach differs from the macOS and Windows ones in that the non-virtualized Git working tree will be visible through a different directory than the virtualized one, that is not expected to cause surprise or issues to Linux users, as the stackable filesystem paradigm is common and well-understood. Users simply avoid writing to the underlying filesystem at its mount point.

Mount Points

Unlike the global application of the kauth hooks in the macOS implementation, a Linux filesystem can be mounted at any point in the directory hierarchy, so our kernel module will only be active for the specific directories used in each virtualized Git working tree.

This eliminates the need for some of the per-file flags (such as FileFlags_IsInVirtualizationRoot) used in the macOS kext, which allow the KauthHandler.cpp code to quickly decide whether a given authorization request is applicable to a virtualized directory or not.

The Linux implementation can instead function much like the fuse kernel module in the Linux mainline kernel, which supports multiple directory mounts, each communicating with a different user-space FUSE daemon process over a different file socket.

Each Git working tree for which virtualization is enabled will be a distinct projfs mount point. The VFSForGit command-line program should create these as follows, given an initial path of /path/to/vfs4git:

  • /path/to/vfs4git/.gvfs/lower
    • normal directory
    • created empty by VFSForGit CLI program
    • then populated with .git directory, etc.
    • then used as lower mount point
  • /path/to/vfs4git/src
    • projfs mount point
    • mount parameter: lower=/path/to/vfs4git/.gvfs/lower

This implies that when a placeholder, hydrated, or modified file is written to disk (either by the .NET provider process via libprojfs, or directly by the Git client or other user process), it will be actually stored at /path/to/vfs4git/.gvfs/lower/path/to/file.txt but would be visible through the projected filesystem at /path/to/vfs4git/src/path/to/file.txt.

Inode Mapping

In Unix filesystems, files and directories, as well as special nodes such as symlinks, are all represented by an inode (or "index node", although the original derivation is obscure). Inodes contain the majority of file metadata (size, modification time, etc.) as well as references to the storage blocks containing the actual file contents. Filesystems are free to store inodes in whatever on-disk layout they choose, normally one designed for fast indexing and retrieval.

Directories have as their content a list of file names and inodes, mapping the names of child files and directories to their inodes. To resolve a file path, the operating system repeatedly performs filename lookups on directories, starting from the root.

Crucially, there is no way to look an inode's path given only an inode. In fact, there may be many paths to an inode, such as when multiple hardlinks are created for a single file, or even no paths at all, when a file which is being removed from the filesystem but for which some file descriptors are still open.

In order to speed up path lookups and file accesses, the Linux VFS (Virtual File System) maintains both an inode cache and a dentry (directory entry) cache. The VFS is an internal part of the kernel and a common, abstract layer above all filesystem modules in Linux.

The inode cache is populated when an inode is first looked up from its parent directory, and the inode will be held in the cache until it is ejected due to lack of use or cache pressure. However, since all file operations begin by lookup up a path, if the inode is needed again after being removed from the cache it will simply be added again during the next lookup.

Stackable filesystems utilize the inode cache and lookup mechanism to maintain their own internal inode-to-inode mappings, which are held in memory. Each time a lookup operation is requested of the higher-level stacking filesystem, it performs lookup on the lower-level filesystem, which returns the lower-level inode, and then the stacking filesystem creates it own inode and stores a reference to the lower-level one in its private-use data field of the inode that is returned to the VFS.

While stacking filesystems implemented as FUSE user-space modules have to do this inode mapping and caching themselves, the projfs kernel module can leverage the VFS's cache, just like the eCryptfs module and other in-kernel stacking filesystems.

Relative Paths

One challenge posed by the Linux kernel's VFS framework will be how to return relative paths (i.e., relative to the projfs mount point) in the callbacks from projfs to user-space. Relative paths are required for the proper functioning of the GVFS provider and its interactions with Git's own indexes. As noted above, inodes do not correspond to a single specific path; they may have more than one (via hardlinks) or none (for a deleted file which still has open file handles).

Moreover, individual Linux filesystems do not have responsibility for walking file paths down from the root of the file hierarchy. The path walk is handled by the VFS, which resolves each path to a specific mount point (and therefore specific filesystem module) and inode, using internal structures known as dentries (directory entries).

The dentry_path_raw() function may provide a reasonable path name in the normal case but experimentation will be required to determine its behaviour with deleted files, files being actively renamed by another kernel thread, and especially with inodes having multiple hard links. Ideally, the dentry structures passed to the inode operations implemented by the filesystem module will be sufficiently populated to allow dentry_path_raw() to return a valid path. For example, the mkdir() syscall is implemented at the individual filesystem level by a call which takes the inode of the directory in which to create the sub-directory, and a dentry which defines the name of the new sub-directory:

int (*mkdir) (struct inode *,struct dentry *,umode_t);

The dentry has not yet been associated with an inode; that's the job of the filesystem once it creates the new inode in its storage system. We will need to determine if, in all relevant cases for VFSForGit, the various inode and file operations which need to trigger callbacks to userspace have sufficient data passed to them to provide the relative path of the given file or directory.

The libfuse user-space library implements a high-level API which maintains an inode-to-path mapping as well as an optional Least-Recently Used list. Microsoft's VFSForGit engineering team is also looking at a potentially similar vnode cache design for its macOS kext, and has encountered some issues which are indicative of the kinds of edge cases we may face implementing similar inode-to-path mappings in the projfs kernel module.

For these reasons, we may find it adventageous to continue development of our FUSE-based user-space libprojfs library as far as possible, relying on libfuse's path mapping logic, and then try to resolve any performance issues which arise by improving the FUSE/libfuse implementations rather than attempting to develop a separate projfs Linux kernel module.

Communication Channel

The communication between the projfs kernel module and the user-space libprojfs library may replicate that implemented for the FUSE kernel module and its user-space library libfuse, Like FUSE/libfuse, projfs and libprojfs will utilize a device file, /dev/projfs, for which each instance of libprojfs will create a unique file descriptor corresponding to a specific mount point of projfs.

The initial handshake between the kernel module and the user library establishes the file descriptor associated with the mount point after the user daemon process starts and opens the /dev/projfs device file, then passes the file descriptor number to the kernel in its first write to the channel. The kernel module records that fd in its per-mount superblock and can then route file operation requests (i.e., VFSForGit callbacks and notifications) for any inode within that mount point to the appropriate descriptor. In this way, the projfs filesystem can be mounted at multiple points in the overall file hierarchy, and each mount point has its requests to user-space handled by a different daemon process.

As with the Windows and macOS implementations, timeouts should largely be handled by the provider, so as to ensure that network issues do not cause long-running local tasks such as build processes to fail unnecessarily. However, the kernel module must recognize when the provider has become entirely unresponsive, and cease communication with the provider, but still permit users to recover (i.e., read) any full files they have modified within the virtualized working directory. Modified/full files will have no projfs extended attributes, and will therefore be passed by the projfs module directly to the underlying filesystem for all operations, so this requirement should be easily attainable.

Callbacks and Notifications

When an opendir() or similar system call is made to the projfs filesystem, the kernel module checks if the directory is an empty placeholder, and if so makes a callback to the .NET provider. The provider can then use the VFSForGit API to create the necessary child placeholder files and directories. Upon return from the callback, projfs then proceeds to invoke opendir() on the lower-level filesystem, using the inode cached by the lookup file operation which preceded the call to opendir().

When an open() or similar system call is made to the projfs filesystem, the kernel module checks if the file is an empty placeholder by examining its metadata (specifically the extended attributes created by the provider during directory enumeration), and if so makes a callback to the .NET provider. The provider uses the VFSForGit API to hydrate the file with its actual file content, but leaves the extended attributes so that the modified Git client can skip over the file if it is not further modified.

Empty files will be created as sparse files, which implies the lower-level filesystem must support sparse files; if not, an error will be reported when an attempt is made to mount the projfs filesystem. Hydrated files will obviously contain their full contents. During the callback to hydrate a file, once all data has been streamed into the file, an fsync() or kernel-level equivalent should be issued, before removing the empty-placeholder attribute on the file. This should ensure that incomplete files will never appear to be fully hydrated.

Directory population and file hydration are both destructive changes; that is, they may conflict with user-initiated changes and overwrite them if we do not take care to avoid such situations. Care also needs to be taken to ensure that incomplete updates are not seen in the virtualized directory tree, such as when the provider fails part of the way through file hydration. In general, we intend to adhere to the design principle used throughout the VFSForGit project that user data, especially user-initiated changes, additions, and edits, should never be lost. For instance, if a user overwrites an empty placeholder file with new contents, these should not be lost if there is a competing file hydration operation occurring at the same time.

To this end we will implement several strategies for ensuring changes to files are made atomically. First, empty placeholder files and hydrated files can be written into an "offline" temporary location and then moved into place, following the process outlined in the Atomic placeholder operations in VFSForGit on macOS design. File operations in the kernel module, which start by looking for projfs "flags" (extended attributes or in-memory inode flags) on the given file, should therefore only encounter either placeholder files (with a flag) or hydrated files (without one). Because we populate a directory on first access, encountering non-extant "virtual" files—as is possible with the current Windows PrjFlt filter—should not occur.

All filesystem operations which depend on file or directory contents, including read and write operations as well as deletions and renames, should, when executed against an empty placeholder, trigger a blocking file or directory hydration callback from the kernel module to the provider. We can follow the design of the HandleVnodeOperation() function in the macOS kext in this regard, but will need to adapt it for the many discrete inode and file operations of the Linux VFS as compared to the common authorization hooks of the macOS kauth framework.

It is then the responsibility of the provider (or more accurately, the user-space software stack) to ensure that per-file and per-directory callbacks are handled in a safely concurrent or sequential manner. In the macOS implementation, the C++ callback handlers in PrjFSLib do this by first acquiring a global mutex lock on a map of per-vnode mutexes, then retrieving or creating the per-vnode mutex, incrementing its reference count, releasing the global lock, and finally acquiring the per-vnode mutex.

A similar design in the libprojfs C user-space library will accomplish the same goals, namely, serializing operations on empty files and directories such, for a given inode, hydration always precedes any user-initiated change so that those changes can not be lost due to the hydration.

Note that some file operations on hydrated files (as opposed to empty files), such as deletions and modifications, should also trigger blocking file event notifications from the kernel module to the provider. However, in these notification-only cases the GVFS provider will return immediately after placing the notification onto its internal queue of background tasks, as implemented in the BackgroundFileSystemTaskRunner and FileSystemCallbacks C# classes, and therefore the kernel module should not be blocked for any significant length of time.

Extended Attributes

Following the model of OverlayFS, we would use trusted extended attributes with named trusted.projection.*, such as trusted.projection.contentid and trusted.projection.version. Only processes with the CAP_SYS_ADMIN capability would be able to access these extended attributes.

So long as we are developing a user-space library only, we will use extended attributes named user.projection.* instead.

We would avoid the io.gvfs.* naming scheme to prevent confusion with the GNOME GVfs open-source project.

The primary two extended attributes used by the projfs module would be as follows:

  • trusted.projection.empty
    • true for placeholder files and directories, otherwise absent
  • trusted.projection.contentid
    • Content ID as set by provider

The underlying storage filesystem obviously needs to support extended attributes; attempting to mount projfs using a lower-level filesystem without this capability would result in a mount-time error.

The underlying filesystem would persist these extended attributes in its on-disk storage layout. However, the projfs module can also take advantage of the private data available to it within the in-memory inode cache of the Linux VFS, in order to speed up the majority of accesses for files and directories with projfs extended attributes.

The data structure which projfs defines as its private in-memory inode type (akin to ovl_inode in OverlayFS) may contain arbitrary bit flags and data fields. By defining these such that the values from the lower-level inode's extended attributes may be copied into them when the inode structures are initialized, we can retain them in the in-memory kernel VFS inode cache for fast retrieval, at least for all recently-used and frequently-accessed inodes within the mounted directory tree. In many cases this should allow projfs to check whether an inode represents an empty placeholder file or directory without even referencing the lower-level filesystem or its extended attributes.

Development Process

We anticipate following a staged, iterative development process, with the following principal goals:

  • Iterative and incremental process
  • Functional from an early stage so test suite can be developed in parallel
  • Conducive to CI/CD best practices
  • Supports multiple environments, including Linux distros and containers
  • Compatibility with .NET Core components where possible (i.e., the VFSForGit providers)
  • Optimized for performance in stages, based on test suite results

In particular, we will develop the Linux VFSForGit client in three phases.

Phase 1 – FUSE User Mode

First, using the in-kernel FUSE module and accompanying libfuse library, we build the libprojfs library which invokes the .NET provider's callbacks for key filesystem operations, including open() and opendir(), and provides other filesystem event notifications and a suite of library functions to the PrjFSLib interoperability layer in the .NET provider.

Note that we will already be developing a stackable filesystem, with the underlying storage in a separate filesystem mount, just doing so in user-space instead of in kernel mode.

Diagram of phase 1 of the Linux implementation

One caveat with the use of a user-space filesystem is the requirement of user read and write file permissions in order to check and update the extended attributes which maintain the projection state of a given file or directory.

Whereas an in-kernel implementation may read and set attributes in the trusted.* namespace, and do so at will, a user-space filesystem is restricted to the use of the user.* extended attribute namespace, and, further, can only read and change attributes as allowed by the file permission modes of a given inode. Thus in order to test whether a given file or directory is a placeholder, the user must have read permission, so a write-only file mode like 0222 can not be permitted. And user write permissions must be assigned to any read-only files or directories, at least temporarily, in order to convert them from the placeholder state to another (i.e., hydrated or full).

Phase 2 – Hybrid

The second development phase adds an in-kernel projfs module which, at first, only handles a single filesystem operation, such a single event notification to user-space.

All other filesystem operations are passed through the kernel module (using pre-existing code from wrapfs or eCryptfs) to the FUSE kernel module, which then invokes libprojfs as in phase 1. This allows us to incrementally add functionality to the kernel module, one filesystem operation at a time.

The projfs kernel module will communicate with libprojfs over a separate file descriptor and socket channel, distinct from FUSE's communication channel. This implies that libprojfs will have a second event loop thread running in this phase of development.

Diagram of phase 2 of the Linux implementation

Phase 3 – Kernel Module

The final phase of development sees the removal of any dependency on FUSE and full in-kernel filesystem passthrough to the lower layer, with the full set of needed callbacks, notifications, and functions provide to libprojfs. The libprojfs library will only run the single event loop thread needed to listen to the projfs communication channel.

Diagram of phase 3 of the Linux implementation

However, note that if we encounter significant issues in converting VFS inodes to relative paths (as required by the VFSForGit interface and Git client), or if we approach our performance goals using only our Phase 1 user-space library, we may choose to focus on continued user-space development in preference to re-developing for an in-kernel implementation.

Distribution

Naming

In keeping with standard Linux/Unix practice, most component names will be lowercase only. The kernel module will be named projfs, to conform to the naming scheme for Linux kernel filesystem modules (e.g., ecryptfs.ko, squashfs.ko). The name "projfs" does not appear to reference any existing projects or Web sites.

Again, to conform to Unix practice, the user-space library which communicates with the kernel module will be installed as libprojfs.so, and its global symbol names will start with projfs_.

The command-line process used to create the virtualized git working tree should almost certainly not be called gvfs, as it is now in the Windows and macOS clients. This is because many Linux distributions, by default, install and configure the GNOME GVfs filesystem (gvfs).

Further, the hidden .gvfs file used by the Windows and macOS implementations will conflict with the GNOME GVfs filesystem's use of a .gvfs file on Linux systems, and so will need to be given a different name in the Linux VFSForGit client.