Design of a ZSTD compression library in otherlibs/ #12812
Replies: 8 comments 3 replies
-
TransformsThe transform API looks nice (it's basically a push-based stream, I think?). I'm just sad to see a new, custom made Buffer val flush: t -> string
val close: t -> string This copies the buffer's content into a string, which presumably is going to immediately be consumed (written somewhere, a socket, a file, something like that). Could there be a way that skips this intermediate copy, like this (pardon the terrible names)? val flush_without_clearing : t -> unit
val content_slice : t -> bytes * int (** only after flush_without_clearing *) |
Beta Was this translation helpful? Give feedback.
-
I agree that your modular I/O proposal would work fine for streaming compression / decompression to/from channels. I don't mind removing the
There's something like that in my Cryptokit library:
Again, we don't need to export |
Beta Was this translation helpful? Give feedback.
-
I'm not sure I understand the idea behind providing this API – not that I'm not interested in But it seems that during the past decade upstream rather had the idea to remove stuff from What is the principle here ? Also I'm not sure having it as a depopt is going to provide a very good user experience, e.g. in the case where My impression is that OCaml would be better of making its use of
Just follow the eco-system conventions. Install the library in its own |
Beta Was this translation helpful? Give feedback.
-
Yes, I'm the one who pushed hard to move these libraries outside of the core OCaml distribution, because they had no compelling reasons to be there. But other libraries in otherlibs/ such as unix and systhreads will stay there forever because they have close ties with the internals of the runtime system, making maintenance difficult if they were separate projects instead.
I'm not sure what is a depopt of what. Either the new library is always installed (but may fail at run-time if ZSTD wasn't available at compile-time), or it is installed only if ZSTD was available at compile-time (just like systhreads used to not be installed if POSIX threads weren't available at compile-time). I don't feel strongly either way; I'd just like to understand which provides the better "user experience".
Of course, I chose ZSTD at random. That's how stupid I am. You should really watch your choices of words.
I'm certainly ignorant of the eco-system conventions, but I don't understand your advice. If I do |
Beta Was this translation helpful? Give feedback.
-
(Given the tone of your response, I don't feel like discussing this anymore, here's a final answer from me to clarify a few points)
Personally I don't see the argument here. Compressed marshaling is it's own unstable format. You may want to change it in the future, perhaps not using zstd. At which point you will have to carry on with this library.
Having
I'm not exactly sure where you read that I suggested you chose zstd at random. What I wrote is that having a Zstd module in
Yes. It has been suggested for ages that upstream moves to namespace it's libraries under the |
Beta Was this translation helpful? Give feedback.
-
I tend to agree with @dbuenzli that the choice of It is seems pretty clear to me that in 20 years, we will probably still want to provide a compressed marshall library for reading Thus, it might make sense to focus the library initial design on exposing the compiler support for compressed marshalling without coupling this library too strongly to the compression algorithm. Another way to word it is that would rather avoid people starting to depend on the marshalled compression library when all they use is the In other words, I would rather see the library organized as
to make sure that we only expose the features that we intend to maintain the future. |
Beta Was this translation helpful? Give feedback.
-
I just installed a bunch of OPAM packages -- those with
|
Beta Was this translation helpful? Give feedback.
-
This discussion went on a tangent about what "everyone" is doing with library names -- or not. At any rate, there's not enough interest for this proposal for a compression library in the core system distribution, so let me close it. I'll probably implement all of this as one of my personal projects. I duly noted that I should give my library the name |
Beta Was this translation helpful? Give feedback.
-
Context
Since release 5.1.1, the OCaml compilers use the ZSTD data compression library to reduce the size of compilation artifacts, but the runtime machinery developed to support this is not accessible from OCaml programs. We floated around the idea of adding an "other library" (similar to Unix or Str) to the core distribution that would provide user access to ZSTD compression. Below is a proposal for the API of such a library. I hope an API design can be agreed upon quickly enough for possible inclusion in OCaml 5.2.
Meta-context
I find pull requests to be the wrong place for discussing new libraries, as they mix API design with actual implementation, and the discussions are neither productive nor pleasant. To try something else, I'm using a Github discussion and I'm focusing on the API and its design rationale. I have code but will not update it and show it until an API is agreed upon. I'm expecting the discussions to be as abusive as for a regular PR, but at least there will be less back and forth on the implementation.
Package name
The library will be part of the OCaml core distribution, so there will be no OPAM package for it. However, it still needs a name for ocamlfind use and for the subdirectory in
ocamlc -where
. Two proposals:zstd
(more specific; conflicts with https://github.com/ygrek/ocaml-zstd)compression
(more generic, perhaps too generic; no known conflicts)Main module name
Here, I think
ZSTD
works fine and gives clear qualified namesZSTD.compress
,ZSTD.decompress
,ZSTD.Marshal.to_channel
, etc.Compression
leads to less nice names:Compression.compression
,Compression.decompression
.Simple interface
Just two functions operating over whole strings:
I don't think we need to provide variants for bytes, sub-strings, sub-bytes, etc. Copying these two/from strings is quite cheap compared with the actual compression or decompression. If memory usage is to be minimized, the advanced streaming API below should be used.
Marshaling
The initial motivation for this library is to support compressed data marshaling in the style of the
Marshal
standard library module. So, we'll implement the same interface as thisMarshal
module, just with on-the-fly compression/decompression:Low-level streaming interface
In streaming mode, the input data for compression or decompression can be submitted one piece at a time, with successive function calls, and the compressed/decompressed output can be extracted and e.g. written to a file one piece at a time, without building the whole output in memory. This is useful when dealing with very large data.
The ZSTD streaming APIs for compression and decompression are similar, so I propose to present both operations as an abstract "data transformer":
(Internally, a transformer will probably be a record of functions, but I'd rather not expose the implementation.)
The main operation on a transformer is to give it N bytes of input and space for M bytes of output, and let it consume some input and produce some output. This returns a pair (N', M') where N' ∈ [0,N] is the number of input bytes consumed, and M' ∈ [0,M] the number of output bytes produced.
When we're done sending input to a transformer, we need to ask it to flush its internal buffers, possibly producing more output:
The result is the number of output bytes produced, and a Boolean indicating whether all internal data was flushed (true) or whether
finish
must be called again with more output space (false).(Note that ZSTD streaming compression has a "flush" operation that terminates the current block and empties internal buffers, but does not write the end-of-compressed-text block. I'm not sure how to use this, and it has no equivalent for decompression, so I'd rather not expose this "flush" operation.)
A transformer remains usable after a
finish
. It can be more efficient memory-wise to reuse a transformer instead of creating a new one. Symmetrically, it can be more efficient memory-wise to tell ZSTD that we no longer need a transformer and that its internal buffers can be freed, rather than waiting for the OCaml GC to do it via finalization:If applied to an already-freed transformer,
transform
andfinish
operations raise an exception, andfree
does nothing.The low-level streaming API is hard to use, esp. to manage output buffering. I propose two higher-level streaming APIs for common use cases: outputting to a file, and outputting to an in-memory string buffer. These may be omitted from the first implementation of this library, if there's not enough time or too much bickering.
Streaming to an output channel
An
Out_channel.t
is just a regular output channel, plus a transformer (compressor or decompressor), plus an internal bytes array to buffer the output.Flushing finishes the transformer, making sure all output has reached the underlying
out_channel
. TheOut_channel.t
remains usable and can accept more input. Closing finishes the transformer but also frees it and frees the internal byte buffer, so theOut_channel.t
becomes unusable. (There's no reason to close the underlying output channel, so the nameclose
is not great.)Some or all the output functions from
Stdlib.Out_channel
can be provided:Streaming to an in-memory string buffer
A
Buffer.t
is a transformer plus an extensible byte array in the style of the stdlibBuffer
. The?size
is the initial size of the byte array.flush
performs afinish
on the underlying buffer, then returns all the data that has accumulated in the buffer as a string, then empties the buffer. This is unlikeBuffer.contents
from the stdlib, which keeps the buffer intact and returns it as part of the next call toBuffer.contents
. I think the "don't return the same output data twice" policy makes more sense for compression, where headers and trailers are added around the output data.close
(find a better name?) is likeflush
, but in addition it frees the transformer and destroys the internal buffer, making theBuffer.t
unusable. In contrast, after aBuffer.flush
, theBuffer.t
remains usable for further compression / decompression.Some of the "add" functions from
Stdlib.Buffer
can be provided:Optional support?
What if the ZSTD C library is not available? We can choose to skip the build of the OCaml library and to not install it, like we used to do with Graphics when it was part of the core distribution and Xlib was not available. Or we can choose to build and install the library with reduced functionality, like we do with the Unix library under Windows. The "reduced functionality" could probably be:
Stdlib.Marshal
moduleval supported: bool
.Related work
I'm aware of two bindings to the ZSTD library:
zstandard
(https://github.com/janestreet/zstandard): comprehensive, exposes a large part of the C library; multiple kinds of data sources and data sinks.zstd
(https://github.com/ygrek/ocaml-zstd): minimalistic (just whole-string compression and decompression); length of decompressed data must be provided by the user to the decompression function, which is impracticalThe API proposed here falls somewhere between these two bindings in terms of exposed functionality.
None of these bindings provides support for compressed marshaling, beyond the memory-inefficient "marshal to a string and compress it".
Beta Was this translation helpful? Give feedback.
All reactions