Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement ideas #12

Open
andrews05 opened this issue Oct 24, 2022 · 12 comments
Open

Performance improvement ideas #12

andrews05 opened this issue Oct 24, 2022 · 12 comments
Labels
enhancement New feature or request
Milestone

Comments

@andrews05
Copy link

This SIMD accelerated crc library may be able to improve performance: https://github.com/srijs/rust-crc32fast
Not sure if this is related to the performance difference between this and zopfli-rs?

@AlexTMjugador
Copy link
Member

AlexTMjugador commented Oct 24, 2022

zopfli-rs provides bindings for the original Zopfli C code, and that code does not use a particularly fast CRC32 implementation, at least for the Gzip format.

Besides, as admitted in the zopfli-rs benchmarks, the performance of both crates is non-linear with respect to the input data size, which makes it harder to reason about. Experimentally, I've also found substantial differences in runtime when trying to compress random data vs. data with a predictable distribution, even if they have the same size. It may happen that this Rust reimplementation is slower for that particular benchmark, but faster in others. And that benchmark tests input sizes of less than 1 KiB, which are so tiny that they don't even fill the DEFLATE window.

Anyway, I think that the performance of this crate can be improved! Using crc32fast is certainly an option, and so is multi-threaded compression.

@andrews05
Copy link
Author

andrews05 commented Oct 24, 2022

Yeah, I'm not too worried about performance really, zopfli is always going to be horrendously slow 🤣. I just came across crc32fast and thought it might be useful here.

Although, ECT has demonstrated that fast zopfli is possible. I guess you could try and figure out how it was done. zopfli#119 and zopfli#110 (including the discussions) may be good reference.

@AlexTMjugador
Copy link
Member

When dealing with near-optimal algorithms like Zopfli, one must keep the no free lunch theorem in mind. A patch may perform better on a given benchmark, in detriment of other possible benchmarks 😉

However, thank you for sharing the links! ECT certainly did some changes that are worth looking into, and multi-threading sounds like a relatively low-hanging fruit that could bring some significant performance improvements to the table.

@kornelski
Copy link
Collaborator

kornelski commented Oct 25, 2022

The current implementation looks very inefficient, handling io::Error for every byte:

zopfli/src/gzip.rs

Lines 31 to 50 in 41d8712

let in_data = IterRead::new(
in_data
.bytes()
.filter_map(|byte_result| {
read_error_kind = byte_result.as_ref().map_or_else(
|error| Some(error.kind()),
|byte| match insize.checked_add(1) {
Some(new_insize) => {
insize = new_insize;
crc_digest.update(&[*byte]);
None
}
None => Some(io::ErrorKind::Other),
},
);
byte_result.ok()
})
.fuse(),
);
so I think there's plenty of room to improve.

@AlexTMjugador
Copy link
Member

AlexTMjugador commented Oct 25, 2022

I was focusing my attention on the deflate function implementation, but that indeed looks slow! I use Zopfli to generate raw DEFLATE streams, so obviously any IterRead overhead here doesn't show in profiling graphs.

Edit: I've realized I had a much more efficient-looking hashing read wrapper type lying around from another Rust project of mine, so let's transplant it to Zopfli and see how it performs!

AlexTMjugador added a commit that referenced this issue Oct 25, 2022
As pointed out by @kornelski, using IterRead for the task of updating a
rolling hasher as bytes are pulled from a reader is likely to be very
inefficient. Let's use a more specific wrapper type that does such a
thing per block of data read, not per byte.

While at it, let's replace the `crc32` dependency by `crc32fast`, and
`adler32` by `simd-adler32`, which are faster equivalents. `iter-read`
is no longer necessary, so it has been removed.

The performance impact of these changes has not been properly analyzed
yet, but some preliminary runs of `make zopfli && make test` show no
noticeable performance regressions at least, which is good!

Addresses #12.
@AlexTMjugador
Copy link
Member

AlexTMjugador commented Oct 25, 2022

I've switched to crc32fast and got rid of IterRead in the latest commit. I'm not sure how much it helps performance yet, but at least the code is more elegant now 😄

@kornelski
Copy link
Collaborator

That looks great!

@andrews05
Copy link
Author

andrews05 commented Oct 25, 2022

Nice! I measured this around 3-4% faster on one file I tested.

When dealing with near-optimal algorithms like Zopfli, one must keep the no free lunch theorem in mind. A patch may perform better on a given benchmark, in detriment of other possible benchmarks 😉

Ha, true! Though in my experience ECT can compress better than zopfli's default settings while still being about 100x faster (without multithreading).

On the topic of near-optimal algorithms, libdeflate has its own implementation which is also very fast, though falls just short of the ratios that ECT/zopfli can reach.

@AlexTMjugador
Copy link
Member

AlexTMjugador commented Oct 26, 2022

Indeed, by the looks of it, ECT and other projects you've linked to did quite a few smart optimizations to the hot paths that work better in the practical scenarios I can imagine. Admittedly, this crate is more focused on correctness and being a faithful Rust port of the original Zopfli code than performance. Anyway, I'm glad that cleanup worked well for you, and I'd like this crate to become faster too, by implementing some of those ideas!

Truth be told, however, I didn't dive that deep into the matter, and getting these optimizations done right requires a lot of effort. I feel confident about implementing multi-threading at some point, but gaining further substantial speedups by accumulating known micro-optimizations is a task that flies over my head right now 😇

For the time being, I'd like to keep this issue open to track performance improvement ideas.

@AlexTMjugador AlexTMjugador changed the title Switch to crc32fast Performance improvement ideas Oct 26, 2022
@andrews05
Copy link
Author

Yeah, I totally understand. Wish I knew more about compression myself 🙂

AlexTMjugador added a commit that referenced this issue Apr 21, 2023
The modified code was part of the get_cost_stat function, which is
pretty hot for bigger files (takes ~17% of the total CPU cycles for a
sample file). Let's do some obvious improvements to use better numeric
types here.

Related issue: #12
@AlexTMjugador
Copy link
Member

I got some time and will to run zopfli under perf. I noticed that squeeze::get_cost_stat was taking up a significant chunk of CPU cycles, so I decided to start by avoiding some of the unsigned-signed type mish-mash that goes on there. This simple change brought down the compression time for a test file from ~14.91 s to ~13.82 s (7.9% improvement) on an Intel Core i7-12700 CPU.

This is by no means a scientific benchmark or the end of the road when it comes to performance optimization, but I'm amazed that this relatively simple change had such outsized impact 😄

@mqudsi
Copy link
Collaborator

mqudsi commented Apr 22, 2023

Nice find!

@AlexTMjugador AlexTMjugador added the enhancement New feature or request label May 22, 2023
@AlexTMjugador AlexTMjugador added this to the v0.8.0 milestone May 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants