-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unresolved issues from performance profiling #59
Comments
Not sure where to add (or post a new issue?), but Meow seems to be slower than many other hashes at small (<200 bytes) data sizes. It blows past everything else on larger data, but for smaller data XXH3, City, Farm, Mum hashes all are faster several times. I only tested on Clang/Mac though, on Core i9 laptop.
|
Just a quick note: while the small size performance may increase, it may
not end up being competitive below 256 bytes in general. This is because
Meow is now designed to be a Level 3 hash (eg., actually secure for an
unknown seed and hardened against seed recovery). The extra protection
does come at a cost for small inputs, because without it, an attacker can
determine the seed from structured inputs. Sufficient scrambling is
effectively free for large inputs, because the extra overhead is constant,
but for small inputs it could be substantial.
As you are probably already aware, the other hashes you're looking at here
are already broken (except perhaps City and Farm?). Meow is trying to be
actually not be broken.
@NoHatCoder may have some comments to add here as well.
Anyway, I will be posting the ASM implementation soon which gets higher
performance for L1 cache hashes than compiled versions (CLANG in particular
really farts on our code, even though looking at the ASM it generates it's
not immediately clear why... performance subtleties of the Skylake
architecture I guess) The ASM version may also help with the small hash
performance, but I just wanted to warn that we're not going for fastest
anymore, we're going for fastest-while-secure for Level 3, so being slower
below 256 bytes may be unavoidable.
- Casey
|
City and Farm are piles of random code gathered together by someone who had no idea what they were doing. Spooky seems decent, I can't rule it out as a possible level 2. I guess we could do some lower level variants of Meow. |
Are you perhaps thinking of Spooky V1? Aras's profile shows Spooky V2,
which I thought was busted (see
https://github.com/hmakholm/smhasher/blob/master/src/LongNeighborTest.md).
Am I wrong about that?
- Casey
|
Okay, so code looked decent everywhere I looked. I guess I just didn't look in the right place. Spooky website has this gem:
Guess what was actually very much needed. |
Fair enough. This bit on the website might need an update:
Not actually aware! How are they "broken"? |
The whole website is being replaced, but it's a spare-time project so things are slow-going :) We will be posting performance numbers, security analysis, etc., when everything is finished. We basically decided the world doesn't need another "fast but garbage" hash, so we've made it a requirement that it not be trivially broken. As for the other hashes, unfortunately no one has collated all the breakage (which is something I hope to have on the Meow website), but if you go spelunking, you can find people locating trivial collisions. For example, here's a thread showing how terrible xxHash3 is. - Casey |
Pretty much every non-cryptographic hash that takes a seed is broken in that the seed doesn't do anything of value. Forge a collision, change the seed, and it is very often still a collision, even if you didn't do anything in particular to make a seed-resistant collision. Obligatory reading: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/ Apart from that there is usually some ways that very simple changes cause collisions, some skew in the internal state that makes certain collisions more likely, or some other problem that flies just below SMHasher. I believe that a lot of them are basically designed using SMHasher, by making changes until the function passes. That is why adding a new test fails so many functions. There is a freaking lot of code in City and Farm (even if you don't count the test code in Farm), so enumerating all the weaknesses is a pretty big job, but in particular I found the WeakHashLen32WithSeeds flawed, the w and z inputs are fed exclusively through this function, so if you collide this by changing only those two values, the whole thing collides. Turns out this often collides if w+z is kept constant and z+rot(z , 44) is also constant. |
@aras-p I took a quick look at the small-hash performance today and I'm having trouble reproducing your results. At least in our benchmarks, at very small sizes (24 bytes, etc.) we are about 50% the speed of the fastest "hash" in the benchmark (which for very small sizes is usually CLHash). Is your benchmarking code published anywhere, by chance? I plan on publishing better benchmarking in the future (meow_bench is only so-so), which can hopefully give more definitive results with careful timings... although it is kind of difficult since most modern systems make it impossible to actually do good profiling automatically (you cannot get performance counters from user space, it is hard/impossible to turn off turbo boost automatically, etc.). (And I should note, especially with buffers less than 1k, you really have to be careful with timing because if you don't properly bracket the timing, it's hard to say what you're really timing, etc...) - Casey |
Mine is at https://github.com/aras-p/HashFunctionsTest -- but yeah it's not a rigorous test by any means, I was just playing around. And yes I'm not careful there etc., etc., (OTOH I'm not careful in the same way for all the hash algos, so...) I was misled by an earlier version of the hash description that was like "fastest 128bit hash for any data size", and did not realize that in 0.5 it changed to also be "more robust hash", but possibly lost some perf at smaller data. That sounds fair enough, just ignore me :) |
@aras-p We still try to at least one of the fastest, though, so I do want as much perf analysis as possible. I think we still have plenty to squeeze out at the small sizes, so the hope anyway is that Meow will still generally be a go-to hash for any buffer size, it just won't be the absolute fastest for tiny sizes because hey, some of these hashes are so bad that there's no way to be faster than them, because they don't really hash :/ - Casey |
Just looking at this really quick - are you sure you're actually testing properly generated code here? I don't program on Mac, but just opening up the "XCode" whatever thingee, it doesn't look like you're enabling the right compiler switches, but maybe that's because I don't know what's going on in XCode. Generally speaking, for modern perf testing (i7/i9 etc.) you need to make sure you're using the compiler switches for the hash in question. xxHash3, for example, really needs -mpclmul (like really), and Meow certainly prefers -mavx2. If you're just trying to test what happens on legacy builds, that's cool, but if you're trying to get high-end perf results, you have to make sure you're generating modern x64... I don't know that this will have much of an effect on small buffer sizes either way, but just thought I'd mention it as a general issue. - Casey |
I'm building with Meow is still not the "very fastest" at small data sizes, but as you said this is probably expected right now. Sorry for westing everyone's time! |
EDIT: Actually, looking more closely, that still looks a little suspicious... but, either way, we'll have more answers once I do the final profiling passes and so on. In general, I would caution also that it's probably not particularly useful to profile the general-call version of the functions on key sizes this small. If you actually care about the performance of sub-128 byte keys, you really want a macro that welds into your actual code. The function call overhead and checking alone on any hash will eat most of your time, and you can do much better by not doing it. We could certainly provide a MEOW_HASH() macro for this purpose, and maybe we should... but I would definitely say that in actual use, you really don't want to be calling a function for hashes that small if you do a lot of them, pretty much period. - Casey |
Why would we need AVX2? The unaligned VPADD and VPXOR were introduced in AVX. I can't think of anything else we would need. |
I don't think it actually needs -mavx2, it just needs -mavx. But I have not actually done any testing on -mavx codegen, so I couldn't say for sure. Although really it's mostly academic, because we still have the unfortunate situation of CLANG generating much slower code for some reason, so, there is that :/ - Casey |
We do not know why CLANG generates subtract-from-pointer loads instead of add-to-pointer loads. It says it's faster when it generates the code inside meow_bench, but when we extract the ASM and run it ourselves under perf, vtune, or meow_bench itself, it is slower than add-to-pointer! This needs to be resolved, and what exactly CLANG is doing with meow_bench to get these results needs to be investigated.
We do not know why MSVC generates slower code on the loop than the hand-written ASM version, but it does appear to do so. I need to verify this by using the hand-written ASM that exactly replicates the MEOW_SHUFFLE part of the hash, which I had not done yet. But in general, we would like to be able to use perf counters to see what about MSVC's register allocation is causing the snafu.
We do not know why perf stat seems to be reporting lower numbers than we would expect. This may be due to excessive startup code or something else bad happening. We should try to stabilize the numbers so we can get 2.8 IPC, which is what we seem to observe on Windows, although there could be reasons why that would not be the case...
- Casey
The text was updated successfully, but these errors were encountered: