Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unresolved issues from performance profiling #59

Open
cmuratori opened this issue May 26, 2019 · 16 comments
Open

Unresolved issues from performance profiling #59

cmuratori opened this issue May 26, 2019 · 16 comments

Comments

@cmuratori
Copy link
Owner

  1. We do not know why CLANG generates subtract-from-pointer loads instead of add-to-pointer loads. It says it's faster when it generates the code inside meow_bench, but when we extract the ASM and run it ourselves under perf, vtune, or meow_bench itself, it is slower than add-to-pointer! This needs to be resolved, and what exactly CLANG is doing with meow_bench to get these results needs to be investigated.

  2. We do not know why MSVC generates slower code on the loop than the hand-written ASM version, but it does appear to do so. I need to verify this by using the hand-written ASM that exactly replicates the MEOW_SHUFFLE part of the hash, which I had not done yet. But in general, we would like to be able to use perf counters to see what about MSVC's register allocation is causing the snafu.

  3. We do not know why perf stat seems to be reporting lower numbers than we would expect. This may be due to excessive startup code or something else bad happening. We should try to stabilize the numbers so we can get 2.8 IPC, which is what we seem to observe on Windows, although there could be reasons why that would not be the case...

- Casey

@aras-p
Copy link
Contributor

aras-p commented Jun 6, 2019

Not sure where to add (or post a new issue?), but Meow seems to be slower than many other hashes at small (<200 bytes) data sizes. It blows past everything else on larger data, but for smaller data XXH3, City, Farm, Mum hashes all are faster several times. I only tested on Clang/Mac though, on Core i9 laptop.

DataSize | Meow-64 | XXH3-64 | xxHash64 | City64 | Mum | Farm64 | SpookyV2-64 | xxHash32
-- | -- | -- | -- | -- | -- | -- | -- | --
2 | 117 | 781 | 480 | 612 | 473 | 589 | 294 | 664
4 | 281 | 1644 | 961 | 1302 | 1024 | 1201 | 606 | 1488
7 | 536 | 2878 | 1333 | 2278 | 1498 | 2063 | 994 | 1764
11 | 842 | 3906 | 1953 | 3242 | 2261 | 2864 | 1576 | 2490
17 | 1321 | 4918 | 3162 | 4660 | 3495 | 4150 | 1535 | 2251
26 | 2021 | 7523 | 3761 | 7004 | 4723 | 6250 | 2269 | 2801
39 | 2915 | 7431 | 3125 | 7254 | 5345 | 6414 | 3311 | 3311
58 | 4274 | 11051 | 4157 | 10918 | 6813 | 9539 | 3331 | 4274
86 | 6398 | 12920 | 5013 | 7147 | 7998 | 10179 | 3463 | 4698
128 | 10202 | 15871 | 8332 | 10868 | 9899 | 8402 | 4385 | 5101
191 | 12434 | 12130 | 8087 | 12434 | 10254 | 9848 | 4605 | 5435
285 | 15621 | 13491 | 10234 | 12366 | 11130 | 9574 | 7011 | 5744
426 | 20226 | 16112 | 13022 | 13580 | 11593 | 11184 | 8339 | 5868
637 | 21171 | 17769 | 13446 | 14215 | 12134 | 12284 | 10153 | 5994
953 | 28356 | 19849 | 14812 | 14595 | 11676 | 13232 | 11676 | 6014
1427 | 31256 | 19379 | 15380 | 14681 | 12111 | 14042 | 12919 | 6055
2138 | 36323 | 20015 | 16077 | 15088 | 12573 | 14859 | 14213 | 6129
3204 | 39233 | 20017 | 16624 | 15089 | 12738 | 15325 | 14861 | 6130
4803 | 39392 | 20953 | 16691 | 16144 | 12789 | 15387 | 15631 | 6155

Screen Shot 2019-06-06 at 7 02 01 PM

@cmuratori
Copy link
Owner Author

cmuratori commented Jun 6, 2019 via email

@NoHatCoder
Copy link

City and Farm are piles of random code gathered together by someone who had no idea what they were doing.

Spooky seems decent, I can't rule it out as a possible level 2.

I guess we could do some lower level variants of Meow.

@cmuratori
Copy link
Owner Author

cmuratori commented Jun 6, 2019 via email

@NoHatCoder
Copy link

Okay, so code looked decent everywhere I looked. I guess I just didn't look in the right place.

Spooky website has this gem:

V2 corrects two oversights in V1:
In the short hash, there was a d = length that should have been d += length, which means some entropy got dropped on the floor. It passed the tests anyhow, but fixing this probably means more distinct info from the message makes it into the result.
The long hash always ended in mix()+end(), but only end() was needed. Removing the extra call to mix() makes all long hashes faster by a small constant amount.

Guess what was actually very much needed.

@aras-p
Copy link
Contributor

aras-p commented Jun 7, 2019

Just a quick note: while the small size performance may increase, it may
not end up being competitive below 256 bytes in general. This is because
Meow is now designed to be a Level 3 hash

Fair enough. This bit on the website might need an update:

It has also now been tuned to be the fastest hash on small inputs, too. Despite the fact that it is a full 128-bit hash, it still outperforms “fast” 64-bit hashes across all input sizes. So whether you have data buffers of a few bytes or a few gigabytes, Meow is the fastest hash available.

As you are probably already aware, the other hashes you're looking at here
are already broken

Not actually aware! How are they "broken"?

@cmuratori
Copy link
Owner Author

The whole website is being replaced, but it's a spare-time project so things are slow-going :) We will be posting performance numbers, security analysis, etc., when everything is finished. We basically decided the world doesn't need another "fast but garbage" hash, so we've made it a requirement that it not be trivially broken.

As for the other hashes, unfortunately no one has collated all the breakage (which is something I hope to have on the Meow website), but if you go spelunking, you can find people locating trivial collisions. For example, here's a thread showing how terrible xxHash3 is.

- Casey

@NoHatCoder
Copy link

Pretty much every non-cryptographic hash that takes a seed is broken in that the seed doesn't do anything of value. Forge a collision, change the seed, and it is very often still a collision, even if you didn't do anything in particular to make a seed-resistant collision. Obligatory reading: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/

Apart from that there is usually some ways that very simple changes cause collisions, some skew in the internal state that makes certain collisions more likely, or some other problem that flies just below SMHasher.

I believe that a lot of them are basically designed using SMHasher, by making changes until the function passes. That is why adding a new test fails so many functions.

There is a freaking lot of code in City and Farm (even if you don't count the test code in Farm), so enumerating all the weaknesses is a pretty big job, but in particular I found the WeakHashLen32WithSeeds flawed, the w and z inputs are fed exclusively through this function, so if you collide this by changing only those two values, the whole thing collides. Turns out this often collides if w+z is kept constant and z+rot(z , 44) is also constant.

@cmuratori
Copy link
Owner Author

cmuratori commented Jun 7, 2019

@aras-p I took a quick look at the small-hash performance today and I'm having trouble reproducing your results. At least in our benchmarks, at very small sizes (24 bytes, etc.) we are about 50% the speed of the fastest "hash" in the benchmark (which for very small sizes is usually CLHash).

Is your benchmarking code published anywhere, by chance?

I plan on publishing better benchmarking in the future (meow_bench is only so-so), which can hopefully give more definitive results with careful timings... although it is kind of difficult since most modern systems make it impossible to actually do good profiling automatically (you cannot get performance counters from user space, it is hard/impossible to turn off turbo boost automatically, etc.).

(And I should note, especially with buffers less than 1k, you really have to be careful with timing because if you don't properly bracket the timing, it's hard to say what you're really timing, etc...)

- Casey

@aras-p
Copy link
Contributor

aras-p commented Jun 7, 2019

I took a quick look at the small-hash performance today and I'm having trouble reproducing your results

Mine is at https://github.com/aras-p/HashFunctionsTest -- but yeah it's not a rigorous test by any means, I was just playing around. And yes I'm not careful there etc., etc., (OTOH I'm not careful in the same way for all the hash algos, so...)

I was misled by an earlier version of the hash description that was like "fastest 128bit hash for any data size", and did not realize that in 0.5 it changed to also be "more robust hash", but possibly lost some perf at smaller data. That sounds fair enough, just ignore me :)

@cmuratori
Copy link
Owner Author

@aras-p We still try to at least one of the fastest, though, so I do want as much perf analysis as possible. I think we still have plenty to squeeze out at the small sizes, so the hope anyway is that Meow will still generally be a go-to hash for any buffer size, it just won't be the absolute fastest for tiny sizes because hey, some of these hashes are so bad that there's no way to be faster than them, because they don't really hash :/

- Casey

@cmuratori
Copy link
Owner Author

Mine is at https://github.com/aras-p/HashFunctionsTest -- but yeah it's not a rigorous test by any means, I was just playing around. And yes I'm not careful there etc., etc., (OTOH I'm not careful in the same way for all the hash algos, so...)

Just looking at this really quick - are you sure you're actually testing properly generated code here? I don't program on Mac, but just opening up the "XCode" whatever thingee, it doesn't look like you're enabling the right compiler switches, but maybe that's because I don't know what's going on in XCode.

Generally speaking, for modern perf testing (i7/i9 etc.) you need to make sure you're using the compiler switches for the hash in question. xxHash3, for example, really needs -mpclmul (like really), and Meow certainly prefers -mavx2. If you're just trying to test what happens on legacy builds, that's cool, but if you're trying to get high-end perf results, you have to make sure you're generating modern x64...

I don't know that this will have much of an effect on small buffer sizes either way, but just thought I'd mention it as a general issue.

- Casey

@aras-p
Copy link
Contributor

aras-p commented Jun 8, 2019

I'm building with -maes, but not -mavx2, since my particular use case I had in mind was that "I'm fine with requiring AES, but not fine with requiring AVX2 just yet", sorry for not mentioning that. However, and that's where it gets embarrassing, I did not notice that Xcode defaults to -Os. Switching to -O2 makes it a bit better, and also I've removed various non-128 bit hashes from comparison since that's not fair to begin with.

Meow is still not the "very fastest" at small data sizes, but as you said this is probably expected right now. Sorry for westing everyone's time!

Screen Shot 2019-06-08 at 8 11 29 AM

and zoomed in on small sizes:
Screen Shot 2019-06-08 at 8 11 34 AM

@cmuratori
Copy link
Owner Author

cmuratori commented Jun 8, 2019

EDIT: Actually, looking more closely, that still looks a little suspicious... but, either way, we'll have more answers once I do the final profiling passes and so on.

In general, I would caution also that it's probably not particularly useful to profile the general-call version of the functions on key sizes this small. If you actually care about the performance of sub-128 byte keys, you really want a macro that welds into your actual code. The function call overhead and checking alone on any hash will eat most of your time, and you can do much better by not doing it.

We could certainly provide a MEOW_HASH() macro for this purpose, and maybe we should... but I would definitely say that in actual use, you really don't want to be calling a function for hashes that small if you do a lot of them, pretty much period.

- Casey

@NoHatCoder
Copy link

Why would we need AVX2? The unaligned VPADD and VPXOR were introduced in AVX. I can't think of anything else we would need.

@cmuratori
Copy link
Owner Author

I don't think it actually needs -mavx2, it just needs -mavx. But I have not actually done any testing on -mavx codegen, so I couldn't say for sure.

Although really it's mostly academic, because we still have the unfortunate situation of CLANG generating much slower code for some reason, so, there is that :/

- Casey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants