0.6 candidate patterns #81

NoHatCoder · 2021-07-03T11:50:20Z

I have searched for a new pattern for input digestion, the the goal for the input pattern is that there should be no simple way for a pattern of differences to cancel out. I have a tool that tries to estimate how simple the simplest pattern is, it delivers results in number of lanes where 4 bytes cancel out.

The following pattern scores 5 cancellations. Each round digests 32 bytes of input, denoted as IN0 and IN1:

XMM6+=IN1
XMM4^=IN0
XMM4=AES(XMM4,XMM1)
XMM2+=IN0
XMM1^=IN1
XMM1=AES(XMM1,XMM5)
//Next round, shift pattern over by 1
XMM7+=IN1
XMM5^=IN0
...

Another slightly more expensive pattern scores 11 cancellations:

XMM6+=IN0
XMM3^=IN1
XMM3=AES(XMM3,XMM6)
XMM6^=IN1
XMM6=AES(XMM6,XMM1)
XMM1^=IN0
XMM1=AES(XMM1,XMM7)

Both patterns are designed so that an ARM version can use the pre-aes xor to replace an xor input instruction, and then add in an xor after the aes.

The first pattern could be used for a level 1 function similar to 0.5, the second, while as best i can tell not quite reaching 128 bits of adversarial resistance, might still be practically good enough for hardening hash tables.

To get higher hash levels I'm also tinkering with looping over the input in large blocks (~1KiB).

The text was updated successfully, but these errors were encountered:

cmuratori · 2021-07-03T20:35:19Z

I'm not sure I understand the score part. If the goal is for there to be "no simple way for a pattern of differences to cancel out", why would scoring more cancellations be better? (I'm just asking about the wording here, not the code).

As for larger block sizes, the main issue there is usually what to do about residuals.

- Casey

NoHatCoder · 2021-07-04T08:07:47Z

It is internal cancellations needed to cancel the entire change, so basically what it says it that something unlikely has to happen 11 times for a full internal collision to occur.

There is going to be some waste with larger blocks, that is unavoidable, but there is definitely also an advantage for larger blocks that will more than offset this when the input gets large enough.

NoHatCoder · 2021-07-06T14:34:14Z

I found a cool trick while playing around with a new function, with 3-reg instructions you can switch what registers get used for what lanes during computation, without using any additional instructions, so for instance this naive implementation of 4 rounds with lane shifts can be rewritten unrolled swapping the registers around as it goes:

    //naive
    for(a=0;a<4;a++){
        xmm6=meow_paddq(xmm6,meow_load(rax+0x10+a*0x20));
        xmm4=meow_aesenc(xmm4,meow_load(rax+0x00+a*0x20),xmm1);
        xmm2=meow_paddq(xmm2,meow_load(rax+0x00+a*0x20));
        xmm1=meow_aesenc(xmm1,meow_load(rax+0x10+a*0x20),xmm5);
        ttt0=xmm0;
        xmm0=xmm1;
        xmm1=xmm2;
        xmm2=xmm3;
        xmm3=xmm4;
        xmm4=xmm5;
        xmm5=xmm6;
        xmm6=xmm7;
        xmm7=ttt0;
    }


    //unrolled
    ttt0=meow_paddq(xmm6,meow_load(rax+0x10));
    ttt1=meow_aesenc(xmm4,meow_load(rax+0x00),xmm1);
    xmm6=meow_paddq(xmm2,meow_load(rax+0x00));
    ttt2=meow_aesenc(xmm1,meow_load(rax+0x10),xmm5);
    
    ttt3=meow_paddq(xmm7,meow_load(rax+0x30));
    xmm1=meow_aesenc(xmm5,meow_load(rax+0x20),xmm6);
    xmm7=meow_paddq(xmm3,meow_load(rax+0x20));
    xmm6=meow_aesenc(xmm6,meow_load(rax+0x30),ttt0);

    xmm4=meow_paddq(xmm0,meow_load(rax+0x50));
    xmm2=meow_aesenc(ttt0,meow_load(rax+0x40),xmm7);
    xmm0=meow_paddq(ttt1,meow_load(rax+0x40));
    xmm7=meow_aesenc(xmm7,meow_load(rax+0x50),ttt3);

    xmm5=meow_paddq(ttt2,meow_load(rax+0x70));
    xmm3=meow_aesenc(ttt3,meow_load(rax+0x60),xmm0);
    xmm1=meow_paddq(xmm1,meow_load(rax+0x60));
    xmm0=meow_aesenc(xmm0,meow_load(rax+0x70),xmm4);

This way we can operate on 128 byte blocks instead of 256 byte. This could also be used for making functions with a non-power-of-2 state that still operate on power-of-2 blocks.

cmuratori · 2021-07-06T16:35:47Z

That does seem good - I have never profiled register renaming in the front end, btw, but it's not an ALU op, so there's also the fact that at least some number of registers can be renamed every cycle "for free" anyway... meaning that even without ternary ops, you can still do this, because

a = op b c

is basically the same cost as

b = op b c
a = b

because register renaming is a front-end op only.

- Casey

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.6 candidate patterns #81

0.6 candidate patterns #81

NoHatCoder commented Jul 3, 2021

cmuratori commented Jul 3, 2021

NoHatCoder commented Jul 4, 2021

NoHatCoder commented Jul 6, 2021

cmuratori commented Jul 6, 2021

0.6 candidate patterns #81

0.6 candidate patterns #81

Comments

NoHatCoder commented Jul 3, 2021

cmuratori commented Jul 3, 2021

NoHatCoder commented Jul 4, 2021

NoHatCoder commented Jul 6, 2021

cmuratori commented Jul 6, 2021