Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6 candidate patterns #81

Open
NoHatCoder opened this issue Jul 3, 2021 · 4 comments
Open

0.6 candidate patterns #81

NoHatCoder opened this issue Jul 3, 2021 · 4 comments

Comments

@NoHatCoder
Copy link

I have searched for a new pattern for input digestion, the the goal for the input pattern is that there should be no simple way for a pattern of differences to cancel out. I have a tool that tries to estimate how simple the simplest pattern is, it delivers results in number of lanes where 4 bytes cancel out.

The following pattern scores 5 cancellations. Each round digests 32 bytes of input, denoted as IN0 and IN1:

XMM6+=IN1
XMM4^=IN0
XMM4=AES(XMM4,XMM1)
XMM2+=IN0
XMM1^=IN1
XMM1=AES(XMM1,XMM5)
//Next round, shift pattern over by 1
XMM7+=IN1
XMM5^=IN0
...

Another slightly more expensive pattern scores 11 cancellations:

XMM6+=IN0
XMM3^=IN1
XMM3=AES(XMM3,XMM6)
XMM6^=IN1
XMM6=AES(XMM6,XMM1)
XMM1^=IN0
XMM1=AES(XMM1,XMM7)

Both patterns are designed so that an ARM version can use the pre-aes xor to replace an xor input instruction, and then add in an xor after the aes.

The first pattern could be used for a level 1 function similar to 0.5, the second, while as best i can tell not quite reaching 128 bits of adversarial resistance, might still be practically good enough for hardening hash tables.

To get higher hash levels I'm also tinkering with looping over the input in large blocks (~1KiB).

@cmuratori
Copy link
Owner

I'm not sure I understand the score part. If the goal is for there to be "no simple way for a pattern of differences to cancel out", why would scoring more cancellations be better? (I'm just asking about the wording here, not the code).

As for larger block sizes, the main issue there is usually what to do about residuals.

- Casey

@NoHatCoder
Copy link
Author

It is internal cancellations needed to cancel the entire change, so basically what it says it that something unlikely has to happen 11 times for a full internal collision to occur.

There is going to be some waste with larger blocks, that is unavoidable, but there is definitely also an advantage for larger blocks that will more than offset this when the input gets large enough.

@NoHatCoder
Copy link
Author

I found a cool trick while playing around with a new function, with 3-reg instructions you can switch what registers get used for what lanes during computation, without using any additional instructions, so for instance this naive implementation of 4 rounds with lane shifts can be rewritten unrolled swapping the registers around as it goes:

    //naive
    for(a=0;a<4;a++){
        xmm6=meow_paddq(xmm6,meow_load(rax+0x10+a*0x20));
        xmm4=meow_aesenc(xmm4,meow_load(rax+0x00+a*0x20),xmm1);
        xmm2=meow_paddq(xmm2,meow_load(rax+0x00+a*0x20));
        xmm1=meow_aesenc(xmm1,meow_load(rax+0x10+a*0x20),xmm5);
        ttt0=xmm0;
        xmm0=xmm1;
        xmm1=xmm2;
        xmm2=xmm3;
        xmm3=xmm4;
        xmm4=xmm5;
        xmm5=xmm6;
        xmm6=xmm7;
        xmm7=ttt0;
    }


    //unrolled
    ttt0=meow_paddq(xmm6,meow_load(rax+0x10));
    ttt1=meow_aesenc(xmm4,meow_load(rax+0x00),xmm1);
    xmm6=meow_paddq(xmm2,meow_load(rax+0x00));
    ttt2=meow_aesenc(xmm1,meow_load(rax+0x10),xmm5);
    
    ttt3=meow_paddq(xmm7,meow_load(rax+0x30));
    xmm1=meow_aesenc(xmm5,meow_load(rax+0x20),xmm6);
    xmm7=meow_paddq(xmm3,meow_load(rax+0x20));
    xmm6=meow_aesenc(xmm6,meow_load(rax+0x30),ttt0);

    xmm4=meow_paddq(xmm0,meow_load(rax+0x50));
    xmm2=meow_aesenc(ttt0,meow_load(rax+0x40),xmm7);
    xmm0=meow_paddq(ttt1,meow_load(rax+0x40));
    xmm7=meow_aesenc(xmm7,meow_load(rax+0x50),ttt3);

    xmm5=meow_paddq(ttt2,meow_load(rax+0x70));
    xmm3=meow_aesenc(ttt3,meow_load(rax+0x60),xmm0);
    xmm1=meow_paddq(xmm1,meow_load(rax+0x60));
    xmm0=meow_aesenc(xmm0,meow_load(rax+0x70),xmm4);

This way we can operate on 128 byte blocks instead of 256 byte. This could also be used for making functions with a non-power-of-2 state that still operate on power-of-2 blocks.

@cmuratori
Copy link
Owner

That does seem good - I have never profiled register renaming in the front end, btw, but it's not an ALU op, so there's also the fact that at least some number of registers can be renamed every cycle "for free" anyway... meaning that even without ternary ops, you can still do this, because

a = op b c

is basically the same cost as

b = op b c
a = b

because register renaming is a front-end op only.

- Casey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants