Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend coverage of aarch64 lifter, including SIMD #1546

Open
wants to merge 175 commits into
base: master
Choose a base branch
from

Conversation

DukMastaaa
Copy link
Contributor

File structure

SIMD instructions have been implemented in files under plugins/arm/semantics with the aarch64-simd- prefix,
in the aarch64 package. This is done as bap only looks in the top level of the semantics folder,
so adding a subdirectory simd won't be recognised.
For FP instructions soon to be implemented, this approach (with prefix aarch64-fp) will also be used.

nth-reg-in-group primitive

For instructions in the CASP and LDn families, LLVM gives BAP register groups like X0_X1 or Q0_Q1_Q2 which used to require a large switch statement like the following to extract the actual register:

(defun first-reg-in-group (r-pair)
  (case (symbol r-pair)
    'X0_X1 X0
    'X1_X2 X1
    ;; ...
    'X30_X31 X30))

A new Primus Lisp primitive, (nth-reg-in-group sym n), returns the nth register in a register group passed in as a symbol, sym.
For example, (nth-reg-in-group 'D0_D1_D2_D3 2) returns D2.
(:warning: there is a slight problem with the implementation, please see the notes on CASP below)

Non-SIMD Instructions

There are a lot of instructions implemented in this PR; these are sufficient to fully lift the cntlm binary (cross-compiled for aarch64) except two FMOV variants. This was tested using the --print-missing option to bap disassemble (#1410).

Instructions added are listed here, with some containing the BIL code hidden under a collapsible menu.

Arithmetic

ADDS*ri, ADDS*rs, ADD*rx, ADDXrx64
Instruction: adds x0, x1, x2
Opcode: 20 00 02 ab
{
  #3 := R2 + R1
  NF := 63:63[#3]
  VF := 63:63[R1] & 63:63[R2] & ~63:63[#3] | ~63:63[R1] & ~63:63[R2] &
    63:63[#3]
  ZF := #3 = 0
  CF := 63:63[R1] & 63:63[R2] | 63:63[R2] & ~63:63[#3] | 63:63[R1] &
    ~63:63[#3]
  R0 := #3
}
SUBXr*, SUBSXrx, SUBSXrx64

Similar to ADDS.

UMADDLrr, SMADDLrr, UMSUBLrr, SMSUBLrr
Instruction: umaddl x0, w1, w2, x3
Opcode: 20 0c a2 9b
{
  R0 := R3 + extend:64[31:0[R1] * 31:0[R2]]
}

The rest are similar.

UMULHrr
Instruction: umulh x0, x1, x2
Opcode: 20 7c c2 9b
{
  R0 := high:64[pad:128[R1] * pad:128[R2]]
}
ADR
Instruction: adr x0, 0xABCD
Opcode: 60 5e 05 30
{
  mem := mem with [R0, el]:u64 <- 0xABCD
}

Atomic

CASP family ⚠️

This uses the load-acquire and store-release intrinsics as described in #1458.

Instruction: caspal x0, x1, x2, x3, [x4]
Opcode: 82 fc 60 48
{
  #0 := mem[R4, el]:u128
  #1 := low:64[#0]
  #2 := high:64[#0]
  call(intrinsic:load-acquire)
  #4 := #0 = (R1.R0)
  if (#4) {
    call(intrinsic:store-release)
    mem := mem with [R4, el]:u128 <- R3.R2
  }
  R0 := #1
  R1 := #2
}

(:warning:)
The nth-reg-in-group primitive is also used to extract the registers in the xa_xb pairs.
However, its implementation prevents the following expression from reifying correctly:

(concat
  (nth-reg-in-group 'X0_X1 0)
  (nth-reg-in-group 'X0_X1 1))

The expected result is X0.X1, but printing out the result with msg gives 0x30000000000000004.
As a temporary workaround, a helper function (register-pair-concat r-pair) has been defined, containing a large switch statement with cases for each 'Xa_Xb, but this is not ideal.
Some advice on how to resolve this would be much appreciated.

Data movement

BIL code has not been provided for most instructions in this category due to the amount of instructions and the minute differences between them.

Loads:

  • LDR*ro*, LDR*pre, LDR*post, LDR*ui
  • LDRBBro*, LDRBBpre, LDRBBpost
  • LDRHHro*, LDRHHpre, LDRHHpost, LDRHHui
  • rest of LDP*pre and LDP*post, LDP*i
  • LDRSWui, LDRSWro*
  • LDURBBi, LDURHHi
  • LDURSB*i, LDURSH*i, LDURSWi
  • LDUR*i

Stores:

  • STR*ro*, STR*pre, STR*post
  • STRHHui
  • STRBBro*, STRBBpre, STRBBpost
  • STP*pre, STP*post, STP*i
  • STURHHi, STURBBi

Other:

EXTR*rri
Instruction: extr x0, x1, x2, #5
Opcode: 20 14 c2 93
{
  R0 := 68:5[R1.R2]
}

Logical

ANDS*ri, ANDS*rs
Instruction: ands x0, x1, x2, LSL #2
Opcode: 20 08 02 ea
{
  #3 := R2 << 2
  #4 := R1 & #3
  NF := 63:63[#4]
  ZF := #4 = 0
  CF := 0
  VF := 0
  R0 := #4
}
BIC*r, BICS*rs
Instruction: bic x0, x1, x2, asr #4
Opcode: 20 10 a2 8a
{
  #3 := R2 ~>> 4
  #4 := ~#3
  R0 := R1 & #4
}
REV*r, REV16*r, REV32Xr

Note that REV16*r etc. reverses the bytes within each container of size 16.

Instruction: rev x0, x1
Opcode: 20 0c c0 da
{
  R0 :=
    7:0[R1].15:8[R1].23:16[R1].31:24[R1].39:32[R1].47:40[R1].55:48[R1].63:56[R1]
}
Instruction: rev16 w0, w1
Opcode: 20 04 c0 5a
{
  R0 := high:32[R0].23:16[R1].31:24[R1].7:0[R1].15:8[R1]
}
ASRV*r, LSRV*r, LSLV*r, RORV*r

Nothing special about these.

RBIT*r
Instruction: rbit w0, w1
Opcode: 20 00 c0 5a
{
  R0 :=
    0:0[R1].1:1[R1].2:2[R1].3:3[R1].4:4[R1].5:5[R1].6:6[R1].7:7[R1].8:8[R1].9:9[R1].10:10[R1].11:11[R1].12:12[R1].13:13[R1].14:14[R1].15:15[R1].16:16[R1].17:17[R1].18:18[R1].19:19[R1].20:20[R1].21:21[R1].22:22[R1].23:23[R1].24:24[R1].25:25[R1].26:26[R1].27:27[R1].28:28[R1].29:29[R1].30:30[R1].31:31[R1]
}

Special

BRK

This passes the label argument to a software-breakpoint intrinsic.

Instruction: brk 0xABCD
Opcode: a0 79 35 d4
{
  intrinsic:x0 := 0xABCD
  call(intrinsic:software-breakpoint)
}

SIMD Instructions

We use . to indicate one of B, H, S, D, Q instead of * to avoid name conflicts with existing non-SIMD macros.

Arithmetic

Here, we just reuse * to also indicate some number for element count or element size.

ADDv*i*, SUBv*i*, MULv*i*

Note: + has a higher precedence in the textual representation than ., so although the spacing in the BIL output below is misleading, the output is correct.

Instruction: add v0.8h, v1.8h, v2.8h
Opcode: 20 84 62 4e
{
  V0 := 127:112[V1] + 127:112[V2].111:96[V1] + 111:96[V2].95:80[V1] +
    95:80[V2].79:64[V1] + 79:64[V2].63:48[V1] + 63:48[V2].47:32[V1] +
    47:32[V2].31:16[V1] + 31:16[V2].15:0[V1] + 15:0[V2]
}

The rest are similar and only differ in the binary operation.

Loads

This PR implements all of the SIMD load instructions; see the PR diff for a full list.
Instructions with interesting BIL output are listed below.

LDNP.i

As an instruction with non-temporal properties, LDNP relaxes the order of its memory accesses. This is represented as a call to a 'non-temporal-hint' intrinsic where the address is passed as a parameter.

Instruction: ldnp s0, s1, [x3, 4]
Opcode: 60 84 40 2c
{
  intrinsic:x0 := R3 + 4
  call(intrinsic:non-temporal-hint)
  V0 := high:96[V0].mem[R3 + 4, el]:u32
  intrinsic:x0 := R3 + 8
  call(intrinsic:non-temporal-hint)
  V1 := high:96[V1].mem[R3 + 8, el]:u32
}
LD..v._POST (e.g. ld2 {v0.4s, v1.4s}, [x2], x3)

This instruction family receives register groups from LLVM like CASP.
The BIL code separates each memory access individually to accurately model the interleaving done by the processor.
This may not be ideal for generated code size -- advice on making such levels of detail toggleable would be appreciated.

Instruction: ld2 {v0.4s, v1.4s}, [x2], x3
Opcode: 40 88 c3 4c
{
  #1 := mem[R2, el]:u32
  #3 := mem[R2 + 4, el]:u32
  #5 := #1.mem[R2 + 8, el]:u32
  #7 := #3.mem[R2 + 0xC, el]:u32
  #9 := #5.mem[R2 + 0x10, el]:u32
  #11 := #7.mem[R2 + 0x14, el]:u32
  #13 := #9.mem[R2 + 0x18, el]:u32
  #15 := #11.mem[R2 + 0x1C, el]:u32
  V0 := #13
  V1 := #15
  R2 := R2 + R3
}

Similar expansions apply to the rest of the LDn family.

Logical

ANDv*i*, EORv*i*, NOTv*i*, ORRv*i*, ORNv*i*

These are done on the whole register Vn.

Instruction: not v0.16b, v1.16b
Opcode: 20 58 20 6e
{
  V0 := ~V1
}

Misc. movement

INSvi32gpr, INSvi32lane

The implementation uses bitmasks and bit shifts to insert the vector elements, but could equivalently use extract and concat.
Please advise if this is preferred.

Instruction: ins v0.s[1], v1.s[1]
Opcode: 20 24 0c 6e
{
  #1 := 63:32[V1]
  #5 := V0 & 0xFFFFFFFFFFFFFFFF00000000FFFFFFFF
  #6 := #5 | 0xFFFFFFFF00000000 & pad:128[#1] << 0x20
  V0 := #6
}
Instruction: ins v0.s[1], w1
Opcode: 20 1c 0c 4e
{
  #3 := V0 & 0xFFFFFFFFFFFFFFFF00000000FFFFFFFF
  #4 := #3 | 0xFFFFFFFF00000000 & pad:128[31:0[R1]] << 0x20
  V0 := #4
}
MOVIv*i*, MOVIv*b_ns
Instruction: movi v0.4h, 0xAB
Opcode: 60 85 05 0f
{
  V0 := 0xAB00AB00AB00AB
}
EXTv*i*

This is implemented literally as described in the ISA with extract after concat.

Instruction: ext v0.16b, v1.16b, v2.16b, 3
Opcode: 20 18 02 6e
{
  V0 := 151:24[V2.V1]
}

Store

Most of these have nearly identical implementations to the non-SIMD STP variants.

STR.ro*, STR.pre, STR.post, STR.ui

For STR.ro*:

Instruction: str q0, [x1, x2]
Opcode: 20 68 a2 3c
{
  mem := mem with [R1 + R2, el]:u128 <- V0
}

For STR.post (pre is similar):

Instruction: str q0, [x1], 123
Opcode: 20 b4 87 3c
{
  mem := mem with [R1, el]:u128 <- V0
  R1 := R1 + 0x7B
}

For STR.ui:

Instruction: str q0, [x1, 0xAB]
Opcode: 20 b0 8a 3c
{
  mem := mem with [R1 + 0xAB, el]:u128 <- V0
}
STP.pre, STP.post, STP.i
Instruction: stp q0, q1, [x2], #16
Opcode: 40 84 80 ac
{
  #3 := R2
  mem := mem with [#3, el]:u128 <- V0
  mem := mem with [#3 + 0x10, el]:u128 <- V1
  R2 := #3 + 0x10
}
STUR.i
Instruction: stur q0, [x1, 0xAB]
Opcode: 20 b0 8a 3c
{
  mem := mem with [R1 + 0xAB, el]:u128 <- V0
}

DukMastaaa and others added 30 commits February 4, 2022 01:53
separated into category files
LLVM can't seem to disassemble ARMv8.4 instructions like RMIF, SETF8
and SETF16. Also, CFINV gets turned into MSR (register) but LLVM
returns ill-formed asm...?
I've commented this in aarch64-pstate.lisp.
i typed is_zero with underscore instead of primitive is-zero
documentation added for macros and helper functions.
llvm mnemonics most likely incorrect, will investigate
why bap's llvm doesn't disassemble these insns
i've used ` bap mc --cpu=cortex-a55 --triple=aarch64`
to get the llvm mnemonic, but will need to talk to ivan
about lisp context and
specifying generic armv8.x instead of a specific cpu
ailrst and others added 29 commits July 20, 2022 13:44
Miscellaneous fixes and adding instructions

    fix: replace lognot with lnot
    LDURHH, LDURSB, LDURSH, LDURSW
    RBIT (and reverse-bits helper)
    UMSUBL,SMSUBL,UMADDL,SMADDL
Implemented all LD (multiple structres), LD (single structures), LD.R…
packages form a flat namespace (and seem to need
a flat file hierarchy as well)
we'll just use the aarch64-simd- prefix as a replacement
for folders
will need to find out why primitive doesn't work
for concat
function overloads are not nice sometimes
@andrewj-brown
Copy link

Added a few more arithmetic instructions.

ADC*r, ADCS*r
Instruction: adc x0, x1, x2
Opcode: 20 00 02 9a
{
  R0 := extend:64[CF] + R2 + R1
}

Instruction: adcs x0, x1, x2
Opcode: 20 00 02 ba
{
  #0 := R1 + R2 + extend:64[CF]
  NF := 63:63[#0]
  VF := CF & 63:63[R2] & ~63:63[#0] | ~CF & ~63:63[R2] & 63:63[#0]
  ZF := #0 = 0
  CF := CF & 63:63[R2] | 63:63[R2] & ~63:63[#0] | CF & ~63:63[#0]
  R0 := #0
}
SBC*r, SBCS*r
Instruction: sbc x0, x1, x2
Opcode: 20 00 02 da
{
  R0 := extend:64[CF] + ~R2 + R1
}

Instruction: sbcs x0, x1, x2
Opcode: 20 00 02 fa
{
  #0 := R1 + ~R2 + extend:64[CF]
  NF := 63:63[#0]
  VF := CF & 63:63[~R2] & ~63:63[#0] | ~CF & ~63:63[~R2] & 63:63[#0]
  ZF := #0 = 0
  CF := CF & 63:63[~R2] | 63:63[~R2] & ~63:63[#0] | CF & ~63:63[#0]
  R0 := #0
}
CLZ*r
Instruction: clz x0, x1
Opcode: 20 10 c0 da
{
  #1 := R1
  #1 := #1 | #1 >> 1
  #1 := #1 | #1 >> 2
  #1 := #1 | #1 >> 4
  #1 := #1 | #1 >> 8
  #1 := #1 | #1 >> 0x10
  #1 := #1 | #1 >> 0x20
  #1 := ~#1
  #2 := #1
  #2 := #2 - (#2 >> 1 & 0x5555555555555555)
  #2 := (#2 & 0x3333333333333333) + (#2 >> 2 & 0x3333333333333333)
  #2 := #2 + (#2 >> 4) & 0xF0F0F0F0F0F0F0F
  R0 := #2 * 0x101010101010101 >> 0x38
}
SUBSWrx
Instruction: subs w0, w1, w2, SXTB#0
Opcode: 20 80 22 6b
{
  #3 := 1 + extend:64[~(31:0[R2] << 0x20)] + extend:64[31:0[R1]]
  NF := 63:63[#3]
  VF := 31:31[R1] & 31:31[~(31:0[R2] << 0x20)] & ~63:63[#3] | ~31:31[R1] &
    ~31:31[~(31:0[R2] << 0x20)] & 63:63[#3]
  ZF := #3 = 0
  CF := 31:31[R1] & 31:31[~(31:0[R2] << 0x20)] | 31:31[~(31:0[R2] << 0x20)] &
    ~63:63[#3] | 31:31[R1] & ~63:63[#3]
  R0 := high:32[R0].#3
}

A lot of these are extensions/alternative versions of pre-existing macros. I haven't shown the differences between x- and w-versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants