Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add "simd" package to standard library #67520

Open
Clement-Jean opened this issue May 20, 2024 · 17 comments
Open

proposal: add "simd" package to standard library #67520

Clement-Jean opened this issue May 20, 2024 · 17 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Milestone

Comments

@Clement-Jean
Copy link

Clement-Jean commented May 20, 2024

Background

After doing a little bit of research on previous issues mentioning adding SIMD to Go, I found the following proposals:

#58610: This proposal relies on the ispmd keyword which does not follow the compatibility guarantees for Go 1.

#53171: This proposal is closer to what I am proposing, however, there are multiple packages per CPU architecture, and I'm providing a more complete explanation on how to achieve adding SIMD to the Go standard library.

#64634: This proposal is more specific to the crypto package but in essence this is close to what I'm proposing here.

Goals

The main goal of this proposal is to provide an alternative approach to designing a simd package for the Go standard library. As of right now, there is no consensus on what the API should look like and this proposal intends to drive the discussion further.

This proposal mainly includes two things:

  • Adding a new kind of build tag that lets the user specify which SIMD ISA to use at compile time.
  • Using compiler intrinsics to generate inline SIMD instructions in the code.

Build Tag

For the first point, I was thinking about optional build tags like the following:

//go:simd sse2
//go:simd neon
//go:simd avx512

etc.

As mentioned, these build tags are optional. If not specified, we would resolve to the appropriate SIMD ISA available on the current OS and ARCH. However, I still think that these build tags are needed for deeper optimization and platform-specific operations. If we know that some instruction is more performant or only available on a certain architecture, we should be able to enforce using it manually.

Finally, having the optional build tag would let the developer choose, at compile time, which SIMD ISA to target and thus cross compile. We could write something similar to:

$ go build -simd neon ...

With this, we could take advantage of platform-specific features and know at compile time the size of the vector registers (e.g., 128, 256, or 512 bits). This would help us make better decisions for optimizations on the compiler side.

Compiler Intrinsics

The next crucial step would be to create a portable SIMD package that would rely on the compiler to generate SIMD instructions through compiler intrinsics. I demonstrated that this is feasible with a POC. As of right now, it looks like the following (you can see more examples here, including UTF8 string validation):

package main

import (
    "fmt"
    "simd"
)

func main() {
    a := simd.Uint8x16{...}
    b := simd.Uint8x16{...}
    c := simd.AddU8x16(a, b)

    fmt.Printf("%v\n", c)
}

And here, the AddU8x16 gets lowered down to a vector add instruction after SSA lowering.

Notes

  • We can provide functions like AddU8x16, AddU8x32, etc. without changing the generics implementation. Other implementations like Highway, Rust std::simd, and Zig @Vector rely on generics for the API. In Go, we do not have non-type parameters in generics; thus we cannot have something like Simd[uint8, 16].

  • Also, we do not have a compile time Sizeof which could help us have:

    type Simd[T SupportedSimdTypes] = [VectorRegisterSize / SizeofInBits(T)]T

Philosophy

It is important to understand that this proposal is not describing an abstraction of SIMD features. Such an abstraction could create noticeable performance difference between ISAs. That's why we are trying to avoid it. This means that if an operation is not available on an ISA, we simply don't provide it. Each intrinsic should only have 1 underlying instruction, not a sequence of instructions.

Challenges to Overcome

  • Under the hood, the current POC, works with pointers on arrays. The main reason is that fixed-size arrays are not currently SSAable. But because we work with these pointers on arrays which are stored in general purpose registers, the performance is not great (allocations required and load of non-contiguous memory) and it requires us to do the LD/ST dance for each function in the simd package.

I believe we would need some kind of type aliases like the following:

type Int8x16 [16]int8
type Uint8x16 [16]uint8
//...

These types should not be indexable and only instantiable through init functions like Splat8x16, Load8x16, etc. The compiler would then promote these special types to vector registers. This would remove all the LD/ST dance and memory allocation that I have in my POC and thus make everything a lot faster.

In the end the previous code snippet could look like this:

package main

import (
    "fmt"
    "simd"
)

func main() {
    a := simd.SplatU8x16(1)
    b := simd.LoadU8x16([16]uint8{...})
    c := simd.AddU8x16(a, b)

    fmt.Printf("%v\n", c)
}
  • A lot of instructions on arm (and I suppose on other ISAs) are missing. The current POC is using constants (WORD $0x...) to encode these.

I believe we should avoid having to use constants when defining intrinsics. We could implement the missing instructions along the way with the implementation of intrinsics.

  • Naming these intrinsics is not always easy. For example, NEON has instructions called VMIN and UMINV. The former returns a vector of min elements, and the latter reduce to the minimum in a given vector. As we don't have function overloads, we will need to find a way to name them appropriately.

I believe we should try to make the horizontal operations more verbose (e.g. ReduceMin8x16) and promote the vertical ones (e.g. Min8x16). For the example of VMIN and UMINV, the latter does not even seem to exist in SSE2 whereas the first one does.

There are other operations that have "conflicting" names. For example shift right and shift left have both logical and arithmetic shifts. For these cases, I believe we could just call them LogicalShiftRight, ArithmeticShiftRight, ... I agree that this is verbose but it makes it clear what is happening.

  • The current POC did not implement the concept of Masks. This is an important concept but also a tricky one to implement without proper compiler support. After discussion with Jan Wassenberg (author of Highway), I realized that some platforms do not treat masks in the same way. Here is a summary:

    • on NEON, SSE4, and AVX2: 1 bit per bit of the vector.
    • on SVE: 1 bit per byte of vector (variable vector size).
    • on AVX-512, and RVV: 1 bit per lane.
    • AVX-512 has a separate register file for masks.

We could have other types aliases, like the one made for vectors. These types would have different shapes and be loaded differently depending on the platform.

  • Some operations need parameters that are known at compile time and within a certain range. For example SSHR (shift right) on NEON takes an immediate n that need to be restricted between 1 and 8 (see vshr_n_s8). I ran into some problems where during the build of the compiler, the n would resolve to 0 (default value of int passed as param) and it would crash the program.

I believe we could have some way to check at compile time these values are within a certain range. Like a static_assert in C++ or checks on the AST.

Why is it Important?

Without SIMD, we are missing on a lot of potential optimizations. Here is a non-exhaustive list of concrete things that could improve performance in daily life scenarios:

Furthermore, it would make these currently existing packages more portable and maintainable:

There are obviously way more applications of SIMD but I am just trying to say that this is useful in practical scenarios.

@gopherbot gopherbot added this to the Proposal milestone May 20, 2024
@jimwei

This comment was marked as spam.

@jimwei

This comment was marked as spam.

@ianlancetaylor ianlancetaylor added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 20, 2024
@ianlancetaylor ianlancetaylor changed the title proposal: Add simd package to standard library proposal: add "simd" package to standard library May 20, 2024
@earthboundkid
Copy link
Contributor

//go:simd sse2

ISTM, this could be a regular build tag //go:build sse2. A simd platform is sort of like GOARCH except that one GOARCH might implement only some SIMD architectures. So maybe there should be a concept of a GOSUBARCH and those could be available as build tags and as runtime lookups at init that PGO could optimize.

This might also help with WASI, where the solution was to make to make wasip1 the GOOS, but as new features get added to WASM itself, we might run into the same issue of needing to differentiate classic WASM from WASM++ or whatever.

@Clement-Jean
Copy link
Author

In the POC this is implemented with a simple build tag. However, I was presenting the idea for backward compatibility reasons. If anyone already defined such build tags, we wouldn't want to interfere with their code.

Furthermore, these new build tags might lead to different build logic or state than the original ones. I haven't look in details so I cannot describe a different use case though...

@earthboundkid
Copy link
Contributor

I guess that's a risk with any new build tag, but yes, it seems more likely to affect real code here. Maybe gosub_amd64_sse2 would be less likely to collide?

@Clement-Jean
Copy link
Author

This seems a little bit terse to me and I'm still not comfortable with impacting others code. That's also why I didn't propose file names having a certain ending like _sse2_amd64.go or _sse2_amd64_test.go.

But overall, I'm open to the idea.

@jan-wassenberg
Copy link

we do not have non-type parameters in generics; thus we cannot have something like Simd[uint8, 16].

Couldn't you have a T32 type that provides a function N that returns 32?
It seems to me that a vector-length agnostic API is important for running on heterogeneous devices, and supporting SVE/RISC-V.

if an operation is not available on an ISA, we simply don't provide it

This seems limiting. For example, dot products are quite important, but platforms differ a bit in what they provide. Also, Scatter isn't universally supported, but important in some fields. Would you simply skip it?

Each intrinsic should only have 1 underlying instruction, not a sequence of instructions.

This cannot be guaranteed with LLVM, which can and does transform intrinsics to something else.

@Clement-Jean
Copy link
Author

Clement-Jean commented May 21, 2024

Couldn't you have a T32 type that provides a function N that returns 32?
It seems to me that a vector-length agnostic API is important for running on heterogeneous devices, and supporting SVE/RISC-V.

If I understand correctly, instead of passing Uint8x16 you are suggesting having a Vector16 type which has a Lanes function that would return 16. And then, depending on which ISA we can resolve we choose the vector type to be used accros all intrinsics. Did I get that right? If yes, I would have to experiment. I'm not entirely sure how this would work.

This seems limiting. For example, dot products are quite important, but platforms differ a bit in what they provide. Also, Scatter isn't universally supported, but important in some fields. Would you simply skip it?

I believe you are seeing that from the Highway point of view. This proposal tries to provide direct intrinsics. Functions wrapping multiple intrinsics could be then created on top of them. These functions would include dot products, scatters, ...

This cannot be guaranteed with LLVM, which can and does transform intrinsics to something else.

Fortunately, we are not dealing with LLVM. Go has its own assembly.

@jan-wassenberg
Copy link

Vector16 type which has a Lanes function that would return 16. And then, depending on which ISA we can resolve we choose the vector type to be used accros all intrinsics.

Yes, that's right. Though I'd say VectorN instead of Vector16. 128-bit vectors have their place for historical reasons, but users asking for Vector1024 is a recipe for poor codegen on most ISAs.

This proposal tries to provide direct intrinsics. Functions wrapping multiple intrinsics could be then created on top of them.

Note that some platforms have dot product instructions e.g. vdpbf16ps, which are too good to pass up on. Of course a dot product kernel would call them in a loop, but you still want these instructions reflected in the abstraction layer.

@earthboundkid
Copy link
Contributor

A sticking point for past proposals has been whether the package should be high level or low level. I think it probably makes sense to make a syscall vs. os style split and have a high level package that calls into a low level package. In terms of implementation, I think starting with the low level package is clearly better. So maybe start with golang.org/x/exp/simd being experimentally special cased by the compiler, if it's successful, move it to runtime/simd and then if everyone is happy with runtime/simd, a separate proposal could add math/vector as a high level library.

@Clement-Jean
Copy link
Author

Clement-Jean commented May 21, 2024

Thank you for clarification @jan-wassenberg

@earthboundkid this is the idea behind this proposal. Let's do the low level intrinsics that are not length agnostic and building on that we will be able to do the fancy length agnostic ones.

@jan-wassenberg
Copy link

It's not clear to me that VL agnostic can be built on top of known-length types, unless for each vector you set aside up to 256 bytes (SVE) or even 64 KiB (RISC-V V)?

@Clement-Jean
Copy link
Author

I'm not even sure the compiler can generate code for SVE and RISC-V V to be honest. So I don't think this is the priority.

But yes for variable length vectors we will need either to set aside a larger amount of bytes, or come up with higher-level vector types.

@klauspost
Copy link
Contributor

What you are proposing is already covered with build tags. With that said build tags are fundamentally a wrong approach for intrinsics.

You seem to imply that intrinsics should be decided at compile time. I disagree. There are maybe 30+ CPU features in x86-64. Add combinations and you have an infinite build matrix already.

If intrinsics should make sense, they can already be platform guarded and cpuid detection should make the compiled binary dynamically either use intrinsics or not. We already have GOAMD64 env var that would allow to bypass certain cpuid checks.

Your proposal for a generic "simd" package is too simplistic. While abstracting certain operations would be possible and fine, I stand by my opinion that this can be done either internally or externally if Go provides the intrinsics to begin with. TBH I don't quite understand how this package would work with the build tags. Would it fail with simd.SplatU8x16(1) is not supported by current simd platform?

type Int8x16 [16]int8
type Uint8x16 [16]uint8

Distinct types is an abstraction over "register with x bits". Take "xor". It doesn't operate on any specific type, but only on a fixed register size. Do you want an xor for all types you define?

I see all of that as a secondary layer to the main complexity of adding vector registers to the compiler. That seems like the primary problem to solve. To me the approach would be to:

A) Find a reasonable way to represent vector registers in Go code, which can be used across platforms.

It could be a type vec128 [8]byte - or it could be an "abstract" type , like int that doesn't attempt to expose the underlying data. I am not the compiler expert to decide on that.

It seems reasonable to provide LoadUint32x4(*[4]uint32) vec128 and StoreUint32x4(*[4]uint32, vec128) as well as splatters, etc. With the slice to array pointer conversion this would be reasonably easy to use. I feel like these should be platform guarded, since these themselves often will have cpuid limitations - for example avx512.

B) Provide intrinsics per platform.

Intrinsics are "close" to the emitted instructions. There is no scalar fallback, and cpuid checks will be up to the user. I still think dividing intrinsics into (primary) cpuid makes sense. Importing a package doesn't mean you use it, merely that you may use it if the cpu supports it.

C) Find a reasonable way for imported packages to "inject" intrinsics into the compiler.

That means importing simd/amd64/gfni (or whatever it gets called) would inject the intrinsics defined in the package into the compiler. That seems to me like a reasonable way to "separate concerns" and not have the compiler explode speedwise with 1000s of new intrinsics to look out for.

D) Constants

A lot of instructions take immediates which must be constant at compile time. For example VGF2P8AFFINEQB(imm8 uint8, src2, src1, vec128) vec128 the imm8 must be resolvable at compile time. Go doesn't have a way to specify that in the type definition.

While it could just result in a compile-time error it would be nice if the function definition by itself could give that information. Not a show-stopper, though - and could be decided later.

There are a lot of "minor" decisions that would need to be made. Should instruction with shared source/destination registers abstract this? How should functions be named? Close to the instruction or slightly abstracted (like C, rust) or fully renamed? While these things would need a resolution at some point, they don't really seem like the primary things to focus on too early.

If the above works out, third parties can implement what they like. The Go stdlib can then look at whether a generic vector package would make sense if that is deemed important.

TLDR; I think the approach of compile-time cpu feature detection is a dead end. I think intrinsics should be low level to allow for the biggest amount of features and leave higher level abstractions to competing third parties. Go will provide the compiler that can make this great.

PS. I am genuinely very sorry I never found the time to reply to your emails.

@Clement-Jean
Copy link
Author

Clement-Jean commented May 22, 2024

So, I believe there are misunderstandings here:

If intrinsics should make sense, they can already be platform guarded and cpuid detection should make the compiled binary dynamically either use intrinsics or not. We already have GOAMD64 env var that would allow to bypass certain cpuid checks.

The build tag is optional and when not provided, the compiler, through cpuid, would detect the all the intrinsics available for a given platform. For example, if you had a machine with SSE and AVX available, the compiler will detect all these features. The build tags let the user restrict to a set of features if needed.

Would it fail with simd.SplatU8x16(1) is not supported by current simd platform?

For that, we could drop to SWAR-like operations or scalar code if a feature is not detected. I believe the currently available intrinsics already have this.

Take "xor". It doesn't operate on any specific type, but only on a fixed register size. Do you want an xor for all types you define?

No, for xor under the hood, there is only one intrinsic covering multiple types.

A) Find a reasonable way to represent vector registers in Go code, which can be used across platforms.

D) Constants

A lot of instructions take immediates which must be constant at compile time. For example VGF2P8AFFINEQB(imm8 uint8, src2, src1, vec128) vec128 the imm8 must be resolvable at compile time. Go doesn't have a way to specify that in the type definition.

While it could just result in a compile-time error it would be nice if the function definition by itself could give that information. Not a show-stopper, though - and could be decided later.

These two points are covered in my proposal. I talk about having to handle constants and making sure that the compiler promote the types to vector registers.

Rest

I still think dividing intrinsics into (primary) cpuid makes sense. Importing a package doesn't mean you use it, merely that you may use it if the cpu supports it.

For importing through different import paths, it just seem to be another way of doing what I proposed with the optional build tag but in reverse. If you don't provide a tag, you get all possible intrinsics. Otherwise, you get the intrinsics only for a set of ISAs.

not have the compiler explode speedwise with 1000s of new intrinsics to look out for.

Once again, a lot of intrinsics are shared across multiple types. This means we will have lot of "functions" calling to the same underlying intrinsic.

There are a lot of "minor" decisions that would need to be made. Should instruction with shared source/destination registers abstract this? How should functions be named? Close to the instruction or slightly abstracted (like C, rust) or fully renamed? While these things would need a resolution at some point, they don't really seem like the primary things to focus on too early.

Totally agree with this. The minor points that I made are purely based on the experience with the POC. They are not the main focus of the proposal.

@Jorropo
Copy link
Member

Jorropo commented May 22, 2024

Under the hood, the current POC, works with pointers on arrays. The main reason is that fixed-size arrays are not currently SSAable. But because we work with these pointers on arrays which are stored in general purpose registers, the performance is not great (allocations required and load of non-contiguous memory) and it requires us to do the LD/ST dance for each function in the simd package.

I think the better solution is to steal the types from #64634:

type ISimd128 ISimd128

Each SIMD vector become it's own native type which are unbreakable.
As unbreakable values it is trivial to lift them up to SSA.

This also discourage programmers to treat them as arrays (which is often slow).
Your type suggest this is ok:

// random example
for i, v := range someSIMDVector {
 if v % 2 == 0 && i % 2 == 1 { // do some logic
  continue
 }
 someSIMDVector[i] = 42 // create inter-loop dependency if lifting to registers
}

however performance for code like this is gonna be way worst than it look it should be on amd64 (and maybe other architectures) due to the high cost to move data between the SIMD and GP complexes.
I guess in your case this is fine since they are pointers so it's always taking the slow path, but we should want the compiler to lift things into registers due to the vast speed increases this gives.

@Clement-Jean
Copy link
Author

Totally agree with you on that. This is why I said:

These types should not be indexable and only instantiable through init functions like Splat8x16, Load8x16, etc. The compiler would then promote these special types to vector registers.

And having a native type is definitely the way to go. I just didn't know how to properly formulate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Projects
Status: No status
Status: Incoming
Development

No branches or pull requests

8 participants