Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a few missing things using SSE or NEON intrinsics. #1338

Open
ibogosavljevic opened this issue Apr 25, 2023 · 14 comments
Open

Implement a few missing things using SSE or NEON intrinsics. #1338

ibogosavljevic opened this issue Apr 25, 2023 · 14 comments

Comments

@ibogosavljevic
Copy link

ibogosavljevic commented Apr 25, 2023

I have the following code:

int32_t MulInt(int32_t out, int32_t a, int32_t b) {
    return static_cast<int32_t>((static_cast<int64_t>(a[i]) * static_cast<int64_t>(b[i])) >> 16);
}

I tried to implement it through Highway alone, but it doesn't work. Highway currently doesn't support multiplication of int32_t to produce int64_t efficiently. Many architectures will multiply two int32_t to produce int64_t without explicit casting, and I want to take advantage of that.

I already have implementations for SSE and NEON, but I want to make them available to Highway. I want to have Highway function
V MulInt(V v2, V v2) for Vec128<int32_t> for SSE and NEON. Is there any official guide on how to supplement Highway with custom build functions?

@johnplatts
Copy link
Contributor

Here is how the vector version of the MulInt operation above can be implemented in Highway for an int32_t vector:

template<class V, HWY_IF_UI32_D(DFromV<V>),
         HWY_IF_V_SIZE_LE_V(V, HWY_MAX_BYTES / 2)>
HWY_INLINE V MulInt(V a, V b) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  const Rebind<uint64_t, decltype(d)> du64;

#if HWY_TARGET == HWY_SCALAR
  const auto a2 = a;
  const auto b2 = b;
#else
  const Twice<decltype(d)> dt;
  const auto a_u64 = PromoteTo(du64, BitCast(du, a));
  const auto b_u64 = PromoteTo(du64, BitCast(du, b));
#if HWY_IS_BIG_ENDIAN
  const auto a2 = BitCast(dt, ShiftLeft<32>(a_u64));
  const auto b2 = BitCast(dt, ShiftLeft<32>(b_u64));
#else
  const auto a2 = BitCast(dt, a_u64);
  const auto b2 = BitCast(dt, b_u64);
#endif  // HWY_IS_BIG_ENDIAN
#endif  // HWY_TARGET != HWY_SCALAR

  return BitCast(d, TruncateTo(du,
    ShiftRight<16>(BitCast(du64, MulEven(a2, b2)))));
}

#if HWY_TARGET != HWY_SCALAR
template<class V, HWY_IF_UI32_D(DFromV<V>),
         HWY_IF_V_SIZE_V(V, HWY_MAX_BYTES)>
HWY_INLINE V MulInt(V a, V b) {
  const DFromV<decltype(a)> d;
  const Half<decltype(d)> dh;
  const RebindToUnsigned<decltype(d)> du;
  const RebindToUnsigned<decltype(dh)> dh_u;
  const Repartition<uint64_t, decltype(d)> du64;

  const auto a1_u64 = PromoteTo(du64, BitCast(dh_u, LowerHalf(dh, a)));
  const auto b1_u64 = PromoteTo(du64, BitCast(dh_u, LowerHalf(dh, b)));
  const auto a2_u64 = PromoteTo(du64, BitCast(dh_u, UpperHalf(dh, a)));
  const auto b2_u64 = PromoteTo(du64, BitCast(dh_u, UpperHalf(dh, b)));

#if HWY_IS_BIG_ENDIAN
  const auto a1 = BitCast(d, ShiftLeft<32>(a1_u64));
  const auto b1 = BitCast(d, ShiftLeft<32>(b1_u64));
  const auto a2 = BitCast(d, ShiftLeft<32>(a2_u64));
  const auto b2 = BitCast(d, ShiftLeft<32>(b2_u64));
#else
  const auto a1 = BitCast(d, a1_u64);
  const auto b1 = BitCast(d, b1_u64);
  const auto a2 = BitCast(d, a2_u64);
  const auto b2 = BitCast(d, b2_u64);
#endif  // HWY_IS_BIG_ENDIAN

  const auto p1 = ShiftRight<16>(BitCast(du64, MulEven(a1, b1)));
  const auto p2 = ShiftRight<16>(BitCast(du64, MulEven(a2, b2)));
  return BitCast(d, OrderedTruncate2To(du, p1, p2));
}
#endif  // HWY_TARGET != HWY_SCALAR

@ibogosavljevic
Copy link
Author

Wow, thanks a lot! I am simply not that fluent in highway to be able to write such a thing. The documentation is good, but I was still missing a few key pieces.

@ibogosavljevic
Copy link
Author

I see the following compilation issue:

In file included from main.cpp:9:
./main-highway.cpp:55:21: error: use of undeclared identifier 'OrderedTruncate2To'
  return BitCast(d, OrderedTruncate2To(du, p1, p2));
                    ^
./main-highway.cpp:71:24: note: in instantiation of function template specialization 'hwy::N_SSSE3::MulInt<hwy::N_SSSE3::Vec128<int, 4>, nullptr, nullptr>' requested here
        auto res = hn::MulInt(a_val, b_val);
                       ^

Do you know what might be the problem?

@jan-wassenberg
Copy link
Member

Hi @ibogosavljevic , it's always interesting to hear how the documentation works for new users and what can be improved. I'm curious what the missing pieces were?

@johnplatts nice implementation!

For the compile issue, it's likely that you are using an older release of Highway, we added that intrinsic 2 weeks after the last release. You could use Compiler Explorer to test this, copying your code into this environment. We'll do another release soon, perhaps a week or two. Or until then, you could consider using the latest Git version?

@johnplatts
Copy link
Contributor

Hi @ibogosavljevic , it's always interesting to hear how the documentation works for new users and what can be improved. I'm curious what the missing pieces were?

@johnplatts nice implementation!

For the compile issue, it's likely that you are using an older release of Highway, we added that intrinsic 2 weeks after the last release. You could use Compiler Explorer to test this, copying your code into this environment. We'll do another release soon, perhaps a week or two. Or until then, you could consider using the latest Git version?

Here is a link to a Compiler Explorer snippet that includes the above MulInt implementation, and the snippet does compile successfully for the HWY_SSE4/HWY_NEON_WITHOUT_AES/HWY_RVV/HWY_SVE/HWY_PPC8/HWY_SCALAR targets:
https://godbolt.org/z/sK1zEv4xb

@ibogosavljevic
Copy link
Author

ibogosavljevic commented Apr 26, 2023 via email

@jan-wassenberg
Copy link
Member

Glad to hear it works.

I can share the insight about Highway with you, in order to improve it.
This means filing issues when documentation is not clear, or reporting bugs
if native implementations are significantly faster than Highway. Let me
know if this is what you want.

This sounds wonderful. Please don't hesitate to raise issues.
If it would be helpful, I am also happy to advise on a draft of your workshop.
(We have internal slides that I'd like to open-source but it will take some time, possibly too long for your timeframe.)

It would be surprising if a native implementation were much faster, but if that happens we are open to adding new ops to bridge any gaps.

@ibogosavljevic
Copy link
Author

I want to add more information about the performance of MulInt, as suggested by @johnplatts .

I compared the performance of highway vs the performance of automatically compiler vectorized code:

void convert_for_vector(int32_t* out, int32_t* a, int32_t* b, int n) {
    const hn::ScalableTag<int32_t> d;
    #pragma clang loop unroll(disable)
    for (int i = 0; i < n; i += hn::Lanes(d)) {
        auto a_val = hn::Load(d, a + i); 
        auto b_val = hn::Load(d, b + i);

        auto res = hn::MulInt<16>(a_val, b_val);
        Store(res, d, out + i);
    }
}

VS

void convert_for(int32_t* out, int32_t* a, int32_t* b, int n) {
    #pragma clang loop unroll(disable)
    for (int i = 0; i < n; ++i) {
        out[i] = static_cast<int32_t>((static_cast<int64_t>(a[i]) * static_cast<int64_t>(b[i])) >> 16);
    }
}

Compiler's version is faster than highway, 0.56s vs 0.85s.

Everything compiled with -mavx2 -O3 on newest clang.

Compiler's version assembly:

image

@jan-wassenberg
Copy link
Member

Thanks for sharing! Sometimes the compiler has interesting tricks, so let's have a look.

First, I notice that the codegen is very similar: https://gcc.godbolt.org/z/8cvrcqbbW
But the difference may be because we're missing -maes, without which Highway doesn't use the AVX2 target.
(There is a documented set of flags for x86 targets here: https://gcc.godbolt.org/z/rGnjMevKG)

Was -maes indeed the difference? Seems that should bring us up to parity. And we can probably do better yet:
The vpmuldq = MulEven already does an int32->int64 cast, so both the Highway code and compiler's autovectorization could skip vpmovsxdq. But it's unclear how to do that with the compiler because we actually do require the cast at the language level, otherwise we get a 32-bit mul.

Here's a quick sketch of Highway code without the cast:

Repartition<uint64_t, decltype(d)> du64;
auto evens = ShiftRight<16>(MulEven(a, b));
auto a_64 = BitCast(du64, a);
auto b_64 = BitCast(du64, b);
auto odds = ShiftRight<16>(MulEven(ShiftRight<32>(a_64), ShiftRight<32>(b_64)));

If you didn't care about order that's likely going to be quicker (just store evens then odds), but if you do: we could InterleaveLower(evens, odds) but that will require further fixup because interleaving is per 128-bit block. Maybe we should add an op for that: either an InterleaveLower variant that does not stop at 128-bit boundaries, or a fixup op for afterwards to transform the results of InterleaveLower. @johnplatts , what do you think?

@johnplatts
Copy link
Contributor

johnplatts commented May 14, 2023

A more optimal implementation of the set-before-first operation for masks is possible on SSE4/AVX2/AVX3/RVV/PPC10.

Here is how the SetBeforeFirst operation could be implemented for masks for 128-bit or smaller vectors on SSE4/AVX2:

template<class T, size_t N>
HWY_API Mask128<T, N> SetBeforeFirst(Mask128<T, N> m) {
  const Simd<T, N, 0> d;
  const RebindToUnsigned<decltype(d)> du;
  using TU = TFromD<decltype(du)>;
  return RebindMask(d, Mask128<TU, N>{
    _mm_cmpistrm(RebindMask(du, Not(m)).raw, _mm_setzero_si128(), 0x58)});
}

Here is how the SetBeforeFirst operation could be implemented for masks on AVX3:

template<class T, size_t N>
HWY_API Mask128<T, N> SetBeforeFirst(Mask128<T, N> m) {
  using RawMask = decltype(MaskFromVec(VFromD<decltype(d)>()).raw);
  return Mask128<T, N>{static_cast<RawMask>(~(m.raw | (-m.raw)))};
}
template<class T>
HWY_API Mask256<T> SetBeforeFirst(Mask256<T> m) {
  using RawMask = decltype(MaskFromVec(VFromD<decltype(d)>()).raw);
  return Mask256<T>{static_cast<RawMask>(~(m.raw | (-m.raw)))};
}
template<class T>
HWY_API Mask512<T> SetBeforeFirst(Mask512<T> m) {
  using RawMask = decltype(MaskFromVec(VFromD<decltype(d)>()).raw);
  return Mask512<T>{static_cast<RawMask>(~(m.raw | (-m.raw)))};
}

Here is how the SetBeforeFirst operation could be implemented for masks on PPC10:

namespace detail {

template<class D, HWY_IF_T_SIZE_D(D, 1)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_genbm(bits)});
}

template<class D, HWY_IF_T_SIZE_D(D, 2)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_genhm(bits)});
}

template<class D, HWY_IF_T_SIZE_D(D, 4)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_genwm(bits)});
}

template<class D, HWY_IF_T_SIZE_D(D, 8)>
HWY_INLINE MFromD<D> VsxGenMaskFromMaskBits(D d, uint64_t bits) {
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;
  return BitCast(d, VU{vec_gendm(bits)});
}

}  // namespace detail

template<class T, size_t N>
HWY_API Mask128<T, N> SetBeforeFirst(Mask128<T, N> m) {
  const Simd<T, N, 0> d;
  const RebindToUnsigned<decltype(d)> du;
  using VU = VFromD<decltype(du)>;

  const auto v_mask = BitCast(du, VecFromMask(d, m));
  const auto mask_bits = static_cast<uint64_t>(vec_extractm(v_mask.raw));
#if HWY_IS_LITTLE_ENDIAN
  const auto first_set_mask_bit = mask_bits & (-mask_bits);
  const auto result_mask_bits = first_set_mask_bit - 1;
#else
  const auto bit_after_last_set_mask_bit =
    (uint64_t{1} << (63 - Num0BitsAboveMS1Bit_Nonzero64((mask_bits << 1) | 1)));
  const auto result_mask_bits = static_cast<uint64_t>(-bit_after_last_set_mask_bit);
#endif

  return detail::VsxGenMaskFromMaskBits(result_mask_bits);
}

Here is how the SetBeforeFirst operation could be implemented for vectors on RVV (which is simply a wrapper for the __riscv_vmsbf_m_b1, __riscv_vmsbf_m_b2, __riscv_vmsbf_m_b4, __riscv_vmsbf_m_b8, __riscv_vmsbf_m_b16, __riscv_vmsbf_m_b32, and __riscv_vmsbf_m_b64 intrinsics):

HWY_RVV_FOREACH_B(HWY_RVV_RETM_ARGM, SetBeforeFirst, sbf)

@jan-wassenberg
Copy link
Member

@johnplatts sounds like you are proposing a new SetBeforeFirst op, unrelated to this particular MulInt code?
That sounds potentially useful, would welcome a pull request with your code.
But perhaps we should replace cmpistrim because it is latency 10 on SKX? I suppose a lookup table indexed by BitsFromMask would also be an option.

@johnplatts
Copy link
Contributor

@johnplatts sounds like you are proposing a new SetBeforeFirst op, unrelated to this particular MulInt code? That sounds potentially useful, would welcome a pull request with your code. But perhaps we should replace cmpistrim because it is latency 10 on SKX? I suppose a lookup table indexed by BitsFromMask would also be an option.

There are some string operations that can be implemented in a more efficient manner on SSE4/AVX2/AVX3 (using the _mm_cmpistri, _mm_cmpistrm, _mm_cmpestri, and _mm_cmpestrm), RVV (using the __riscv_vmsbf_m_b intrinsics), SVE (using svbrkb_b_z), and PPC10 (using the vec_stril intrinsic).

Here is how the ZeroPastNullTerminator operation (which zeroes out all lanes past the null terminator) can be implemented for U8 vectors using native SIMD intrinsics on SSE4.2/AVX3/RVV/SVE/PPC10:

// SSE4.2
__m128i ZeroPastNullTerminator(__m128i v) {
  __m128i mask = _mm_cmpistrm(v, _mm_setzero_si128(), 0x58);
  return _mm_and_si128(v, mask);
}

// AVX3
__m128i ZeroPastNullTerminator(__m128i v) {
  const __mmask16 mask = _mm_cmpeq_epi8_mask(v, _mm_setzero_si128());
  return _mm_maskz_mov_epi8(static_cast<__mmask16>(mask | (-mask)), v);
}

// RVV
vuint8m1_t ZeroPastNullTerminator(vuint8m1_t v) {
  const size_t vl = __riscv_vsetvlmax_e8m1();
  const vbool8_t eq_to_zero_mask =
    __riscv_vmseq_vx_u8m1_b8(v, uint8_t{0}, vl);
  const vbool8_t valid_chars_mask =
    __riscv_vmsbf_m_b8(eq_to_zero_mask, vl);
  return __riscv_vmerge_vvm_u8m1(__riscv_vmv_v_x_u8m1(0, vl), v,
                                 valid_chars_mask, vl);
}

// SVE
svuint8_t ZeroPastNullTerminator(svuint8_t v) {
  const svbool_t pg = svptrue_b8();
  const svuint8_t v_zero = svdup_n_u8(uint8_t{0});
  const svbool_t eq_to_zero_mask =
    svcmpeq_u8(pg, v, v_zero);
  const svbool_t valid_chars_mask =
    svbrkb_b_z(pg, eq_to_zero_mask);
  return svsel_u8(valid_chars_mask, v, v_zero);
}

// PPC10
__vector unsigned char ZeroPastNullTerminator(__vector unsigned char v) {
  return vec_stril(v);
}

@johnplatts
Copy link
Contributor

The SSE4.2 PCMPISTRI, PCMPISTRM, PCMPESTRI, and PCMPESTRM can do the following operations using a single instruction:

  • Equal any (equivalent to the following):
template<class V>
MFromD<DFromV<V>> EqualAny(V a, V b, size_t b_len) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  using TU = TFromD<decltype(du)>;
  auto m = MaskFromVec(Zero(d));
  for(size_t i = 0; i < b_len; i++) {
    const auto idx = IndicesFromVec(d, Set(du, static_cast<TU>(i)));
    m = Or(m, Eq(a, TableLookupLanes(b, idx)));
  }
  return m;
}
  • WithinRanges (equivalent to the following):
template<class V>
MFromD<DFromV<V>> WithinRanges(V a, V b, size_t b_len) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  using TU = TFromD<decltype(du)>;
  auto m = MaskFromVec(Zero(d));
  const auto b_pair_len = b_len & static_cast<size_t>(-2);
  for(size_t i = 0; i < b_pair_len; i += 2) {
    const auto idx1 = IndicesFromVec(d, Set(du, static_cast<TU>(i)));
    const auto idx2 = IndicesFromVec(d, Set(du, static_cast<TU>(i + 1)));
    const auto within_range =
      And(Ge(a, TableLookupLanes(b, idx1)), Le(a, TableLookupLanes(b, idx2)));
    m = Or(m, within_range);
  }
  return m;
}
  • EqualOrdered (equivalent to the following):
template<class V>
MFromD<DFromV<V>> EqualOrdered(V a, V b, size_t b_len) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  const RebindToSigned<decltype(d)> di;
  using TU = TFromD<decltype(du)>;

  const auto all_ones = BitCast(d, Set(di, -1));
  if(b_len <= 0)
    return MaskFromVec(all_ones);

  const auto iota0 = Iota(du, TU{0});
  auto m = Eq(a, Broadcast<0>(b));
  for(size_t i = 1; i < b_len; i++) {
    const auto b_shuf_idx = Set(du, static_cast<TU>(i));
    const auto mask_shuf_idx = Add(iota0, static_cast<TU>(i));
    const auto m_i = VecFromMask(b, Eq(a, TableLookupLanes(b, b_shuf_idx)));
    m = Or(m, MaskFromVec(TwoTablesLookupLanes(m_i, all_ones, mask_shuf_idx)));
  }
  return m;
}

@jan-wassenberg
Copy link
Member

I like your ZeroPastNullTerminator idea. It's not clear how this could look in AVX2, though? 32 lanes is too much for one lookup table. And NEON would also struggle, its BitsFromMask is quite expensive.

That's a wider concern about the SSE4 string instructions - they are not very performance-portable. Even Intel hasn't carried them forward to the >128 bit instruction sets. But your emulations look reasonable. Seems it would be better to use those than not vectorize. Would you like to create a pull request with those operations? Perhaps we can put them in hwy/contrib/algo/string-inl.h or similar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants