FastNoise SIMD AVX512 Support

I had worked on AVX512 support a few months ago but had no hardware to test it on, and I didn’t want to release it without testing. Now thanks to Google cloud platform offering Skylake and inviting me to test it I’ve tested it fully, got all working and now released it.

When testing it I found that even the newest version of MSVC 2017 failed when trying to compile the AVX512 intrinsics so all testing has been done using Intel compiler 17.0. For this reason I also left AVX512 compile disabled by default, since it is unusable for 99.99% of people at this time anyway.

Intrinsic Masks

One of the big new features in AVX512 is the ability to use a bit mask on almost any of the SIMD intrinsics. For each element in the vector the operation is only carried out if its corresponding bit is set to 1. Because of this all of the comparison intrinsics (less than, greater than, equal) now return a uint16 mask instead of a full vector. This was the biggest change to the code in FastNoiseSIMD since I wanted to have multi-instruction set(IS) compatible code and not have a separate case for AVX512 masks.

To make masks multi-IS compatible and easy to use I simply emulated them on the old ISs, so on AVX512 a MASK type is defined as __mmask16 and for everything else it is a normal integer vector.

#if SIMD_LEVEL == FN_AVX512
typedef __mmask16 MASK;
#else
typedef SIMDi MASK;
#endif

The masking intrinsics are also emulated on the older ISs using a bitwise AND on the second operand value and mask, then doing the operation on the result.

#define SIMDi_MASK_ADD(m,a,b) SIMDi_ADD(a,SIMDi_AND(m,b))
#define SIMDi_MASK_SUB(m,a,b) SIMDi_SUB(a,SIMDi_AND(m,b))

Improved Gradient Function

Apart from the obvious performance increase from processing 16 values in AVX512 simultaneously versus only 8 in AVX2, another large speed increase comes from the ability to store all 16 possible gradient directions in a single vector for use with the permute intrinsic. Permute allows a vector to be used as a kind of lookup array, it uses an int vector as an index and returns the corresponding value from the lookup vector. I was using permute in AVX2 already but it required 2 permutes per axis and then a blend to get all 16 gradient possibilities. Below is a comparison of the 3 different gradient functions, they all produce identical outputs.

#if SIMD_LEVEL == FN_AVX512
static SIMDf VECTORCALL FUNC(GradCoord)(const SIMDi& seed, const SIMDi& xi, const SIMDi& yi, const SIMDi& zi, const SIMDf& x, const SIMDf& y, const SIMDf& z)
{
 SIMDi hash = FUNC(Hash)(seed, xi, yi, zi);

 SIMDf xGrad = SIMDf_PERMUTE(SIMDf_NUM(X_GRAD), hash);
 SIMDf yGrad = SIMDf_PERMUTE(SIMDf_NUM(Y_GRAD), hash);
 SIMDf zGrad = SIMDf_PERMUTE(SIMDf_NUM(Z_GRAD), hash);

 return SIMDf_MUL_ADD(x, xGrad, SIMDf_MUL_ADD(y, yGrad, SIMDf_MUL(z, zGrad)));
}

#elif SIMD_LEVEL == FN_AVX2
static SIMDf VECTORCALL FUNC(GradCoord)(const SIMDi& seed, const SIMDi& xi, const SIMDi& yi, const SIMDi& zi, const SIMDf& x, const SIMDf& y, const SIMDf& z)
{
 SIMDi hash = FUNC(Hash)(seed, xi, yi, zi);
 MASK hashBit8 = SIMDi_SHIFT_L(hash, 28);

 // x0 and y8 share the same values and are easy to calculate
 SIMDf x0y8Perm = SIMDf_XOR(SIMDf_CAST_TO_FLOAT(SIMDi_SHIFT_L(hash, 31)), SIMDf_NUM(1));

 SIMDf xGrad = SIMDf_BLENDV(x0y8Perm, SIMDf_PERMUTE(SIMDf_NUM(X_GRAD_8), hash), hashBit8);
 SIMDf yGrad = SIMDf_BLENDV(SIMDf_PERMUTE(SIMDf_NUM(Y_GRAD_0), hash), x0y8Perm, hashBit8);
 SIMDf zGrad = SIMDf_BLENDV(SIMDf_PERMUTE(SIMDf_NUM(Z_GRAD_0), hash), SIMDf_PERMUTE(SIMDf_NUM(Z_GRAD_8), hash), hashBit8);

return SIMDf_MUL_ADD(x, xGrad, SIMDf_MUL_ADD(y, yGrad, SIMDf_MUL(z, zGrad)));
}

#else
static SIMDf VECTORCALL FUNC(GradCoord)(const SIMDi& seed, const SIMDi& xi, const SIMDi& yi, const SIMDi& zi, const SIMDf& x, const SIMDf& y, const SIMDf& z)
{
 SIMDi hash = FUNC(Hash)(seed, xi, yi, zi);
 SIMDi hasha13 = SIMDi_AND(hash, SIMDi_NUM(13));

 //if h < 8 then x, else y
 MASK l8 = SIMDi_LESS_THAN(hasha13, SIMDi_NUM(8));
 SIMDf u = SIMDf_BLENDV(y, x, l8);

 //if h < 4 then y else if h is 12 or 14 then x else z
 SIMDi l4 = SIMDi_LESS_THAN(hasha13, SIMDi_NUM(2));
 SIMDi h12o14 = SIMDi_EQUAL(SIMDi_NUM(12), hasha13);
 SIMDf v = SIMDf_BLENDV(SIMDf_BLENDV(z, x, h12o14), y, l4);

 //if h1 then -u else u
 //if h2 then -v else v
 SIMDf h1 = SIMDf_CAST_TO_FLOAT(SIMDi_SHIFT_L(hash, 31));
 SIMDf h2 = SIMDf_CAST_TO_FLOAT(SIMDi_SHIFT_L(SIMDi_AND(hash, SIMDi_NUM(2)), 30));
 //then add them
 return SIMDf_ADD(SIMDf_XOR(u, h1), SIMDf_XOR(v, h2));
}
#endif

AVX512 is the simplest, then AVX2 closely followed by the default, this also closely reflects their relative performance. Performance of the gradient function affects Perlin noise the most since it has 8 calls to it, then Simplex with 4.

Inline Scalar Value Broadcasting

Another lesser know feature of AVX512 is the ability to use scalar values in SIMD intrinsics and have them broadcast and used in the operation without any overhead. There is currently no compiler that provides and explicit way to use this, it is down to the compiler to recognise it could be used and create the relevant op code. Because of this and the minimal gain for FastNoise I haven’t currently implemented it, although it is a future consideration.

Performance

The most interesting part of using AVX512!

See the updated Performance Comparisons page.

You may also like...

Leave a Reply