Attempting to use SIMD for single noise values
I was testing using explicit SIMD for generating singles noise values. FastNoise SIMD works by calculating multiple noise positions in parallel, but this can be hard to use in some situations. The idea behind this was that in perlin/simplex noise there are 8/4 gradient lookups, this is a large part of the noise functions and is well suited to SIMD. The initial x/y/z axis calculations can also be moved into a single vector and done in parallel.
I tried to stick to SSE2 intrinsics to make it easily compatible with all 64bit systems, I used a 2 shortcuts in this test though: _mm_mullo_epi32() and _mm_blendv_ps(). These are both SSE4.1 but I used them for getting quick initial performance results, since they could be substituted for SSE2 intrinsics if needed, this would be slower though.
Below is the code I substituted the SinglePerlin() function for in FastNoise for testing:
I tested the above code and compared the results to the normal perlin noise function in FastNoise. The results were disappointing, unfortunately I didn’t save the benchmark scores but the SIMD version averaged about 5% slower than the normal version. Since this was using SSE4.1 intrinsics I didn’t progress any further with the testing, I didn’t want to make FastNoise CPU reliant or have the extra SIMD intrinsic support checks like FastNoise SIMD, since it is meant to be an easier to use library.
I think several things contribute to the SIMD version being slower:
- Using _mm_set_ps() to get the initial positions into a vector is slow
- Shuffling from the position vector into the gradient position vectors is slow
- The SIMD gradient function is slower than the normal gradient function, since it can’t use lookup arrays
- Extracting the gradient results from the vectors for lerping is slow
Unfortunately utilising SIMD for single noise values didn’t have the performance benefits I had hoped for.
This is not the only avenue for utilising SIMD in a single noise function, when calculating noise fractals multiple octaves could be calculated at once using SIMD. This would have a much lower usage of inserting and extracting from vectors and could directly use the SIMD based function from FastNoise SIMD. The downside of this is that a fractal noise function with only 2 octaves wouldn’t be any faster than using 4 octaves, hopefully even at 2 octaves the performance would match or be an improvement over non SIMD. I would have to figure out how to structure this feature and whether it would go into FastNoise or FastNoise SIMD.