Writing Maintainable SIMD Intrinsics Using C++ Templates

思考的轨迹 2012-03-27

展开全文

Writing Maintainable SIMD Intrinsics Using C++ Templates

Posted on December 5, 2011

With the release of CPUs supporting AVX instructions, there are now 2 major versions of SIMD instructions available to software developers. This presents a challenge to developers who want to write code that performs well on a variety of processors. It is expensive and error prone to maintain a scalar, SSE, and now AVX implementation of the same piece of code. In this article, I present a technique that I found useful to help solve this problem.

Introduction to SSE/AVX Intrinsics
A very brief introduction to SSE and AVX intrinsics is in order. SSE and AVX are SIMD (Single Instruction, Multiple Data) instruction set extensions for x86. These instructions allow you to execute the same instruction on many operands at once for similar cost to a scalar operation. These instructions can offer certain algorithms a great boost in performance. Intrinsic functions are functions built in to the compiler that in the case of SSE and AVX, map very closely to SSE and AVX assembly instructions. However, they are not the same as assembly. Intrinsic functions offer advantages and disadvantages, but unless you have a good reason why you can’t use them, use intrinsics. The following example illustrates how to use scalar code, and SSE and AVX intrinsics to evaluate the sum of arrays of floats:

// Header for AVX intrinsics (includes SSE intrinsics.)
#include <immintrin.h>
 
// Compute out[i] = a[i] + b[i] with scalar instructions.
void AddScalar(float *out, const float *a, const float *b, int N)
{
	for(int i = 0; i < N; ++i)
		out[i] = a[i] + b[i];
}
 
// Compute out[i] = a[i] + b[i] with SSE instructions.
void AddSSE(float *out, const float *a, const float *b, int N)
{
	for(int i = 0; i < N; i += 4)
	{
		// Read 4 floats from a to a_i, and b to b_i.
		__m128 a_i = _mm_load_ps(&a[i]);
		__m128 b_i = _mm_load_ps(&b[i]);
 
		// Compute out_i = a_i + b_i.
		__m128 out_i = _mm_add_ps(a_i, b_i);
 
		// Write 4 floats from out_i to out.
		_mm_store_ps(&out[i], out_i);
	}
}
 
// Compute out[i] = a[i] + b[i] with AVX instructions.
void AddAVX(float *out, const float *a, const float *b, int N)
{
	for(int i = 0; i < N; i += 8)
	{
		// Read 8 floats from a to a_i, and b to b_i.
		__m256 a_i = _mm256_load_ps(&a[i]);
		__m256 b_i = _mm256_load_ps(&b[i]);
 
		// Compute out_i = a_i + b_i.
		__m256 out_i = _mm256_add_ps(a_i, b_i);
 
		// Write 8 floats from out_i to out.
		_mm256_store_ps(&out[i], out_i);
	}
}

As you can see, the SIMD versions are stepping through the arrays by 4 and 8 for SSE and AVX, repsectively. Theoretically, the SSE and AVX versions of this code are 4x and 8x faster than the scalar version! In practice, it is very difficult to achieve these theoretical performance improvements. It is also important to note that the SSE and AVX versions of this function have some additional requirements that are not necessarily obvious. The pointers passed to the SSE version must be aligned to 16 byte addresses, and N must be a multiple of 4. Likewise for the AVX version, the pointers must be aligned to 32 byte addresses, and N must be a multiple of 8. There are a lot of other issues that are beyond the scope of this post, although I touch on instruction latency as it is a problem that the technique I will present can help solve. With that out of the way, lets see how we can use C++ templates to avoid writing 3 versions of all our algorithms.

An Example Complex Number Class
If you are like me, you write a lot of your helper classes such as vectors and complex numbers using templates. Usually, this is because I often find a need for both single and double precision vectors and complex numbers, and this is a great way to avoid re-writing boiler plate vector and complex number operators and functions. This practice will prove itself very useful once again with the subject of this post. Consider this example complex number template class:

template < typename T >
class Complex
{
public:
	T x, y;
 
	// Default constructor.
	Complex() : x(0.0), y(0.0) { }
	// Construct real number.
	Complex(T x) : x(x), y(0.0) { }
	// Construct complex number.
	Complex(T x, T y) : x(x), y(y) { }
};
 
// Addition operator.
template < typename T, typename U >
Complex < T > operator + (const Complex < T > &l, const Complex < U > &r)
{
	return Complex < T > (
		l.x + r.x,
		l.y + r.y);
}
 
// Subtraction operator.
template < typename T, typename U >
Complex < T > operator - (const Complex < T > &l, const Complex < U > &r)
{
	return Complex < T > (
		l.x - r.x,
		l.y - r.y);
}
 
// Multiplication operator.
template < typename T, typename U >
Complex < T > operator * (const Complex < T > &l, const Complex < U > &r)
{
	return Complex < T > (
		l.x * r.x - l.y * r.y,
		l.x * r.y + l.y * r.x);
}

Complex numbers are used very frequently in audio and image processing, where SIMD instructions can be extremely important. SSE implementations of basic video processing tasks can mean the difference between smooth real time playback and dropping frames.

4-point DFT: An Example Algorithm
Suppose we’re trying to write an FFT algorithm that supports radix 4. This basically means that we need to evaluate many 4 point DFTs. If you are like me, you probably didn’t think far enough ahead to write this as a template, but lets do that now:

template < class T >
void Swap(Complex < T > &x0, Complex < T > &x1)
{
	Complex < T > t = x0;
	x0 = x1;
	x1 = t;
}
 
template < class T >
void Dft2(Complex < T > &x0, Complex < T > &x1)
{
	Complex < T > t = x0;
	x0 = t + x1;
	x1 = t - x1;
}
 
template < class T >
void Dft4(Complex < T > &x0, Complex < T > &x1, Complex < T > &x2, Complex < T > &x3)
{
	Dft2(x0, x2);
	Dft2(x1, x3);
 
	// x3 = x3 * -i
	x3 = x3 * Complex < T > (0.0, -1.0);
 
	Dft2(x0, x1);
	Dft2(x2, x3);
 
	Swap(x1, x2);
}

This 4 point DFT implementation is perfectly usable as a scalar DFT, just instantiate this template with T = float. However, suppose we want to also write an SSE version of this function. With the framework we have so far, you don’t have to do much work.

The Keystone: The m128f Class
The last thing we need to instantiate an SSE implementation of Dft4 is a vector class accelerated with SSE:

// SSE vector.
_declspec(align(16)) class m128f
{
public:
	// These identify information about this vector type.
	typedef float T;
	enum { N = 4 };
 
	__m128 v;
 
	// Purposefully do not initialize v.
	m128f() { }
	// Replicate scalar x across v.
	m128f(float x) : v(_mm_set1_ps(x)) { }
	// Load x from pointer px. px must be aligned to 16 bytes.
	m128f(float *px) : v(_mm_load_ps(x)) { }
	// Copy vector v.
	m128f(__m128 v) : v(v) { }
};
 
m128f operator + (const m128f &l, const m128f &r)
{
	return m128f(_mm_add_ps(l.v, r.v));
}
m128f operator - (const m128f &l, const m128f &r)
{
	return m128f(_mm_sub_ps(l.v, r.v));
}
m128f operator * (const m128f &l, const m128f &r)
{
	return m128f(_mm_mul_ps(l.v, r.v));
}

Now, we can use T = m128f as our complex number class template parameter. This means that we get the SSE implementations of all the complex number operations basically for free, including the Dft4 function! Consider the following example:

void DoDftScalar()
{
	// Input signal.
	Complex < float > x[4];
	// Construct a sine wave.
	for(int i = 0; i < 4; ++i)
		x[i] = Complex < float > (std::sin((float)i * PI));
 
	Dft4(x[0], x[1], x[2], x[3]);
}
 
void DoDftSSE()
{
	// Input signal.
	Complex < m128f > x[4];
	// Construct a sine wave.
	for(int i = 0; i < 4; ++i)
		x[i] = Complex < m128f > (std::sin((float)i * PI));
 
	Dft4(x[0], x[1], x[2], x[3]);
}

The DoDftSSE function will evaluate 4 4 point DFTs simulatenously using SSE, theoretically for the same cost as 1 scalar DFT using DoDftScalar!

A very similar class to m128f can also be created for AVX instructions using the __m256 type. This will enable all your templated algorithms to be instantiated using AVX instructions in addition to SSE instructions!

Limitations
If this seems too good to be true, thats because it is. This technique works, and for simple algorithms it might be sufficient, but there are a number of problems with it.

The first problem you will run into with this technique is data layout. Using the example of complex numbers, if your data is laid out in a typical AoS (Array of Structures) format for a scalar implementation, you will not be able to simply load these into Complex < m128f > instances from memory. A typical pattern using this technique is to implement the core of an algorithm with a template function, then have wrapper functions to translate the data from AoS to SoA (Structure of Arrays) form before calling a specific instantiation of the template algorithm function. This way you can customize the data layout for each implementation of the algorithm with separate code paths, while maintaining a single code path for the core of the algorithm. This only makes sense as long as the cost of the data transpose is negligable compared to the cost of executing the algorithm.

This technique creates much more register pressure than more intelligent hand optimized SIMD code. Using the complex number example, it is possible to implement complex number arithmetic with two complete complex numbers in a single SSE register, albeit with less than the theoretical 4x/8x throughput for SSE/AVX compared to scalar code for some operations. This means that all else equal, half the registers are required. Depending on how many registers an algorithm requires, this approach may be faster than the sledgehammer technique presented here.

Some of the concepts presented here are still useful even if you do need to write higher level abstractions, such as a set of 2/4 complex numbers packed into a single SSE/AVX register. Such an implementation of complex numbers will have the same data layout of scalar complex numbers, meaning that a templated algorithm operating on packed complex numbers will be compatible with a class that packs multiple complex numbers into one SSE or AVX register.

Latency and Data Dependencies: The mTx2 < T > Class
As alluded to earlier, there is another issue that templates can help resolve. Instruction implementations in processors have latency, which means that the result of an instruction cannot be used in an instruction immediately following without introducing a delay. This means that you may want to have multiple independent instruction streams for simple algorithms. These independent instruction streams can be interleaved together to reduce data dependencies on instructions with high latency. This concept corresponds to unrolling a loop that can be executed in parallel, and rearranging the instructions to increase the distance between dependent instructions. A simple class that can implement this concept easily, if your algorithm is templated (as it should be by now), is shown below:

template < class T >
class mTx2
{
public:
	T v0, v1;
 
	mTx2() { }
	// Replicate x to both vectors.
	mTx2(typename T::T x) : v0(x), v1(x) { }
	// Load vectors from px.
	mTx2(typename T::T *px) : v0(x), v1(x + T::N) { }
	// Copy two existing vectors.
	mTx2(T v0, T v1) : v0(v0), v1(v1) { }
};
 
// Handy type for SSE vector x2.
typedef mTx2 < m128f > m128x2f;
 
template < class T >
mTx2 < T > operator + (const mTx2 < T > &l, const mTx2 < T > &r)
{
	return mTx2 < T > (l.v0 + r.v0, l.v1 + r.v1);
}
template < class T >
mTx2 < T > operator - (const mTx2 < T > &l, const mTx2 < T > &r)
{
	return mTx2 < T > (l.v0 - r.v0, l.v1 - r.v1);
}
template < class T >
mTx2 < T > operator * (const mTx2 < T > &l, const mTx2 < T > &r)
{
	return mTx2 < T > (l.v0 * r.v0, l.v1 * r.v1);
}

The mTx2 < m128f > class is a packet of 2 SSE vectors. When this class is used in a calculation, it creates 2 independent instruction streams, reducing data dependencies. Keep in mind that complex number and vector classes will often create multiple independent instruction streams themselves for most operations, and using mTx2 < m128f > will burn through physical registers very quickly.

Some algorithms may present a situation where part of the algorithm uses a tightly dependent section, while another part of the algorithm may present opportunities for interleaving instructions. In this case, it can be useful to split apart the mTx2 < m128f > class into their component registers, and process them one at a time. This will reduce register pressure when the algorithm presents its own opportunities for interleaved instructions. Template specialization of algorithm functions can make certain cases of this almost trivial to implement.

Conclusion
This technique requires a lot of trust in your compiler being relatively smart. Relying on smart compilers has historically been a dumb thing to do, so as always, profile your code. Register allocation and instruction scheduling is a big part of achieving good performance with SSE and AVX, and this technique tends to hide these details away where they are easily ignored and the compiler is forced to deal with them.

In practice, for some algorithms, this technique has delivered nearly the 4x/8x theoretical performance improvements expected from SSE/AVX with barely any work for already templated algorithms. However, I’ve also experienced performance degradation in some cases that required more intricate instruction scheduling and reducing register pressure in certain parts of algorithms. Template specialization proved invaluable for resolving performance issues these cases. Profile your code!

After you’ve profiled your code, and you want to optimize your hotspots with SSE/AVX, here are some references to help with optimizing your program with SSE and AVX:

Intel? Advanced Vector Extensions Programming Reference: The official reference for AVX instructions.
Agner Fog’s Software optimization resources: Very thorough guide to optimizing software. Not directly focused on intrinsics, but most of the concepts apply.