Visual Studio Technical Articles
Windows Data Alignment on IPF, x86, and x64
Kang Su Gatlin March 2006 Applies to: Summary: Gives developers the information they need in order to confront data alignment problems critical to the performance of 64-bit and 32-bit applications developed for the Microsoft Windows XP and Microsoft Windows Server 2003 platforms. (17 printed pages) Contents Introduction IntroductionIntel and AMD have introduced a new family of processors, the Intel Itanium Processor Family (IPF) Architecture and the x64 Architecture. These processors join the IA-32 Intel Architecture family in the Microsoft Windows desktop/server world. With Microsoft Visual C++ and Microsoft Windows on these platforms, you can get incredible performance, but this good performance is contingent upon certain programming practices. One of these programming practices is proper data alignment. Proper data alignment allows you to get the most out of your 64-bit and 32-bit applications—and on the Itanium, it is often not only a matter of performance, but it can also be a matter of correctness. In this document we explain why you should care about data alignment, the costs if you do not, how to get your data aligned, and what to do when you cannot. You will never look at your data access the same way again. What Is Data Alignment?All variables have two components associated with them: their value, and their storage location. In this article our concern is the storage location. The storage location of a variable is also called its address, and it is the integer (the mathematical term integer, not the data type) offset in memory where the data begins. The alignment of a given variable is the largest power-of-2 value, L, where the address of the variable, A, modulo this power-of-two value is 0—that is, A mod L = 0. We will call this variable L-byte aligned. Note that when X > Y, and both X and Y are power-of-two values, a variable that is X-byte aligned is also Y-byte aligned. In Listing 1, we give a code example to illustrate where variables get stored/aligned. Don't worry if you do not understand why things are aligned where they are. You will understand all of this by the end of the paper. We do encourage you to have fun and play with the example (reorder the local variables and class member variables, and then see what happens to the addresses). Listing 1. Data alignment example #include <stdio.h> int main() { char a; char b; class S1 { public: char m_1; // 1-byte element // 3-bytes of padding are placed here int m_2; // 4-byte element double m_3, m_4; // 8-byte elements }; S1 x; long y; S1 z[5]; printf("b = %p\n", &b); printf("x = %p\n", &x); printf("x.m_2 = %p\n", &x.m_2); printf("x.m_3 = %p\n", &x.m_3); printf("y = %p\n", &y); printf("z[0] = %p\n", z); printf("z[1] = %p\n", &z[1]); return 0; } In Listing 2, we show the output of what Listing 1 might print. Remember that this is just what it prints on my computer. Your computer will almost certainly print different numbers. That is to be expected. Listing 2. Output from example in Listing 1 b = 000006FBFFB8FEB1 x = 000006FBFFB8FE98 x.m_2 = 000006FBFFB8FE9C x.m_3 = 000006FBFFB8FEA0 y = 000006FBFFB8FE90 z[0] = 000006FBFFB8FEB8 z[1] = 000006FBFFB8FED0 So, from the example in Listings 1 and 2, you can now see how each of the variables is aligned. The char, b, is aligned on a 1-byte boundary (0xB1 % 2 = 1). The class, x, is aligned on an 8-byte boundary (0x98 % 8 = 0). The member, x.m_2, is aligned on a 4-byte boundary (0x9C % 8 = 4). x.m_3 is on an 8-byte boundary, as is y. z[0] and z[1] are also 8-byte aligned (we omit the modulo math for those last sets of variables, because it is straightforward). If we look at the class S1, we see that the whole class has become 8-byte aligned. The packing within the class is not optimal, because there exists a gap of 4 bytes between elements x.m_1 and x.m_2, although x.m_1 is merely a 1-byte element. The Itanium and x64 compilers provide for data items of natural lengths of 1, 2, 4, 8, 10, and 16 bytes. All types are aligned on their natural lengths, except items that are greater than 8 bytes in length: those are aligned on the next power-of-two boundary. For example, 10-byte data items are aligned on 16-byte boundaries. The x86 compiler supports aligning on boundaries of the natural lengths of 1, 2, 4, and 8 bytes. Next we give a relatively simple way to determine the alignment of a given type. To do this, use the __alignof(type) operator. (The macro equivalent is TYPE_ALIGNMENT(type)). This operator returns the alignment requirement of the variable/type passed to it. Stack AlignmentOn both of the 64-bit platforms, the top of each stackframe is 16-byte aligned. Although this uses more space than is needed, it guarantees that the compiler can place all data on the stack in a way that all elements are aligned. The x86 compiler uses a different method
for aligning the stack. By default, the stack is 4-byte aligned.
Although this is space efficient, you can see that there are some data
types that need to be 8-byte aligned, and that, in order to get good
performance, 16-byte alignment is sometimes needed. The compiler can
determine, on some occasions, that dynamic 8-byte stack alignment would
be beneficial—notably when there are The compiler does this in two ways. First, the compiler can use link-time code generation (LTCG), when specified by the user at compile and link time, to generate the call-tree for the complete program. With this, it can determine regions of the call-tree where 8-byte stack alignment would be beneficial, and it determines call-sites where the dynamic stack alignment gets the best payoff. The second way is used when the function has doubles on the stack, but, for whatever reason, has not yet been 8-byte aligned. The compiler applies a heuristic (which improves with each iteration of the compiler) to determine whether the function should be dynamically 8-byte aligned. Note A downside to dynamic 8-byte stack alignment, with respect to performance, is that frame pointer omission (/Oy) effectively gets turned off. Register EBP must be used to reference the stack with dynamic 8-byte stack, and therefore it cannot be used as a general register in the function. Structure and Union LayoutThe layout with respect to alignment in structures and unions is dependent on a few simple rules. We can break structure and union alignment into two components: inter-structure/union alignment, and intra-structure alignment. (There is no intra-union alignment.) Inter-structure/union alignment is the simpler case. The rule here is that the compiler aligns the structure with the largest alignment requirement of any of the members of the structure. Unions follow the same rule. Intra-structure alignment works by the principle that the members are aligned by the compiler at their natural boundaries, and it does this through padding, inserting as much padding as necessary up to the padding limit. The padding limit is set by the compilation switch /Zpn. The default for this switch is /Zp8. The programmer can use the #pragma pack at the point of declaration of the structure, to also set the padding limit from that point in the translation unit onward. That is, it does not affect structures declared prior to the #pragma pack. Access to structure members that are packed may result in access to data that is unaligned. The compiler inserts the fix-up code for these members, which means that the access will not result in an exception, but it will result in slower and more bloated code. (The fix-up code and exception may not make sense yet, but you will understand them by the end of this article.) The padding limits (#pragma pack and /Zpn) should be used with care. Unless most of your work consists of simply moving data, without reading or writing particular elements, or you are space constrained, the trade-offs involved with using padding limits that violate the alignment rules usually do not work in the programmer's favor. Why Is Alignment a Concern?Okay, so now you know what it means for a variable to be aligned. Why do we care about alignment? Well, as you may have guessed, the reason is performance. On the Itanium platform, the reason is correctness as well, due to the way misalignment is handled. Now the question is, Why? What is the underlying reason that we care about alignment? Certainly, no computer architect arbitrarily decided to make our lives difficult. No, but these alignment issues are, in fact, a remnant of architectural trade-offs made by computer architects. On most modern RISC-based designs, data can be accessed only at the boundary defined by the natural length of the data being requested. This fills the destination register with the data of that length. The implication of this is that the computer gets data in natural-length chunks from addresses that are a product of the natural length. What this further implies is that reading data from addresses that are not a product of the natural length will be problematic (it may slow down or crash the application). For example, a 32-bit computer with a word boundary starting at 0 can load data from bytes at location 0 to 3 in one load, or 4 to 7 in one load, or 40 to 43 in one load, but NOT 2 to 5 in one load (because bytes 2 to 5 span two words). What this means is that if the computer actually needed to retrieve the 32-bit value from location 2 to 5, it would have to retrieve the data from 0 to 3, and also retrieve the value from location 4 to 7, and then perform some operations to properly extract and shift the bytes that it needs. Depending on the computer system, either the operating system or compiler does this for you. If they do not, then the hardware can raise an exception (and you do not want that to happen; as a worst case, it could crash). When the software bails you out, this not only requires some extra logic, but it also takes extra memory accesses. In fact, for many applications on modern computers, the memory system is the performance bottleneck, thus making extra memory requests very costly. In the particular example of this paragraph, it will take two memory accesses to get the 32-bit value from 2 to 5, rather than the one memory access it would take to get the 32-bit value from an aligned address. See Figure 1, because a visual representation might help to make more sense of this potentially tricky topic. Figure 1. Loading bytes at addresses 2 to 5 Figure 1 shows: a) loading the first word (bytes 0 to 3); b) extracting bytes 2 to 3 from the loaded word; c) loading the second word; and d) extracting the first two bytes from the second loaded word and appending it to the previously extracted bytes. This notion of data alignment goes beyond the word-size of the given computer architecture, extending up the memory hierarchy, through the multiple levels of cache, translation lookaside buffer, and pages. Each of these, like the 32-bit words, has an associated unit chunk size. Caches have cache lines that are on the order of 32 to 128 bytes. Pages go from 1024 bytes to megabytes in size. This is all done to make our programs perform more efficiently. We just need to know how to deal with it when it bites us. Data Alignment Exceptions and Fix-UpsThe obvious way to deal with alignment issues is to avoid them; however, in the real world, that is not always possible. To help generate correct programs, Microsoft Visual C++ and Microsoft Windows have some mechanisms to help the programmer. These do not come without some performance impact, but they do assist in rapid development and/or porting of applications. The first question that comes to mind might be, "What if I violate the alignment restrictions?" That is, what happens if I generate an alignment fault? Well, a few things can happen, and none of them are good. In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT. On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3, we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception. Listing 3. Code to catch alignment exception on Itanium #include <windows.h> #include <stdio.h> int mswindows_handle_hardware_exceptions (DWORD code) { printf("Handling exception\n"); if (code == STATUS_DATATYPE_MISALIGNMENT) { printf("misalignment fault!\n"); return EXCEPTION_EXECUTE_HANDLER; } else return EXCEPTION_CONTINUE_SEARCH; } int main() { __try { char temp[10]; memset(temp, 0, 10); double *val; val = (double *)(&temp[3]); printf("%lf\n", *val); } __except(mswindows_handle_hardware_exceptions (GetExceptionCode ())) {} } The application can change the behavior of the alignment fault from the default, to one where the alignment fault is fixed up. This is done with the Win API call SetErrorMode, with the argument field SEM_NOALIGNMENTFAULTEXCEPT set. This allows the OS to handle the alignment fault, but at considerable performance cost. There are two things to note: 1) this is on a per-process basis, so each process should set this before the first alignment fault, and 2) SEM_NOALIGNMENTFAULTEXCEPT is sticky—that is, if this bit is ever set in an application through SetErrorMode, then it can never be reset for the duration of the application (inadvertently or otherwise). On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data. On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.) With that said, there are situations on the x86 and x64 platform where unaligned access will generate a general-protection exception. (Note that these are general-protection exceptions, not alignment-check exceptions.) This is when the misalignment occurs on a 128-bit type—specifically, SSE/SSE2-based types. In some experimental runs, with the code in Listing 4 (we used 9,000,000 iterations, with 0 and 3 offset representing aligned and unaligned, respectively), we saw that on a slower Pentium III (731MHz, running Microsoft Windows XP Professional), the program with the unaligned access runs about 3.25 times slower than the program with the aligned access. On a faster Pentium IV (2.53GHz, running Windows XP Professional), the program with an unaligned access runs about 2 times slower than the program with the aligned access. This is definitely not the type of performance hit you want to take. Unfortunately, it gets even worse on the Itanium Processor Family. With the same test, running on an Itanium2 at 900MHz with Microsoft Windows Server 2003 (but only for 90,000 iterations, due to how long the test takes to run), the unaligned program runs 459 times slower! As you can see, unaligned access in an inner-loop can devastate the performance of your application. So, even with the OS fix-up, which prevents your application from crashing, you should avoid unaligned access. Listing 4. Example code to compare OS fix-up unaligned vs. aligned #include <stdio.h> #include <stdlib.h> #include <sys/timeb.h> #include <time.h> #include <windows.h> #ifdef _WIN64 #define UINT unsigned __int64 #define ENDPART QuadPart #else #define UINT unsigned int #define ENDPART LowPart #endif int main(int argc, char* argv[]) { SetErrorMode(GetErrorMode() | SEM_NOALIGNMENTFAULTEXCEPT); UINT iters, offset; if(argc < 2) iters = 9000000; else iters = atoi(argv[1]); if(argc < 3) offset = 0; else offset = atoi(argv[2]); printf("iters = %d, offset = %d\n", iters, offset); double *dest, *origsource; double *source; dest = new double[128]; origsource = new double[150]; source = (double *)((UINT)origsource + offset); printf("dest = %x source = %x\n", dest, source); LARGE_INTEGER startCount, endCount, freq; QueryPerformanceFrequency(&freq); QueryPerformanceCounter(&startCount); for (UINT x = 0; x < iters; x++) for(UINT i = 0; i < 128; ++i) dest[i] = source[i]; QueryPerformanceCounter(&endCount); printf("elapsed time = %lf\nTo keep stuff from being optimized %lf\n", (double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART, dest[75]); delete[] origsource; delete[] dest; return 0; } Compiler Support for AlignmentSometimes, through explicit syntax, the compiler can help with these alignment issues. In this section, we give a few extensions that you can use in the source code to either minimize the cost of unaligned access, or to help ensure aligned access. __unaligned keywordAs we stated earlier, by default, the compiler will align data on their natural boundaries. Most of the time, this is sufficient, and there will not be a problem; however, there can be situations where an alignment issue will exist, with no clear way to work around it (or it would take too much effort to do so). When you, the programmer, can determine statically which variables might be accessed on unaligned boundaries, you can specify these variables as being unaligned, by using the __unaligned keyword (the macro equivalent is UNALIGNED). This keyword is useful in that the compiler will insert the code to access the variable on an unaligned boundary, and it will not fault. It does this by inserting extra code that will finesse its way around the unaligned boundary—but this does not come without a trade-off. These extra instructions will slow your code down, plus increase the code size. Unfortunately, these extra instructions are generated even in places where it might be provable that the data is aligned! So use this keyword with care. We can modify the program of Listing 4 by using the __unaligned keyword in a variable declaration. In this example, we change the declaration of __unaligned double *source; This program will now run correctly on the Itaniums, even if you do not enable the operating system to fix up the alignment faults, although it will suffer some performance degradation. This is still better than having your program crash or suffer the severe performance penalty of the OS fix-up. (Keep in mind that, as noted earlier, the compiler inserts code to handle misaligned access, even where it is provable that the data is aligned. The OS goes into its fix-up code only when an exception occurs, and these occur only when the misaligned access actually happens.) In Figure 2, we have a chart that gives the running time on an Itanium 2 for the example program of Listing 4 when using various data access methods. The program executes fastest when the data is aligned, and the __unaligned keyword is not used. It runs next fastest when the data is aligned, but the __unaligned keyword is used. (Recall that if you use the __unaligned keyword, you pay a performance penalty, even if your data is aligned.) You run slightly slower if you use the __unaligned keyword on unaligned data. Lastly, you run much slower if you access unaligned data, but you have set SetErrorMode with SEM_NOALIGNMENTFAULTEXCEPT. Note In this chart, the y-axis is on a log10 scale. Figure 2. Comparative runtimes of test program to illustrate effect of different types of accesses __declspec(align(#))So, we have dealt with the problem of a variable that you know is going to have unaligned access, but what about when you have a variable, and you would like it to be allocated on a boundary that is different from its natural boundary? For example, when using SSE2 instructions, you may want to align your operands on a 16-byte boundary, or you may want to align certain variables on cache-line boundaries. __declspec(align(#)) (where # is a power of two) is made for such purposes. In Listing 5, we give an example of its use. Listing 5. Code to demonstrate how __declspec(align(#)) works #include <stdio.h> class ClassA { public: char d1; __declspec(align(256)) char d2; double d3; }; int main() { __declspec(align(32)) double a; double b; __declspec(align(512)) char c; ClassA d; printf("sizeof(a) = %d, address(a) = %0x\n", sizeof(a), &a); printf("sizeof(b) = %d, address(b) = %0x\n", sizeof(b), &b); printf("sizeof(c) = %d, address(c) = %0x\n", sizeof(c), &c); printf("sizeof(d) = %d, address(d.d2) = %0x\n", sizeof(d), &d.d2); return 0; } The output might look something like the following (taken from my computer): sizeof(a) = 8, address(a) = 12fde0 sizeof(b) = 8, address(b) = 12fdd8 sizeof(c) = 1, address(c) = 12fa00 sizeof(d) = 512, address(d.d2) = 12f900 Note the sizeof of the class. The sizeof value for any structure/class is the offset of the final member, plus that member's size, rounded up to the nearest multiple of the largest member alignment value, or the whole structure/class alignment value, whichever is greater. (This definition is taken from MSDN's entry on align.) The CRT and Intrinsics__declspec(align) is a useful tool, but it cannot align dynamic data off of the heap. For this, the C runtime library (CRT) gives a set of aligned memory allocation routines. These are listed below (and come with <malloc.h>):
See Data Alignment on MSDN for more information on these routines. One of the best ways to get performance is to use code that programmers have spent a lot of time tuning. The supplied CRT memory routines (strncpy, memcpy, memset, memmove, and so on) are a great example of this. The CRT routines are hand-code routines (often assembly) that are tuned to the particular architecture, which will align the source and destination so that, for large moves, the costs of the unaligned accesses are minimized. Alternatively, the user can use the /Oi flag or the #pragma intrinsic The IPF compiler will also use type information to assist in expanding the inline intrinsics. The compiler will examine the types of pointers to the source and destination addresses, and from this, it will infer the alignment of these addresses. If the pointer types are not correct, you might take an alignment exception, or the program will run slower (with the dreaded OS fix-ups). In Listing 6, we give code to show the
effects of aligned versus unaligned accesses on code that uses the
compiler intrinsics for memcpy or the CRT assembly language
hand-tuned routines. To use the CRT assembly language hand-tuned
routines, make sure to insert the #pragma function Listing 6. Code to demonstrate the effect of intrinsic and CRT routines on aligned vs. unaligned accesses #include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <windows.h>
#ifdef _WIN64
#define UINT unsigned __int64
#define ENDPART QuadPart
#else
#define UINT unsigned int
#define ENDPART LowPart
#endif
Figures 3 and 4 show the relative performance of each of the four configurations on memcpys of various size—on a Pentium III and Itanium2 computer, respectively. We generated this data with the code from Listing 6, using the following parameters: exename 1000000 size offset Where 8 ≤ size ≤ 4096 and 0 ≤ offset ≤ 1. Figure 3. The time to perform a memcpy using aligned vs. unaligned data and CRT vs. intrinsic routines on a Pentium III On the Pentium III, for aligned copies, it does not matter too much whether you use CRT or intrinsic. However, for large unaligned copies, using the CRT version is a big win. On the Itanium2, we compare only the CRT versions, because the compiler almost always uses the CRT versions, even when the programmer specifies /Oi or #pragma intrinsic. In Figure 4, we compare unaligned versus aligned CRT calls. You can clearly see that using aligned data results in better performance. The lesson here is not subtle at all. Figure 4. The time to perform a memcpy using aligned vs. unaligned data with CRT routines on an Itanium2 Some Quick Tips on How to Avoid Alignment IssuesIf you are short on time, and just want a quick section to refer to, you have found the right place. Here are some quick tips to help deal with data alignment related issues:
What About Instruction Alignment?Well, you are almost at the end of this article, and some of you may be wondering, "You've talked about data alignment, but what about instruction alignment? Aren't instructions also stored in memory?" The answer is, instruction alignment is also an issue, but it is not covered in this article, because most programmers do not have to deal with it at all. Instruction alignment is mostly an issue for compiler writers. The one type of general-purpose programmer who might still care about instruction alignment would be the assembly-language programmer, especially if he or she is not using an assembler. ConclusionHopefully, you will now feel confident that you know the ins and outs of data alignment when you sit down to do Windows development. This article has covered how to avoid many data-alignment faults, what to do when they are inevitable, and the various costs associated with them. This knowledge will be useful for all Windows development, but it will prove especially useful when porting code from x86 to Itanium, where data alignment plays a front-and-center role. In the end, the result will be faster, more reliable code.
About the author Kang Su Gatlin is a Program Manager at Microsoft in the Visual C++ group. He received his PhD from UC San Diego. His focus is on high-performance computation and optimization. |
|