Optimisations – Flameeyes's Weblog

展开全文

Tag: Optimisations

The subtlety of modern CPUs, or the search for the phantom bug

Yesterday I have released a new version of unpaper which is now in Portage, even though is dependencies are not exactly straightforward after making it use libav. But when I packaged it, I realized that the tests were failing — but I have been sure to run the tests all the time while making changes to make sure not to break the algorithms which (as you may remember) I have not designed or written — I don’t really have enough math to figure out what’s going on with them. I was able to simplify a few things but I needed Luca’s help for the most part.

Turned out that the problem only happened when building with -O2 -march=native so I decided to restrict tests and look into it in the morning again. Indeed, on Excelsior, using -march=native would cause it to fail, but on my laptop (where I have been running the test after every single commit), it would not fail. Why? Furthermore, Luca was also reporting test failures on his laptop with OSX and clang, but I had not tested there to begin with.

A quick inspection of one of the failing tests’ outputs with vbindiff showed that the diffs would be quite minimal, one bit off at some non-obvious interval. It smelled like a very minimal change. After complaining on G+, Måns pushed me to the right direction: some instruction set that differs between the two.

My laptop uses the core-avx-i arch, while the server uses bdver1. They have different levels of SSE4 support – AMD having their own SSE4a implementation – and different extensions. I should probably have paid more attention here and noticed how the Bulldozer has FMA4 instructions, but I did not, it’ll show important later.

I decided to start disabling extensions in alphabetical order, mostly expecting the problem to be in AMD’s implementation of some instructions pending some microcode update. When I disabled AVX, the problem went away — AVX has essentially a new encoding of instructions, so enabling AVX causes all the instructions otherwise present in SSE to be re-encoded, and is a dependency for FMA4 instructions to be usable.

The problem was reducing the code enough to be able to figure out if the problem was a bug in the code, in the compiler, in the CPU or just in the assumptions. Given that unpaper is over five thousands lines of code and comments, I needed to reduce it a lot. Luckily, there are ways around it.

The first step is to look in which part of the code the problem appears. Luckily unpaper is designed with a bunch of functions that run one after the other. I started disabling filters and masks and I was able to limit the problem to the deskewing code — which is when most of the problems happened before.

But even the deskewing code is a lot — and it depends on at least some part of the general processing to be run, including loading the file and converting it to an AVFrame structure. I decided to try to reduce the code to a standalone unit calling into the full deskewing code. But when I copied over and looked at how much code was involved, between the skew detection and the actual rotation, it was still a lot. I decided to start looking with gdb to figure out which of the two halves was misbehaving.

The interface between the two halves is well-defined: the first return the detected skew, and the latter takes the rotation to apply (the negative value to what the first returned) and the image to apply it to. It’s easy. A quick look through gdb on the call to rotate() in both a working and failing setup told me that the returned value from the first half matched perfectly, this is great because it meant that the surface to inspect was heavily reduced.

Since I did not want to have to test all the code to load the file from disk and decode it into a RAW representation, I looked into the gdb manual and found the dump commands that allows you to dump part of the process’s memory into a file. I dumped the AVFrame::data content, and decided to use that as an input. At first I decided to just compile it into the binary (you only need to use xxd -i to generate C code that declares the whole binary file as a byte array) but it turns out that GCC is not designed to compile efficiently a 17MB binary blob passed in as a byte array. I then opted in for just opening the raw binary file and fread() it into the AVFrame object.

My original plan involved using creduce to find the minimal set of code needed to trigger the problem, but it was tricky, especially when trying to match a complete file output to the md5. I decided to proceed with the reduction manually, starting from all the conditional for pixel formats that were not exercised… and then I realized that I could split again the code in two operations. Indeed while the main interface is only rotate(), there were two logical parts of the code in use, one translating the coordinates before-and-after the rotation, and the interpolation code that would read the old pixels and write the new ones. This latter part also depended on all the code to set the pixel in place starting from its components.

By writing as output the calls to the interpolation function, I was able to restrict the issue to the coordinate translation code, rather than the interpolation one, which made it much better: the reduced test case went down to a handful of lines:

void rotate(const float radians, AVFrame *source, AVFrame *target) {
    const int w = source->width;
    const int h = source->height;

    // create 2D rotation matrix
    const float sinval = sinf(radians);
    const float cosval = cosf(radians);
    const float midX = w / 2.0f;
    const float midY = h / 2.0f;

    for (int y = 0; y < h; y++) {
        for (int x = 0; x < w; x++) {
            const float srcX = midX + (x - midX) * cosval + (y - midY) * sinval;
            const float srcY = midY + (y - midY) * cosval - (x - midX) * sinval;
            externalCall(srcX, srcY);
        }
    }
}

Here externalCall being a simple function to extrapolate the values, the only thing it does is printing them on the standard error stream. In this version there is still reference to the input and output AVFrame objects, but as you can notice there is no usage of them, which means that now the testcase is self-contained and does not require any input or output file.

Much better but still too much code to go through. The inner loop over x was simple to remove, just hardwire it to zero and the compiler still was able to reproduce the problem, but if I hardwired y to zero, then the compiler would trigger constant propagation and just pre-calculate the right value, whether or not AVX was in use.

At this point I was able to execute creduce; I only needed to check for the first line of the output to match the “incorrect” version, and no input was requested (the radians value was fixed). Unfortunately it turns out that using creduce with loops is not a great idea, because it is well possible for it to reduce away the y++ statement or the y < h comparison for exit, and then you’re in trouble. Indeed it got stuck multiple times in infinite loops on my code.

But it did help a little bit to simplify the calculation. And with again a lot of help by Måns on making sure that the sinf()/cosf() functions would not return different values – they don’t, also they are actually collapsed by the compiler to a single call to sincosf(), so you don’t have to write ugly code to leverage it! – I brought down the code to

extern void externCall(float);
extern float sinrotation();
extern float cosrotation();

static const float midX = 850.5f;
static const float midY = 1753.5f;

void main() {
    const float srcX = midX * cosrotation() - midY * sinrotation();
    externCall(srcX);
}

No external libraries, not even libm. The external functions are in a separate source file, and beside providing fixed values for sine and cosine, the externCall() function only calls printf() with the provided value. Oh if you’re curious, the radians parameter became 0.6f, because 0, 1 and 0.5 would not trigger the behaviour, but 0.6 which is the truncated version of the actual parameter coming from the test file, would.

Checking the generated assembly code for the function then pointed out the problem, at least to Måns who actually knows Intel assembly. Here follows a diff of the code above, built with -march=bdver1 and with -march=bdver1 -mno-fma4 — because turns out the instruction causing the problem is not an AVX one but an FMA4, more on that after the diff.

        movq    -8(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmovss  -20(%rbp), %xmm2
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovss  -20(%rbp), %xmm1
+       vfmsubss        %xmm0, .LC0(%rip), %xmm1, %xmm0
        leave
        .cfi_remember_state
        .cfi_def_cfa 7, 8
-       vsubss  %xmm0, %xmm1, %xmm0
        jmp     externCall@PLT
 .L6:
        .cfi_restore_state

It’s interesting that it’s changing the order of the instructions as well, as well as the constants — for this diff I have manually swapped .LC0 and .LC1 on one side of the diff, as they would just end up with different names due to instruction ordering.

As you can see, the FMA4 version has one instruction less: vfmsubss replaces both one of the vmulss and the one vsubss instruction. vfmsubss is a FMA4 instruction that performs a Fused Multiply and Subtract operation — midX * cosrotation() - midY * sinrotation() indeed has a multiply and subtract!

Originally, since I was disabling the whole AVX instruction set, all the vmulss instructions would end up replaced by mulss which is the SSE version of the same instruction. But when I realized that the missing correspondence was vfmsubss and I googled for it, it was obvious that FMA4 was the culprit, not the whole AVX.

Great, but how does that explain the failure on Luca’s laptop? He’s not so crazy so use an AMD laptop — nobody would be! Well, turns out that Intel also have their Fused Multiply-Add instruction set, just only with three operands rather than four, starting from Haswell CPUs, which include… Luca’s laptop. A quick check on my NUC which also has a Haswell CPU confirms that the problem exists also for the core-avx2 architecture, even though the code diff is slightly less obvious:

        movq    -24(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmovd   %ebx, %xmm2
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovd   %ebx, %xmm1
+       vfmsub132ss     .LC0(%rip), %xmm0, %xmm1
        addq    $24, %rsp
+       vmovaps %xmm1, %xmm0
        popq    %rbx
-       vsubss  %xmm0, %xmm1, %xmm0
        popq    %rbp
        .cfi_remember_state
        .cfi_def_cfa 7, 8

Once again I swapped .LC0 and .LC1 afterwards for consistency.

The main difference here is that the instruction for fused multiply-subtract is vfmsub132ss and a vmovaps is involved as well. If I read the documentation correctly this is because it stores the result in %xmm1 but needs to move it to %xmm0 to pass it to the external function. I’m not enough of an expert to tell whether gcc is doing extra work here.

So why is this instruction causing problems? Well, Måns knew and pointed out that the result is now more precise, thus I should not work it around. Wikipedia, as linked before, points also out why this happens:

A fused multiply–add is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire sum a+b×c to its full precision before rounding the final result down to N significant bits.

Unfortunately this does mean that we can’t have bitexactness of images for CPUs that implement fused operations. Which means my current test harness is not good, as it compares the MD5 of the output with the golden output from the original test. My probable next move is to use cmp to count how many bytes differ from the “golden” output (the version without optimisations in use), and if the number is low, like less than 1‰, accept it as valid. It’s probably not ideal and could lead to further variation in output, but it might be a good start.

Optimally, as I said a long time ago I would like to use a tool like pdiff to tell whether there is actual changes in the pixels, and identify things like 1-pixel translation to any direction, which would be harmless… but until I can figure something out, it’ll be an imperfect testsuite anyway.

A huge thanks to Måns for the immense help, without him I wouldn’t have figured it out so quickly.

Some _FORTIFY_SOURCE far-fetched warnings are… funny!

So the new Portage releases finally kills when sure-enough overflows are built and that’s for me good news: we’ll finally get to fix these problems rather than ignoring them until they hit a security issue.

But then, Kumba reported a failure in libtirpc, which I was the last one to play with, to fix NFS with Kerberos — as you can understand, it was something I felt I had to fix myself, since I might as well have been the cause.

Interestingly, the problem I could reproduce on my user system, but not on the tinderbox; but that was enough for me to have an idea where to start, so I finally was able to reduce the problem to this synthetic test case out of the libtirpc 0.2.1 sources:

#include 
#include 
#include 

int main(int argc, char *argv[]) {
  struct sockaddr_in addr = {
    .sin_family = AF_INET,
    .sin_port = 0
  };
  struct hostent *hp;

  if ((hp = gethostbyname(argv[1])) == NULL)
    return -1;

  if (hp->h_length > sizeof(addr))
    hp->h_length = sizeof(addr);

  memcpy(&addr.sin_addr.s_addr, hp->h_addr, (size_t)hp->h_length);

  return 0;
}

Interestingly enough, if you compile it with the minimum flags required for _FORTIFY_SOURCE diagnostics to be emitted – gcc -O2 -Wall test.c – you won’t find any warning in your output; there isn’t a sure-fire overflow in that case. So let’s look deeper at the code.

On the highest level you can look at the code, there just isn’t space for any overflow: we’re calling gethostbyname() which happens to only return IPv4 addresses, which are always 32-bit; so the h_length member of hostent is never going to be anything else than 4 (size in bytes of the address). So the code is, by any mean, safe.

If you look at the actual code much harder, you can notice that there is an extra failsafe: in any case you’re going to copy at most sizeof(addr) bytes… and here’s the catch; addr is a sockaddr structure, and as all of the other sockaddr structures, start with a single, 16-bit value indicating the actual address family. Additionally, what you’re actually going to copy is the IP address itself, (addr.sin_addr.s_addr which is neither the first nor the second member of addr, but it’s following the actual port. This does not only mean that the offset of s_addr is 8, but also that the size of the complete structure is 8, not 4, so the example code above can be rewritten as:

#include 
#include 
#include 

int main(int argc, char *argv[]) {
  struct sockaddr_in addr = {
    .sin_family = AF_INET,
    .sin_port = 0
  };
  struct hostent *hp;

  if ((hp = gethostbyname(argv[1])) == NULL)
    return -1;

  memcpy(&addr.sin_addr.s_addr, hp->h_addr, MAX(hp->h_length, 8));

  return 0;
}

It should be now much easier to identify where the problem is: you can have hp->h_length to be 4 (or less — more to follow) or it is between 5 and 8 inclusive. But once again, GCC is not complaining it with -O2, because it is not surely an overflow for the compiler, as the value is actually variable: it can be included in either the 0-8 range, or between 0x80000000-0xFFFFFFFF (the negative range). What is causing it to understand there is a dangerous situation? An old friend of mine, actually: -ftracer.

The description from the man page is not really something you look out for, but here it is:

Perform tail duplication to enlarge superblock size. This transformation simplifies the control flow of the function allowing other optimizations to do better job.

What it means is that it somehow allows different branches to duplicate their code, allowing for further optimisations of corner cases. I have reported before that this was causing a build failure in MPlayer last year, because a segment of handwritten inline asm was using global labels rather than local labels, and while it was present on an already-joined branch, gcc -ftracer was splitting it in two to optimise it better; if a single copy of the asm was there, the global labels worked just as fine, but once it was inlined twice, you’d have a conflict of labels. It was trivial to fix but took a while to track it down.

Now, thanks to the help of -S -O2 -ftracer -fverbose-asm I could tell what was going on in the code by looking at the generated asm; if I try to translate that back to C code (you don’t really want to look at x86 asm code when trying to track down a logic bug!), the result would be something very similar to this (it would be closer if I used goto instructions, but I’d rather not):

#include 
#include 
#include 

int main(int argc, char *argv[]) {
  struct sockaddr_in addr = {
    .sin_family = AF_INET,
    .sin_port = 0
  };
  struct hostent *hp;

  if ((hp = gethostbyname(argv[1])) == NULL)
    return -1;

  if (hp->h_length >= sizeof(addr))
    memcpy(&addr.sin_addr.s_addr, hp->h_addr, sizeof(addr));
  else
    memcpy(&addr.sin_addr.s_addr, hp->h_addr, (size_t)hp->h_length);

  return 0;
}

And indeed, if you build that code with gcc -O2 -Wall, it will report the same overflow warning.

My first look at fixing this was only fixing the obvious problem: it should have compared h_length with addr.sin_addr.s_addr rather than just addr, so it would compare to 4, not 8… and that’s what I first committed; then I actually thought a bit more at the code, and noted that there are quite a few problems with code such as that:

if the address is reported as IPv6 (although I admit I’m not sure if gethostbyname() does that at all!), there is no way that the such-reduced address works at all; for my own IPv6 subnet, it would report 2002:5e5f which is not even telling my full IPv4 address (it is a 6-to-4 mapping);
if the address has a size shorter than 32-bit, it’ll cause an overflow on the other side, as more data is copied from the hostent array than is present — do note that it it’s not a security problem or a possible buffer overflow, as the returned array is NULL-terminated, so it either reads garbage from the following entries, or it reads the 0-bytes of the NULL termination; any recent architecture has 32-bit sized NULLs as far as I know;
everything can break if the hostent result is actually providing a negative length; in such case you’d be reading a huge number of bytes, and causing a segfault at one point or another; I don’t think that case is going to happen ever though, so it shouldn’t be of concern.

Given these details, I updated the patch during the same sync cycle, so that if the length is not the same between the hostent and addr.sin_addr.s_addr, the getrpcport() call will simply fail.

Is the bug exploitable? I don’t think so but I wouldn’t bet my life on it.

I’m not extremely sure if it’s the client or the server to call getrpcport(), but it is going to ask the question to a nameserver and, from there, it means that you’re working on untrusted data. If gethostbyname() reports AF_INET6 hosts as well, an attacker could rewrite the four bytes below the addr object; if the code is built with optimisations turned on, that’s the only object on the stack, and I have sincerely no idea whether it’s the parameters or the return value that’s present at that point on the stack; if it’s the parameters, it shouldn’t really matter at all, return value might be a bit trickier, but on 64-bit systems it wouldn’t be able to do much good beside possibly causing either server or client to crash by jumping on an non-executable memory area.

At any rate, I hope you now have a bit more idea of why I’m interested that this stuff is actually looked at properly, and not simply ignored.

Mailbox: does GCC miscompile at -O0?

Jeremy asks via email:

I’m curious about the mention of gcc -O0 producing broken code in this article: do you have a reference to a particular bug for that? I think the article would benefit from being more specific about which versions of gcc produce broken code for -O0, otherwise readers will be scared off from using it for ever.

What he refers to is the Gentoo backtracing guide I wrote that gives some indications about using Portage features and ebuild support to get proper backtraces (stack traces), and how to avoid debug code instead.

If you really need a precise backtrace, you should use -O1 (using the optimisations of level 0 is not recommended as there are known issues with gcc building invalid code).

I’ll update the guide today if I can find the time, so you’ll probably find a more correct phrase up there, soonish.

The issue here is actually very complex, which is why I think a reply on the blog is worth the hassle. The first note here to do is GCC itself does not really miscompile with -O0 optimisations. The problems lie in the code of the software itself, which makes it much more difficult to deal with (when the problem is in the compiler you can isolate the problem and wait till it’s fixed to make sure you can use the feature; if the problem is in the code that gets compiled, then we cannot really be sure it all works, not unless you build and test everything and even then it might not be easy.

Anyway, -O0 does a little more than just disabling some optimisations, it changes a lot in the way code is handled, first of all by changing the preprocessor definitions:

flame@yamato ~ % diff -u <(echo | gcc -O0 -dM -E -) <(echo | gcc -O1 -dM -E -)
--- /proc/self/fd/11    2010-06-14 02:59:32.973351250 +0200
+++ /proc/self/fd/12    2010-06-14 02:59:32.974351316 +0200
@@ -31,6 +31,7 @@
 #define __UINTMAX_TYPE__ long unsigned int
 #define __linux 1
 #define __DEC32_EPSILON__ 1E-6DF
+#define __OPTIMIZE__ 1
 #define __unix 1
 #define __UINT32_MAX__ 4294967295U
 #define __LDBL_MAX_EXP__ 16384
@@ -89,7 +90,6 @@
 #define __UINT16_MAX__ 65535
 #define __DBL_HAS_DENORM__ 1
 #define __UINT8_TYPE__ unsigned char
-#define __NO_INLINE__ 1
 #define __FLT_MANT_DIG__ 24
 #define __VERSION__ "4.5.0"
 #define __UINT64_C(c) c ## UL

Beside showing the second big problem that I’ll talk about (disabling inline functions), it should remind you of my corner case testing from two years ago: disabling optimisation explicitly causes headers to change as they hide or show different interfaces, “proper” or “fast”. As you might guess already, this makes it slightly harder to actually track down a crash, or get an absolute stack trace, because it’s the (already pre-processed) code that changes itself; the crash might very well be only happening in the codepath taken when the optimised functions are used.

This gets combined with no inlining of functions: some of the system headers will replace definitions of some functions with inline wrappers when it is possible. These inlines often are faster, but they also take care of adding the warnings with _FORTIFY_SOURCE (the feature of recent GCC and GLIBC that adds warnings at build-time for probably-wrong behaviour of source code, ignoring return values, providing wrong parameters, or causing buffer overflows).

The two combined make using the system library (GLIBC) quite different between optimised and non-optimised builds; it’s almost like having two different basic C libraries, as you might guess, it’s important, to know where something’s crashing, to use the same C library to get the trace. In quite a few cases, rebuilding with -O0 can stop something from crashing, but it’s almost never caused by the compiler miscompiling something, it’ s usually due to the code only working with the “proper” interfaces, but failing when using “less proper” interfaces that are faster, but rely on the rules to be followed literally.

One example of this are the functions ntohs and ntohl: they are present as functions in the C library but they are also present as macros in the GLIBC includes, so that they can be expanded at build-time (especially when used on constants). This causes quite some change on the handling of temporary variables, if you play with pointers and type-punning (very long story now) so the behaviour of the ntohs or ntohl calls differs, and upstream is likely to have tested only one behaviour, rather than both.

Finally, there are possible build failures tied with the use of -O0, because, among other things, the Dead Code Elimination pass is hindered. Dead Code Elimination is a safe optimisation, so it’s executed even at -O0, its objective is to remove code that is never going to be executed, like an if (0) code. Using if (0) rather than adding #ifdef all over the code is a decent way to make sure that the syntax of the code is correct, even for those parts that are not enabled (as long as you have declarations of functions and the like). Unfortunately, without optimisations enabled, DCE will only work over explicit if (0) blocks, and not indirect ones. So for instance if you got code like this:

int bar() {
#if ENABLE_FOO
  int a = is_foo_enabled_runtime();
#else
  int a = 0;
#endif

  if ( a )
     process_foo_parameters();

  process_parameters();

  if ( a )
   foo();
  else
   not_foo();
}

DCE will only drop the if (a) branches when constant propagation is also applied, and will produce code with undefined references to process_foo_parameters() and foo() when -O0 is used for the build. This trick is often used in the wild to write code that is readable and validated for syntax even when features are not enabled, without using a huge quantity of #ifdef statements. And as I just shown, it can easily break when built with -O0. Among the others, FFmpeg uses this kind of code, so it cannot really be compiled at an absolute -O0.

Note: this mostly applies to GCC, as other compilers might do constant propagation even at -O0 given that it is a fairly stable and safe optimisation. Clang, though, behaves the same way.

While it can be argued that software shouldn’t rely on specific features of the C compilers but just on the C language specification, it is quite impossible for us to vouch that all the software out there will build and work properly when built with no optimisation at all, for the reasons I just explained. Gentoo of course is ready to accept bugs for those packages, but don’t expect most of us to go out of our way to fix it in the tree if you don’t provide a patch; alternatively, talk with upstream and get them to fix their code, or work it around somehow. If you find a package that fails to build or (worse) fail to work properly when built with -O0, please open a bug against the tracker after tracking down where the issue here. If you give us a patch we’ll likely merge it soon, otherwise it would still be feasible to track down the packages with problems.

Upstream, rice it down!

While Gentoo often gets a bad name because of the so-called ricers and upstream developer complains that we allow users to shoot themselves in the foot by setting CFLAGS as they please, it has to be said that not all upstream projects are good in that regard. For instance, there are a number of projects that, unless you enable debug support, will force you to optimise (or even over-optimise) the code, which is obviously not the best of ideas (this does not count in things like FFmpeg that rely on Dead Code Elimination to link properly — in those cases we should be even more careful but let’s leave it alone for now).

Now, what is the problem with forcing optimisation for non-debug builds? Well, sometimes you might not want to have debug support (extra verbosity, assertions, …) but you might still want to be able to fetch a proper backtrace; in such cases you have a non-debug build that needs to turn down optimisations. Why should I be forced to optimise? Most of the time, I shouldn’t.

Over-optimisation is even nastier: when upstream forces stuff like -O3, they might not even understand that the code might easily slow down further. Why is that? Well one of the reasons is -funroll-loops: declaring all loops to be slower than unrolled code is an over-generalisation that you cannot pretend to keep up with, if you have a minimum of CPU theory in mind. Sure, the loop instructions have an higher overhead than just pushing the instruction pointer further, but unrolled loops (especially when they are pretty complex) become CPU cache-hungry; where a loop might stay hot within the cache for many iterations, an unrolled version will most likely require more than a couple of fetch operations from memory.

Now, to be honest, this was much more of an issue with the first x86-64 capable processors, because of their risible cache size (it was vaguely equivalent to the cache available for the equivalent 32-bit only CPUs, but with code that almost literally doubled its size). This was the reason why some software, depending on a series of factors, ended up being faster when compiled with -Os rather than -O2 (optimise for size, the code size decreases and it uses less CPU cache).

At any rate, -O3 is not something I’m very comfortable to work with; while I agree with Mark that we shouldn’t filter or exclude compiler flags (unless they are deemed experimental, as is the case for graphite) based on compiler bugs – they should be fixed – I also would prefer avoiding to hit those bugs in production systems. And since -O3 is much more likely to hit them, I’d rather stay the hell away from it. Jesting about that, yesterday I produced a simple hack for the GCC spec files:

flame@yamato gcc-specs % diff -u orig.specs frigging.specs
--- orig.specs  2010-04-14 12:54:48.182290183 +0200
+++ frigging.specs  2010-04-14 13:00:48.426540173 +0200
@@ -33,7 +33,7 @@
 %(cc1_cpu) %{profile:-p}

 *cc1_options:
-%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage}
+%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage} %{O3:%eYou're frigging kidding me, right?} %{O4:%eIt's a joke, isn't it?} %{O9:%eOh no, you didn't!}

 *cc1plus:

flame@yamato gcc-specs % gcc -O2 hellow.c -o hellow; echo $?   
0
flame@yamato gcc-specs % gcc -O3 hellow.c -o hellow; echo $?
gcc: You're frigging kidding me, right?
1
flame@yamato gcc-specs % gcc -O4 hellow.c -o hellow; echo $?
gcc: It's a joke, isn't it?
1
flame@yamato gcc-specs % gcc -O9 hellow.c -o hellow; echo $?
gcc: Oh no, you didn't!
1
flame@yamato gcc-specs % gcc -O9 -O2 hellow.c -o hellow; echo $?
0

Of course, there is no way I could put this in production as it is. While the spec files allow enough flexibility to hit the case for the latest optimisation level (the one that is actually applied), rather than for any parameter passed, they lack an “emit warning” instruction, the instruction above, as you can see from the value of $? is “error out”. While I could get it running in the tinderbox, it would probably produce so much noise and for failing packages that I’d spend each day just trying to find why something failed.

But if somebody feels like giving it a try, it would be nice to ask the various upstream to rice it down themselves, rather than always being labelled as the ricer-distribution.

P.S.: building with no optimisation at all may cause problems; in part because of reliance on features such as DCE, as stated above, and as used by FFmpeg; in part because headers, including system headers might change behaviour and cause the packages to fail.

Garbage-collecting sections is not for production

Some time ago I wrote about using --gc-sections to avoid unused functions to creep into final code. Today instead I’d like to show how that can be quite a problem if it was used indiscriminately.

I’m still using at least for some projects the diagnostic of --gc-sections to identify stuff that is unused but still kept around. Today I noticed one bad thing with it and pulseaudio:

/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.data.__padsp_disabled__' in file 'pulseaudio-main.o'

The __padsp_disabled__ symbol is declared in main.c to avoid pulseaudio’s access to OSS devices to be wrapped around by the padsp script. When I first have seen this, I thought the problem was a missing #ifdef directive: if I didn’t ask for the wrapper, it might still have declared the (unused) symbol. That was not the case.

Looking at the code, I found what the problem was: the symbol is (obviously) never used by pulseaudio itself; it is, rather, checked through dlsym() by the DSP wrapper library. For this reason, for compiler and linker, the symbol looks pretty much unused, and when asking for it to be dropped explicitly, it is. Since the symbol is loaded via the runtime linker, neither building nor executing pulseaudio will have any problem. And indeed, the only problem would be when running pulseaudio as a children of padsp, and using the OSS output module (so not on most Linux systems).

This shows how just using -fdata-sections -ffunction-sections -Wl,--gc-sections is not safe at all and why you shouldn’t get excited about GCC and ld optimisations without understanding how they work in detail.

In particular, even I thought that it would be easier to work around than it actually seem to be: while GCC provides a used attribute that allows to declare a variable (or a function) as used even though the compiler can’t tell that by itself (it’s often used together with inline hand-written ASM the compiler doesn’t check for), this does not propagate to the linker, so it won’t save the section from being emitted. The only solution I can think of is adding one instruction that sets the variable to itself, but that’s probably going to be optimised away. Or giving a way for gcc to explicit that the section is used.

Does as needed link make software faster?

Christian Ruppert (idl0r) asked me today whether --as-needed makes software faster or smaller. I guess this is one of the most confusing points about --as-needed; focus of tons of hearsay, and with different assertions all around the place. So I’ll do my best to explain this once and for all.

In perfect conditions, that is, if --as-needed were not doing anything at all, then it wouldn’t be changing anything. The flag is not magical, it does not optimise the output at all. The same exact result you would have if libtool wasn’t pushing all crap down the line, and if all the build systems only requested the correct dependencies.

When it does matter is when overlink is present. To understand what the term overlink refers to check my old post that explains a bit what --as-needed does, and shows the overlink case, the perfect link case, and what really happens.

Now, of course you’ll find reports of users saying that --as-needed makes software faster or smaller. Is this true, or false? It’s not easy to answer one straight answer because it depends on what it’s happening with and without the flag. If with the flag there are libraries loaded, directly and indirectly, that are not used (neither directly nor indirectly), then the process spawned from the executable will have less libraries loaded in the address space, and thus both be faster to load (no need to read, map in memory, and relocate those libraries) and smaller in memory (some libraries are “free” in memory, but most will have relocations, especially if immediate bindings (“now” bindings) are used, like happens for setuid executables.

Indeed, the biggest improvements you can have when comparing the with and without cases in a system, or in software, that uses immediate bindings. In that case, all the symbols from shared objects are bound at load, instead than at runtime, so the startup time for the processes are cut down sensibly. This does not only involve hardened systems, or setuid binaries, but also applications using plugins, that may be requesting immediate bindings (to reject the plugin, rather than aborting at runtime, in case of missing symbols).

For A Parallel World. Home Exercise n.1: a drop-in dynamic replacement for memcpy()

Since I’ve written about OpenMP I’ve been trying to find time to test it on real usage scenarios; unfortunately between health being far from optimal the past few days with general aches, and work piling up, I haven’t been able to get to work on it at all. It would be much nicer if I could get a job that would allow me to spend time on these things, but I don’t want to rant about that since I’m happy to have jobs from time to time, as it is.

Yesterday I toyed a bit around with OpenMP and xine-lib, I wanted to try implementing a memcpy() replacement that could use OpenMP to work in parallel, it’s especially useful for multimedia applications. Beside some issues with autotools and OpenMP which I’m going to address in a future post, I ended up with a few more things in my mind (the usual problem with trying out new things: you want to achieve one result, and you get material for other three tests; now I know why Mythbusters starts with one idea and then end up doing four or five similar tests).

My parallel memcpy() replacement was just as lame as my byteswapping attempt from before, a single for parallelised with the proper pragma code. Just to make it not too lame, I used 64-bit copies (unaligned, but I would expect that not to matter on x86-64 at least, it was just a test). The reason why I didn’t go for a less lame method is that from a second test on byteswapping, which I didn’t have time to write about yet, I found that using more complex tricks does not really help. While splitting the memory area to swap in X equally-sized areas, with X being the maximum number of threads OpenMP is managing, identified dynamically, caused a slight improvement on the huge areas (half a gigabyte and a full gigabyte), it didn’t make any acceptable difference (considering the cost of the more complex code) on smaller blocks, and I really doubt that such huge memory areas would ever go swapped all at once. Splitting the area in page-sized (4KiB) blocks actually made the code slower, I guess, since I didn’t go deeper to check, that the problem there is that the threads are usually all executed on either the same core or on anyway on the same CPU, which means that the pages are all mapped on the memory directly connected to that CPU; splitting it up in pages might make it swap between the two different CPUs and thus make it slower. I’ll look more deeply into that when I have time.

Unfortunately, using this particular memcpy() implementation didn’t let me start xine properly, I probably made a mistake, maybe unaligned 64-bit copies on x86-64 don’t work, just like on any other architecture, but I didn’t go around trying to fix that for the very same reason why I’m writing this post.

It turns out that xine, just like MPlayer, and other multimedia application, have their own implementation of “fast memcpy()”, using SIMD instructions (MMX, MMXEXT, SSE, AltiVec, …). They benchmark at runtime which one has the best result (on my system it’s either the Linux kernel implementation, not sure which version, or the MMX version), and then they use that. This has some problems that are obvious, and some that are much less obvious. The first problem is that the various implementations do have to take care of some similar issues which cause code duplication (handling of small memory area copies, handling of unaligned copies and so on). The second is much more subtle and it’s what I think is a main issue to be solved.

When a programmer in a C program uses functions like memcpy(), strlen() and others, the compilation process (with optimisations) will hit some particular code called “builtins”. Basically the compiler knows how to deal with it, and emits different machine code depending on the way the function is called. This usually happens when the parameters to the call are known at build time, because they are either constant or can be derived (for static functions) from the way the function is called. How this affects mathematical functions and functions like strlen() can be better understood reading an old article of mine; for what concerns memcpy(), I’ll try to be brief but explain it here.

Let’s take a very easy function that copies an opaque type that is, in all truth, a 64-bit data field:

#include 

void copy_opaque(void *to, void *from) {
  memcpy(to, from, 8);
}

Building this code on x86-64 with GCC 4.3 and no optimisation enabled will produce this code:

copy_opaque:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    -16(%rbp), %rsi
        movq    -8(%rbp), %rdi
        movl    $8, %edx
        call    memcpy
        leave
        ret

As you can see there is a call to memcpy() after setting up the parameters, just like one would expect. But turn on the optimisation with -O2 and the resulting code is quite different:

copy_opaque:
        movq    (%rsi), %rax
        movq    %rax, (%rdi)
        ret

The function has been reduced to two instructions, plus the return, with no stack usage. This because the compiler knows that for 64-bit copies, it can just emit straight memory access and simplify the code quite a bit. The memcpy() function is not a static inline, but the compiler knows its interface and can produce optimised code just fine when using builtin. Similarly, when using -O2 -fno-builtin to ask the compiler not to use builtins knowledge, for instance because you’re using special access functions, you can see that the resulting code is still composed of two instructions, but of a different type:

copy_opaque:
        movl    $8, %edx
        jmp     memcpy

Let’s go back to the builtin though, since that’s what it’s important to know before I can explain why the dynamically-chosen implementation in xine and company is quite suboptimal.

When you change the size of the memory area to copy in copy_opaque() from 8 to a different constant, you can see that the code changes accordingly. If you use a number that is not a multiple of 8 (that is the biggest size that x86-64 can deal with without SIMD), you can see that the “tail” of the area is copied using smaller move operations, but it’s still expanded. If you compare the output with multiple power-of-two values, you can see that up to 128 it inlines multiple movq instructions, while starting with 256, it uses rep movsq. With very big values, like (1 << 20), the compiler emits a straight memcpy() call. This is because the compiler can assess the overhead of the call and decide when it’s big enough to use a function rather than inlining code.

It can also decide this based on what type of optimisation is requested, for instance I said above that rep movsq starts to get used after the value 256 (1 << 8), but that was intended with the -O2 level; with -Os, it’s already when you have more than two 64-bit words.

Since the library functions like memcpy() and similar are very widely used, the fact that the compiler can emit much simpler code is very useful. But this works just as long as the compiler knows about them. As I said, turning off the builtin replacement will cause the code to be compiled “literally” with a call to the function, which might have a much higher overhead than a straight copy. Now it might be quite easier to grasp what the problem is with the dynamic memcpy() replacement used by xine and other software.

Let’s change the code above to something like this:

#include 

extern void *fast_memcpy(void *to, void *from, size_t foo);

void copy_opaque(void *to, void *from, size_t foo) {
  fast_memcpy(to, from, 8);
}

Now, even turning on the optimisations, won’t make any difference, the compiler will always emit a call to memcpy():

copy_opaque:
        movl    $8, %edx
        jmp     fast_memcpy

As you might guess, this is certainly slower than the straight copy that we had before, even if the memcpy() replacement is blazing fast. The jump will also require a symbol resolution since fast_memcpy() is not statically defined, so it’ll have to pass through the PLT (Procedure Linking Table) which is an expensive operation. Even if the symbol were defined internally to the same library, this would still most likely cause a call to the GOT (Global Offset Table) for shared objects.

By redefining the memcpy() function, xine and others are actually slowing the code down, at least when the size of the copy is known, a constant at build time. GCC extensions actually allow to define a macro, or even better a static inline function, that can discern whether a compile-time constant is used, and then fall back to the original memcpy() call, which the compiler will mangle as it prefers, but this is quite complex, and in my opinion not worth the bother.

Why do I say this? Well the first issue is that sometimes even if a value is not properly a constant at build-time, the compiler can find some particular code path where the function can be replaced, and thus emit adaptive code. The second is that you might just as well always use properly optimised memcpy() functions when needed, and if the C library does not provide anything as fast, you just need to use the ~~Force~~ ELF.

When the C library does not provide functions optimised for your architecture, for compatibility or any other reason, you can try to replace them through the ELF feature called symbol interposing, which basically work in the same as symbol collisions (I have some “slides” I’ve been working on for a while on the subject, but I’ll talk more extensively about this in a future post), and allows to intercept or replace calls to C library functions. It’s the same method used to implement the sandbox used by Portage, or the OSS wrappers for ALSA, PulseAudio, ESounD and so on.

What I’d like to see is a configurable library that would allow to choose between different memcpy() implementations, maybe on a per-size basis too, parallel and non-parallel, at runtime, through a file in /etc. This is quite possible, and similar features, with replacement for many common library functions, are available with freevec (which unfortunately only implements AltiVec on 32-bit PPC). But a more arch-neutral way to handle this would most likely help out.

Anyway, if somebody is up to take the challenge, I’d be glad to test, and to add to the tinderbox to test on the packages’ testsuites too. Let’s see what happens.

I’m in yur ALSA… killing your dirties

Well, maybe not right now.

Since I doubt you’d be able to understand what I mean from the lolcat-speech title, let me try to summarise it in a language that nears a lot more what people actually speak (yes I know it’s still going to be too technical for some of the readers, but I guess that cannot really be helped).

Last night I couldn’t sleep, for a series of reason, not last that to make sure I could implement some stuff for my job while waiting for the actual definitive specs, I took three coffee cups, which while making me feel very nice, stops me from sleeping; not so nice when your neighbours woke you up two days in a row fighting, but I can manage.

Since at the time I was waiting for the chroot to complete some builds so I could check and submit a few more bugs (the count of “My Bugs” search on bugzilla now reaches 1200 bugs and has some reserve too), I decided to try something different. I already have been adding to my git repositories changes to a few libraries I contributed to in the past enough buildsystem so that --no-undefined is added, so last night I decided to go with doing some work on ALSA upstream repositories.

I already had checked out three of the repositories when 1.0.18 was added to the tree, since I had to fix an --as-needed issue and decided to just go on and submit all the patches to upstream for merge, this time I checked out alsa-lib, added --no-undefined and then started some analysis with ruby-elf tools cowstats and missingstatic, as well as removed a few compiler warnings, just to make sure I wouldn’t be distracted by faux problems.

The result should now be that the alsa-lib and alsa-plugins libraries have a few dirty pages less, and that the code is a bit more solid than before, with added static and const modifiers where needed. It wasn’t much of a work, but I forgot once again to add -s to the git commits so I had to rewrite history to get the Signed-off-by header to all the commits; if somebody knows how to set git per-repository to always use -s when committing, I’d be very glad.

On the other hand, this task shown me that cowstats still had and has some problems, in particular, it lacked a way to separate .data.rel from .data.rel.ro sections data. This is important to distinguish between the two since .data.rel.ro is fully prelinkable, which means after a prelink it would always be loaded from the disk without further relocation, while .data could still cause copy on write because it can be changed at runtime even after relocation.

This is even further understood by noticing that shared objects built with GCC under Linux have .data, .data and .data.rel.ro, but no .data.rel which is instead merged back into .data itself. But because of this the “real” data count in cowstats is entirely out of reality. I’ll have to rewrite that part most likely.

Anyway, I’ve done my best and hopefully tomorrow more of my patches will be merged in, so that alsa-lib’s dirty pages get reduced again. Unfortunately even after my changes, with all the plugins enabled, and in the worst case scenario, libasound.so will go on requiring more than 28KiB of dirty pages per process (minus forks and various preloads). Which is not nice at all. Prelinking can reduce the dirty pages removing these 28KiB (which are all of .data.rel.ro), and then it would just require a couple of pages.

There is one question though that now is driving me nuts: hasn’t Nokia worked on ALSA for their N800 tablets? I know alsa-plugins has a Maemo plugin (which I also cleaned up a bit last night, as it had quite a few hacks on the autotools side, and an unwarranted use of pow() instead of using left shift), but I’d expect Nokia to know better about having so many dirty pages…

Anyway, for all the remaining users, I strongly suggest you look into removing some of the plugins that ship with ALSA, like the iec958 plugin if you don’t use digital pass-through. By cutting down the amount of built-in plugins you should be able to reduce sensibly the memory that alsa and the applications using alsa would be using on your system.

-I also wonder why didn’t I add an USE flag to disable the alisp feature- Sorry, of course I wouldn’t be able to find an alisp USE flag if I check the output of emerge -pv alsa-utils. D’oh!. Why does ALSA need a LISP interpreter anyway?

Testing the corner cases

One interesting thing of using chroots to check things out is that often enough you stumble across different corner cases when you get to test one particular aspect of packages.

For instance, when I was testing linking collisions, I found a lot of included libraries. This time testing for flags being respected I found some other corner cases.

It might some funky, but it has been common knowledge for a while that gcc -O0 sometimes produced bad code, and sometimes it failed to build some packages. Unfortunately it’s difficult to track it down to specific problems when you’re “training” somebody in handling the compiler. Today, I found one of these cases.

I was going to merge sys-block/unieject in my flagstesting chroot so I could make sure it worked properly, for this, it needed dev-libs/confuse, which I use for configuration files parsing. All at once, I found this failure:

 i686-pc-linux-gnu-gcc -DLOCALEDIR="/usr/share/locale" -DHAVE_CONFIG_H -I. -I. -I.. -Wall -pipe -include /var/tmp/portage/dev-libs/confuse-2.6-r1/temp/flagscheck.h -MT confuse.lo -MD -MP -MF .deps/confuse.Tpo -c confuse.c  -fPIC -DPIC -o .libs/confuse.o
confuse.c: In function 'cfg_init':
confuse.c:1112: warning: implicit declaration of function 'setlocale'
confuse.c:1112: error: 'LC_MESSAGES' undeclared (first use in this function)
confuse.c:1112: error: (Each undeclared identifier is reported only once
confuse.c:1112: error: for each function it appears in.)
confuse.c:1113: error: 'LC_CTYPE' undeclared (first use in this function)
make[2]: *** [confuse.lo] Error 1
make[2]: Leaving directory `/var/tmp/portage/dev-libs/confuse-2.6-r1/work/confuse-2.6/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/var/tmp/portage/dev-libs/confuse-2.6-r1/work/confuse-2.6'
make: *** [all] Error 2

This was funny to see as I did merge confuse lately on my main system on Yamato, and I do have nls enabled there, too. It didn’t fail, so it’s not even a glibc cleanup-related failure.

Time to dig into the code, where is setlocale used in confuse?

#if defined(ENABLE_NLS) && defined(HAVE_GETTEXT)
    setlocale(LC_MESSAGES, "");
    setlocale(LC_CTYPE, "");
    bindtextdomain(PACKAGE, LOCALEDIR);
#endif

As I used this before, I know that locale.h provides setlocale() function, but usually it’s included through gettext’s own libintl.h header file, so where is that included? A common problem here would be to have different preprocessor tests between the include and the use so that one is applied but not the other.

#if defined(ENABLE_NLS) && defined(HAVE_GETTEXT)
# include 
# define _(str) dgettext(PACKAGE, str)
#else
# define _(str) str
#endif
#define N_(str) str

It seems to be entirely fine, the only problem would be if libintl.h didn’t include locale.h, but why would it then work on the rest of the rest of the system?

The focal point here is to check why libintl.h includes locale.h in one place and not the other. Let’s then look at the file itself:

/* Optimized version of the function above.  */
#if defined __OPTIMIZE__ && !defined __cplusplus

[...]

/* We need LC_MESSAGES for `dgettext'.  */
# include 

[...]

#endif  /* Optimizing.  */

So not for any kind of assurance, but just because there’s a technical need, libintl.h brings in the declaration of setlocale(), and only if you have optimisations enabled. Guess what? my chroot has no optimisations enabled, as I don’t need to execute the code, but just build it.

The fix here is very easy, just include locale.h explicitly; I’ll be sending a patch upstream and submitting one to Gentoo, but it puts an important shadow over the correctness of Free Software when building with optimisations disabled. I suppose this is one other thing that I’ll be testing for in the future, in my checklist.

Respecting CFLAGS and CXXFLAGS, reality testing!

As I have written in my post Flags and flags, I think that one way out of the hardened problem would be to actually respect the CFLAGS and CXXFLAGS the user requests so that they actually apply to the ebuilds. Unfortunately, not all the ebuilds in the tree respect the flags, and finding out which ones do and which ones don’t hasn’t been, up to now, an easy task.

There are many reasons for this, the most common one is to look at the build output and spot that all the compile lines lack your custom flags, but this is difficult to automate, another option is to inject a fake definition option (-DIWASHERE) and grep for it in the build logs, but this is messed up if you consider that a package might ignore CFLAGS just for a subset of its final outputs.

While I was without Enterprise I spent some time thinking about this and I came to find a possible solution, which I’m going to experiment on Yamato, starting tonight (which is Friday 29th for what it’s worth).

The trick is that GCC provides a flag that allows you to include an extra file, unknown to the rest of the code. With a properly structured file, you can easily inject some beacon that you can later pick up.

And with a proper beacon injected in the build files, it shouldn’t be a problem to check using scanelf or similar tools if the flags were respected.

The trick here is all in the choice of the beacon and in looking it up; the first requirement for the proper beacon is to make sure it does not intrude and disrupt the code or the compilation, this means it has to have a name that is not common, and thus does not risk to collide with other pieces of code, and won’t clash between different translation units.

To solve this, the name can be just very long so that it’s impractical that somebody might have used it for a funciton or variable name, let’s say we call that beacon cflags_test_cflags_respected. This is the first step, but it still doesn’t solve the problem of clashing traslation units. If we were to write it like this:

const int cflags_test_cflags_respected = 1234;

two translation units with that in them, linked together, will cause a linker error that will stop the build. This cannot happen or it’ll make our test useless. The solution is to make the symbol a common symbol. In C, common symbols are usually the ones that are declared without an initialisation value, like this:

int cflags_test_cflags_respected;

Unfortunately this syntax doesn’t work on C++, as the notion of common symbol hasn’t crossed that language barrier. Which means that we have to go deeper in the stack of languages to find the way to create the common symbol. It’s not difficult, once you decide to use the assembly language:

asm(".comm cflags_test_cflags_respected,1,1");

will create a common symbol of size 1 byte. It won’t be perfect as it might increase the size of .bss section for a program by one byte, and thus screw up perfect non-.bss programs, but we’re interested in the tests rather than the performance, as of this moment.

There is still one little problem though: the asm construct is not accepted by the C99 language, so we’ll have to use the new one instead: __asm__, that works just in the same way.

But before we go on with this, there is something else to take care of. As I have written in the entry linked at the start of this one, there are packages that mix CFLAGS and CXXFLAGS. As we’re here, it could be easy to just add some more test beacons that track down for us if the package has used CFLAGS to build C++ code or CXXFLAGS to build C code. With this in mind, i came to create two files: flagscheck.h and flagscheck.hpp, respectively to be injected through CFLAGS and CXXFLAGS.

flame@yamato ~ % sudo cat /media/chroots/flagstesting/etc/portage/flagscheck.h
#ifdef __cplusplus
__asm__(".comm cflags_test_cxxflags_in_cflags,1,1");
#else
__asm__(".comm cflags_test_cflags_respected,1,1");
#endif
flame@yamato ~ % sudo cat /media/chroots/flagstesting/etc/portage/flagscheck.hpp
#ifndef __cplusplus
__asm__(".comm cflags_test_cflags_in_cxxflags,1,1");
#else
__asm__(".comm cflags_test_cxxflags_respected,1,1");
#endif

And here we are, now it’s just time to inject these in the variables and check for the output. But I’m still not satisfied with this. There are packages that, mistakenly, save their own CFLAGS and propose them to other programs that are linked against; to avoid these to falsify our tests, I’m going to make the injection unique on a package level.

Thanks to Portage, we can create two functions in the bashrc file, pre_src_unpack and post_src_unpack, in the former, we’re going to copy the two header files in the ${T} directory of the package (the temporary directory), then we can mess with the flags variables and insert the -include command. This way, each package will get its own particular path; when a library passes the CFLAGS assigned to itself to another package, it will fail to build.

pre_src_compile() {
    ln -s /etc/portage/flagscheck.{h,hpp} "${T}"

    CFLAGS="${CFLAGS} -include ${T}/flagscheck.h"
    CXXFLAGS="${CXXFLAGS} -include ${T}/flagscheck.hpp"
}

After the build completed, it’s time to check the results, luckily pax-utils contains scanelf, which makes it piece of cake to check whether one of the four symbols is defined, or if none is (and thus all the flags were ignored), the one line function is as follow:

post_src_compile() {
    scanelf "${WORKDIR}" 
        -E ET_REL -R -s 
        cflags_test_cflags_respected,cflags_test_cflags_in_cxxflags,cflags_test_cxxflags_respected,cflags_test_cxxflags_in_cflags
}

At this point you just have to look for the ET_REL output:

ET_REL cflags_test_cflags_respected /var/tmp/portage/sys-apps/which-2.19/work/which-2.19/tilde/tilde.o 
ET_REL cflags_test_cflags_respected /var/tmp/portage/sys-apps/which-2.19/work/which-2.19/tilde/shell.o 
ET_REL  -  /var/tmp/portage/sys-apps/which-2.19/work/which-2.19/getopt.o 
ET_REL cflags_test_cflags_respected /var/tmp/portage/sys-apps/which-2.19/work/which-2.19/bash.o 
ET_REL  -  /var/tmp/portage/sys-apps/which-2.19/work/which-2.19/getopt1.o 
ET_REL cflags_test_cflags_respected /var/tmp/portage/sys-apps/which-2.19/work/which-2.19/which.o

And it’s time to find out why getopt.o and getopt1.o are not respecting CFLAGS while the rest of the build is.