Inline assembly : a case study

astrotycoon 2015-02-04

展开全文

Our first article "introduction of inline assembly" will coverthe case study of inline assembly and benefit of using inlineassembly in C/C++ program. Of course we will describe the basicsyntax of inline assembly and tell readers how to read and writeinline assembly.

1. Casestudy: thread lock in multi-thread program

With the maturity and popularization of multicore processor and tomaximize the throughput of processors, the latest applications arenormally implemented with multithread to boost the performance.pthread(POSIX thread) is a famous thread library on Linux and is usedby many multithread applications written in c/c++.

As we all know, when writing applications in shared memory model inmultiprocessor environment, we should deal with date synchronization.Pthread library supplies programmers with several kinds of mechanismsto implement data synchronization and one of them is mutual-exclusionlock which is called mutex lock. Check the simple example below. wehave two threads and they will access to a shared variable calledshared_var. one thread will increment it by one and another threadwill decrement it by one. so before accessing the variable functionhas to acquire the lock to ensure on other thread can modify theshared variable.

[cpp] view plain copy

//codesnippet 1
#include<unistd.h>
#include<stdio.h>
#include<pthread.h>
pthread_mutex_tlock = PTHREAD_MUTEX_INITIALIZER;
intshared_var;
void* increment(void * args)
{
while(1){
pthread_mutex_lock(&lock);
shared_var+=1;
printf("incre:%d\n",shared_var);
pthread_mutex_unlock(&lock);
sleep(0);
}
}
void* decrement(void * args)
{
while(1){
pthread_mutex_lock(&lock);
shared_var-=1;
printf("decre:%d\n",shared_var);
pthread_mutex_unlock(&lock);
sleep(0);
}
}
intmain()
{
pthread_t t1,t2;
pthread_create(&t1,NULL,increment,NULL);
pthread_create(&t2,NULL,decrement,NULL);
pthread_join(t1,NULL);
pthread_join(t2,NULL);
return 0;
}

From the c code above we know that we can use pthread_mutex_lock toacquire a lock and then access shared_variable, which is implementedin c code. But what does pthread_mutex_lock really do in thelow-level.

2.Lock implementation in the low-level hardware

In this section we will have a deep insight of mutual exclusion lockfrom a low-level perspective. Intuitively, it is very easy toimplement a lock. A lock is a shared variable and if the variableequals to 1 it means the lock is locked, otherwise it means that thelock is not locked then we can set the lock to 1 to lock it.

So how Pthread implements a lock. If we hack into the source code ofpthread library then we can get a clear view of the implementation. Alock is implemented with compare_and_swap method, which quiteconforms to our intuition. The code below is extracted from thesource code of pthread library on powerpc architecture.

[cpp] view plain copy

//codesnippet 2
PT_EIint
__compare_and_swap(long int *p, long int oldval, long int newval)
{
int ret;
__asm__ __volatile__ (
"0: lwarx %0,0,%1 ;"
" xor. %0,%3,%0;"
" bne 1f;"
" stwcx. %2,0,%1;"
" bne- 0b;"
"1: "
: "=&r"(ret)
: "r"(p), "r"(newval),"r"(oldval)
: "cr0", "memory");
/* This version of__compare_and_swap is to be used when acquiring
a lock, so we don't need toworry about whether other memory
operations have completed, butwe do need to be sure that any loads
after this point really occurafter we have acquired the lock. */
__asm__ __volatile__ ("isync": : : "memory");
return ret == 0;
}

At the first glance we can conclude that it is implemented with assemblybut check it again, it is not just as normal as the assembly we usebefore. Actually it is inline assembly. Inline assembly is a verypowerful feature of modern compiler. It provides C/C++ languageprogrammers abilities to manipulate hardware directly and veryhandy,speedy and powerful in some performance specific and hardwareoperation related applications. So we need learn how to write inlineassembly correctly and efficiently to make our code much powerful.

In the example we are talking about, we want to implement a lock and wewill use compare_and_swap method. According to our convention, 0means unlocked and 1 means locked then we can call__compare_and_swap(lock,0,1) which means

[cpp] view plain copy

If lock == 0 then
set lock =1 to lock it ;
return true
elseIf lock == 1 then
it is locked ;
return false;

Now we have already known the internal logic in implementing lock. Sincewe can write a very straightforward pseudo code to implement lock whyshould we bother so much to write complicated inline assembly code inC/C++ program. We all know that when one processor access a sharedvariable in the shared memory another processor might also isaccessing it too, which will cause corruption. So we need lockhardware bus to support an atomic operation. C language is notcapable of realizing this functionality while assembly can because itcan manipulate hardware by instructions. However it is reallycomplicated to write assembly in the system mainly written in C/C++and that is why inline assembly was born. We can easily use assemblyin C/C++ code and do anything assembly can do.it is really cool anduseful.

3.Use of inline assembly

In the last section we have already known what inline assembly can doand why we should use inline assembly and in this section we willintroduce how to use inline assembly based on the example above,including its basic syntax, constraints and clobbered list.

Basic syntax and constraints

First of all, let us see the basic structure of inline assembly written inC/C++ program. All the assembly instructions must be enclosed in__asm__() which tells compiler that we are now using inline assembly.Each assembly instruction must be quoted by double quotation marklike "0: lwarx %0,0,%1 ;".

Maybe you have many questions. for example, what does %0,%1 means in theinstruction, and what are the things starting with colon below theinstructions. This is the distinctive feature of inline assembly usedto bridge inline assemble and C expressions and we will have adetailed description below.

Actually the whole structure of inline assembly is: asm(code : output operandlist : input operand list : clobber list); in our example, outputoperand list is : "=&r"(ret) ; input operand list is:"r"(p), "r"(newval), "r"(oldval) andclobbered list is: "cr0", "memory". don't behurry we will describe one by one.

The output and input operand list can have several entries, separated bycommas. each entry has two items. the first one is a string calledconstraint and the second one is a C expression enclosed inparentheses. That is how C expression is passed into instructions.

Let us talk about constraint at first. Take the the simpler one "r"(p)as an example. What does "r" mean? r means register, morespecific, general register. the whole entry means pass C expression pinto a general register which will be allocated by compiler. Ofcourse,there are many other constraints used to demonstrate the typeof operand but here we just mention the basic one "r"appeared in our example.

Now we know C expression will be passed into inline assembly and compilerwill allocate a register to hold it but how can instructions usethese C expressions. the trick here is the %0,%1 in the instruction "0: lwarx %0,0,%1 ;". Actually all the operands appearedin the output and input list are numbered. The first operand in theoutput operand list is numbered by zero and other operands arenumbered according to the order they appear in the list.

Now you can guess what does %0,%1 in the instructions "0: lwarx%0,0,%1 ;" mean. You are right. %0 means the first operand and%1 means the second operands. All the register reference must bepreceded with % to tell compiler that we are using a operand definedin the input and output list, otherwise compiler will treat it as aconstant value.

Now we have learned a lot, right? It is time to check what the inlineassembly code in the example really does. We describe the behavior ofeach instruction in the comments.

[cpp] view plain copy

__asm____volatile__ (
"0: lwarx %0,0,%1 ;" //load *p into ret
" xor. %0,%3,%0;" //xor oldval and *p and store the result into ret.
" bne 1f;" //if ret != 0 then jump to 1:
" stwcx. %2,0,%1;" //store newval into *p
" bne- 0b;" //store fails jump to 0:
"1: "
: "=&r"(ret)
: "r"(p), "r"(newval),"r"(oldval)
: "cr0", "memory");

It actually did as what we described with pseudo code in the section 2.In the section 2 we have mentioned lock operation must be supportedby hardware to supply atomic operation. Here instructions lwarx andstwcx. supply an atomic operation so that all the memory accessbetween these two instructions will be locked which can ensure onlyone processor can access shared variable *p.

Clobbered list and volatile

All right, we have almost learned much knowledge about inline assembly.Now we will introduce some advanced skill used in the inlineassembly, that is clobbered list and volatile, which is the verycomplicated and tricky.

These two are both concerned with compiler optimization that is why theyare complicated. Volatile is a modifier following the modifier__asm__ to demonstrate all the instructions in the inline assemblycode block cannot be moved so that they must stay where they are put.Sometimes compiler will think he is more clever than programmers soit could move the instructions to implement some optimization. Atmost time it could lead to a better result but in our exampleactually it could lead to a opposed site. We don't want compiler doany change on our instructions so that we use modifier volatile.

So what about clobbered list? Why should we use clobbered list? From aprogrammer's perspective, actually we do not need it. But forcompiler it really need clobbered list. Why? A clobbered list is usedto tell compiler that which register is used implicitly in the inlineassembly instructions. in our example, instructions xor. and stwcx.can implicitly change the CR0 so that we should specify CR0 in theclobbered list. But why programmer should tell compiler thisinformation? it is concerned with compiler optimization. CR0 does notappear in the instructions explicitly so that compiler might thinkthat CR0 is not changed so that if CR0 is used somewhere latercompiler will not load it again and just store it into the variabledirectly which is wrong because it is actually modified by xor. andstwcx. so that is why programmer should tell compiler that CR0 ismodified and compiler should never reply on its value.

OK I know it is really hard for you to read till now and this one is thelast one. come on!!. the "memory" in the clobbered listmeans that the instructions have modified the memory so that compilershould reload all the variable in the memory after the inlineassembly is executed.

Conclusion:

This is our first article about inline assembly.We think inline assemblyis really cool and really powerful. It provides us more capabilitiesto manipulate hardware and boost performance of our applications.

In this session We have described a real case of inline assembly andanalyzed its basic syntax, constraints and clobbered list. This isjust the first session of our inline assembly article series and wewill continue to introduce inline assembly in the following articles.Please enjoy it.

Reference:

1. GCC-Inline-Assembly-HOWTO

2. ARM GCC Inline Assembler Cookbook

3. POSIX thread: http://ftp./gnu/glibc/

4. https://computing./tutorials/pthreads/