BASM for beginners, lesson 2

aaie_ 2012-02-20

展开全文

This is the second chapter of the introduction to BASM programming with Delphi. The first chapter was a short introduction to integer code and this second one is about floating point code. Our example functions can evaluate a second order polynomial. The parameters A, B and C that defines the polynomial is coded as local constants. Input to the function is the variable X of type double and the result is also of type double. The function looks like this.

function SecondOrderPolynomial1(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

begin
Result := A*X*X + B*X + C;
end;

Copying the assembler code from the cpu view gives us this.

function SecondOrderPolynomial2(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

begin
{
push ebp
mov   ebp,esp
add   esp,-$08
}
Result := A*X*X + B*X + C;
{
fld   qword ptr [A]
fmul qword ptr [ebp+$08]
fmul qword ptr [ebp+$08]
fld   qword ptr [B]
fmul qword ptr [ebp+$08]
faddp st(1)
fadd qword ptr [C]
fstp qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
}
{
pop   ecx
pop    ecx
pop    ebp
}
end;

Let us explain the asm code line by line. The begin results in this code,

begin
{
push ebp
mov   ebp,esp
add   esp,-$08
}

which sets up a stackframe for the function. A stack frame is just a piece of memory that is reserved for the stack. A stack frame is accessed through two pointers, the basepointer and the stack pointer. The basepointer is in ebp and the stack pointer is in esp. These two registers are reserved for use by these pointers only. The line push ebp backs up the basepointer, The line mov ebp, esp sets up a new basepointer which is pointing to the top of the stack. The line add esp, -$08 moves the stack pointer 8 bytes down. As a curiosity the stacks grows downward and the last line could more intuitively have been sub esp, 8. The new stack frame that was created by these three lines is standing on top of, or actually hanging under, the last stackframe, which was probably allocated by the function that called our SecondOrderPolynomial function.

The next line of Pascal was compiled into no less than 9 lines of ASM.

Result := A*X*X + B*X + C;
{
fld   qword ptr [A]
fmul qword ptr [ebp+$08]
fmul qword ptr [ebp+$08]
fld   qword ptr [B]
fmul qword ptr [ebp+$08]
faddp st(1)
fadd qword ptr [C]
fstp qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]

For those that is used to HP calculators floating point code is very easy to understand. The first line, fld   qword ptr [A], loads the constant A onto the floating point register stack. The line, fmul qword ptr [ebp+$08], multiplies A with X. This makes sense by watching the Pascal code, but what means "qword ptr [ebp+$08]". qword ptr says "pointer to a quadword, which is the size of a double. (64 bit). The valueof the pointer is between the brackets in [ebp+$08]. ebp is the basepointer and $08 is - well just 8. Because the stack grows down the memory location 8 bytes above the basepointer is in the previous stack frame. Here X was placed by the function which called our function. This placement of a double variable is decided by the register calling convention. A double variable does not fit into the 32 bit integer registers, but it fits perfectly onto the floating point registers. Borland decided to pass double variables via the stack, but passing them in floating point registers would have been more efficient. The next 3 lines need no further explanation, but the line, faddp st(1), needs some. All floating point instructions start with an f. add is addition. st(1) is floating point register 1, which is the second because st(0) is the first! The floating point registers are combined into a stack and instructions implicitly works on the top of the stack which is st(0). faddp st(1) is the same as faddp st(0), st(1) and it adds register st(0) to register st(1) and place the result in st(1). The p in faddp means pop st(0) of the stack. This way the result ends up in st(0). The line fadd qword ptr [C] completes the calculations and the only thing left is to place the result in st(0). It is actually already there and the two lines

fstp qword ptr [ebp-$08]
fld   qword ptr [ebp-$08]

are redundant. They just copy the result into the stack frame and loads it back in again. Such a waste of precious time and energy :-). The line wait makes sure that any exceptions that migth have been raised by one of the floating point instructions is checked. See Intel SW Developers Manual Volume 2 page 822 for the full explanation.

Then there is only three lines of asm back to explain.

{
pop   ecx
pop    ecx
pop    ebp
}
end;

These are removing the stack frame, by restoring the values of esp and ebp back to the values they had when the function was entered. This code is much more intuitive and does the same thing

add esp, 4
pop ebp

it is also more effective and I do not know why the compiler is incrementing the stack pointer in this cumbersome way. Remember that ecx can be used for free and assigning values to it is just like pouring them into a waste bucket.

Now we only need to investigate what is hiding behind the [A] in the line fld qword ptr [A]. We know that A must be a pointer to the place where A is placed in memory. The address of A is coded in the instruction. This is the full line from the cpu view.

00451E40 DD05803C4500     fld qword ptr [B]

00451E40 is the address of this instruction in the exe file. DD05803C4500 is the machine code for the line and fld qword ptr [B] is the more human readable format of it. By consulting the Intel SW Developers Manual Volume 2 on page 280 we see that the opcode for fld is D9, DD, DB or D9C0 depending on the type of data it should load. We recognize DD which is the opcode for fld double. What is left is 05803C4500. 05 is (Somebody help me ! ). 803C4500 is the 32 bit address of A.

Let us convert the function into a pure BASM function now that we have finished analyzing it.

function SecondOrderPolynomial3(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
push ebp
mov   ebp,esp
add   esp,-$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul qword ptr [ebp+$08]
fmul qword ptr [ebp+$08]
fld   qword ptr [B]
fmul qword ptr [ebp+$08]
faddp //st(1)
fadd qword ptr [C]
fstp qword ptr [ebp-$08]
wait
fld   qword ptr [ebp-$08]
pop   ecx
pop    ecx
pop    ebp
end;

Now comes a few surprises. First the function will not compile. faddp st(1) is not recognized as a valid combination of opcode and operands. By again consulting the Intel manual we learn that faddp comes in one version only. It operates on st(0), st(1) and it is not necessary to write faddp st(0), st(1) and the short form faddp is the only valid one. We comment out st(1) and it compiles now.

Second surprise. Calling the function with X = 2 yields the calculation Y = 2^2+2*2+3 = 11. SecondOrderPolynomial3 returns 3! We must open the FPU view as well as the CPU view and trace through the code and watch what is happening. It is seen that A=1 is correctly loaded into st(0) by the fourth line, but the 5. line that should multiply A by X, 1 by 2, is resulting in st(0) being a very small number, in effect 0. This tells us that X is near zero instead of 2. Two things can be wrong. The calling code is transferring a wrong value of X or we are addressing X incorrectly. By comparing the calling code when calling function SecondOrderPolynomial3 and SecondOrderPolynomial1 we see that it is the same and this is not the bug. It would also be quite surprising if Delphi was suddenly getting this wrong! Try to step through the calling code while watching the memory pane in the cpu view. The little green arrow is the position of the stackpointer. The calling code looks like this.

push dword ptr [ebp-$0c]
push dword ptr [ebp-$10]
call SecondOrderPolynomial1

Two pointers are pushed onto the stack. One of them is a pointer to X. I do not what the other one is. By watching the memory pane we see that the first one is the pointer to X and the second one is a nil pointer. When we trace into the function we see that the first two lines are repeated. The compiler automatically inserted the push ebp and mov ebp, esp lines. Because push decrements the stack pointer by 4, the reference to X went wrong. These two first lines are removed and everything is ok again.

Now we have finished analyzing the code and know what it does, we can begin optimizing it.

Let us first change the two fstp/fld lines that we already have seen is redundant.

function SecondOrderPolynomial4(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//push ebp
//mov   ebp,esp
add   esp,-$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul qword ptr [ebp+$08]
fmul qword ptr [ebp+$08]
fld   qword ptr [B]
fmul qword ptr [ebp+$08]
faddp //st(1)
fadd qword ptr [C]
//fstp qword ptr [ebp-$08]
wait
//fld    qword ptr [ebp-$08]
pop   ecx
pop    ecx
pop    ebp
end;

This was the only references to the stack frame which is not needed now.

function SecondOrderPolynomial5(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//push ebp
//mov   ebp,esp
//add   esp,-$08
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul qword ptr [ebp+$08]
fmul qword ptr [ebp+$08]
fld   qword ptr [B]
fmul qword ptr [ebp+$08]
faddp //st(1)
fadd qword ptr [C]
wait
//pop   ecx
//pop   ecx
//pop   ebp
end;

That removed another 6 lines and reduces the function to this.

function SecondOrderPolynomial6(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul qword ptr [ebp+$08]
fmul qword ptr [ebp+$08]
fld   qword ptr [B]
fmul qword ptr [ebp+$08]
faddp
fadd qword ptr [C]
wait
end;

X is loaded from memory into the FPU 3 times. It would be more effective to load it once and then reuse it.

function SecondOrderPolynomial7(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+$08]
fld   qword ptr [A]
fmul st(0), st(1)
fmul st(0), st(1)
fld   qword ptr [B]
fmul st(0), st(2)
ffree st(2)
faddp
fadd qword ptr [C]
wait
end;

I magically came up with this code. The first line loads X. The second line loads A. The third line multiplies A with X. The 4. line multiplies a*X know in st(0) with X. Then we have calculated the first term. Calculating the second term is done by loading B and multiply it with X. This was the last time we needed X in we free's the register, st(2), holding it. Now we add term 1 and 2 and popping term 2 of the stack. The only thing left to do is adding C. The result is now in st(0) and the other registers are empty. Then we check for exceptions with wait and is done. It is seen that no redundant work is done and this implementation is near optimal.

There exits seven instructions for loading often used constants into the FPU. One of these constants is 1, which can be loaded with the instruction fld1. Using it saves one load from memory, which can be costly in terms of clock cycles if data are not properly aligned.

function SecondOrderPolynomial8(X : Double) : Double;

const
//A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+$08]
//fld   qword ptr [A]
fld1
fmul st(0), st(1)
fmul st(0), st(1)
fld   qword ptr [B]
fmul st(0), st(2)
ffree st(2)
faddp
fadd qword ptr [C]
wait
end;

This ended the second lesson. Stay tuned for more.

Regards

Dennis