This post provides a detailed view of the MSDN article CRT Initialization. Just paste some content here:

The CRT obtains the list of function pointers from the Visual C++ compiler. When the compiler sees a global initializer, it generates a dynamic initializer in the .CRT$XCU section (where CRT is the section name and XCU is the group name). To obtain a list of those dynamic initializers run the command dumpbin /all main.obj, and then search the .CRT$XCU section (when main.cpp is compiled as a C++ file, not a C file).

The CRT defines two pointers:
– __xc_a in .CRT$XCA
– __xc_z in .CRT$XCZ

Both groups do not have any other symbols defined except __xc_a and __xc_z. Now, when the linker reads various .CRT groups, it combines them in one section and orders them alphabetically. This means that the user-defined global initializers (which the Visual C++ compiler puts in .CRT$XCU) will always come after .CRT$XCA and before .CRT$XCZ.

So, the CRT library uses both __xc_a and __xc_z to determine the start and end of the global initializers list because of the way in which they are laid out in memory after the image is loaded.

Let’s run our VS debugger to further investigate the CRT implementation. I’m using VS2010, and a global instance of class A is declared and initialized:

Now set the breakpoints in the constructor and destructor, and start debugging. I’ve tried exe/dll and dynamic/static CRT combinations to view the call stacks:

_initterm is defined as follow. It is used to walk through __xc_a and __xc_z mentioned above:

__xc_a, __xc_z and other section groups are defined as:

gcc uses similar technology to deal with pre/post-main stuff. The section names are .init and .fini .

Code snippet:

Output:

Exception safety is ensured, when using shared_ptr. Memory allocated by m_a is freed even when an exception is thrown. The trick is: the destructor of class shared_ptr is invoked after the destructor of class C.

Copied from Wikipedia:

An intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.

A code snippet is written to check the code generation when intrinsic is enabled or not:

Generated assembly:

Only printf() is in code. No abs() nor memcpy(). Since they are intrinsic, as listed here in gcc’s online document.

Intrinsic can be explicitly disabled. For instance, CRT intrinsic must be disabled for kernel development. Add -fno-builtin flag to gcc, or remove /Oi switch in MSVC. Only paste the generated code in gcc case here:

There _are_ abs() and memcpy() now. General MSVC intrinsic can be found here.

Intrinsic is easier than inline assembly. It is used to increase performance in most cases. Both gcc and MSVC provide intrinsic support for Intel’s MMX, SSE and SSE2 instrument set. Code snippet to use MMX:

Assembly looks like:

You see MMX registers and instruments this time. -mmmx flag is required to build for gcc. MSVC also generate similar code. Reference for these instrument set is available on Intel’s website.

A simple benchmark to use SSE is avalable here.

There was a misleading in my knowledge of a conditional jump: It checks only the result of CMP and TEST instruments. So when it appears after other instruments like ADD or SUB, I can find no clue on how it works.

Actually, a conditional jump checks flags in the EFLAGS control register. From Intel’s manual, vol 1, 3.4.3:

The status flags (bits 0, 2, 4, 6, 7, and 11) of the EFLAGS register indicate the results of arithmetic instructions, such as the ADD, SUB, MUL, and DIV instructions. The status flag functions are:

CF (bit 0) Carry flag: Set if an arithmetic operation generates a carry or a borrow out of the most-significant bit of the result; cleared otherwise. This flag indicates an overflow condition for unsigned-integer arithmetic. It is also used in multiple-precision arithmetic.

PF (bit 2) Parity flag: Set if the least-significant byte of the result contains an even number of 1 bits; cleared otherwise.
AF (bit 4) Adjust flag: Set if an arithmetic operation generates a carry or a borrow out of bit 3 of the result; cleared otherwise. This flag is used in binary-coded decimal (BCD) arithmetic.

ZF (bit 6) Zero flag: Set if the result is zero; cleared otherwise.

SF (bit 7) Sign flag: Set equal to the most-significant bit of the result, which is the sign bit of a signed integer. (0 indicates a positive value and 1 indicates a negative value.)

OF (bit 11) Overflow flag: Set if the integer result is too large a positive number or too small a negative number (excluding the sign-bit) to fit in the destination operand; cleared otherwise. This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic.

And again from vol 2a, section Jcc Jump if Condition is met, more details. I just copy content from here:

Instruction Description signed? Flags short
jump
opcodes
near
jump
opcodes
JO Jump if overflow OF = 1 70 0F 80
JNO Jump if not overflow OF = 0 71 0F 81
JS Jump if sign SF = 1 78 0F 88
JNS Jump if not sign SF = 0 79 0F 89
JE
JZ
Jump if equal
Jump if zero
ZF = 1 74 0F 84
JNE
JNZ
Jump if not equal
Jump if not zero
ZF = 0 75 0F 85
JB
JNAE
JC
Jump if below
Jump if not above or equal
Jump if carry
unsigned CF = 1 72 0F 82
JNB
JAE
JNC
Jump if not below
Jump if above or equal
Jump if not carry
unsigned CF = 0 73 0F 83
JBE
JNA
Jump if below or equal
Jump if not above
unsigned CF = 1 or ZF = 1 76 0F 86
JA
JNBE
Jump if above
Jump if not below or equal
unsigned CF = 0 and ZF = 0 77 0F 87
JL
JNGE
Jump if less
Jump if not greater or equal
signed SF <> OF 7C 0F 8C
JGE
JNL
Jump if greater or equal
Jump if not less
signed SF = OF 7D 0F 8D
JLE
JNG
Jump if less or equal
Jump if not greater
signed ZF = 1 or SF <> OF 7E 0F 8E
JG
JNLE
Jump if greater
Jump if not less or equal
signed ZF = 0 and SF = OF 7F 0F 8F
JP
JPE
Jump if parity
Jump if parity even
PF = 1 7A 0F 8A
JNP
JPO
Jump if not parity
Jump if parity odd
PF = 0 7B 0F 8B
JCXZ
JECXZ
Jump if %CX register is 0
Jump if %ECX register is 0
%CX = 0
%ECX = 0
E3 E3

There are signed and unsigned versions when comparing: JA Vs JG, JB Vs JL etc.. Let’s take JA and JG to explain the difference. For JA, it’s clear that it requires CF=0(no borrow bit) and ZF=0(not equal). For JG, when two operands are both positive or negative, it requires ZF=0 and SF=OF=0. When two operands have different signs, it requires ZF=0 and the first operand is positive, thus requires SF=OF=1.

Note, the following 2 lines(AT&T syntax) are equivalent. CPU does arithmetic calculation, it does not care about whether it is signed or unsigned. It only set flags. It is we that make the signed or unsigned jump decision.

Last, I’d like to use ndisasm(install nasm package to get it) to illustrate how jump instruments are encoded, including short jump, near jump and far jump:

MSVC’s inline assembly is easier to use, as compared to gcc’s version. It is easier to write right code than wrong one, I think. Still a simple add function is used to illustrate:

The corresponding inline version:

__asm keyword is used to specify a inline assembly block. From MSDN, there is another asm keyword which is not recommended:

Visual C++ support for the Standard C++ asm keyword is limited to the fact that the compiler will not generate an error on the keyword. However, an asm block will not generate any meaningful code. Use __asm instead of asm.

Symbols in C/C++ code can be used directly in inline assembly. This is much more convenient than gcc. And it is also not necessary to load parameters into registers before usage as in gcc. MSVC does the job right even in optimization case.

NOTE: Inline assembly is not supported on the Itanium and x64 processors.

Let’s see the generated code:

Output:

Function parameters are located in [ebp+12] and [ebp+8] as referred by symbol a and b. Then, what happened if registers other than scratch registers are specified?

Output assembly code:

As you see, MSVC automatically preserves ebx for us. From MSDN:

When using __asm to write assembly language in C/C++ functions, you don’t need to preserve the EAX, EBX, ECX, EDX, ESI, or EDI registers.

Let’s see the case when stdcall calling convention is used:

Output:

In stdcall, stack is cleaned up by callee. So, there’s a ret 8 before return. And the function name is mangled to _add4@8.

MSVC also supports fastcall calling convention, but it causes register conflicts as mentioned on MSDN, and is not recommended. Just test it here, the code happens to work:)

Output:

Function parameters are passed in ecx and edx when using fastcall. But they are saved to stack first. It seems we get no benefit using this calling convention. Maybe MSVC does not implement it well. The function name is mangled to @add5@8.

Last, we can tell MSVC that we want to write our own prolog/epilog code sequences using __declspec(naked) directive:

Output:

Normal prolog/epilog is used here. MSVC does not generate duplicate these code when using __declspec(naked) directive.