Tag: gcc

Two-phase Lookup in C++ Templates

December 12, 2014 by gonwan·0 Comments

This is a quick note to C++ Templates: The Complete Guide. Name Taxonomy comes first in chapter 9:

Qualified name: This term is not defined in the standard, but we use it to refer to names that undergo so-called qualified lookup. Specifically, this is a qualified-id or an unqualified-id that is used after an explicit member access operator (. or ->). Examples are S::x, this->f, and p->A::m. However, just class_mem in a context that is implicitly equivalent to this->class_mem is not a qualified name: The member access must be explicit.

Unqualified name: An unqualified-id that is not a qualified name. This is not a standard term but corresponds to names that undergo what the standard calls unqualified lookup.

Dependent name: A name that depends in some way on a template parameter. Certainly any qualified or unqualified name that explicitly contains a template parameter is dependent. Furthermore, a qualified name that is qualified by a member access operator (. or ->) is dependent if the type of the expression on the left of the access operator depends on a template parameter. In particular, b in this->b is a dependent name when it appears in a template. Finally, the identifier ident in a call of the form ident(x, y, z) is a dependent name if and only if any of the argument expressions has a type that depends on a template parameter.

Nondependent name: A name that is not a dependent name by the above description.

And the definition from Chapter 10:

This leads to the concept of two-phase lookup: The first phase is the parsing of a template, and the second phase is its instantiation.

During the first phase, nondependent names are looked up while the template is being parsed using both the ordinary lookup rules and, if applicable, the rules for argument-dependent lookup (ADL). Unqualified dependent names (which are dependent because they look like the name of a function in a function call with dependent arguments) are also looked up that way, but the result of the lookup is not considered complete until an additional lookup is performed when the template is instantiated.

During the second phase, which occurs when templates are instantiated at a point called the point of instantiation(POI), dependent qualified names are looked up (with the template parameters replaced with the template arguments for that specific instantiation), and an additional ADL is performed for the unqualified dependent names.

To summarize: nondependent names are looked up in first phase, qualified dependent names are looked up in second phase, and unqualified dependent names are looked up in both phases. Some code to illustrate how this works:

#include <iostream>

template <typename T>
struct Base {
    typedef int I;
};

template <typename T>
struct Derived : Base<T> {
    void foo() {
        //typename Base<T>::I i = 1.024;
        I i = 1.024;
        std::cout << i << std::endl;
    }
};

template <>
struct Base<void> {
    //const static int I = 0;
    typedef double I;
};

int main() {
    Derived<bool> d1;
    d1.foo();
    Derived<void> d2;
    d2.foo();
    return 0;
}

#include <iostream>

template <typename T>

struct Base {

typedef int I;

};

template <typename T>

struct Derived : Base<T> {

void foo() {

//typename Base<T>::I i = 1.024;

I i = 1.024;

std::cout << i << std::endl;

}

};

template <>

struct Base<void> {

//const static int I = 0;

typedef double I;

};

int main() {

Derived<bool> d1;

d1.foo();

Derived<void> d2;

d2.foo();

return 0;

}

Now look into Derived::foo(). I is a nondependent name, it should be looked up only in first phase. But at that point, the compiler cannot decide the type of it. When instantiated with Derived<bool>, I is type int. When instantiated with Derived<void>, I is type double. So it’s better to look up I in the second phase. We can use typename Base<T>::I i = 1.024; to delay the look up, for I is a qualified dependent name now.

Unfortunately, two-phase lookup(C++03 standard) is not fully supported in VC++ even in VC++2013. It compiles well and gives your most expecting result(output 1 and 1.024). With gcc-4.6, it gives errors like:

temp1.cpp: In member function ‘void Derived<T>::foo()’:
temp1.cpp:12:9: error: ‘I’ was not declared in this scope
temp1.cpp:12:11: error: expected ‘;’ before ‘i’
temp1.cpp:13:22: error: ‘i’ was not declared in this scope

temp1.cpp: In member function ‘void Derived<T>::foo()’:

temp1.cpp:12:9: error: ‘I’ was not declared in this scope

temp1.cpp:12:11: error: expected ‘;’ before ‘i’

temp1.cpp:13:22: error: ‘i’ was not declared in this scope

Another code snippet:

#ifdef _USE_STRUCT
/* ADL of nondependent names in two-phase lookup should
 * only works for types that have an associated namespace. */
struct Int { 
    Int(int) { };
};
#else
typedef int Int;
#endif

template <typename T>
void f(T i) {
    g(i);
};

void g(Int i) {
}

int main() {
    f(Int(1024));
    return 0;
}

#ifdef _USE_STRUCT

/* ADL of nondependent names in two-phase lookup should

* only works for types that have an associated namespace. */

struct Int {

Int(int) { };

};

#else

typedef int Int;

#endif

template <typename T>

void f(T i) {

g(i);

};

void g(Int i) {

}

int main() {

f(Int(1024));

return 0;

}

When the compiler sees f(), g() has not been declared. This code should not compile, if f() is a nontemplate function. Since f() is a template function and g() is a nondependent name, the compiler can use ADL in first phase to find the declaration of g(). Note, a user-defined type like Int is required here. Since int is a primitive type, it has no associated namespace, and no ADL is performed.

VC++2013 still compiles well with this code. You can find some clue that they will not support it in the next VC++2015 release. With gcc, they declared to fully support two-phase lookup in gcc-4.7. I used gcc-4.8, error output looks like:

temp2.cpp: In instantiation of ‘void f(T) [with T = int]’:
temp2.cpp:20:16:   required from here
temp2.cpp:13:8: error: ‘g’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     g(i);
        ^
temp2.cpp:16:6: note: ‘void g(Int)’ declared here, later in the translation unit
 void g(Int i) {
      ^

temp2.cpp: In instantiation of ‘void f(T) [with T = int]’:

temp2.cpp:20:16: required from here

temp2.cpp:13:8: error: ‘g’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]

g(i);

temp2.cpp:16:6: note: ‘void g(Int)’ declared here, later in the translation unit

void g(Int i) {

And the code compiles well with self-defined type Int(using -D_USE_STRUCT switch).

Pre/Post-main Function Call Implementation in C

February 13, 2014 by gonwan·0 Comments

In C++, pre/post-main function call can be implemented using a global class instance. Its constructor and destructor are invoked automatically before and after the main function. But in C, no such mechanism. Actually, there’s a glib implementation that can help. You may want to read my previous post about CRT sections of MSVC. I just copy the code and do some renaming:

#include <stdlib.h>
#if defined (_MSC_VER)
#if (_MSC_VER >= 1500)
/* Visual Studio 2008 and later have __pragma */
#define HAS_CONSTRUCTORS
#define DEFINE_CONSTRUCTOR(_func) \
    static void _func(void); \
    static int _func ## _wrapper(void) { _func(); return 0; } \
    __pragma(section(".CRT$XCU",read)) \
    __declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _wrapper;
#define DEFINE_DESTRUCTOR(_func) \
    static void _func(void); \
    static int _func ## _constructor(void) { atexit (_func); return 0; } \
    __pragma(section(".CRT$XCU",read)) \
    __declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _constructor;
#elif (_MSC_VER >= 1400)
/* Visual Studio 2005 */
#define HAS_CONSTRUCTORS
#pragma section(".CRT$XCU",read)
#define DEFINE_CONSTRUCTOR(_func) \
    static void _func(void); \
    static int _func ## _wrapper(void) { _func(); return 0; } \
    __declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _wrapper;
#define DEFINE_DESTRUCTOR(_func) \
    static void _func(void); \
    static int _func ## _constructor(void) { atexit (_func); return 0; } \
    __declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _constructor;
#else
/* Visual Studio 2003 and early versions should use #pragma code_seg() to define pre/post-main functions. */
#error Pre/Post-main function not supported on your version of Visual Studio.
#endif
#elif (__GNUC__ > 2) || (__GNUC__ == 2 && __GNUC_MINOR__ >= 7)
#define HAS_CONSTRUCTORS
#define DEFINE_CONSTRUCTOR(_func) static void __attribute__((constructor)) _func (void);
#define DEFINE_DESTRUCTOR(_func) static void __attribute__((destructor)) _func (void);
#else
/* not supported */
#endif

#include <stdlib.h>

#if defined (_MSC_VER)

#if (_MSC_VER >= 1500)

/* Visual Studio 2008 and later have __pragma */

#define HAS_CONSTRUCTORS

#define DEFINE_CONSTRUCTOR(_func) \

static void _func(void); \

static int _func ## _wrapper(void) { _func(); return 0; } \

__pragma(section(".CRT$XCU",read)) \

__declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _wrapper;

#define DEFINE_DESTRUCTOR(_func) \

static void _func(void); \

static int _func ## _constructor(void) { atexit (_func); return 0; } \

__pragma(section(".CRT$XCU",read)) \

__declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _constructor;

#elif (_MSC_VER >= 1400)

/* Visual Studio 2005 */

#define HAS_CONSTRUCTORS

#pragma section(".CRT$XCU",read)

#define DEFINE_CONSTRUCTOR(_func) \

static void _func(void); \

static int _func ## _wrapper(void) { _func(); return 0; } \

__declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _wrapper;

#define DEFINE_DESTRUCTOR(_func) \

static void _func(void); \

static int _func ## _constructor(void) { atexit (_func); return 0; } \

__declspec(allocate(".CRT$XCU")) static int (* _array ## _func)(void) = _func ## _constructor;

#else

/* Visual Studio 2003 and early versions should use #pragma code_seg() to define pre/post-main functions. */

#error Pre/Post-main function not supported on your version of Visual Studio.

#endif

#elif (__GNUC__ > 2) || (__GNUC__ == 2 && __GNUC_MINOR__ >= 7)

#define HAS_CONSTRUCTORS

#define DEFINE_CONSTRUCTOR(_func) static void __attribute__((constructor)) _func (void);

#define DEFINE_DESTRUCTOR(_func) static void __attribute__((destructor)) _func (void);

#else

/* not supported */

#endif

One limitation in glib code is the lack of support for VS2003 and early versions. #pragma code_seg() is used to implement the same function:

/*
 * cl ctor.c
 * gcc ctor.c -o ctor
 */
#include "ctor.h"
#include <stdio.h>

#ifdef HAS_CONSTRUCTORS
DEFINE_CONSTRUCTOR(before)
DEFINE_DESTRUCTOR(after)
#else
#ifdef _MSC_VER
static void before(void);
static void after(void);
#pragma data_seg(".CRT$XCU")
static void (*msc_ctor)(void) = before;
#pragma data_seg(".CRT$XPU")
static void (*msc_dtor)(void) = after;
#pragma data_seg()
#endif
#endif

void before()
{
    printf("before main\n");
}

void after()
{
    printf("after main\n");
}

int main()
{
    printf("in main\n");
    return 0;
}

* cl ctor.c

* gcc ctor.c -o ctor

#include "ctor.h"

#include <stdio.h>

#ifdef HAS_CONSTRUCTORS

DEFINE_CONSTRUCTOR(before)

DEFINE_DESTRUCTOR(after)

#else

#ifdef _MSC_VER

static void before(void);

static void after(void);

#pragma data_seg(".CRT$XCU")

static void (*msc_ctor)(void) = before;

#pragma data_seg(".CRT$XPU")

static void (*msc_dtor)(void) = after;

#pragma data_seg()

#endif

void before()

{

printf("before main\n");

}

void after()

{

printf("after main\n");

}

int main()

{

printf("in main\n");

return 0;

}

Output from msvc/gcc:

before main
in main
after main

before main

in main

after main

MSVC CRT Initialization

February 13, 2014 by gonwan·0 Comments

This post provides a detailed view of the MSDN article CRT Initialization. Just paste some content here:

The CRT obtains the list of function pointers from the Visual C++ compiler. When the compiler sees a global initializer, it generates a dynamic initializer in the .CRT$XCU section (where CRT is the section name and XCU is the group name). To obtain a list of those dynamic initializers run the command dumpbin /all main.obj, and then search the .CRT$XCU section (when main.cpp is compiled as a C++ file, not a C file).

The CRT defines two pointers:
– __xc_a in .CRT$XCA
– __xc_z in .CRT$XCZ

Both groups do not have any other symbols defined except __xc_a and __xc_z. Now, when the linker reads various .CRT groups, it combines them in one section and orders them alphabetically. This means that the user-defined global initializers (which the Visual C++ compiler puts in .CRT$XCU) will always come after .CRT$XCA and before .CRT$XCZ.

So, the CRT library uses both __xc_a and __xc_z to determine the start and end of the global initializers list because of the way in which they are laid out in memory after the image is loaded.

Let’s run our VS debugger to further investigate the CRT implementation. I’m using VS2010, and a global instance of class A is declared and initialized:

class A
{
public:
    A();
    ~A();
};

A::A()
{
    std::cout << "in A::A()" << std::endl;
}

A::~A()
{
    std::cout << "in A::~A()" << std::endl;
}

A a;

class A

{

public:

A();

~A();

};

A::A()

{

std::cout << "in A::A()" << std::endl;

}

A::~A()

{

std::cout << "in A::~A()" << std::endl;

}

A a;

Now set the breakpoints in the constructor and destructor, and start debugging. I’ve tried exe/dll and dynamic/static CRT combinations to view the call stacks:

1) exe with crt dynamic linked:
  crtexe.c: (w)mainCRTStartup()
    +--> crtexe.c: __tmainCRTStartup()
           +--> crt0dat.c: _initterm()
2) exe with crt static linked:
  crt0.c: _tmainCRTStartup()
    +--> crt0.c: __tmainCRTStartup()
           +--> crt0dat.c: _cinit()
                  +--> crt0dat.c: _initterm()
3) dll with crt dynamic linked:
  crtdll.c: _DllMainCRTStartup()
    +--> crtdll.c: __DllMainCRTStartup()
           +--> crtdll.c: _CRT_INIT()
                  +--> crt0dat.c: _initterm()
4) dll with crt static linked:
  dllcrt0.c: _DllMainCRTStartup()
    +--> dllcrt0.c: __DllMainCRTStartup()
           +--> dllcrt0.c: _CRT_INIT()
                  +--> crt0dat.c: _cinit()
                         +--> crt0dat.c: _initterm()

1) exe with crt dynamic linked:

crtexe.c: (w)mainCRTStartup()

+--> crtexe.c: __tmainCRTStartup()

+--> crt0dat.c: _initterm()

2) exe with crt static linked:

crt0.c: _tmainCRTStartup()

+--> crt0.c: __tmainCRTStartup()

+--> crt0dat.c: _cinit()

+--> crt0dat.c: _initterm()

3) dll with crt dynamic linked:

crtdll.c: _DllMainCRTStartup()

+--> crtdll.c: __DllMainCRTStartup()

+--> crtdll.c: _CRT_INIT()

+--> crt0dat.c: _initterm()

4) dll with crt static linked:

dllcrt0.c: _DllMainCRTStartup()

+--> dllcrt0.c: __DllMainCRTStartup()

+--> dllcrt0.c: _CRT_INIT()

+--> crt0dat.c: _cinit()

+--> crt0dat.c: _initterm()

_initterm is defined as follow. It is used to walk through __xc_a and __xc_z mentioned above:

// crt0dat.c
void __cdecl _initterm (
        _PVFV * pfbegin,
        _PVFV * pfend
        )
{
        /*
         * walk the table of function pointers from the bottom up, until
         * the end is encountered.  Do not skip the first entry.  The initial
         * value of pfbegin points to the first valid entry.  Do not try to
         * execute what pfend points to.  Only entries before pfend are valid.
         */
        while ( pfbegin < pfend )
        {
            /*
             * if current table entry is non-NULL, call thru it.
             */
            if ( *pfbegin != NULL )
                (**pfbegin)();
            ++pfbegin;
        }
}

// crt0dat.c

void __cdecl _initterm (

_PVFV * pfbegin,

_PVFV * pfend

)

{

* walk the table of function pointers from the bottom up, until

* the end is encountered. Do not skip the first entry. The initial

* value of pfbegin points to the first valid entry. Do not try to

* execute what pfend points to. Only entries before pfend are valid.

while ( pfbegin < pfend )

{

* if current table entry is non-NULL, call thru it.

if ( *pfbegin != NULL )

(**pfbegin)();

++pfbegin;

}

__xc_a, __xc_z and other section groups are defined as:

// crt0dat.c
/*
 * pointers to initialization sections
 */
extern _CRTALLOC(".CRT$XIA") _PIFV __xi_a[];
extern _CRTALLOC(".CRT$XIZ") _PIFV __xi_z[];    /* C initializers */
extern _CRTALLOC(".CRT$XCA") _PVFV __xc_a[];
extern _CRTALLOC(".CRT$XCZ") _PVFV __xc_z[];    /* C++ initializers */
extern _CRTALLOC(".CRT$XPA") _PVFV __xp_a[];
extern _CRTALLOC(".CRT$XPZ") _PVFV __xp_z[];    /* C pre-terminators */
extern _CRTALLOC(".CRT$XTA") _PVFV __xt_a[];
extern _CRTALLOC(".CRT$XTZ") _PVFV __xt_z[];    /* C terminators */
// sect_attribs.h
#define _CRTALLOC(x) __declspec(allocate(x))

// crt0dat.c

* pointers to initialization sections

extern _CRTALLOC(".CRT$XIA") _PIFV __xi_a[];

extern _CRTALLOC(".CRT$XIZ") _PIFV __xi_z[]; /* C initializers */

extern _CRTALLOC(".CRT$XCA") _PVFV __xc_a[];

extern _CRTALLOC(".CRT$XCZ") _PVFV __xc_z[]; /* C++ initializers */

extern _CRTALLOC(".CRT$XPA") _PVFV __xp_a[];

extern _CRTALLOC(".CRT$XPZ") _PVFV __xp_z[]; /* C pre-terminators */

extern _CRTALLOC(".CRT$XTA") _PVFV __xt_a[];

extern _CRTALLOC(".CRT$XTZ") _PVFV __xt_z[]; /* C terminators */

// sect_attribs.h

#define _CRTALLOC(x) __declspec(allocate(x))

gcc uses similar technology to deal with pre/post-main stuff. The section names are .init and .fini .

Compiler Intrinsic Functions

October 30, 2013 by gonwan·0 Comments

Copied from Wikipedia:

An intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.

A code snippet is written to check the code generation when intrinsic is enabled or not:

/*
 * # gcc -S intrinsic.c -o intrinsic.s
 * # gcc -S -fno-builtin intrinsic.c -o intrinsic2.s
 * # cl /c /Oi intrinsic.c /FAs /Faintrinsic.asm
 * # cl /c intrinsic.c /FAs /Faintrinsic2.asm
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const char *c = "Hello World!";
char c2[16];

int main(int argc, char *argv[])
{
    int a = abs(argc);
    memcpy(c2, c, 12);
    printf("%d,%s\n", a, c2);
    return 0;
}

* # gcc -S intrinsic.c -o intrinsic.s

* # gcc -S -fno-builtin intrinsic.c -o intrinsic2.s

* # cl /c /Oi intrinsic.c /FAs /Faintrinsic.asm

* # cl /c intrinsic.c /FAs /Faintrinsic2.asm

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

const char *c = "Hello World!";

char c2[16];

int main(int argc, char *argv[])

{

int a = abs(argc);

memcpy(c2, c, 12);

printf("%d,%s\n", a, c2);

return 0;

}

Generated assembly:

main:
    pushl   %ebp
    movl    %esp, %ebp
    andl    $-16, %esp
    subl    $32, %esp
    movl    8(%ebp), %eax
    sarl    $31, %eax
    movl    %eax, %edx
    xorl    8(%ebp), %edx
    movl    %edx, 28(%esp)
    subl    %eax, 28(%esp)
    movl    c, %eax
    movl    %eax, %edx
    movl    $c2, %eax
    movl    (%edx), %ecx
    movl    %ecx, (%eax)
    movl    4(%edx), %ecx
    movl    %ecx, 4(%eax)
    movl    8(%edx), %edx
    movl    %edx, 8(%eax)
    movl    $.LC1, %eax
    movl    $c2, 8(%esp)
    movl    28(%esp), %edx
    movl    %edx, 4(%esp)
    movl    %eax, (%esp)
    call    printf
    movl    $0, %eax
    leave
    ret

main:

pushl %ebp

movl %esp, %ebp

andl $-16, %esp

subl $32, %esp

movl 8(%ebp), %eax

sarl $31, %eax

movl %eax, %edx

xorl 8(%ebp), %edx

movl %edx, 28(%esp)

subl %eax, 28(%esp)

movl c, %eax

movl %eax, %edx

movl $c2, %eax

movl (%edx), %ecx

movl %ecx, (%eax)

movl 4(%edx), %ecx

movl %ecx, 4(%eax)

movl 8(%edx), %edx

movl %edx, 8(%eax)

movl $.LC1, %eax

movl $c2, 8(%esp)

movl 28(%esp), %edx

movl %edx, 4(%esp)

movl %eax, (%esp)

call printf

movl $0, %eax

leave

ret

Only printf() is in code. No abs() nor memcpy(). Since they are intrinsic, as listed here in gcc’s online document.

Intrinsic can be explicitly disabled. For instance, CRT intrinsic must be disabled for kernel development. Add -fno-builtin flag to gcc, or remove /Oi switch in MSVC. Only paste the generated code in gcc case here:

main:
    pushl   %ebp
    movl    %esp, %ebp
    andl    $-16, %esp
    subl    $32, %esp
    movl    8(%ebp), %eax
    movl    %eax, (%esp)
    call    abs
    movl    %eax, 28(%esp)
    movl    c, %eax
    movl    %eax, %edx
    movl    $c2, %eax
    movl    $12, 8(%esp)
    movl    %edx, 4(%esp)
    movl    %eax, (%esp)
    call    memcpy
    movl    $.LC1, %eax
    movl    $c2, 8(%esp)
    movl    28(%esp), %edx
    movl    %edx, 4(%esp)
    movl    %eax, (%esp)
    call    printf
    movl    $0, %eax
    leave
    ret

main:

pushl %ebp

movl %esp, %ebp

andl $-16, %esp

subl $32, %esp

movl 8(%ebp), %eax

movl %eax, (%esp)

call abs

movl %eax, 28(%esp)

movl c, %eax

movl %eax, %edx

movl $c2, %eax

movl $12, 8(%esp)

movl %edx, 4(%esp)

movl %eax, (%esp)

call memcpy

movl $.LC1, %eax

movl $c2, 8(%esp)

movl 28(%esp), %edx

movl %edx, 4(%esp)

movl %eax, (%esp)

call printf

movl $0, %eax

leave

ret

There _are_ abs() and memcpy() now. General MSVC intrinsic can be found here.

Intrinsic is easier than inline assembly. It is used to increase performance in most cases. Both gcc and MSVC provide intrinsic support for Intel’s MMX, SSE and SSE2 instrument set. Code snippet to use MMX:

/*
 * # gcc -O2 -S -mmmx intrinsic_mmx.c -o intrinsic_mmx.s
 * # cl /O2 /c intrinsic_mmx.c /FAs /Faintrinsic_mmx.asm
 */
#include <stdio.h>
#include <mmintrin.h>

int main()
{
    __m64 m1, m2, m3;
    int out1, out2;
    int in1[] = { 222, 111 };
    int in2[] = { 444, 333 };
#if 0
    m1 = _mm_setr_pi32(in1[0], in1[1]);
    m2 = _mm_setr_pi32(in2[0], in2[1]);
#else
    m1 = *(__m64 *)in1;
    m2 = *(__m64 *)in2;
#endif
    m3 = _mm_add_pi32(m1, m2); 
    out1 = _mm_cvtsi64_si32(m3);
    m3  = _mm_srli_si64(m3, 32);
    out2 = _mm_cvtsi64_si32(m3);
    _mm_empty();
    printf("out1=%d,out2=%d\n", out1, out2);
    return 0;
}

* # gcc -O2 -S -mmmx intrinsic_mmx.c -o intrinsic_mmx.s

* # cl /O2 /c intrinsic_mmx.c /FAs /Faintrinsic_mmx.asm

#include <stdio.h>

#include <mmintrin.h>

int main()

{

__m64 m1, m2, m3;

int out1, out2;

int in1[] = { 222, 111 };

int in2[] = { 444, 333 };

#if 0

m1 = _mm_setr_pi32(in1[0], in1[1]);

m2 = _mm_setr_pi32(in2[0], in2[1]);

#else

m1 = *(__m64 *)in1;

m2 = *(__m64 *)in2;

#endif

m3 = _mm_add_pi32(m1, m2);

out1 = _mm_cvtsi64_si32(m3);

m3 = _mm_srli_si64(m3, 32);

out2 = _mm_cvtsi64_si32(m3);

_mm_empty();

printf("out1=%d,out2=%d\n", out1, out2);

return 0;

}

Assembly looks like:

main:
    pushl   %ebp
    movl    %esp, %ebp
    andl    $-16, %esp
    subl    $16, %esp
    movq    .LC1, %mm0
    paddd   .LC2, %mm0
    movd    %mm0, 8(%esp)
    psrlq   $32, %mm0
    movd    %mm0, 12(%esp)
    emms
    movl    $.LC0, 4(%esp)
    movl    $1, (%esp)
    call    __printf_chk
    xorl    %eax, %eax
    leave
    ret

main:

pushl %ebp

movl %esp, %ebp

andl $-16, %esp

subl $16, %esp

movq .LC1, %mm0

paddd .LC2, %mm0

movd %mm0, 8(%esp)

psrlq $32, %mm0

movd %mm0, 12(%esp)

emms

movl $.LC0, 4(%esp)

movl $1, (%esp)

call __printf_chk

xorl %eax, %eax

leave

ret

You see MMX registers and instruments this time. -mmmx flag is required to build for gcc. MSVC also generate similar code. Reference for these instrument set is available on Intel’s website.

A simple benchmark to use SSE is avalable here.

GCC Inline Assembly

October 22, 2013 by gonwan·0 Comments

Inline assembly is used in Linux kernel to optimize performance or access hardware. So I decided to check it first. Before digging deeper, you may wanna read the GCC Inline Assembly HOWTO to get a general understanding. In C, a simple add function looks like:

int add1(int a, int b)
{
    return a + b;
}

int add1(int a, int b)

{

return a + b;

}

Its inline assembly version may be:

int add2(int a, int b)
{
    __asm__ __volatile__ ("movl 12(%ebp), %eax\n\t"
                          "movl 8(%ebp), %edx\n\t"
                          "addl %edx, %eax"
    );
}

int add2(int a, int b)

{

__asm__ __volatile__ ("movl 12(%ebp), %eax\n\t"

"movl 8(%ebp), %edx\n\t"

"addl %edx, %eax"

);

}

Or simpler:

int add3(int a, int b)
{
    __asm__ __volatile__ ("movl 12(%ebp), %eax\n\t"
                          "addl 8(%ebp), %eax"
    );
}

int add3(int a, int b)

{

__asm__ __volatile__ ("movl 12(%ebp), %eax\n\t"

"addl 8(%ebp), %eax"

);

}

Here’s its generated code by gcc:

# gcc -S testasm_linux.c -o testasm_linux.s

1	# gcc -S testasm_linux.c -o testasm_linux.s

Output:

add3:
    pushl   %ebp
    movl    %esp, %ebp
#APP
# 21 "testasm_linux.c" 1
    movl 12(%ebp), %eax
    movl 8(%ebp), %edx
    addl %edx, %eax
# 0 "" 2
#NO_APP
    popl    %ebp
    ret
add3:
    pushl   %ebp
    movl    %esp, %ebp
#APP
# 31 "testasm_linux.c" 1
    movl 12(%ebp), %eax
    addl 8(%ebp), %eax
# 0 "" 2
#NO_APP
    popl    %ebp
    ret

add3:

pushl %ebp

movl %esp, %ebp

#APP

# 21 "testasm_linux.c" 1

movl 12(%ebp), %eax

movl 8(%ebp), %edx

addl %edx, %eax

# 0 "" 2

#NO_APP

popl %ebp

ret

add3:

pushl %ebp

movl %esp, %ebp

#APP

# 31 "testasm_linux.c" 1

movl 12(%ebp), %eax

addl 8(%ebp), %eax

# 0 "" 2

#NO_APP

popl %ebp

ret

Our inline assembly is surrounded by #APP and #NO_APP comments. Redundant gcc directives are already removed, the remaining are just function prolog/epilog code. add2() and add3() works fine using default gcc flags. But it is not the case when -O2 optimize flag is passed. From the output of gcc -S -O2(try it yourself), I found these 2 function calls are inlined in their caller, no function call at all. These 2 issues prevent the inline assembly from working: – Depending on %eax to be the return value. But it is silently ignored in -O2. – Depending on 12(%ebp) and 8(%ebp) as parameters of function. But it is not guaranteed that parameters are there in -O2. To solve issue 1, an explicit return should be used:

int add4(int a, int b)
{
    int res;
    /* note the double % */
    __asm__ __volatile__ ("movl 12(%%ebp), %%eax\n\t"
                          "addl 8(%%ebp), %%eax"
                          : "=a" (res)
    );
    return res;
}

int add4(int a, int b)

{

int res;

/* note the double % */

__asm__ __volatile__ ("movl 12(%%ebp), %%eax\n\t"

"addl 8(%%ebp), %%eax"

: "=a" (res)

);

return res;

}

To solve issue 2, parameters are required to be loaded in registers first:

int add5(int a, int b)
{
    int res;
    __asm__ __volatile__ ("movl %%ecx, %%eax\n\t"
                          "addl %%edx, %%eax"
                          : "=a" (res)
                          : "c" (a), "d" (b)
    );
    return res;
}

int add5(int a, int b)

{

int res;

__asm__ __volatile__ ("movl %%ecx, %%eax\n\t"

"addl %%edx, %%eax"

: "=a" (res)

: "c" (a), "d" (b)

);

return res;

}

add5() now works in -O2. The default calling convention is cdecl for gcc. %eax, %ecx and %edx can be used from scratch in a function. It’s the function caller’s duty to preserve these registers. These registers are so-called scratch registers. So what if we specify to use other registers other than these scratch registers, like %esi and %edi?

int add6(int a, int b)
{
    int res;
    __asm__ __volatile__ ("movl %%esi, %%eax\n\t"
                          "addl %%edi, %%eax"
                          : "=a" (res)
                          : "S" (a), "D" (b)
    );
    return res;
}

int add6(int a, int b)

{

int res;

__asm__ __volatile__ ("movl %%esi, %%eax\n\t"

"addl %%edi, %%eax"

: "=a" (res)

: "S" (a), "D" (b)

);

return res;

}

Again with gcc -S:

add6:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %edi
    pushl   %esi
    pushl   %ebx
    subl    $20, %esp
    movl    8(%ebp), %esi
    movl    %esi, -32(%ebp)
    movl    12(%ebp), %edx
    movl    -32(%ebp), %esi
    movl    %edx, %edi
#APP
# 65 "testasm_linux.c" 1
    movl %esi, %eax
    addl %edi, %eax
# 0 "" 2
#NO_APP
    movl    %eax, %ebx
    movl    %ebx, -16(%ebp)
    movl    -16(%ebp), %eax
    addl    $20, %esp
    popl    %ebx
    popl    %esi
    popl    %edi
    popl    %ebp
    ret

add6:

pushl %ebp

movl %esp, %ebp

pushl %edi

pushl %esi

pushl %ebx

subl $20, %esp

movl 8(%ebp), %esi

movl %esi, -32(%ebp)

movl 12(%ebp), %edx

movl -32(%ebp), %esi

movl %edx, %edi

#APP

# 65 "testasm_linux.c" 1

movl %esi, %eax

addl %edi, %eax

# 0 "" 2

#NO_APP

movl %eax, %ebx

movl %ebx, -16(%ebp)

movl -16(%ebp), %eax

addl $20, %esp

popl %ebx

popl %esi

popl %edi

popl %ebp

ret

It seems that code generation of gcc in default optimize level is not so efficient:) But you should actually noticed that %esi and %edi are pushed onto stack before their usage, and popped out when finishing. These code generation is automatically done by gcc, since you have specified to use %esi(“S”) and %edi(“D”) in input list of the inline assembly. Actually, the code can be simpler by specify %eax as both input and output:

int add7(int a, int b)
{
    int res;
    __asm__ __volatile__ ("addl %%edx, %%eax"
                          : "=a" (res)
                          : "a" (a), "d" (b)
    );
    return res;
}

int add7(int a, int b)

{

int res;

__asm__ __volatile__ ("addl %%edx, %%eax"

: "=a" (res)

: "a" (a), "d" (b)

);

return res;

}

We can tell gcc to use a general register(“r”) available in current context in inline assembly:

int add8(int a, int b)
{
    int res;
    __asm__ __volatile__ ("movl %1, %%eax\n\t"
                          "addl %2, %%eax"
                          : "=a" (res)
                          : "r" (a), "r" (b)
    );
    return res;
}

int add8(int a, int b)

{

int res;

__asm__ __volatile__ ("movl %1, %%eax\n\t"

"addl %2, %%eax"

: "=a" (res)

: "r" (a), "r" (b)

);

return res;

}

And wrong code generation again…:

add8:
    pushl   %ebp
    movl    %esp, %ebp
    pushl   %ebx
    subl    $20, %esp
    movl    8(%ebp), %eax
    movl    %eax, -24(%ebp)
    movl    12(%ebp), %edx
    movl    -24(%ebp), %eax
#APP
# 88 "testasm_linux.c" 1
    movl %eax, %eax
    addl %edx, %eax
# 0 "" 2
#NO_APP
    movl    %eax, %ebx
    movl    %ebx, -8(%ebp)
    movl    -8(%ebp), %eax
    addl    $20, %esp
    popl    %ebx
    popl    %ebp
    ret

add8:

pushl %ebp

movl %esp, %ebp

pushl %ebx

subl $20, %esp

movl 8(%ebp), %eax

movl %eax, -24(%ebp)

movl 12(%ebp), %edx

movl -24(%ebp), %eax

#APP

# 88 "testasm_linux.c" 1

movl %eax, %eax

addl %edx, %eax

# 0 "" 2

#NO_APP

movl %eax, %ebx

movl %ebx, -8(%ebp)

movl -8(%ebp), %eax

addl $20, %esp

popl %ebx

popl %ebp

ret

%eax is moved to %eax? gcc selected %eax and %edx as general registers to use. The code accidentally does the right job, but it is still a potential pitfall. Clobber list can be used to avoid this:

int add9(int a, int b)
{
    int res;
    /*
     * The clobber list tells gcc which registers(or memory) are changed by the asm,
     * but not listed as an output.
     */
    __asm__ __volatile__ ("movl %1, %0\n\t"
                          "addl %2, %0\n\t"
                          "movl %0, %%eax"
                          : "=r" (res)
                          : "r" (a), "r" (b)
                          : "%eax"
    );
    return res;
}

int add9(int a, int b)

{

int res;

* The clobber list tells gcc which registers(or memory) are changed by the asm,

* but not listed as an output.

__asm__ __volatile__ ("movl %1, %0\n\t"

"addl %2, %0\n\t"

"movl %0, %%eax"

: "=r" (res)

: "r" (a), "r" (b)

: "%eax"

);

return res;

}

As commented inline: The clobber list tells gcc which registers(or memory) are changed by the asm, but not listed as an output. Now gcc does not use %eax as a candidate of general registers any more. gcc can also generate code to preserve(push onto stack) registers in clobber list if necessary.

0x2B|~0x2B

My broken wings still strong enough to cross the ocean with.

Tag: gcc

Two-phase Lookup in C++ Templates

Pre/Post-main Function Call Implementation in C

MSVC CRT Initialization

Compiler Intrinsic Functions

GCC Inline Assembly