Skip to main content

Macro "container_of"

Understanding container_of macro in the Linux kernel

When you begin with the kernel, and you start to look around and read the code, you will eventually come across this magical preprocessor construct.
What does it do? Well, precisely what its name indicates. It takes three arguments – a pointertype of the container, and the name of the member the pointer refers to. The macro will then expand to a new address pointing to the container which accommodates the respective member. It is indeed a particularly clever macro, but how the hell can this possibly work? Let me illustrate…
The first diagram illustrates the principle of the container_of(ptr, type, member) macro for who might find the above description too clumsy.



I
llustration of how the containter_of macro works.

Bellow is the actual implementation of the macro from Linux Kernel:

#define container_of(ptr, type, member) ({                      \
        const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
        (type *)( (char *)__mptr - offsetof(type,member) );})

At first glance, this might look like a whole lot of magic, but it isn’t quite so. Let’s take it step by step.

Statements in Expressions

The first thing to gain your attention might be the structure of the whole expression. The statement should return a pointer, right? But there is just some kind of weird ({}) block with two statements in it. This in fact is a GNU extension to C language called braced-group within expression. The compiler will evaluate the whole block and use the value of the last statement contained in the block. Take for instance the following code. It will print 5.

int x = ({1; 2;}) + 3;
printf("%d\n", x);

typeof()

This is a non-standard GNU C extension. It takes one argument and returns its type. Its exact semantics is throughly described in gcc documentation.

int x = 5;
typeof(x) y = 6;
printf("%d %d\n", x, y);

Zero Pointer Dereference

But what about the zero pointer dereference? Well, it’s a little pointer magic to get the type of the member. It won’t crash, because the expression itself will never be evaluated. All the compiler cares for is its type. The same situation occurs in case we ask back for the address. The compiler again doesn’t care for the value, it will simply add the offset of the member to the address of the structure, in this particular case 0, and return the new address.

struct s {
        char m1;
        char m2;
};

/* This will print 1 */
printf("%d\n", &((struct s*)0)->m2);

Also note that the following two definitions are equivalent:

typeof(((struct s *)0)->m2) c;
char c;

offsetof(st, m)

This macro will return a byte offset of a member to the beginning of the structure. It is even part of the standard library (available in stddef.h ). Not in the kernel space though, as the standard C library is not present there. It is a little bit of the same 0 pointer dereference magic as we saw earlier and to avoid that modern compilers usually offer a built-in function, that implements that. Here is the messy version (from the kernel):

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

It returns an address of a member called MEMBER of a structure of type TYPE that is stored in memory from address 0 (which happens to be the offset we’re looking for).

Putting It All Together


#define container_of(ptr, type, member) ({                      \
        const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
        (type *)( (char *)__mptr - offsetof(type,member) );})

When you look more closely at the original definition from the beginning of this post, you will start wondering if the first line is really good for anything. You will be right. The first line is not intrinsically important for the result of the macro, but it is there for type checking purposes. And what the second line really does? It subtracts the offset of the structure’s member from its addressyielding the address of the container structure. That’s it!
After you strip all the magical operators, constructs and tricks, it is that simple :-).
Note: This has been copied from  http://radek.io/2012/11/10/magical-container_of-macro/

Comments

Popular posts from this blog

Spinlock implementation in ARM architecture

SEV and WFE are the main instructions used for implementing spinlock in case of ARM architecture . Let's look briefly at those two instructions before looking into actual spinlock implementation. SEV SEV causes an event to be signaled to all cores within a multiprocessor system. If SEV is implemented, WFE must also be implemented. Let's look briefly at those two instructions before looking into actual spinlock implementation. WFE If the Event Register is not set, WFE suspends execution until one of the following events occurs: an IRQ interrupt, unless masked by the CPSR I-bit an FIQ interrupt, unless masked by the CPSR F-bit an Imprecise Data abort, unless masked by the CPSR A-bit a Debug Entry request, if Debug is enabled an Event signaled by another processor using the SEV instruction. In case of  spin_lock_irq( )/spin_lock_irqsave( ) , as IRQs are disabled, the only way to to resume after WFE intruction has executed is to execute SEV ins

Explanation of "struct task_struct"

This document tries to explain clearly what fields in the structure task_struct do. It's not complete and everyone is welcome to add information. Let's start by saying that each process under Linux is defined by a structure task_struct. The following information are available (on kernel 2.6.7): - volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */ - struct thread_info *thread_info; a pointer to a thread_info... - atomic_t usage; used by get_task_struct(). It's also set in kernel/fork.c. This value acts like a reference count on the task structure of a process. It can be used if we don't want to hold the tasklist_lock. - unsigned long flags;    /* per process flags, defined below */ process flag can be, for example, PF_DEAD when exit_notify() is called. List is of possible values is in include/linux/sched.h - unsigned long ptrace; used by ptrace a system call that provides the ability to a parent process to observe and con