Skip to main content

Core dump & Memory Alignment

Core dump

core dump is a file containing a process's address space (memory) when the process terminates unexpectedly. Core dumps may be produced on-demand (such as by a debugger), or automatically upon termination. Core dumps are triggered by the kernel in response to program crashes, and may be passed to a helper program (such as systemd-coredump) for further processing. A core dump is not typically used by an average user, but may be passed on to developers upon request where it can be invaluable as a post-mortem snapshot of the program's state at the time of the crash, especially if the fault is hard to reliably reproduce.
systemd's default behavior is to generate core dumps for all processes in /var/lib/systemd/coredump
The default action of certain signals is to cause a process to terminate and produce a core dump file, a disk file containing an image of the process's memory at the time of termination. This image can be used in a debugger (e.g., gdb) to inspect the state of the program at the time that it terminated.

Aligned vs Unaligned Memory Access

The problem starts in the memory chip. Memory is capable to read certain number of bytes at a time. If you try to read from unaligned memory address, your request will cause RAM chip to do two read operations. For instance, assuming RAM works with units of 8 bytes, trying to read 8 bytes from relative offset of 5 will cause RAM chips to do two read operations. First will read bytes 0-7. Second will read bytes 8-15. As a result, the relatively slow memory access will become even slower.
Luckily hardware developers learned to overcome this problem long time ago. Actually the solution is not absolute and the ultimate problem remains. In modern computers CPU caches memory it reads. Memory cache build of 32 and 64 byte long cache lines. Even when you read just one byte from the memory, CPU reads a complete cache line and places it into the cache. This way reading 8 bytes from offset 5 or from offset 0 makes no difference – CPU reads 64 bytes in any case.
However things get not so pretty when you read from a memory address that is not on a cache line boundary. In that case CPU will read two complete cache lines. That is 128 bytes instead of only 8. This is a huge overhead and this is the overhead I would like to demonstrate you.
Now I guess you already noticed that I only talk about reads, but not about write memory accesses. This is because I think that memory reads should give us good enough results. The thing is that when you write some data into the memory, CPU does two things. First it loads cache line into cache. Then it modifies part of the line in the cache. Note that it does not write the cache line back to memory immediately. Instead it waits for more appropriate time (for instance when its less busy or, in SMP systems, when other processor needs this cache line).
We’ve already seen that, at least theoretically speaking, problem that effects performance of unaligned memory access is the time it takes for the CPU to transfer memory to the cache. Compared to this, time it takes for the CPU to modify few bytes in the cache line is negligible. As a result, its enough to test only the reads.

http://www.alexonlinux.com/aligned-vs-unaligned-memory-access

Comments

Popular posts from this blog

Spinlock implementation in ARM architecture

SEV and WFE are the main instructions used for implementing spinlock in case of ARM architecture . Let's look briefly at those two instructions before looking into actual spinlock implementation. SEV SEV causes an event to be signaled to all cores within a multiprocessor system. If SEV is implemented, WFE must also be implemented. Let's look briefly at those two instructions before looking into actual spinlock implementation. WFE If the Event Register is not set, WFE suspends execution until one of the following events occurs: an IRQ interrupt, unless masked by the CPSR I-bit an FIQ interrupt, unless masked by the CPSR F-bit an Imprecise Data abort, unless masked by the CPSR A-bit a Debug Entry request, if Debug is enabled an Event signaled by another processor using the SEV instruction. In case of  spin_lock_irq( )/spin_lock_irqsave( ) , as IRQs are disabled, the only way to to resume after WFE intruction has executed is to execute SEV ins

Explanation of "struct task_struct"

This document tries to explain clearly what fields in the structure task_struct do. It's not complete and everyone is welcome to add information. Let's start by saying that each process under Linux is defined by a structure task_struct. The following information are available (on kernel 2.6.7): - volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */ - struct thread_info *thread_info; a pointer to a thread_info... - atomic_t usage; used by get_task_struct(). It's also set in kernel/fork.c. This value acts like a reference count on the task structure of a process. It can be used if we don't want to hold the tasklist_lock. - unsigned long flags;    /* per process flags, defined below */ process flag can be, for example, PF_DEAD when exit_notify() is called. List is of possible values is in include/linux/sched.h - unsigned long ptrace; used by ptrace a system call that provides the ability to a parent process to observe and con

Macro "container_of"

Understanding container_of macro in the Linux kernel When you begin with the kernel, and you start to look around and read the code, you will eventually come across this magical preprocessor construct. What does it do? Well, precisely what its name indicates. It takes three arguments – a  pointer ,  type  of the container, and the  name  of the member the pointer refers to. The macro will then expand to a new address pointing to the container which accommodates the respective member. It is indeed a particularly clever macro, but how the hell can this possibly work? Let me illustrate… The first diagram illustrates the principle of the  container_of(ptr, type, member)  macro for who might find the above description too clumsy. I llustration of how the containter_of macro works. Bellow is the actual implementation of the macro from Linux Kernel: #define container_of(ptr, type, member) ({ \ const typeof( ((type *)0)->member ) *__mptr = (