Blog
Linux IPC Shootout: Shared Memory versus Unix Domain Sockets
A rigorous C benchmark comparing shared memory, Unix Domain Socket (UDS) stream, and UDS datagram for latency and throughput across embedded-relevant payload sizes.
01-05-2026
Blog
A rigorous C benchmark comparing shared memory, Unix Domain Socket (UDS) stream, and UDS datagram for latency and throughput across embedded-relevant payload sizes.
01-05-2026
This started over lunch. A colleague and I were curious what the fastest and most reliable method of inter-process communication (IPC) is on Linux. IPC refers to any mechanism by which one process transfers data to another process on the same machine. Pipes are simple, sockets are portable, shared memory is fast but carries complexity. Neither of us really knew the answer, so I built a benchmark suite to find out. This post is a summary of what I learned.
The three mechanisms tested were shared memory, Unix Domain Sockets (UDS) in stream mode, and UDS in datagram mode. SPSC, which stands for single-producer, single-consumer, describes a concurrency pattern where exactly one thread writes to a shared buffer and exactly one thread reads from it, eliminating the need for multi-producer arbitration. UDS is a Linux facility that transports data between local processes through the kernel using the standard socket API, bypassing the network stack entirely. Latency and throughput were measured across payload sizes from 32 bytes to 8 KB, corresponding to CAN frames, sensor readings, control commands, and audio data in embedded systems.
| Mechanism | Transport | Synchronization |
|---|---|---|
| Shared Memory | mmap(MAP_SHARED | MAP_ANONYMOUS) | Lock-free SPSC ring buffer, __atomic acquire/release |
| UDS SOCK_STREAM | socketpair(AF_UNIX, SOCK_STREAM) | Kernel send/recv, 4 MB socket buffers, length-prefix framing |
| UDS SOCK_DGRAM | socketpair(AF_UNIX, SOCK_DGRAM) | Kernel send/recv, 4 MB socket buffers, native message boundaries |
| Payload Size | Representative Use Case |
|---|---|
| 32 B | CAN FD frame, small control command |
| 128 B | Sensor sample batch |
| 1 KB | Log entry, small image tile |
| 4 KB | Page-sized buffer |
| 8 KB | Audio frame, video macroblock |
Benchmark integrity required control over confounding factors. Timing uses CLOCK_MONOTONIC_RAW, a clock source immune to NTP and adjtime adjustments. The producer process is pinned to core 0 and the consumer to core 1 via sched_setaffinity to prevent scheduler migrations from introducing latency variation. Each message size receives 200 untimed warmup rounds to populate caches and TLBs (Translation Lookaside Buffers, which cache virtual-to-physical address translations), ensuring measurements reflect steady-state behaviour. Latency is computed from 5000 timed round-trip iterations per size. Throughput is measured as an 8 MB unidirectional transfer with wall-clock timing and one-byte acknowledgement upon completion. A volatile accumulator forces the compiler to materialise all payload reads, preventing dead-code elimination from falsifying the result.
The shared memory benchmark uses a lock-free SPSC ring buffer allocated via mmap(MAP_SHARED | MAP_ANONYMOUS). Two independent ring buffers are created so that parent and child can communicate in both directions without interference.
typedef struct {
_Atomic uint32_t head;
_Atomic uint32_t tail;
uint32_t sizes[RING_SLOTS];
uint8_t data[];
} ring_hdr_t;
The producer writes payload data and its size to the slot indexed by head, then issues an atomic release-store to publish the new head value. The consumer spins on tail, reads the published slot, and advances tail with an acquire-load to ensure visibility of the producer's writes. No system calls occur in the data path. All synchronisation is performed through atomic loads and stores with acquire-release semantics. The SPSC constraint guarantees that a CAS-loop (Compare-And-Swap loop) is unnecessary, as no more than one thread ever accesses each index concurrently.
| Size | Shared Memory (ns) | UDS Stream (ns) | UDS Datagram (ns) |
|---|---|---|---|
| 32 B | 270 | 5910 | 4640 |
| 64 B | 330 | 5750 | 4820 |
| 128 B | 430 | 7740 | 4800 |
| 256 B | 650 | 7330 | 5440 |
| 512 B | 1090 | 8110 | 6180 |
| 1 KB | 1980 | 8290 | 6860 |
| 2 KB | 3750 | 12190 | 9100 |
| 4 KB | 7290 | 15210 | 12760 |
| 8 KB | 14360 | 23280 | 19850 |
| Size | Shared Memory (MB/s) | UDS Stream (MB/s) | UDS Datagram (MB/s) |
|---|---|---|---|
| 32 B | 269.7 | 23.7 | 29.5 |
| 64 B | 411.8 | 47.8 | 60.6 |
| 128 B | 526.7 | 87.1 | 129.2 |
| 256 B | 518.8 | 145.2 | 238.2 |
| 512 B | 488.0 | 223.4 | 317.6 |
| 1 KB | 506.0 | 303.9 | 394.4 |
| 2 KB | 484.9 | 362.8 | 453.1 |
| 4 KB | 480.8 | 407.6 | 483.9 |
| 8 KB | 482.8 | 451.2 | 526.7 |
Shared memory achieves a median round-trip latency of 270 ns at 32 bytes, which is 17 to 22 times lower than either UDS variant (4640 ns for datagram, 5910 ns for stream). This advantage persists across the entire size range and stems from a single architectural property: the data path contains zero system calls. The SPSC ring buffer performs all coordination through userspace atomic operations, whereas every UDS transfer must enter the kernel to copy data through a socket buffer.
UDS datagram mode consistently outperforms stream mode for small messages because it preserves message boundaries natively, removing the need for length-prefix framing. At 32 bytes, datagram latency is 4640 ns compared to 5910 ns for stream.
Throughput exhibits a different pattern. Shared memory peaks at 526.7 MB/s (128 B payload) and stabilises around 480 MB/s. UDS datagram throughput converges to 526.7 MB/s at 8 KB, matching shared memory. Large payloads amortise the per-syscall overhead to the point where the kernel's datagram pipeline competes with userspace shared memory on throughput.
For applications requiring deterministic sub-microsecond latency (engine control loops, audio processing pipelines, sensor fusion), shared memory with a lock-free SPSC ring buffer is the appropriate choice. It offers approximately 20 times lower latency than Unix Domain Sockets at small message sizes, with throughput that remains competitive across all payloads. For applications where integration simplicity takes priority over peak performance, UDS datagrams provide a practical alternative with latency in the low microsecond range and throughput scaling to match shared memory on transfers above 1 KB.
The benchmark suite is available on GitHub. A single make run reproduces the complete measurement set.