Skip to main content

Blog

Linux IPC Shootout: Shared Memory versus Unix Domain Sockets

A rigorous C benchmark comparing shared memory, Unix Domain Socket (UDS) stream, and UDS datagram for latency and throughput across embedded-relevant payload sizes.

Date

01-05-2026

Tags
CLinuxIPCPerformanceEmbedded

Why

This started over lunch. A colleague and I were curious what the fastest and most reliable method of inter-process communication (IPC) is on Linux. IPC refers to any mechanism by which one process transfers data to another process on the same machine. Pipes are simple, sockets are portable, shared memory is fast but carries complexity. Neither of us really knew the answer, so I built a benchmark suite to find out. This post is a summary of what I learned.

The three mechanisms tested were shared memory, Unix Domain Sockets (UDS) in stream mode, and UDS in datagram mode. SPSC, which stands for single-producer, single-consumer, describes a concurrency pattern where exactly one thread writes to a shared buffer and exactly one thread reads from it, eliminating the need for multi-producer arbitration. UDS is a Linux facility that transports data between local processes through the kernel using the standard socket API, bypassing the network stack entirely. Latency and throughput were measured across payload sizes from 32 bytes to 8 KB, corresponding to CAN frames, sensor readings, control commands, and audio data in embedded systems.

What I tested

The three IPC mechanisms and their respective transport and synchronization strategies.
MechanismTransportSynchronization
Shared Memorymmap(MAP_SHARED | MAP_ANONYMOUS)Lock-free SPSC ring buffer, __atomic acquire/release
UDS SOCK_STREAMsocketpair(AF_UNIX, SOCK_STREAM)Kernel send/recv, 4 MB socket buffers, length-prefix framing
UDS SOCK_DGRAMsocketpair(AF_UNIX, SOCK_DGRAM)Kernel send/recv, 4 MB socket buffers, native message boundaries
Message sizes and their corresponding embedded applications.
Payload SizeRepresentative Use Case
32 BCAN FD frame, small control command
128 BSensor sample batch
1 KBLog entry, small image tile
4 KBPage-sized buffer
8 KBAudio frame, video macroblock

Methodology

Benchmark integrity required control over confounding factors. Timing uses CLOCK_MONOTONIC_RAW, a clock source immune to NTP and adjtime adjustments. The producer process is pinned to core 0 and the consumer to core 1 via sched_setaffinity to prevent scheduler migrations from introducing latency variation. Each message size receives 200 untimed warmup rounds to populate caches and TLBs (Translation Lookaside Buffers, which cache virtual-to-physical address translations), ensuring measurements reflect steady-state behaviour. Latency is computed from 5000 timed round-trip iterations per size. Throughput is measured as an 8 MB unidirectional transfer with wall-clock timing and one-byte acknowledgement upon completion. A volatile accumulator forces the compiler to materialise all payload reads, preventing dead-code elimination from falsifying the result.

Shared memory design

The shared memory benchmark uses a lock-free SPSC ring buffer allocated via mmap(MAP_SHARED | MAP_ANONYMOUS). Two independent ring buffers are created so that parent and child can communicate in both directions without interference.

typedef struct {
    _Atomic uint32_t head;
    _Atomic uint32_t tail;
    uint32_t         sizes[RING_SLOTS];
    uint8_t          data[];
} ring_hdr_t;

The producer writes payload data and its size to the slot indexed by head, then issues an atomic release-store to publish the new head value. The consumer spins on tail, reads the published slot, and advances tail with an acquire-load to ensure visibility of the producer's writes. No system calls occur in the data path. All synchronisation is performed through atomic loads and stores with acquire-release semantics. The SPSC constraint guarantees that a CAS-loop (Compare-And-Swap loop) is unnecessary, as no more than one thread ever accesses each index concurrently.

Results

Round-trip latency

Median round-trip latency in nanoseconds. Lower is better.
SizeShared Memory (ns)UDS Stream (ns)UDS Datagram (ns)
32 B27059104640
64 B33057504820
128 B43077404800
256 B65073305440
512 B109081106180
1 KB198082906860
2 KB3750121909100
4 KB72901521012760
8 KB143602328019850

Unidirectional throughput

Unidirectional throughput in megabytes per second. Higher is better.
SizeShared Memory (MB/s)UDS Stream (MB/s)UDS Datagram (MB/s)
32 B269.723.729.5
64 B411.847.860.6
128 B526.787.1129.2
256 B518.8145.2238.2
512 B488.0223.4317.6
1 KB506.0303.9394.4
2 KB484.9362.8453.1
4 KB480.8407.6483.9
8 KB482.8451.2526.7

Discussion

Shared memory achieves a median round-trip latency of 270 ns at 32 bytes, which is 17 to 22 times lower than either UDS variant (4640 ns for datagram, 5910 ns for stream). This advantage persists across the entire size range and stems from a single architectural property: the data path contains zero system calls. The SPSC ring buffer performs all coordination through userspace atomic operations, whereas every UDS transfer must enter the kernel to copy data through a socket buffer.

UDS datagram mode consistently outperforms stream mode for small messages because it preserves message boundaries natively, removing the need for length-prefix framing. At 32 bytes, datagram latency is 4640 ns compared to 5910 ns for stream.

Throughput exhibits a different pattern. Shared memory peaks at 526.7 MB/s (128 B payload) and stabilises around 480 MB/s. UDS datagram throughput converges to 526.7 MB/s at 8 KB, matching shared memory. Large payloads amortise the per-syscall overhead to the point where the kernel's datagram pipeline competes with userspace shared memory on throughput.

Conclusion

For applications requiring deterministic sub-microsecond latency (engine control loops, audio processing pipelines, sensor fusion), shared memory with a lock-free SPSC ring buffer is the appropriate choice. It offers approximately 20 times lower latency than Unix Domain Sockets at small message sizes, with throughput that remains competitive across all payloads. For applications where integration simplicity takes priority over peak performance, UDS datagrams provide a practical alternative with latency in the low microsecond range and throughput scaling to match shared memory on transfers above 1 KB.

The benchmark suite is available on GitHub. A single make run reproduces the complete measurement set.