Skip to main content

Blog

HIP Deep Dive: 10B Galaxy-Pair Angles on AMD (~0.15s)

Full spoiler deep dive of my native HIP solution: kernel structure, symmetry math, LDS design, mmap I/O, tuning workflow, and correctness checks.

Date

14-02-2026

Tags
HPCGPUHIPCUDAAMD

This is the full, spoiler-allowed deep dive of my HIP solution for the galaxy-correlation assignment.

No abstractions, no hand-waving: this is the exact strategy that took me from an early ~1.60s implementation to ~0.154s wall-clock on later optimized runs (best observed ~0.152s in the latest phase).

If you want to see the data, I also built a small 3D viewer for it: Galaxy Visualization

Problem statement and target metric

Input:

  • 100,000 real galaxies
  • 100,000 synthetic (random) galaxies

Each row contains right ascension α\alpha and declination δ\delta in arcminutes, converted to radians during ingest.

Core workload:

  • Build histograms for DR, DD, and RR
  • Total pair workload is N2=1000002=1010N^2 = 100000^2 = 10^{10} comparisons per histogram family
  • Angular bins: 360 * 4 = 1440 bins

The spherical-angle computation can be written as:

heta12=arccos(sin(δ1)sin(δ2)+cos(δ1)cos(δ2)cos(α1α2)) heta_{12} = \arccos\left(\sin(\delta_1)\sin(\delta_2) + \cos(\delta_1)\cos(\delta_2)\cos(\alpha_1 - \alpha_2)\right)
// Pseudo-C: angular separation for one pair
float dot = sinf(delta1) * sinf(delta2)
          + cosf(delta1) * cosf(delta2) * cosf(alpha1 - alpha2);
dot = fminf(fmaxf(dot, -1.0f), 1.0f);
float theta12 = acosf(dot);

The final omega value per bin is:

ω=DD2DR+RRRR\omega = \frac{DD - 2DR + RR}{RR}
// Pseudo-C: Landy-Szalay estimator at one histogram bin
float omega = (DD[bin] - 2.0f * DR[bin] + RR[bin]) / RR[bin];

Acronym guide

  • HIP: Heterogeneous-computing Interface for Portability (AMD's CUDA-like GPU programming model)
  • LDS: Local Data Share (AMD on-chip shared memory, similar in role to CUDA shared memory)
  • RDNA 2: AMD GPU architecture generation used for this tuning pass
  • DR/DD/RR: Data-Random, Data-Data, Random-Random pair histograms
  • RA/DEC: Right Ascension / Declination sky coordinates
  • mmap: Memory mapping file I/O directly into process address space

End-to-end architecture

The implementation is intentionally simple at a high level:

  1. Read both catalogs with mmap + custom ASCII parsers.
  2. Copy RA/DEC arrays to GPU.
  3. Launch one HIP kernel over a 2D (i,j)(i, j) space.
  4. Accumulate DR/DD/RR in shared-memory histograms.
  5. Flush per-block histograms into global 64-bit histograms.
  6. Copy histograms back, validate sums, compute omega, write output.

Why this maps well to GPU

Pair generation is embarrassingly parallel. Histogram updates are not.

That means the real challenge is minimizing contention and global atomic pressure while preserving exact counts.

I map one thread to one (i,j)(i, j) pair:

const int i = blockIdx.x * blockDim.x + threadIdx.x;
const int j = blockIdx.y * blockDim.y + threadIdx.y;

And use a 2D launch grid:

dim3 threadsInBlock(BLOCK_SIZE_X, BLOCK_SIZE_Y);
dim3 threadBlocks(
    (N + threadsInBlock.x - 1) / threadsInBlock.x,
    (N + threadsInBlock.y - 1) / threadsInBlock.y);

Kernel deep dive

1) Shared-memory histogram staging

Each block allocates three shared histograms:

  • s_hist_DR
  • s_hist_DD
  • s_hist_RR

Instead of 1440 bins, I allocate 1536 padded bins per histogram to reduce LDS bank conflict pressure on RDNA-class hardware.

const int num_bins = 1440;
const int num_bins_padded = 1536;
extern __shared__ unsigned int s_hist[];

unsigned int *s_hist_DR = s_hist;
unsigned int *s_hist_DD = s_hist_DR + num_bins_padded;
unsigned int *s_hist_RR = s_hist_DD + num_bins_padded;

Shared memory footprint per block is:

3×1536×4 bytes=18432 bytes3 \times 1536 \times 4\text{ bytes} = 18432\text{ bytes}
// Pseudo-C: shared-memory budget per block
int histograms = 3;                 // DR, DD, RR
int padded_bins = 1536;
int bytes_per_counter = 4;          // uint32
int shared_bytes = histograms * padded_bins * bytes_per_counter; // 18432

That is a good fit for occupancy while still giving enough local accumulation capacity.

2) Numerical safety in binning

Floating-point dot products can drift slightly outside [1,1][-1, 1], so I clamp before acosf:

expr = fminf(fmaxf(expr, -1.0f), 1.0f);
int histogram_index = int(acosf(expr) * 57.29577951308232f * binsperdegree);
histogram_index = min(max(histogram_index, 0), num_bins - 1);

This removes NaNs at the domain boundary and avoids out-of-range bin writes.

3) DR, DD, RR computation path

Per valid thread:

  • Always compute DR for (i,j)(i, j).
  • Compute DD and RR only for upper triangle (j >= i).
  • Use a symmetric increment:
    • diagonal (i=ji = j): +1
    • off-diagonal (iji \neq j): +2
if (j >= i)
{
    const unsigned int symmetric_increment = 1U + static_cast<unsigned int>(j != i);
    // DD update with symmetric_increment
    // RR update with symmetric_increment
}

This cuts DD/RR math work roughly in half while preserving exact totals.

4) Why symmetry is exact

For DD/RR, the full matrix has N2N^2 entries. Upper triangle with mirrored weighting gives:

i=0N1j=iN1(1+[ji])=N+2N(N1)2=N2\sum_{i=0}^{N-1}\sum_{j=i}^{N-1} \left(1 + [j \neq i]\right) = N + 2\cdot\frac{N(N-1)}{2} = N^2
// Pseudo-C: exact symmetry-preserving accumulation
for (int i = 0; i < N; ++i) {
    for (int j = i; j < N; ++j) {
        int w = (j == i) ? 1 : 2;
        DD[bin_dd(i, j)] += w;
        RR[bin_rr(i, j)] += w;
    }
}

So we keep mathematically exact counts and avoid duplicate trigonometric work.

5) Atomic strategy

  • Shared memory atomics: 32-bit atomicAdd to LDS arrays.
  • Global memory atomics: 64-bit atomicAdd when flushing block totals.
if (s_hist_DR[b] > 0)
    atomicAdd(&d_histogram_DR[b], (unsigned long long int)s_hist_DR[b]);

Rationale:

  • LDS updates are much cheaper than global updates.
  • 32-bit in LDS is enough per block (bounded per-kernel contribution).
  • 64-bit global counters safely hold totals up to 101010^{10}.

Input pipeline: why mmap mattered

I replaced fscanf loops with:

  • open + fstat
  • mmap whole file
  • custom fast ASCII parsers (parse_int_fast, parse_float_fast)

This improved startup cost and made timing more stable.

int fd = open(file_path, O_RDONLY);
void *mapped_data = mmap(NULL, (size_t)file_info.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

I still keep strict validation:

  • verify line count availability
  • fail on malformed rows
  • convert arcminutes to radians at ingest

Build and hardware tuning choices

From the build script and runtime tuning:

  • -O3
  • -ffast-math
  • -munsafe-fp-atomics
  • --offload-arch=gfx1030
  • -mwavefrontsize64

I tested multiple block shapes (16x32, 32x16, 16x16, and 32x32). The right answer depends on occupancy, LDS pressure, and register usage on your exact GPU/driver stack.

Correctness guardrails

I fail hard if histogram sums are wrong:

if (histogramDRsum != 10000000000L) exit(EXIT_FAILURE);
if (histogramDDsum != 10000000000L) exit(EXIT_FAILURE);
if (histogramRRsum != 10000000000L) exit(EXIT_FAILURE);

This catches race/logic regressions immediately.

On Linux x86-64, long is 64-bit, so 10^10 fits on host-side sum checks.

Timing protocol and measured progression

My benchmark protocol is 5 back-to-back runs:

  • Run 1: warm-up/cache effects
  • Runs 2-5: steady-state behavior

I also print explicit phase timing:

  • input
  • kernel
  • output
  • end-to-end GPU phase
Runtime progression for the galaxy pair angle computation across different implementation stages.
DateVersionRuntime (s)Improvement vs. PreviousImprovement vs. Initial
Course ProjectInitial1.60--
June 2025HIP + Shared Memory0.8546.9%46.9%
February 2026Further Optimized0.3361.2%79.4%
February 2026Final Tuning0.2621.2%83.7%
2026-02-14Native HIP + I/O + Symmetry0.15440.8%90.4%

Best observed run in the latest phase: 0.152s.

Key utility pattern: fail-fast HIP errors

#define HIP_ERR_CHECK(ans)                    \
    {                                         \
        gpuAssert((ans), __FILE__, __LINE__); \
    }

static inline void
gpuAssert(hipError_t code, const char *file, int line, bool abort = true)
{
    if (code != hipSuccess)
    {
        fprintf(stderr, "   GPUassert: %s %s %d\n", hipGetErrorString(code), file, line);
        if (abort)
            exit(code);
    }
}

Practical lessons from this implementation

  1. Histogram contention dominates long before floating-point math does.
  2. Shared-memory staging + padded bins is worth real time on AMD GPUs.
  3. Symmetry in DD/RR is one of the highest-impact algorithmic wins.
  4. Input path (mmap + parser) matters once kernels are fast.
  5. Hard correctness checks save days of benchmarking bad results.