HIP Deep Dive: 10B Galaxy-Pair Angles on AMD (~0.15s)

This is the full, spoiler-allowed deep dive of my HIP solution for the galaxy-correlation assignment.

No abstractions, no hand-waving: this is the exact strategy that took me from an early ~1.60s implementation to ~0.154s wall-clock on later optimized runs (best observed ~0.152s in the latest phase).

If you want to see the data, I also built a small 3D viewer for it: Galaxy Visualization

Problem statement and target metric

Input:

100,000 real galaxies
100,000 synthetic (random) galaxies

Each row contains right ascension $\alpha$ and declination $\delta$ in arcminutes, converted to radians during ingest.

Core workload:

Build histograms for DR, DD, and RR
Total pair workload is $N^2 = 100000^2 = 10^{10}$ comparisons per histogram family
Angular bins: 360 * 4 = 1440 bins

The spherical-angle computation can be written as:

heta_{12} = \arccos\left(\sin(\delta_1)\sin(\delta_2) + \cos(\delta_1)\cos(\delta_2)\cos(\alpha_1 - \alpha_2)\right)

// Pseudo-C: angular separation for one pair
float dot = sinf(delta1) * sinf(delta2)
          + cosf(delta1) * cosf(delta2) * cosf(alpha1 - alpha2);
dot = fminf(fmaxf(dot, -1.0f), 1.0f);
float theta12 = acosf(dot);

The final omega value per bin is:

\omega = \frac{DD - 2DR + RR}{RR}

// Pseudo-C: Landy-Szalay estimator at one histogram bin
float omega = (DD[bin] - 2.0f * DR[bin] + RR[bin]) / RR[bin];

Acronym guide

HIP: Heterogeneous-computing Interface for Portability (AMD's CUDA-like GPU programming model)
LDS: Local Data Share (AMD on-chip shared memory, similar in role to CUDA shared memory)
RDNA 2: AMD GPU architecture generation used for this tuning pass
DR/DD/RR: Data-Random, Data-Data, Random-Random pair histograms
RA/DEC: Right Ascension / Declination sky coordinates
mmap: Memory mapping file I/O directly into process address space

End-to-end architecture

The implementation is intentionally simple at a high level:

Read both catalogs with mmap + custom ASCII parsers.
Copy RA/DEC arrays to GPU.
Launch one HIP kernel over a 2D $(i, j)$ space.
Accumulate DR/DD/RR in shared-memory histograms.
Flush per-block histograms into global 64-bit histograms.
Copy histograms back, validate sums, compute omega, write output.

Why this maps well to GPU

Pair generation is embarrassingly parallel. Histogram updates are not.

That means the real challenge is minimizing contention and global atomic pressure while preserving exact counts.

I map one thread to one $(i, j)$ pair:

const int i = blockIdx.x * blockDim.x + threadIdx.x;
const int j = blockIdx.y * blockDim.y + threadIdx.y;

And use a 2D launch grid:

dim3 threadsInBlock(BLOCK_SIZE_X, BLOCK_SIZE_Y);
dim3 threadBlocks(
    (N + threadsInBlock.x - 1) / threadsInBlock.x,
    (N + threadsInBlock.y - 1) / threadsInBlock.y);

Kernel deep dive

1) Shared-memory histogram staging

Each block allocates three shared histograms:

s_hist_DR
s_hist_DD
s_hist_RR

Instead of 1440 bins, I allocate 1536 padded bins per histogram to reduce LDS bank conflict pressure on RDNA-class hardware.

const int num_bins = 1440;
const int num_bins_padded = 1536;
extern __shared__ unsigned int s_hist[];

unsigned int *s_hist_DR = s_hist;
unsigned int *s_hist_DD = s_hist_DR + num_bins_padded;
unsigned int *s_hist_RR = s_hist_DD + num_bins_padded;

Shared memory footprint per block is:

3 \times 1536 \times 4\text{ bytes} = 18432\text{ bytes}

// Pseudo-C: shared-memory budget per block
int histograms = 3;                 // DR, DD, RR
int padded_bins = 1536;
int bytes_per_counter = 4;          // uint32
int shared_bytes = histograms * padded_bins * bytes_per_counter; // 18432

That is a good fit for occupancy while still giving enough local accumulation capacity.

2) Numerical safety in binning

Floating-point dot products can drift slightly outside $[-1, 1]$ , so I clamp before acosf:

expr = fminf(fmaxf(expr, -1.0f), 1.0f);
int histogram_index = int(acosf(expr) * 57.29577951308232f * binsperdegree);
histogram_index = min(max(histogram_index, 0), num_bins - 1);

This removes NaNs at the domain boundary and avoids out-of-range bin writes.

3) DR, DD, RR computation path

Per valid thread:

Always compute DR for $(i, j)$ .
Compute DD and RR only for upper triangle (j >= i).
Use a symmetric increment:
- diagonal ( $i = j$ ): +1
- off-diagonal ( $i \neq j$ ): +2

if (j >= i)
{
    const unsigned int symmetric_increment = 1U + static_cast<unsigned int>(j != i);
    // DD update with symmetric_increment
    // RR update with symmetric_increment
}

This cuts DD/RR math work roughly in half while preserving exact totals.

4) Why symmetry is exact

For DD/RR, the full matrix has $N^2$ entries. Upper triangle with mirrored weighting gives:

\sum_{i=0}^{N-1}\sum_{j=i}^{N-1} \left(1 + [j \neq i]\right) = N + 2\cdot\frac{N(N-1)}{2} = N^2

// Pseudo-C: exact symmetry-preserving accumulation
for (int i = 0; i < N; ++i) {
    for (int j = i; j < N; ++j) {
        int w = (j == i) ? 1 : 2;
        DD[bin_dd(i, j)] += w;
        RR[bin_rr(i, j)] += w;
    }
}

So we keep mathematically exact counts and avoid duplicate trigonometric work.

5) Atomic strategy

Shared memory atomics: 32-bit atomicAdd to LDS arrays.
Global memory atomics: 64-bit atomicAdd when flushing block totals.

if (s_hist_DR[b] > 0)
    atomicAdd(&d_histogram_DR[b], (unsigned long long int)s_hist_DR[b]);

Rationale:

LDS updates are much cheaper than global updates.
32-bit in LDS is enough per block (bounded per-kernel contribution).
64-bit global counters safely hold totals up to $10^{10}$ .

Input pipeline: why `mmap` mattered

I replaced fscanf loops with:

open + fstat
mmap whole file
custom fast ASCII parsers (parse_int_fast, parse_float_fast)

This improved startup cost and made timing more stable.

int fd = open(file_path, O_RDONLY);
void *mapped_data = mmap(NULL, (size_t)file_info.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

I still keep strict validation:

verify line count availability
fail on malformed rows
convert arcminutes to radians at ingest

Build and hardware tuning choices

From the build script and runtime tuning:

-O3
-ffast-math
-munsafe-fp-atomics
--offload-arch=gfx1030
-mwavefrontsize64

I tested multiple block shapes (16x32, 32x16, 16x16, and 32x32). The right answer depends on occupancy, LDS pressure, and register usage on your exact GPU/driver stack.

Correctness guardrails

I fail hard if histogram sums are wrong:

if (histogramDRsum != 10000000000L) exit(EXIT_FAILURE);
if (histogramDDsum != 10000000000L) exit(EXIT_FAILURE);
if (histogramRRsum != 10000000000L) exit(EXIT_FAILURE);

This catches race/logic regressions immediately.

On Linux x86-64, long is 64-bit, so 10^10 fits on host-side sum checks.

Timing protocol and measured progression

My benchmark protocol is 5 back-to-back runs:

Run 1: warm-up/cache effects
Runs 2-5: steady-state behavior

I also print explicit phase timing:

input
kernel
output
end-to-end GPU phase

Runtime progression for the galaxy pair angle computation across different implementation stages.
Date	Version	Runtime (s)	Improvement vs. Previous	Improvement vs. Initial
Course Project	Initial	1.60	-	-
June 2025	HIP + Shared Memory	0.85	46.9%	46.9%
February 2026	Further Optimized	0.33	61.2%	79.4%
February 2026	Final Tuning	0.26	21.2%	83.7%
2026-02-14	Native HIP + I/O + Symmetry	0.154	40.8%	90.4%

Best observed run in the latest phase: 0.152s.

Key utility pattern: fail-fast HIP errors

#define HIP_ERR_CHECK(ans)                    \
    {                                         \
        gpuAssert((ans), __FILE__, __LINE__); \
    }

static inline void
gpuAssert(hipError_t code, const char *file, int line, bool abort = true)
{
    if (code != hipSuccess)
    {
        fprintf(stderr, "   GPUassert: %s %s %d\n", hipGetErrorString(code), file, line);
        if (abort)
            exit(code);
    }
}

Practical lessons from this implementation

Histogram contention dominates long before floating-point math does.
Shared-memory staging + padded bins is worth real time on AMD GPUs.
Symmetry in DD/RR is one of the highest-impact algorithmic wins.
Input path (mmap + parser) matters once kernels are fast.
Hard correctness checks save days of benchmarking bad results.