Blog
HIP Deep Dive: 10B Galaxy-Pair Angles on AMD (~0.15s)
Full spoiler deep dive of my native HIP solution: kernel structure, symmetry math, LDS design, mmap I/O, tuning workflow, and correctness checks.
14-02-2026
Blog
Full spoiler deep dive of my native HIP solution: kernel structure, symmetry math, LDS design, mmap I/O, tuning workflow, and correctness checks.
14-02-2026
This is the full, spoiler-allowed deep dive of my HIP solution for the galaxy-correlation assignment.
No abstractions, no hand-waving: this is the exact strategy that took me from an early ~1.60s implementation to ~0.154s wall-clock on later optimized runs (best observed ~0.152s in the latest phase).
If you want to see the data, I also built a small 3D viewer for it: Galaxy Visualization
Input:
Each row contains right ascension and declination in arcminutes, converted to radians during ingest.
Core workload:
360 * 4 = 1440 binsThe spherical-angle computation can be written as:
// Pseudo-C: angular separation for one pair
float dot = sinf(delta1) * sinf(delta2)
+ cosf(delta1) * cosf(delta2) * cosf(alpha1 - alpha2);
dot = fminf(fmaxf(dot, -1.0f), 1.0f);
float theta12 = acosf(dot);
The final omega value per bin is:
// Pseudo-C: Landy-Szalay estimator at one histogram bin
float omega = (DD[bin] - 2.0f * DR[bin] + RR[bin]) / RR[bin];
mmap: Memory mapping file I/O directly into process address spaceThe implementation is intentionally simple at a high level:
mmap + custom ASCII parsers.Pair generation is embarrassingly parallel. Histogram updates are not.
That means the real challenge is minimizing contention and global atomic pressure while preserving exact counts.
I map one thread to one pair:
const int i = blockIdx.x * blockDim.x + threadIdx.x;
const int j = blockIdx.y * blockDim.y + threadIdx.y;
And use a 2D launch grid:
dim3 threadsInBlock(BLOCK_SIZE_X, BLOCK_SIZE_Y);
dim3 threadBlocks(
(N + threadsInBlock.x - 1) / threadsInBlock.x,
(N + threadsInBlock.y - 1) / threadsInBlock.y);
Each block allocates three shared histograms:
s_hist_DRs_hist_DDs_hist_RRInstead of 1440 bins, I allocate 1536 padded bins per histogram to reduce LDS bank conflict pressure on RDNA-class hardware.
const int num_bins = 1440;
const int num_bins_padded = 1536;
extern __shared__ unsigned int s_hist[];
unsigned int *s_hist_DR = s_hist;
unsigned int *s_hist_DD = s_hist_DR + num_bins_padded;
unsigned int *s_hist_RR = s_hist_DD + num_bins_padded;
Shared memory footprint per block is:
// Pseudo-C: shared-memory budget per block
int histograms = 3; // DR, DD, RR
int padded_bins = 1536;
int bytes_per_counter = 4; // uint32
int shared_bytes = histograms * padded_bins * bytes_per_counter; // 18432
That is a good fit for occupancy while still giving enough local accumulation capacity.
Floating-point dot products can drift slightly outside , so I clamp before acosf:
expr = fminf(fmaxf(expr, -1.0f), 1.0f);
int histogram_index = int(acosf(expr) * 57.29577951308232f * binsperdegree);
histogram_index = min(max(histogram_index, 0), num_bins - 1);
This removes NaNs at the domain boundary and avoids out-of-range bin writes.
Per valid thread:
j >= i).+1+2if (j >= i)
{
const unsigned int symmetric_increment = 1U + static_cast<unsigned int>(j != i);
// DD update with symmetric_increment
// RR update with symmetric_increment
}
This cuts DD/RR math work roughly in half while preserving exact totals.
For DD/RR, the full matrix has entries. Upper triangle with mirrored weighting gives:
// Pseudo-C: exact symmetry-preserving accumulation
for (int i = 0; i < N; ++i) {
for (int j = i; j < N; ++j) {
int w = (j == i) ? 1 : 2;
DD[bin_dd(i, j)] += w;
RR[bin_rr(i, j)] += w;
}
}
So we keep mathematically exact counts and avoid duplicate trigonometric work.
atomicAdd to LDS arrays.atomicAdd when flushing block totals.if (s_hist_DR[b] > 0)
atomicAdd(&d_histogram_DR[b], (unsigned long long int)s_hist_DR[b]);
Rationale:
mmap matteredI replaced fscanf loops with:
open + fstatmmap whole fileparse_int_fast, parse_float_fast)This improved startup cost and made timing more stable.
int fd = open(file_path, O_RDONLY);
void *mapped_data = mmap(NULL, (size_t)file_info.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
I still keep strict validation:
From the build script and runtime tuning:
-O3-ffast-math-munsafe-fp-atomics--offload-arch=gfx1030-mwavefrontsize64I tested multiple block shapes (16x32, 32x16, 16x16, and 32x32). The right answer depends on occupancy, LDS pressure, and register usage on your exact GPU/driver stack.
I fail hard if histogram sums are wrong:
if (histogramDRsum != 10000000000L) exit(EXIT_FAILURE);
if (histogramDDsum != 10000000000L) exit(EXIT_FAILURE);
if (histogramRRsum != 10000000000L) exit(EXIT_FAILURE);
This catches race/logic regressions immediately.
On Linux x86-64, long is 64-bit, so 10^10 fits on host-side sum checks.
My benchmark protocol is 5 back-to-back runs:
I also print explicit phase timing:
| Date | Version | Runtime (s) | Improvement vs. Previous | Improvement vs. Initial |
|---|---|---|---|---|
| Course Project | Initial | 1.60 | - | - |
| June 2025 | HIP + Shared Memory | 0.85 | 46.9% | 46.9% |
| February 2026 | Further Optimized | 0.33 | 61.2% | 79.4% |
| February 2026 | Final Tuning | 0.26 | 21.2% | 83.7% |
| 2026-02-14 | Native HIP + I/O + Symmetry | 0.154 | 40.8% | 90.4% |
Best observed run in the latest phase: 0.152s.
#define HIP_ERR_CHECK(ans) \
{ \
gpuAssert((ans), __FILE__, __LINE__); \
}
static inline void
gpuAssert(hipError_t code, const char *file, int line, bool abort = true)
{
if (code != hipSuccess)
{
fprintf(stderr, " GPUassert: %s %s %d\n", hipGetErrorString(code), file, line);
if (abort)
exit(code);
}
}
mmap + parser) matters once kernels are fast.