Skip to main content

Blog

Galaxy Pair Angles: From Course Project to 0.85s

A non-spoiler look at a GPU programming course project at Åbo Akademi: computing 10B galaxy-pair angles fast, and then optimizing it further after graduation with HIP on AMD.

Published01-07-2025
Tags
HPCGPUHIPCUDA

This was one of the GPU programming assignments at Åbo Akademi, and it’s still one of my favorite “simple to state, brutal to run” problems.

I’m going to keep this post non-spoiler for future students taking the course: you’ll get the problem framing, the performance story and a few implementation patterns, but not the full kernel logic.

If you want to see the data, I also built a small 3D viewer for it: Galaxy Visualization.

The challenge

You’re given two sky catalogs:

  • 100,000 real galaxies
  • 100,000 synthetic (random) galaxies

Each row is right ascension α\alpha and declination δ\delta in arcminutes (converted to radians when read). The core workload is building angle histograms for galaxy pairs (DR, DD, RR), which means computing 10 billion pairwise angles.

The spherical-angle computation can be written as:

θ12=arccos(sin(δ1)sin(δ2)+cos(δ1)cos(δ2)cos(α1α2)) \theta_{12} = \arccos\left(\sin(\delta_1)\cdot\sin(\delta_2) + \cos(\delta_1)\cdot\cos(\delta_2)\cdot\cos(\alpha_1 - \alpha_2)\right)

My approach (and why HIP mattered)

Most students wrote CUDA-first solutions. I went the other way: I focused on running the GPU work through HIP so I could iterate and benchmark on AMD hardware.

The version I kept optimizing after my studies uses a shared-memory strategy (details omitted here) to reduce global contention when building histograms.

Results (measured)

These are the key facts from my report and the later optimization work:

  • The main challenge involved computing 10 billion galaxy-pair angles, with an initial runtime of ~1.6 s.
  • As of June 2025, I optimized the kernel using shared memory via HIP on an AMD RX 6900 XT, reducing execution time to 0.85 s, a 46.8% improvement.
  • In the course context, my report version ended up among the fastest submissions (without trying to make it a competition).

A few non-spoiler code patterns

The newer optimized code lives in galaxy_cuda.cpp. Even though it’s written in CUDA-like syntax, it fits the same workflow described in my report (hipify + hipcc) when targeting AMD.

1) Fail fast on GPU errors

#define gpuErrchk(ans)                        \
	{                                         \
		gpuAssert((ans), __FILE__, __LINE__); \
	}

inline internal void
gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
	if (code != cudaSuccess)
	{
		fprintf(stderr, "   GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
		if (abort) exit(code);
	}
}

2) Launch configuration and shared-memory budgeting

This problem is “embarrassingly parallel” in the pair enumeration sense, but not in the histogram accumulation sense — the kernel launch and shared-memory size become part of the performance story:

dim3 threadsInBlock(32, 32);
dim3 threadBlocks = dim3(
	ceil((N + threadsInBlock.x - 1) / threadsInBlock.x),
	ceil((N + threadsInBlock.y - 1) / threadsInBlock.y));

size_t sharedMemSize = 3 * binsperdegree * totaldegrees * sizeof(unsigned long long int);
fill_histograms<<<threadBlocks, threadsInBlock, sharedMemSize>>>(
	d_real_rasc, d_real_decl, d_rand_rasc, d_rand_decl,
	d_histogram_DR, d_histogram_DD, d_histogram_RR);

3) Shared-memory histogram layout (high-level)

Inside the kernel, the histograms are laid out in shared memory and initialized cooperatively. I’m intentionally not including the angle math or binning logic here.

extern __shared__ unsigned long long int s_hist[];
unsigned long long int *s_hist_DR = s_hist;
unsigned long long int *s_hist_DD = s_hist_DR + num_bins;
unsigned long long int *s_hist_RR = s_hist_DD + num_bins;

for (int b = threadIdx.y * blockDim.x + threadIdx.x; b < num_bins; b += blockDim.x * blockDim.y)
{
	s_hist_DR[b] = 0;
	s_hist_DD[b] = 0;
	s_hist_RR[b] = 0;
}
__syncthreads();

What I learned

This project was a good reminder that performance isn’t just “more threads”:

  • The math is the easy part; memory traffic and contention dominate.
  • Small structural changes (like how you accumulate histograms) can matter more than micro-optimizing trig.
  • Being able to iterate on real hardware (in my case AMD + HIP) helped a lot.