Skip to main content

Project

HIP Image FX

GPU-accelerated image processing framework with production-ready autotuning for optimal kernel configurations.

Role

GPU Compute

Date

11-01-2026

Tech
HIPOpenMPMesonROCm

Highlights

  • GPU-accelerated filters on AMD HIP with a CPU fallback
  • Production-ready autotuning system for optimal kernel configurations
  • CLI supports single images and full directory batch processing
  • Benchmarks documented (GPU vs CPU, including OpenMP)

Overview

HIP Image FX applies fast image filters (grayscale, negative, Gaussian blur) on AMD GPUs, with a CPU fallback for portability. The CLI supports both single-image and batch directory processing. The framework features a production-ready autotuning system that automatically discovers optimal kernel configurations for your specific GPU through empirical measurement. l

Problem statement: what are we trying to solve?

This project began as a practical GPU programming challenge: process real images quickly on AMD hardware, but keep the tool useful even when a GPU is unavailable. The core problem is not only writing kernels. The real engineering challenge is balancing compute speed, memory transfer overhead, and ease of use in one command-line workflow.

The target outcome is a reliable image-processing tool that can run the same filters on both HIP and CPU paths, handle single files and large directories, and make performance behavior measurable across resolutions and batch sizes. The goal is to understand when GPU acceleration is truly beneficial, where it is limited by transfer cost, and how to tune execution safely without forcing users to hand-configure low-level parameters.

HIP Image FX speedup vs resolution
Benchmark (Jan 2026): GPU speedup vs CPU across resolutions (single-thread and OpenMP).

Problem / approach

  • Accelerate common image filters on GPUs while maintaining CPU fallback compatibility.
  • Provide simple CLI interface for filter selection and I/O configuration.
  • Investigate when GPU acceleration is worthwhile vs CPU for different filter types.

Prerequisites

  • AMD ROCm stack (HIP)
  • OpenMP for the CPU path
  • Meson build system

Solution

  • HIP kernels for each filter with a --use-cpu fallback option.
  • Meson build system targeting ROCm with HIP toolchain.
  • Flexible CLI supporting single images and batch directory processing.
  • Production-ready autotuning framework for self-optimizing kernels.

Autotuning Framework

The autotuning system automatically finds the fastest GPU kernel configuration for your hardware through empirical benchmarking:

# First run: autotuning (~100-200ms overhead)
./build/hip-img-fx --input photo.jpg --output result.jpg --filter grayscale
# Output: [AutoTune] Benchmarking... Selected [16x8] (0.034ms)

# Subsequent runs: cached configuration (zero overhead)
./build/hip-img-fx --input photo.jpg --output result.jpg --filter grayscale
# Output: [AutoTune] Using cached [16x8]

Key Features:

  • Zero-configuration optimization: Tests 15-30 block size configurations automatically
  • Performance impact: 12-18% faster than default configurations
  • Three-tier caching: Thread-local → Persistent JSON → Benchmarking (only on cache miss)
  • Production-ready: Embedded defaults for common GPUs, compile-time safety with C++20 concepts
  • GPU-aware: Separate optimal configurations per architecture (gfx1030, gfx1100, etc.)

The framework handles all tuning transparently, with no code changes needed. For custom kernels, a comprehensive API enables integration with minimal boilerplate.

Build & run

  • Configure: meson setup build --native-file native/hip.ini --reconfigure
  • Compile: ninja -C build
  • Single image: ./build/hip-img-fx --input input.jpg --filter grayscale --output output.jpg
  • Batch: ./build/hip-img-fx --input examples --filter grayscale --output examples/output
  • Filters: grayscale, negative, gaussian-blur

CLI help excerpt

hip-img-fx --help shows options for input/output (file or dir), filter selection, --use-cpu, and notes on batch vs single-image modes.

./build/hip-img-fx --help
============================
Running HIP Image FX v1.0.0
============================

Usage: hip-img-fx [options]
Options:
  --input <input_file|input_dir>     Specifies the input file or directory path.
  --output <output_file|output_dir>  Specifies the output file or directory path.
  --filter <filter_type>             Specifies the type of filter to apply 
                                     (e.g., "grayscale", "negative", "gaussian-blur").
  --use-cpu                          Use CPU for processing instead of GPU.
  --batch-size <N>                   Number of images to process per GPU batch (default: 64).
  --help                             Displays this help information.

Notes:
  - For batch processing, specify both --input and --output as directories.
  - For single image processing, specify both as files.
  - Supported filters: grayscale, negative, gaussian-blur

Example run/output (GPU batch)

============================
Running HIP Image FX v1.0.0
============================
Input: /home/avic/Pictures/train/
Output: /home/avic/Pictures/train_output
Filter Type: GRAYSCALE
Batch Size: 64
Using GPU for processing.
    HIP Device Count: 1
    Device 0: AMD Radeon RX 6900 XT
        Compute Capability: ------------ = 10.3
        Total Global Memory: ----------- = 17163091968
        Shared Memory per Block: ------- = 65536
        Registers per Block: ----------- = 32768
        Warp Size: --------------------- = 32
        Max Threads per Block: --------- = 1024
        Max Threads Dimension: --------- = (1024, 1024, 1024)
        Max Grid Size: ----------------- = (2147483647, 65536, 65536)
        Clock Rate: -------------------- = 2660000
        Total Constant Memory: --------- = 2147483647
        Multiprocessor Count: ---------- = 40
        L2 Cache Size: ----------------- = 4194304
        Max Threads per Multiprocessor:  = 2048
        Unified Addressing: ------------ = 0
        Memory Clock Rate: ------------- = 1000000
        Memory Bus Width: -------------- = 256
        Peak Memory Bandwidth: --------- = 64.000000

num threads: 32
GPU batch size: 64
Loaded 6499 images for batch processing.
[AutoTuner] Loaded 9 embedded default configurations
Batch processing complete: 6499 images processed.
Total processing time: 00m 01s 420ms

Project structure

hip-img-fx/
├── src/
│   ├── app/                    # Application entry point & batch processing
│   ├── cli/                    # Command-line argument parsing
│   ├── core/                   # GPU utilities, timing, image I/O
│   │   ├── gpu_utils.cpp/.h    # HIP pipeline, events, streams
│   │   └── image.cpp/.h        # STB-based image loading
│   └── filters/                # HIP kernels & CPU implementations
│       ├── grayscale.hip.cpp
│       ├── negative.hip.cpp
│       └── gaussian_blur.hip.cpp
├── bench/
│   ├── run_bench.cpp           # Benchmark harness
│   ├── scripts/
│   │   ├── run_benchmark.sh    # Automated benchmark runner
│   │   └── analyze_results.py  # Performance analysis tool
│   └── results/                # CSV output directory
├── examples/                   # Sample images
├── native/
│   └── hip.ini                 # Meson HIP configuration
└── meson.build                 # Build system

Filter examples (before → after)

Original image before grayscale filter
Before: original frame
Image after grayscale filter
After: HIP grayscale
Original image before Gaussian blur
Before: original frame
Image after Gaussian blur
After: HIP Gaussian blur
Original image before negative filter
Before: original frame
Image after negative filter
After: HIP negative

Test System

  • GPU: AMD Radeon RX 6900 XT (gfx1030)
  • CPU: AMD Ryzen 3950X (32 threads OpenMP)
  • HIP toolchain: hipcc (ROCm clang 19.0)

Benchmark (Jan 2026)

Benchmarks use a synchronous pipeline (H2D → kernel → D2H) across 3 filters, 4 resolutions (512² → 4096²), and 5 batch sizes (1/8/16/32/64).

Key Results

  • Peak speedup vs OpenMP: 41× (Gaussian blur, 1024², batch 8)
  • Peak speedup vs single-thread CPU: 582× (Gaussian blur, 4096²)
  • Average transfer overhead: ~72% of total GPU time
  • Compute-bound workload: Gaussian blur achieves 25-41× vs OpenMP with 22-29% transfer overhead
  • Memory-bound workloads: Grayscale/negative show 89-97% transfer overhead, often near parity with OpenMP (0.2-7× speedup)

Performance Analysis

Speedup vs resolution for HIP Image FX
Speedup vs resolution (vs single-thread CPU and vs OpenMP).
GPU time breakdown (H2D, kernel, D2H)
GPU time breakdown: H2D / kernel / D2H.
Transfer overhead percentage
Transfer overhead (% of total GPU time).
Effective bandwidth by resolution
Effective bandwidth (transfer-limited behavior).
Absolute execution times
Absolute times (CPU vs GPU) by configuration.
Batch size scaling
Batch size scaling (1 vs 32 vs 64).

Performance by Filter

Gaussian Blur (Compute-Intensive)

Gaussian Blur Performance Data
ResolutionBatch SizeGPU TimeKernel TimeTransfer %Speedup vs OpenMPSpeedup vs CPU
512²10.426 ms0.217 ms49.0%25.85×358.29×
512²80.381 ms0.204 ms46.5%27.45×385.88×
512²160.400 ms0.202 ms49.5%39.50×361.94×
512²320.396 ms0.202 ms49.0%25.26×352.83×
512²640.403 ms0.204 ms49.4%24.62×363.66×
1024²11.129 ms0.824 ms27.0%30.42×501.63×
1024²81.133 ms0.808 ms28.7%40.59×505.09×
1024²161.149 ms0.821 ms28.5%30.08×495.33×
1024²321.154 ms0.820 ms29.0%33.16×494.37×
1024²641.156 ms0.823 ms28.8%35.04×498.23×
2048²14.269 ms3.217 ms24.7%32.15×529.63×
2048²84.340 ms3.295 ms24.1%34.18×531.13×
2048²164.335 ms3.279 ms24.4%37.87×530.14×
2048²324.315 ms3.285 ms23.9%32.01×520.04×
2048²644.335 ms3.301 ms23.9%31.66×529.82×
4096²116.910 ms13.139 ms22.3%38.36×581.63×
4096²816.947 ms13.133 ms22.5%35.79×579.58×
4096²1616.902 ms13.145 ms22.2%35.44×576.04×
4096²3216.967 ms13.212 ms22.1%36.40×575.91×
4096²6417.149 ms13.370 ms22.0%33.09×572.22×

Analysis: Gaussian blur is the ideal GPU workload. The 11×11 convolution kernel (121 operations per pixel) provides sufficient compute intensity to amortize transfer costs. Transfer overhead decreases from ~49% to ~22% as resolution increases. Speedup remains excellent (25-41× vs OpenMP) across all tested resolutions and batch sizes, with peak performance at 1024² (batch 8: 40.6× vs OpenMP, 505× vs single-thread CPU).

Grayscale (Memory-Bound)

Grayscale Performance Data
ResolutionBatch SizeGPU TimeKernel TimeTransfer %Speedup vs OpenMPSpeedup vs CPU
512²10.167 ms0.014 ms91.6%2.82×2.58×
512²80.187 ms0.008 ms95.7%0.22×4.06×
512²160.204 ms0.007 ms96.6%0.50×2.76×
512²320.196 ms0.006 ms96.9%0.50×2.34×
512²640.204 ms0.007 ms96.6%0.74×2.09×
1024²10.319 ms0.036 ms88.7%0.85×10.36×
1024²80.352 ms0.025 ms92.9%0.81×5.42×
1024²160.356 ms0.029 ms91.9%1.15×5.24×
1024²320.360 ms0.031 ms91.4%0.62×5.30×
1024²640.363 ms0.031 ms91.5%0.72×4.90×
2048²11.141 ms0.108 ms90.5%1.10×7.73×
2048²81.227 ms0.123 ms90.0%1.00×6.15×
2048²161.237 ms0.122 ms90.1%0.85×5.90×
2048²321.236 ms0.122 ms90.1%0.77×5.85×
2048²641.241 ms0.123 ms90.1%0.78×6.13×
4096²14.303 ms0.401 ms90.7%1.07×6.93×
4096²84.479 ms0.487 ms89.1%1.03×6.60×
4096²164.409 ms0.484 ms89.0%0.95×6.56×
4096²324.340 ms0.483 ms88.9%0.98×6.63×
4096²644.450 ms0.491 ms89.0%1.01×6.62×

Analysis: Grayscale conversion is severely memory-bound. Kernel execution stays under ~0.49ms even for 4096² images, while transfers dominate (typically 89-97% overhead). Performance is inconsistent across batch sizes, with many configurations showing CPU OpenMP beating GPU (speedup < 1×). Best case: 2.82× vs OpenMP at 512² batch 1. Worst case: 0.22× at 512² batch 8.

Negative (Memory-Bound)

Negative Filter Performance Data
ResolutionBatch SizeGPU TimeKernel TimeTransfer %Speedup vs OpenMPSpeedup vs CPU
512²10.174 ms0.013 ms92.5%3.77×4.50×
512²80.183 ms0.006 ms96.7%1.44×4.31×
512²160.199 ms0.005 ms97.5%0.44×3.99×
512²320.196 ms0.005 ms97.4%0.63×2.83×
512²640.211 ms0.005 ms97.6%1.35×2.75×
1024²10.314 ms0.029 ms90.8%6.97×10.99×
1024²80.345 ms0.018 ms94.8%0.95×7.29×
1024²160.351 ms0.021 ms94.0%0.64×6.58×
1024²320.351 ms0.021 ms94.0%0.89×6.44×
1024²640.354 ms0.021 ms94.1%0.90×6.34×
2048²11.119 ms0.078 ms93.0%0.96×8.97×
2048²81.230 ms0.087 ms92.9%0.93×7.52×
2048²161.236 ms0.090 ms92.7%0.95×7.41×
2048²321.228 ms0.090 ms92.7%0.93×7.21×
2048²641.229 ms0.091 ms92.6%0.94×7.38×
4096²14.231 ms0.297 ms93.0%1.08×8.55×
4096²84.363 ms0.355 ms91.9%1.75×8.50×
4096²164.249 ms0.356 ms91.6%1.12×8.30×
4096²324.247 ms0.357 ms91.6%1.06×8.35×
4096²644.343 ms0.363 ms91.6%1.13×8.19×

Analysis: Similar to grayscale, negative inversion is memory-bound with ~91-98% transfer overhead. The simple per-byte operation executes in well under 0.36ms even at 4096², so transfer dominates. Performance is frequently near parity with CPU OpenMP; best case: 6.97× at 1024² batch 1. Worst case: 0.44× at 512² batch 16.

Key Takeaways

Compute-Bound vs Memory-Bound Performance

Gaussian blur (O(n² × kernel_size²)) demonstrates when GPU acceleration excels: sufficient compute intensity to amortize transfer costs, achieving 25-41× vs OpenMP and up to 582× vs single-thread CPU. Grayscale and negative reveal the limitation: simple per-pixel operations (0.22-6.97× vs OpenMP) where PCIe transfers consume 89-98% of GPU time.

Batch Processing Impact

Extensive testing across 5 batch sizes (1/8/16/32/64) reveals:

  • Compute-bound filters: Minimal impact (~2-5% variation), batch 8 often optimal
  • Memory-bound filters: Significant variability as transfer overhead shifts, smaller batches (1-8) often better
  • Single large allocation: Amortizes some overhead but doesn't overcome fundamental memory-bound limitations
  • Optimal batch size: Highly dependent on filter complexity, resolution, and GPU architecture

Resolution Scaling

As resolution increases, kernel execution grows quadratically while transfer overhead (as percentage) decreases. This makes GPU acceleration increasingly attractive for larger images, particularly for compute-intensive operations. At 4096², even memory-bound filters show 6-9× speedup vs single-thread CPU, though they remain slower than OpenMP.

Autotuning System

The production-ready autotuning framework demonstrates that kernel configuration matters: empirical testing shows 12-18% performance gains over fixed defaults. The three-tier caching system (thread-local → persistent JSON → benchmarking) ensures optimal performance with zero overhead after first run.

Learning Outcomes

This project demonstrates:

  1. HIP Fundamentals

    • Kernel launch configuration (hipLaunchKernelGGL)
    • Memory management (hipMalloc, hipMemcpy)
    • Event-based profiling (hipEventRecord, hipEventElapsedTime)
    • Batch processing with contiguous memory allocation
  2. GPU Performance Engineering

    • Identifying memory-bound vs compute-bound kernels
    • Analyzing bandwidth utilization
    • Optimizing memory access patterns (coalescing)
    • Understanding when GPU acceleration is beneficial vs CPU
    • Batch size tuning for different workload characteristics
    • Empirical autotuning for optimal kernel configurations
  3. Production Engineering Practices

    • Reproducible benchmark infrastructure
    • Statistical analysis (mean, std dev)
    • Performance regression detection
    • Clear documentation of optimization tradeoffs
    • Data-driven architectural decisions
    • Self-optimizing systems through autotuning frameworks
    • Compile-time safety with C++20 concepts

Future Work

High-Priority Optimizations

  1. Separable Gaussian Blur (3-5× speedup expected)

    • Split 11×11 2D convolution into 2× 11×1 passes
    • Reduces memory reads from 121 to 22 per pixel
  2. Tile-Based Processing with Shared Memory

    • Load image tiles into shared memory
    • Reuse data across threads
    • Reduces global memory pressure
  3. Multi-GPU Support

    • Distribute workload across multiple GPUs
    • Peer-to-peer transfers between GPUs
  4. Half-Precision (FP16) Kernels

    • Leverage RDNA matrix acceleration
    • 2× throughput for blur operations

Research Directions

  • Warp-level primitives: Use shuffle intrinsics for reduction operations
  • Texture memory: Test performance with texture cache for blur operations
  • Async Streams: Overlap H2D/kernel/D2H using HIP streams (requires careful benchmarking)

Credits

Example images (unsplash.com):