Parallel Patterns
Reduction, scan, and histogram
The Building Blocks of GPU Computing
Every complex GPU algorithm is built from a handful of fundamental patterns. Learn these patterns, and you can decompose almost any parallel problem into familiar pieces.
The patterns we cover here—reduction, prefix sum, and histogram—appear constantly in GPU programming. Image processing, physics simulation, sorting, graph algorithms, machine learning: all rely on these primitives. They are the vocabulary of parallel thinking.
Each pattern solves a specific communication problem. How do thousands of threads combine their results into a single value? How does each thread learn what all the threads before it computed? How do threads safely count occurrences when multiple threads might count the same category? These questions have elegant parallel answers.
Parallel Reduction
Reduction combines many values into one. Sum an array. Find the maximum. Compute the logical AND of a set of booleans. Any associative operation can be parallelized through reduction.
The sequential approach is obvious: iterate through the array, accumulating a running result. But this is fundamentally serial—each step depends on the previous. How do we parallelize an inherently cumulative operation?
The trick is the tree structure. In the first step, pair up adjacent elements and combine them. This produces half as many values. Repeat: pair and combine, pair and combine. After steps, a single value remains.
Interactive: Parallel Reduction
Each level, adjacent pairs are summed in parallel. After log₂(n) levels, only the final sum remains. The highlighted nodes show which values are being combined.
Watch how the work proceeds in waves. Each level of the tree can execute in parallel—all pairs at a given level are independent. The number of active threads halves at each step, but the total work remains . The critical insight is that the depth of the computation is only , a massive speedup over sequential depth.
// Workgroup reduction using shared memory
var<workgroup> shared_data: array<f32, 256>;
@compute @workgroup_size(256)
fn reduce(@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) wg_id: vec3<u32>) {
let tid = local_id.x;
let gid = wg_id.x * 256u + tid;
// Load data into shared memory
shared_data[tid] = input[gid];
workgroupBarrier();
// Tree reduction
for (var stride = 128u; stride > 0u; stride = stride >> 1u) {
if (tid < stride) {
shared_data[tid] = shared_data[tid] + shared_data[tid + stride];
}
workgroupBarrier();
}
// Thread 0 writes the result
if (tid == 0u) {
partial_sums[wg_id.x] = shared_data[0];
}
}The pattern has two phases. First, each workgroup reduces its portion of the data to a single value, storing it in partial_sums. Then a second kernel (or a single workgroup) reduces these partial sums to the final result.
The workgroupBarrier() calls are essential. Each iteration reads values written in the previous iteration. Without the barrier, threads might read stale data, producing incorrect results.
Bank Conflicts
Shared memory is divided into banks—typically 32 banks in modern GPUs. When multiple threads access different addresses in the same bank simultaneously, the accesses are serialized, creating a bank conflict that degrades performance.
The naive reduction pattern above can suffer from bank conflicts. When stride is a multiple of 32, threads 0 and 32 access the same bank, as do threads 1 and 33, and so on.
A common optimization is sequential addressing: instead of having thread access indices and , have it access indices and initially, then progressively larger strides that avoid bank conflicts. The choice of access pattern can make or break reduction performance.
Prefix Sum (Scan)
Prefix sum, also called scan, computes all running totals simultaneously. Given input , exclusive prefix sum produces and inclusive scan produces .
This seems inherently sequential—how can we compute before computing ? The parallel algorithm is less intuitive than reduction but equally elegant.
Interactive: Prefix Sum (Hillis-Steele)
The Hillis-Steele algorithm: in each step, element i adds the value from position i - offset. The offset doubles each step (1, 2, 4, ...) until all prefix sums are computed.
The Hillis-Steele algorithm proceeds in steps. In step , each element adds the value at position (if it exists). After step 0, each element contains the sum of itself and its left neighbor. After step 1, it contains the sum of itself and its three predecessors. After step , each element contains the sum of the preceding elements.
// Hillis-Steele inclusive scan (for small arrays within a workgroup)
var<workgroup> temp: array<f32, 256>;
@compute @workgroup_size(256)
fn scan(@builtin(local_invocation_id) local_id: vec3<u32>) {
let tid = local_id.x;
// Load input
temp[tid] = input[tid];
workgroupBarrier();
// Hillis-Steele scan
for (var offset = 1u; offset < 256u; offset = offset << 1u) {
var val = temp[tid];
if (tid >= offset) {
val = val + temp[tid - offset];
}
workgroupBarrier();
temp[tid] = val;
workgroupBarrier();
}
output[tid] = temp[tid];
}Note the double barrier in the loop. The first ensures all threads have read their input before any thread writes. The second ensures all threads have written before the next iteration reads. This "read-barrier-write-barrier" pattern is common when values depend on neighbors from the previous iteration.
The Hillis-Steele algorithm does total work, more than the sequential . For large arrays, the Blelloch algorithm achieves work while maintaining depth, at the cost of more complex code.
Why Prefix Sum Matters
Prefix sum may seem like a narrow operation, but it solves a fundamental problem in parallel computing: compaction. Suppose you have an array and want to keep only elements that satisfy some predicate. Sequentially, you would iterate and append matching elements to an output buffer. In parallel, every thread wants to write simultaneously—but to which index?
Prefix sum provides the answer. First, each thread marks its element as 1 (keep) or 0 (discard). Then compute the exclusive prefix sum of these marks. The prefix sum value at position tells thread exactly where to write its element in the compacted output.
This pattern appears everywhere: stream compaction, radix sort, sparse matrix operations, collision detection. Prefix sum is the "parallel append."
Histogram
A histogram counts how many elements fall into each category. Given values in some range, how many 0s are there? How many 1s? How many 2s?
The challenge is that multiple threads might try to increment the same counter simultaneously. Without coordination, increments are lost. If two threads both read count 5, increment to 6, and write 6, the count should be 7 but ends up as 6.
Interactive: Parallel Histogram
Sequential processing updates one bin at a time. Parallel processing updates all bins simultaneously, but requires atomic operations when multiple threads target the same bin (shown in red).
The solution involves atomics, which we cover in depth in the next chapter. For now, understand that atomicAdd performs a read-modify-write cycle that cannot be interrupted by other threads.
@group(0) @binding(0) var<storage, read> data: array<u32>;
@group(0) @binding(1) var<storage, read_write> histogram: array<atomic<u32>, 256>;
@compute @workgroup_size(256)
fn build_histogram(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
if (idx < arrayLength(&data)) {
let value = data[idx];
let bin = value % 256u; // Simple binning for 0-255 values
atomicAdd(&histogram[bin], 1u);
}
}This direct approach works but may be slow when many threads update the same bin. A common optimization is to build local histograms in shared memory within each workgroup, then merge them into the global histogram. This trades atomic operations on global memory for faster shared memory operations, with a final atomic merge.
var<workgroup> local_histogram: array<atomic<u32>, 256>;
@compute @workgroup_size(256)
fn build_histogram_optimized(@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(global_invocation_id) global_id: vec3<u32>) {
// Initialize local histogram
local_histogram[local_id.x] = 0u;
workgroupBarrier();
// Each thread processes multiple elements, updating local histogram
let idx = global_id.x;
if (idx < arrayLength(&data)) {
let value = data[idx];
let bin = value % 256u;
atomicAdd(&local_histogram[bin], 1u);
}
workgroupBarrier();
// Merge into global histogram
atomicAdd(&global_histogram[local_id.x], atomicLoad(&local_histogram[local_id.x]));
}The Pattern Language
These three patterns—reduction, scan, and histogram—form a pattern language for parallel algorithms.
Parallel Pattern Building Blocks
When you face a new parallel problem, ask:
- Am I combining many values into one? → Reduction
- Does each element need information about all preceding elements? → Prefix Sum
- Am I counting or categorizing? → Histogram
Complex algorithms compose these primitives. Radix sort uses prefix sum for redistribution and histogram to count digits. Particle simulation uses reduction to compute forces and scan for memory allocation. Graph algorithms use scan for frontier expansion and reduction for convergence checks.
Learn the patterns deeply. Understand their memory access patterns, their work-depth tradeoffs, their bank conflict behaviors. These details separate functional code from fast code.
Key Takeaways
- Parallel reduction combines values in depth using a tree structure
- Prefix sum (scan) computes all running totals, enabling parallel compaction and index assignment
- Histogram requires atomic operations to safely count concurrent increments
- Bank conflicts in shared memory can serialize parallel accesses—address patterns matter
- These patterns compose to build complex algorithms like sorting, searching, and simulation
- The double-barrier pattern (read-barrier-write-barrier) prevents race conditions when neighbors exchange data