Memory Hierarchy
Registers, shared memory, and global memory
The Memory Wall
GPUs contain thousands of cores capable of performing trillions of arithmetic operations per second. Yet most GPU programs spend the majority of their time waiting—not computing, but fetching data.
This is the memory wall: the growing gap between how fast we can compute and how fast we can move data. A modern GPU can execute hundreds of floating-point operations in the time it takes to load a single value from global memory. If your threads constantly read from slow memory, all that compute power sits idle.
The GPU architects' solution is a memory hierarchy—layers of memory with different sizes, speeds, and scopes. Understanding this hierarchy is the difference between writing code that crawls and code that flies.
Interactive: Memory Latency Comparison
Scale matters: A global memory read takes 400× longer than a register access. During that wait, the GPU could perform hundreds of arithmetic operations.
The numbers are stark. Reading from a register is essentially free—one cycle. Reading from shared memory costs around 10-20 cycles. Reading from global memory? Hundreds of cycles, sometimes over 400. Every algorithm design decision should account for these ratios.
Registers: The Fastest Memory
At the top of the hierarchy sit registers—the fastest memory available to a GPU thread. Each thread has its own private set of registers, invisible to all other threads. When you declare a local variable in your shader, it lives in a register.
@compute @workgroup_size(64)
fn main(@builtin(local_invocation_id) local_id: vec3<u32>) {
// These live in registers - extremely fast
var sum: f32 = 0.0;
var temp: f32;
// Register access costs ~1 cycle
temp = f32(local_id.x) * 2.0;
sum = sum + temp;
}Registers have two critical properties. First, they are private. No other thread can see or modify your registers. This makes them ideal for intermediate calculations that do not need to be shared. Second, they are limited. Each thread gets only so many registers, and using too many forces the compiler to "spill" values to slower memory.
The challenge is that you cannot directly control register allocation. The compiler decides which variables go into registers based on usage patterns. What you can control is keeping your computations local—the more intermediate values you compute and consume without storing them elsewhere, the more the compiler can keep them in registers.
Shared Memory: Workgroup Cooperation
Threads within a workgroup often need to exchange data. Passing through global memory would be painfully slow. Instead, GPUs provide shared memory (called workgroup memory in WebGPU/WGSL)—a small, fast memory space visible to all threads in the same workgroup.
var<workgroup> shared_data: array<f32, 256>;
@compute @workgroup_size(256)
fn main(@builtin(local_invocation_id) local_id: vec3<u32>) {
let idx = local_id.x;
// Write to shared memory
shared_data[idx] = compute_something(idx);
// Synchronize - ensure all threads have written
workgroupBarrier();
// Now safely read neighbors' values
let left = shared_data[(idx + 255) % 256];
let right = shared_data[(idx + 1) % 256];
}Shared memory access costs roughly 10-20 cycles—an order of magnitude slower than registers, but an order of magnitude faster than global memory. This makes it invaluable for algorithms that need threads to cooperate.
The classic pattern is tile loading: instead of each thread reading from global memory independently, the workgroup collaborates to load a tile of data into shared memory, synchronizes, then processes that tile. Each global memory location is read once, not dozens of times.
Interactive: Shared Memory Tiling Pattern
0
Global Memory Accesses
0/4
Threads Complete
With shared memory: Load data once from global to shared, then all threads read from fast shared memory. Total global accesses equal tile size.
Notice how the tile-based approach dramatically reduces global memory accesses. Without tiling, each output element might require multiple global memory reads. With tiling, the workgroup loads data once into shared memory, then all threads read from the fast local copy.
The synchronization primitive workgroupBarrier() is essential. It ensures all threads have finished their writes before any thread starts reading. Without this barrier, threads might read stale or uninitialized values—a race condition that produces unpredictable results.
Global Memory: Large but Slow
At the base of the hierarchy sits global memory—the large VRAM that holds your buffers and textures. Every thread can access global memory, but each access costs hundreds of cycles.
@group(0) @binding(0) var<storage, read> input: array<f32>;
@group(0) @binding(1) var<storage, read_write> output: array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
// Each of these reads from global memory - expensive!
let a = input[idx];
let b = input[idx + 1];
let c = input[idx + 2];
output[idx] = a + b + c;
}Global memory is where your data lives. Every buffer you create, every texture you upload—all of it sits in global memory. The art of GPU programming is minimizing how often your threads need to reach down into this slow layer.
Think of global memory as a distant warehouse. You can store enormous amounts there, but every trip takes time. Smart programmers batch their "trips"—loading chunks of data into shared memory or registers, processing them intensively, then writing results back in bulk.
Memory Coalescing
Not all global memory accesses are equally expensive. When adjacent threads access adjacent memory locations, the hardware can combine those requests into a single coalesced transaction. When threads access scattered locations, each request becomes a separate transaction—dramatically slower.
Interactive: Coalesced vs Scattered Access
0
Transactions
0/8
Threads Served
Adjacent threads access adjacent memory locations. The GPU combines these into a single efficient transaction.
In the coalesced case, threads 0, 1, 2, 3... read from addresses 0, 1, 2, 3... The memory controller sees this as one contiguous request and services it efficiently—perhaps a single 128-byte fetch that satisfies all threads.
In the scattered case, each thread reads from a random or strided location. The memory controller must issue separate requests for each thread, and the total latency balloons.
The rule is straightforward: arrange your data so that consecutive threads access consecutive memory. If your algorithm naturally accesses memory in a scattered pattern, consider restructuring your data layout or using shared memory to gather the scattered values first.
// BAD: Strided access - each thread jumps by STRIDE
// Threads 0,1,2,3 read indices 0, STRIDE, 2*STRIDE, 3*STRIDE
let value = data[local_id.x * STRIDE];
// GOOD: Coalesced access - consecutive threads read consecutive indices
// Threads 0,1,2,3 read indices 0, 1, 2, 3
let value = data[local_id.x];The Texture Cache
Textures get special treatment. Unlike raw buffer reads, texture sampling goes through a texture cache optimized for 2D spatial locality. When you sample a pixel, the hardware assumes you will likely sample nearby pixels too—a pattern common in graphics (neighboring fragments sample adjacent texels) and image processing.
This cache behavior means textures can be faster than buffers for workloads with spatial locality, even though both technically live in global memory. If your access pattern is roughly 2D-local (such as image filtering or terrain sampling), textures might outperform equivalent buffer implementations.
However, the texture cache is not free. It consumes limited cache space, and its benefit depends entirely on your access pattern exhibiting spatial locality. For linear, non-spatial data, raw storage buffers are typically the better choice.
The Full Picture
Interactive: Memory Hierarchy Overview
Registers
1 cycle
Shared Memory
~15 cycles
L2 Cache
~50 cycles
Global Memory (VRAM)
~400 cycles
Click layers to see details. Data flows from global memory up through the cache hierarchy to reach registers, where computation happens fastest.
The memory hierarchy forms a pyramid. At the peak, registers: tiny, private, and blazingly fast. Below them, shared memory: bigger, shared within a workgroup, still fast. At the base, global memory: enormous capacity, accessible everywhere, but slow.
Effective GPU programming means keeping hot data at the top of the pyramid. Load from global memory once, process intensively in registers and shared memory, write back once. Every algorithm should be designed with this hierarchy in mind.
Key Takeaways
- The memory wall is the fundamental bottleneck—computation is fast, memory access is slow
- Registers are the fastest (1 cycle), private to each thread, but limited in quantity
- Shared memory (10-20 cycles) enables fast data sharing within a workgroup
- Global memory (200-400+ cycles) is large but slow—minimize accesses
- Memory coalescing dramatically improves global memory throughput when adjacent threads access adjacent addresses
- Texture caches exploit 2D spatial locality for image-like access patterns
- The pyramid rule: load once from global → process in shared/registers → write once to global