Workgroup Organization
Choosing workgroup sizes
The Two-Level Hierarchy
Compute shaders organize threads at two levels. At the top, you dispatch workgroups—independent units of work that can run on any compute unit. Inside each workgroup, threads cooperate and can share data.
This hierarchy exists because GPU hardware works in groups. Threads within a workgroup execute on the same compute unit, share fast local memory, and can synchronize with each other. Threads in different workgroups cannot directly communicate—they run independently, possibly on different parts of the chip.
Understanding this hierarchy is essential for writing efficient compute shaders. The workgroup size affects memory access patterns, occupancy, and the ability to use shared memory.
Workgroup Size Declaration
You declare workgroup size using the @workgroup_size attribute. The size can be 1D, 2D, or 3D.
// 1D workgroup: 256 threads in a line
@compute @workgroup_size(256)
fn process_array() { ... }
// 2D workgroup: 16×16 = 256 threads in a grid
@compute @workgroup_size(16, 16)
fn process_image() { ... }
// 3D workgroup: 8×8×4 = 256 threads in a volume
@compute @workgroup_size(8, 8, 4)
fn process_volume() { ... }All three examples have 256 total threads, but organized differently. The choice depends on your data layout. For linear arrays, 1D makes sense. For images, 2D matches the pixel grid. For voxel data, 3D aligns with the volume structure.
Interactive: Workgroup Size Explorer
@workgroup_size(8, 8) with dispatchWorkgroups(3, 2)The workgroup dimensions need not be powers of two, but powers of two often yield better hardware utilization. Common sizes include 64, 128, and 256 threads per workgroup.
Dispatch Dimensions
When dispatching, you specify how many workgroups to launch in each dimension:
passEncoder.dispatchWorkgroups(numX, numY, numZ);For processing a 1920×1080 image with 16×16 workgroups:
// Each workgroup processes 16×16 pixels
// We need ceil(1920/16) × ceil(1080/16) workgroups
passEncoder.dispatchWorkgroups(
Math.ceil(1920 / 16), // 120 workgroups in X
Math.ceil(1080 / 16) // 68 workgroups in Y
);This launches 120 × 68 = 8,160 workgroups, each containing 256 threads, for a total of about 2 million threads. The GPU schedules these workgroups across all available compute units.
Thread Indices
Inside a compute shader, several built-in values identify each thread:
@compute @workgroup_size(16, 16)
fn main(
@builtin(global_invocation_id) global_id: vec3u,
@builtin(local_invocation_id) local_id: vec3u,
@builtin(workgroup_id) wg_id: vec3u,
@builtin(num_workgroups) num_wgs: vec3u
) {
// global_id: unique across entire dispatch
// local_id: position within this workgroup (0 to workgroup_size-1)
// wg_id: which workgroup this is
// num_wgs: total workgroups dispatched
}The relationship between them:
Interactive: Index Calculator
Adjust the sliders to see how different workgroup sizes affect the index decomposition.
For image processing, global_id.xy typically maps directly to pixel coordinates. For array processing, global_id.x maps to the array index. The local_id becomes important when using shared memory, which we will cover in the next chapter.
Occupancy: Filling the GPU
GPUs have many compute units, each capable of running multiple workgroups concurrently. Occupancy measures how well you are utilizing this capacity.
Low occupancy means compute units sit idle while others work. High occupancy means the GPU stays busy, hiding memory latency by switching between threads.
Several factors affect occupancy. Workgroup size matters—if your workgroups are too small, you might not have enough threads to fill a compute unit. If they are too large, fewer workgroups fit. Register usage also matters—each thread needs registers, and using too many limits how many threads can run simultaneously.
Interactive: Occupancy
Occupancy depends on workgroup size and resource usage. Higher is generally better for hiding memory latency.
The relationship between workgroup size and occupancy is not linear. A size of 64 might give good occupancy. Doubling to 128 might improve it. But 256 could reduce occupancy if threads use too many registers.
Guidelines for Choosing Sizes
Start with 64 or 256 threads per workgroup. These sizes work well on most hardware and give reasonable occupancy.
Match the workgroup shape to your data. For 2D data, use 2D workgroups—typically 8×8, 16×16, or 8×32. For 1D data, use 1D workgroups. Mismatched shapes lead to awkward index calculations and poor memory access patterns.
Use multiples of 32. Modern GPUs execute threads in groups called warps (NVIDIA) or wavefronts (AMD), typically 32 or 64 threads. A workgroup size that is not a multiple of this wastes execution slots.
Keep resource usage in mind. More registers per thread means fewer concurrent threads. Complex shaders with many variables might need smaller workgroups to maintain occupancy.
Test on target hardware. Optimal sizes vary between GPU architectures. What works best on one GPU might not be optimal on another. Profile on your target platform.
Practical Example: Histogram
Computing a histogram demonstrates workgroup organization. Each workgroup processes a region of the image and counts pixels in local bins, then merges results.
var<workgroup> local_bins: array<atomic<u32>, 256>;
@compute @workgroup_size(16, 16)
fn histogram(
@builtin(global_invocation_id) global_id: vec3u,
@builtin(local_invocation_id) local_id: vec3u
) {
let flat_local = local_id.y * 16 + local_id.x;
// Initialize local bins (one thread per bin)
if (flat_local < 256) {
atomicStore(&local_bins[flat_local], 0);
}
workgroupBarrier();
// Each thread counts its pixel
let pixel = textureLoad(input_image, global_id.xy, 0);
let luminance = u32(dot(pixel.rgb, vec3f(0.299, 0.587, 0.114)) * 255.0);
atomicAdd(&local_bins[luminance], 1);
workgroupBarrier();
// Merge to global histogram
if (flat_local < 256) {
atomicAdd(&global_bins[flat_local], atomicLoad(&local_bins[flat_local]));
}
}The 16×16 workgroup (256 threads) matches the 256 histogram bins. This is intentional—it simplifies initialization and merging. Each thread handles one bin during setup and merge phases.
Multi-Dimensional Dispatch
For volume processing, you might use 3D dispatch:
@compute @workgroup_size(4, 4, 4)
fn process_volume(@builtin(global_invocation_id) id: vec3u) {
let value = volume[id.z][id.y][id.x];
// Process voxel...
}With a 256³ volume, dispatch (64, 64, 64) workgroups. Each workgroup processes a 4×4×4 = 64 voxel region.
The 3D structure helps with spatial locality. Threads in the same workgroup access nearby voxels, which improves cache behavior compared to flattening the volume into a 1D array.
Indirect Dispatch
Sometimes you do not know the dispatch size at command recording time. Indirect dispatch reads workgroup counts from a GPU buffer:
passEncoder.dispatchWorkgroupsIndirect(indirectBuffer, offset);The buffer contains three 32-bit unsigned integers: workgroup counts for X, Y, and Z. This enables GPU-driven workflows where a previous compute pass determines how much work the next pass should do.
A particle system might use this pattern: one pass counts active particles and writes the dispatch size, then the main pass runs with exactly the right number of workgroups.
Key Takeaways
- Workgroup size is declared with
@workgroup_size(x, y, z)and affects thread organization - Dispatch dimensions specify how many workgroups to launch in each direction
global_invocation_iduniquely identifies threads;local_invocation_ididentifies position within the workgroup- Occupancy measures GPU utilization; aim for workgroup sizes of 64-256 threads
- Match workgroup shape to your data: 2D workgroups for images, 3D for volumes
- Use multiples of 32 or 64 to align with hardware execution units