Workgroup Organization

Choosing workgroup sizes

The Two-Level Hierarchy

Compute shaders organize threads at two levels. At the top, you dispatch workgroups—independent units of work that can run on any compute unit. Inside each workgroup, threads cooperate and can share data.

This hierarchy exists because GPU hardware works in groups. Threads within a workgroup execute on the same compute unit, share fast local memory, and can synchronize with each other. Threads in different workgroups cannot directly communicate—they run independently, possibly on different parts of the chip.

Understanding this hierarchy is essential for writing efficient compute shaders. The workgroup size affects memory access patterns, occupancy, and the ability to use shared memory.

Workgroup Size Declaration

You declare workgroup size using the @workgroup_size attribute. The size can be 1D, 2D, or 3D.

// 1D workgroup: 256 threads in a line
@compute @workgroup_size(256)
fn process_array() { ... }
 
// 2D workgroup: 16×16 = 256 threads in a grid
@compute @workgroup_size(16, 16)
fn process_image() { ... }
 
// 3D workgroup: 8×8×4 = 256 threads in a volume
@compute @workgroup_size(8, 8, 4)
fn process_volume() { ... }

wgsl

All three examples have 256 total threads, but organized differently. The choice depends on your data layout. For linear arrays, 1D makes sense. For images, 2D matches the pixel grid. For voxel data, 3D aligns with the volume structure.

Interactive: Workgroup Size Explorer

WG Size X: 88

WG Size Y: 88

Dispatch X: 33

Dispatch Y: 22

Threads per workgroup

8 × 8 = 64

Total workgroups

3 × 2 = 6

Total threads

384

@workgroup_size(8, 8) with dispatchWorkgroups(3, 2)

The workgroup dimensions need not be powers of two, but powers of two often yield better hardware utilization. Common sizes include 64, 128, and 256 threads per workgroup.

Dispatch Dimensions

When dispatching, you specify how many workgroups to launch in each dimension:

passEncoder.dispatchWorkgroups(numX, numY, numZ);

typescript

For processing a 1920×1080 image with 16×16 workgroups:

// Each workgroup processes 16×16 pixels
// We need ceil(1920/16) × ceil(1080/16) workgroups
passEncoder.dispatchWorkgroups(
  Math.ceil(1920 / 16),  // 120 workgroups in X
  Math.ceil(1080 / 16)   // 68 workgroups in Y
);

typescript

This launches 120 × 68 = 8,160 workgroups, each containing 256 threads, for a total of about 2 million threads. The GPU schedules these workgroups across all available compute units.

Thread Indices

Inside a compute shader, several built-in values identify each thread:

@compute @workgroup_size(16, 16)
fn main(
  @builtin(global_invocation_id) global_id: vec3u,
  @builtin(local_invocation_id) local_id: vec3u,
  @builtin(workgroup_id) wg_id: vec3u,
  @builtin(num_workgroups) num_wgs: vec3u
) {
  // global_id: unique across entire dispatch
  // local_id: position within this workgroup (0 to workgroup_size-1)
  // wg_id: which workgroup this is
  // num_wgs: total workgroups dispatched
}

wgsl

The relationship between them:

\text{global\_id} = \text{wg\_id} \times \text{workgroup\_size} + \text{local\_id}

Interactive: Index Calculator

WG Size X: 1616

WG Size Y: 1616

Target X: 4242

Target Y: 2727

For global_invocation_id = (42, 27)

workgroup_id

(2, 1)

floor(42 / 16) = 2

floor(27 / 16) = 1

local_invocation_id

(10, 11)

42 % 16 = 10

27 % 16 = 11

Verification

2 × 16 + 10 = 42

1 × 16 + 11 = 27

✓ Matches

The Index Formula

global_id = workgroup_id × workgroup_size + local_id

This relationship is fundamental. Given any global_id, you can derive workgroup_id and local_id. Given workgroup_id and local_id, you can reconstruct global_id.

Adjust the sliders to see how different workgroup sizes affect the index decomposition.

For image processing, global_id.xy typically maps directly to pixel coordinates. For array processing, global_id.x maps to the array index. The local_id becomes important when using shared memory, which we will cover in the next chapter.

Occupancy: Filling the GPU

GPUs have many compute units, each capable of running multiple workgroups concurrently. Occupancy measures how well you are utilizing this capacity.

Low occupancy means compute units sit idle while others work. High occupancy means the GPU stays busy, hiding memory latency by switching between threads.

Several factors affect occupancy. Workgroup size matters—if your workgroups are too small, you might not have enough threads to fill a compute unit. If they are too large, fewer workgroups fit. Register usage also matters—each thread needs registers, and using too many limits how many threads can run simultaneously.

Interactive: Occupancy

Workgroup size: 6464

Registers/thread: 3232

Simulated GPU with 4 Compute Units

CU 0

32 WGs

CU 1

32 WGs

CU 2

32 WGs

CU 3

32 WGs

Occupancy

100%

Active threads/unit

2048

Max threads/unit

2048

Occupancy depends on workgroup size and resource usage. Higher is generally better for hiding memory latency.

The relationship between workgroup size and occupancy is not linear. A size of 64 might give good occupancy. Doubling to 128 might improve it. But 256 could reduce occupancy if threads use too many registers.

Guidelines for Choosing Sizes

Start with 64 or 256 threads per workgroup. These sizes work well on most hardware and give reasonable occupancy.

Match the workgroup shape to your data. For 2D data, use 2D workgroups—typically 8×8, 16×16, or 8×32. For 1D data, use 1D workgroups. Mismatched shapes lead to awkward index calculations and poor memory access patterns.

Use multiples of 32. Modern GPUs execute threads in groups called warps (NVIDIA) or wavefronts (AMD), typically 32 or 64 threads. A workgroup size that is not a multiple of this wastes execution slots.

Keep resource usage in mind. More registers per thread means fewer concurrent threads. Complex shaders with many variables might need smaller workgroups to maintain occupancy.

Test on target hardware. Optimal sizes vary between GPU architectures. What works best on one GPU might not be optimal on another. Profile on your target platform.

Practical Example: Histogram

Computing a histogram demonstrates workgroup organization. Each workgroup processes a region of the image and counts pixels in local bins, then merges results.

var<workgroup> local_bins: array<atomic<u32>, 256>;
 
@compute @workgroup_size(16, 16)
fn histogram(
  @builtin(global_invocation_id) global_id: vec3u,
  @builtin(local_invocation_id) local_id: vec3u
) {
  let flat_local = local_id.y * 16 + local_id.x;
  
  // Initialize local bins (one thread per bin)
  if (flat_local < 256) {
    atomicStore(&local_bins[flat_local], 0);
  }
  workgroupBarrier();
  
  // Each thread counts its pixel
  let pixel = textureLoad(input_image, global_id.xy, 0);
  let luminance = u32(dot(pixel.rgb, vec3f(0.299, 0.587, 0.114)) * 255.0);
  atomicAdd(&local_bins[luminance], 1);
  
  workgroupBarrier();
  
  // Merge to global histogram
  if (flat_local < 256) {
    atomicAdd(&global_bins[flat_local], atomicLoad(&local_bins[flat_local]));
  }
}

wgsl

The 16×16 workgroup (256 threads) matches the 256 histogram bins. This is intentional—it simplifies initialization and merging. Each thread handles one bin during setup and merge phases.

Multi-Dimensional Dispatch

For volume processing, you might use 3D dispatch:

@compute @workgroup_size(4, 4, 4)
fn process_volume(@builtin(global_invocation_id) id: vec3u) {
  let value = volume[id.z][id.y][id.x];
  // Process voxel...
}

wgsl

With a 256³ volume, dispatch (64, 64, 64) workgroups. Each workgroup processes a 4×4×4 = 64 voxel region.

The 3D structure helps with spatial locality. Threads in the same workgroup access nearby voxels, which improves cache behavior compared to flattening the volume into a 1D array.

Indirect Dispatch

Sometimes you do not know the dispatch size at command recording time. Indirect dispatch reads workgroup counts from a GPU buffer:

passEncoder.dispatchWorkgroupsIndirect(indirectBuffer, offset);

typescript

The buffer contains three 32-bit unsigned integers: workgroup counts for X, Y, and Z. This enables GPU-driven workflows where a previous compute pass determines how much work the next pass should do.

A particle system might use this pattern: one pass counts active particles and writes the dispatch size, then the main pass runs with exactly the right number of workgroups.

Key Takeaways

Workgroup size is declared with @workgroup_size(x, y, z) and affects thread organization
Dispatch dimensions specify how many workgroups to launch in each direction
global_invocation_id uniquely identifies threads; local_invocation_id identifies position within the workgroup
Occupancy measures GPU utilization; aim for workgroup sizes of 64-256 threads
Match workgroup shape to your data: 2D workgroups for images, 3D for volumes
Use multiples of 32 or 64 to align with hardware execution units