The GPU Execution Model

Threads, workgroups, and the SIMT paradigm

Threads Everywhere

A GPU is not a single powerful processor. It is thousands of tiny processors working in parallel. Each of these tiny processors runs what we call a thread—a single instance of your shader code operating on a single piece of data.

When you launch a compute shader, you're not running one copy of your code. You're running thousands of copies simultaneously, each operating on different data. If you want to process a 1920×1080 image, you might launch over two million threads—one for every pixel.

This is the fundamental shift in thinking required for GPU programming. On a CPU, you write code that processes items one at a time, perhaps with a loop. On a GPU, you write code for a single item, then launch it millions of times in parallel.

Each thread is deliberately simple. It has access to a small amount of private memory, knows its own identity (which thread am I?), and executes the same shader code as every other thread. The simplicity is intentional—keeping threads lightweight allows the GPU to run so many of them.

Workgroups

Threads don't operate in isolation. They're organized into workgroups—small teams of threads that can cooperate with each other.

A typical workgroup might contain 64, 128, or 256 threads. The exact number depends on your shader code and what you're trying to accomplish. Threads within a workgroup share a special kind of memory called workgroup shared memory, and they can synchronize with each other—waiting until all threads in the group reach the same point before continuing.

Interactive: Workgroups and Threads

Workgroups X3

Workgroups Y2

Threads/Workgroup X4

Threads/Workgroup Y4

Hover over a thread to see its identity

6 workgroups × 16 threads each = 96 total threads

The workgroup is the fundamental unit of scheduling on the GPU. The hardware schedules entire workgroups onto processor cores. All threads in a workgroup are guaranteed to run together, which is why they can safely share memory and synchronize.

Threads in different workgroups cannot share memory directly or synchronize with each other. If workgroup A needs to communicate with workgroup B, they must write to global memory and coordinate through other means. This constraint exists because workgroups may run at different times, on different parts of the GPU.

The Grid of Workgroups

When you dispatch a compute shader, you specify a 3D grid of workgroups. Each workgroup is identified by its position in this grid: (x, y, z). The dispatch call defines how many workgroups to launch in each dimension.

For a 1920×1080 image with workgroups of 16×16 threads, you would dispatch a grid of 120×68 workgroups (since 1920÷16=120 and 1080÷16≈68, rounded up). That's 8,160 workgroups, each containing 256 threads, for a total of about 2 million threads.

The 3D structure isn't just convenience—it maps naturally to many computational problems. 2D images use two dimensions. 3D volumes use all three. 1D arrays can use a single dimension. The grid dimensions exist to help you organize threads in a way that matches your data.

From the GPU's perspective, the grid is simply a collection of independent workgroups. It schedules them in whatever order is efficient, potentially running many workgroups simultaneously if the hardware has capacity. The grid represents the total amount of work; the GPU decides how to execute it.

SIMT: Single Instruction, Multiple Threads

Here's where GPU architecture becomes clever. The GPU doesn't have one processor per thread—that would require millions of processors. Instead, it uses SIMT (Single Instruction, Multiple Threads) execution.

In SIMT, many threads execute the same instruction at the same time, but each operates on different data. Imagine a line of workers all hammering simultaneously, but each hammering a different nail. One instruction ("hammer"), many results.

Animation: SIMT Execution

Single Instruction

result = data × 2

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Each thread has its own data value

SIMT (Single Instruction, Multiple Threads) executes the same instruction across many threads simultaneously. Each thread operates on different data, but they all perform the same operation at the same time.

The threads executing in lockstep are called a warp (NVIDIA) or wavefront (AMD). A warp typically contains 32 threads. When the instruction decoder fetches an instruction, it doesn't execute it once—it broadcasts it to 32 threads simultaneously.

This is why GPUs achieve such high throughput. A single instruction fetch and decode drives 32 parallel operations. The overhead of instruction processing is amortized across many threads. Where a CPU might execute one multiply instruction at a time, a GPU executes 32 multiplies for barely more cost.

But SIMT comes with constraints. Those 32 threads must execute the same instruction. They can have different data in their registers, but they must perform the same operation. What happens when your code has an if-statement, and different threads want to take different branches?

Divergence and Its Cost

Consider shader code with a conditional:

if (value >= threshold) {
    result = value * 2;
} else {
    result = value + 10;
}

wgsl

In a warp of 32 threads, some threads might have values above the threshold, others below. They want to execute different branches.

The GPU handles this through predication. It executes both branches, but masks off threads that shouldn't participate. First, it runs the if-branch with only the qualifying threads active. Then it runs the else-branch with the other threads active. Threads sit idle while their branch isn't executing.

Interactive: Thread Divergence

shader code

// All threads same path
result = value × 2

Execution Timeline

All 8 threads (single pass)

When all threads follow the same path, execution is fully parallel. No time is wasted.

This is called divergence, and it's costly. If half your threads take the if-branch and half take the else-branch, you've effectively doubled the execution time for that section of code. The worst case is when only one thread takes a different path—the entire warp must still execute both branches, wasting 31/32 of the potential parallelism.

Divergence is one of the main performance pitfalls in GPU programming. Code that looks innocent can devastate performance if it causes threads in the same warp to take different paths. The most efficient GPU code keeps threads in lockstep, ensuring they all execute the same instructions.

Not all divergence is equally bad. If entire warps take the same branch—just different warps take different branches—there's no penalty within each warp. The key is what happens among the 32 threads that must execute together.

Thread Identity

Each thread needs to know who it is. Without identity, every thread would compute the same result—pointless parallelism. Thread identity tells each thread which portion of the data it's responsible for.

In WGSL, three built-in values provide this information:

local_invocation_id identifies a thread within its workgroup. If your workgroup is 8×8 threads, this ranges from (0,0) to (7,7). This is useful when accessing shared memory within the workgroup.

workgroup_id identifies which workgroup this thread belongs to in the dispatch grid. If you dispatched a 10×10 grid of workgroups, this ranges from (0,0) to (9,9).

global_invocation_id uniquely identifies the thread across the entire dispatch. It combines workgroup position and local position: global_id = workgroup_id * workgroup_size + local_id. This is what you typically use to determine which data element to process.

Interactive: Thread Identity

Dispatch: (3, 2, 1) workgroups·Workgroup size: (4, 4, 1)

workgroup_id

(1, 0)

local_invocation_id

(1, 3)

global_invocation_id

(5, 3)

The relationship:

global_id = workgroup_id × workgroup_size + local_id

(5, 3) = (1, 0) × (4, 4) + (1, 3)

Click any cell to see how its position maps to thread IDs. Each workgroup (outlined) contains a fixed number of threads. The global ID uniquely identifies every thread in the entire dispatch.

The relationship between these IDs is deterministic. Given the dispatch parameters and workgroup size, you can always calculate one from the others. But having all three available lets you choose whichever is most convenient for your algorithm.

When processing a 2D image, you might use global_invocation_id.xy directly as pixel coordinates. When implementing algorithms that require workgroup cooperation, you'll use local_invocation_id to coordinate within the group and workgroup_id to identify which chunk of the problem you're working on.

Key Takeaways

GPUs run thousands of lightweight threads in parallel—each thread executes the same shader code on different data
Threads are organized into workgroups that can share memory and synchronize with each other
Workgroups form a 3D grid; the dispatch call defines how many workgroups to launch
SIMT (Single Instruction, Multiple Threads) executes the same instruction across many threads simultaneously
Divergence occurs when threads in a warp take different branches, forcing sequential execution of both paths
Thread identity comes from local_invocation_id (within workgroup), workgroup_id (which workgroup), and global_invocation_id (unique across dispatch)