Why GPUs Exist

Understanding the fundamental problem that graphics processors were built to solve

Your computer has two processors that think in fundamentally different ways. The CPU excels at complex, sequential reasoning—following intricate chains of logic where each step depends on the last. The GPU takes a different approach: it handles thousands of simple tasks simultaneously, trading sophistication for raw parallelism. Understanding when to use each is the first step toward mastering graphics programming.

The Sequential Bottleneck

A CPU executes instructions one after another. Even with multiple cores, modern CPUs have perhaps 8 to 16 cores, and each core follows a strict sequence: fetch instruction, decode, execute, write result, repeat. This sequential model maps naturally to how we think about computation—first do A, then use A's result to do B.

But many problems don't have inherent dependencies. Consider computing the brightness of every pixel on a screen. The color at pixel (100, 200) has nothing to do with the color at pixel (500, 300). Each calculation is independent. If we have a million pixels and process them one at a time, we waste time waiting when we could be computing.

The CPU's sequential execution becomes a bottleneck not because the processor is slow, but because the problem doesn't require sequence. We're forcing inherently parallel work into a serial pipeline.

The Pixel Problem

A 1920×1080 display has 2,073,600 pixels. At 60 frames per second, that's 124 million pixel calculations every second. At 4K resolution and 120 Hz, you're looking at nearly a billion pixel updates per second.

Each pixel's color might require sampling textures, calculating lighting, blending transparency, and applying post-processing effects. Even if each calculation takes only a few nanoseconds, doing them sequentially adds up. A CPU core running at 4 GHz can execute roughly 4 billion simple operations per second—but that's total operations, not complete pixel calculations.

The insight that led to GPUs: pixel calculations are embarrassingly parallel. That's the technical term. Each pixel can be computed without knowing anything about its neighbors. There's no dependency chain, no ordering constraint, no shared state that needs synchronization.

Interactive: Sequential vs parallel pixel rendering

0ms

elapsed

pixels lit

Sequential: one pixel at a time. Parallel: all 256 pixels at once.

Speedup: 20× faster

Graphics hardware emerged from this realization. Instead of one very fast processor trying to handle millions of pixels, why not have thousands of smaller processors each handling a few pixels? The individual processors can be simpler and slower—what matters is that they work simultaneously.

SIMD and Beyond

Before GPUs, CPU designers already recognized that some workloads benefit from parallelism. SIMD—Single Instruction, Multiple Data—lets one instruction operate on multiple values at once. Instead of adding two numbers, a SIMD instruction adds two vectors of numbers in a single operation.

Interactive: Scalar vs SIMD operations

Scalar (one at a time)

0/8 operations

SIMD (all at once)

0/1 operation

Scalar: 8 separate multiply instructions. SIMD: 1 instruction multiplies all 8 values simultaneously.

Modern CPUs include SIMD instructions (SSE, AVX) that can process 4, 8, or even 16 values simultaneously. This helps, but it's limited. A CPU might have a few SIMD units per core. A GPU takes the same principle and scales it radically: thousands of simple processors, all executing the same instruction on different data.

The GPU's architecture is often called SIMT—Single Instruction, Multiple Threads. Groups of threads (typically 32 or 64) execute in lockstep, all running the same instruction but on different data. This is why GPU programming uses the concept of "shaders": you write one program that describes what happens to a single pixel or vertex, and the hardware runs thousands of copies simultaneously.

Throughput vs Latency

CPUs optimize for latency—the time to complete a single task. They have large caches, sophisticated branch predictors, and out-of-order execution to minimize the time between starting a calculation and getting the result. This makes them excellent for interactive applications, operating system tasks, and complex decision-making.

GPUs optimize for throughput—the total work completed per unit time. Individual operations might take longer (higher latency), but the GPU compensates by having thousands in flight simultaneously. If one group of threads is waiting for memory, another group runs. The goal is to keep all the processors busy, not to minimize any single task's completion time.

Interactive: The throughput-latency tradeoff

Problem Size500 tasks

CPU Time

62.5ms

GPU Time

10.5ms

Winner: GPU

Crossover point: ~81 tasks

The GPU has higher startup latency but scales better. Below the crossover point, the CPU wins. Above it, the GPU's parallelism dominates.

Think of it like shipping packages. A sports car (CPU) can deliver one package very quickly. A cargo ship (GPU) is slower per package, but it carries thousands at once. For moving one urgent document, take the sports car. For shipping a warehouse of goods, the cargo ship wins despite being slower per item.

The crossover point matters. For small problems, the GPU's setup overhead dominates—you spend more time launching the computation than actually computing. For large problems with independent elements, the GPU's massive parallelism makes it orders of magnitude faster.

When Parallelism Wins

Not every problem benefits from parallelism. The key question: can the work be divided into independent pieces?

Image processing is a natural fit. Applying a filter to a photo means computing each output pixel from nearby input pixels. No pixel's calculation affects another's. Physics simulations work similarly—update each particle's position based on forces, and the calculations are independent (at least within a single time step).

Machine learning turned out to be an unexpected windfall for GPU computing. Neural networks involve massive matrix multiplications and element-wise operations—exactly the kind of independent, repetitive computation GPUs excel at. Training a modern language model would take years on CPUs; GPUs reduce this to days or weeks.

Interactive: Which tasks benefit from GPU parallelism?

Click a task to see why it does or doesn't benefit from GPU parallelism.

Some problems resist parallelization. Parsing a document where each line's interpretation depends on previous lines. Walking a linked list. Evaluating a deeply nested expression. These have inherent sequential dependencies that no amount of parallel hardware can overcome. For these, the CPU's low-latency sequential execution remains king.

The art is recognizing which category your problem falls into, and restructuring computations to expose parallelism when possible.

Interactive: CPU vs GPU processing 64 tasks

CPU

4 fast cores

0/64 tasks

GPU

64 slower cores

0/64 tasks

Each CPU core completes tasks quickly, but there are only 4 of them. The GPU has 64 cores working simultaneously—even though each is slower, the total throughput is higher.

The Modern GPU

Today's GPUs contain thousands of cores organized into groups that share instruction decoders and memory. NVIDIA calls these groups "Streaming Multiprocessors," AMD calls them "Compute Units." Each can run hundreds of threads, and a high-end GPU might have 80 or more of these units.

The programming model abstracts this complexity. In WebGPU, you'll write compute shaders that describe the work for a single invocation—what one thread does. You then dispatch thousands or millions of invocations, and the GPU schedules them across its hardware. The goal is to express enough parallelism that the hardware never runs dry.

Memory bandwidth often limits real-world performance more than raw compute. GPUs have specialized high-bandwidth memory (HBM or GDDR) that can transfer hundreds of gigabytes per second, but this is still the bottleneck for many workloads. Effective GPU programming means keeping data on-device, minimizing transfers, and structuring access patterns to maximize memory throughput.

Key Takeaways

CPUs execute sequentially and optimize for low latency on individual tasks; GPUs execute in massive parallel and optimize for throughput across many tasks
The pixel problem—millions of independent color calculations per frame—drove the development of parallel graphics hardware
SIMD (Single Instruction Multiple Data) lets one instruction operate on many values; GPUs extend this to thousands of simultaneous threads
GPUs win when work can be divided into independent pieces; sequential dependencies favor CPUs
Memory bandwidth, not raw compute power, often limits GPU performance in practice