Neural Network Foundations

Parameters, gradients, and GPU computation

Before diving into transformers, we need to build a shared vocabulary. This chapter covers the foundational concepts that underpin all modern deep learning—the ideas you will encounter repeatedly as we explore attention mechanisms, embeddings, and large language models.

What is a Neural Network?

At its core, a neural network is a mathematical function with adjustable knobs. That is it. We feed in some input—an image, a sentence, a sequence of numbers—and the network produces an output. What makes them useful is those adjustable knobs.

Think of it like a mixing board in a recording studio. Each slider and dial changes how the sound comes out. A neural network has millions (or billions) of these knobs, and finding the right settings is what we call training.

More formally, a neural network computes:

y = f(x; \theta)

Here, $x$ is our input, $y$ is the output, and $\theta$ represents all those adjustable knobs—the parameters. The function $f$ describes the architecture: how the parameters combine with the input to produce the output.

The key realization: neural networks are just flexible math functions. They are not magic, not truly "intelligent." They are functions with so many parameters that they can approximate almost any input-output relationship, given enough data to tune those parameters.

Parameters vs Hyperparameters

Two terms that often confuse newcomers: parameters and hyperparameters. They sound similar but serve very different roles.

Parameters are the values the network learns during training. These include:

Weights: Numbers that scale how much influence one neuron has on another
Biases: Numbers that shift the activation of a neuron up or down

Every connection in the network has an associated weight. Every neuron has a bias. A model like GPT-3 has 175 billion of these parameters, each one a number that was adjusted during training.

Hyperparameters are choices we make before training begins. The network does not learn these—we set them:

Learning rate: How big of a step to take during each update
Batch size: How many examples to process before updating parameters
Number of layers: The depth of the network
Hidden dimension: The width of each layer
Number of epochs: How many times to cycle through the training data

Here is a quick reference:

Term	Learned?	Examples
Parameters	Yes, during training	Weights, biases
Hyperparameters	No, set by humans	Learning rate, batch size, epochs

Getting hyperparameters right is an art. Too high a learning rate and training becomes unstable. Too low and it takes forever. The best practitioners develop intuition through experimentation.

How Neural Networks Learn

Training a neural network means finding the parameter values that make the network's outputs match what we want. This happens through three interconnected ideas: the loss function, gradient descent, and backpropagation.

The Loss Function

The loss function measures how wrong our predictions are. If the network predicts that an image shows a cat but it is actually a dog, the loss should be high. If the prediction is correct, the loss should be low.

For classification, we often use cross-entropy loss. For regression, mean squared error. The specific choice depends on the task, but the principle is universal: lower loss means better predictions.

\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \ell(f(x_i; \theta), y_i)

This formula says: for each of our $N$ training examples, compare the network's output $f(x_i; \theta)$ to the true label $y_i$ , measure the error with $\ell$ , and average over all examples.

Gradient Descent

Now we need to minimize the loss. Imagine you are blindfolded on a hilly landscape and want to reach the lowest point. What do you do? Feel the slope beneath your feet and step downhill.

That is gradient descent. The gradient tells us the direction of steepest ascent—the uphill direction. We take a step in the opposite direction, downhill, to reduce the loss.

\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}

Here $\eta$ is the learning rate (how big a step to take) and $\nabla_\theta \mathcal{L}$ is the gradient of the loss with respect to our parameters. The gradient points uphill, so we subtract it to go downhill.

Interactive: Gradient Descent

Learning Rate0.5

Balanced learning rate

The purple ball represents our current parameters. Each step moves in the direction that reduces the loss. Try different learning rates to see how it affects convergence.

Watch how the learning rate affects convergence. Too small and progress is slow. Too large and you might overshoot the minimum entirely, bouncing back and forth without settling.

Backpropagation

Computing the gradient $\nabla_\theta \mathcal{L}$ is not trivial. With billions of parameters, how do we know how each one affects the final loss?

The answer is backpropagation—a clever application of the chain rule from calculus. It works in two phases:

Forward pass: Feed the input through the network layer by layer, computing activations until we get the output and loss.
Backward pass: Starting from the loss, propagate gradients backward through the network. Each layer computes how its inputs affected the loss, passing that information to the previous layer.

By the end of the backward pass, we know the gradient for every single parameter in the network.

The Training Loop

Putting it all together, here is the training loop in pseudocode:

for epoch in range(num_epochs):
    for batch in training_data:
        # Forward pass
        predictions = model(batch.inputs)
        loss = loss_function(predictions, batch.targets)
        
        # Backward pass
        gradients = compute_gradients(loss, model.parameters)
        
        # Update parameters
        for param, grad in zip(model.parameters, gradients):
            param = param - learning_rate * grad

python

We repeat this loop thousands or millions of times. Each iteration nudges the parameters slightly toward better predictions. Over time, the loss decreases and the network improves.

Why GPUs?

You might wonder why training large models requires specialized hardware. The answer comes down to the nature of neural network computation.

A neural network is fundamentally a series of matrix multiplications. When we pass a batch of inputs through a layer, we multiply an input matrix by a weight matrix. This operation is repeated layer after layer, millions of times per training step.

Matrix multiplication is embarrassingly parallel. To multiply two matrices, we compute many independent dot products that do not depend on each other. This is where GPUs shine.

Interactive: CPU vs GPU Processing

CPU

8 powerful cores

8 cycles

to complete all operations

GPU

Thousands of small cores

1 cycle

all operations in parallel

CPUs have few powerful cores optimized for complex sequential tasks. GPUs have thousands of simpler cores that excel at parallel operations—perfect for matrix multiplications in neural networks.

A CPU has a few powerful cores (typically 8-16) optimized for complex, sequential tasks. A GPU has thousands of simpler cores (10,000+) designed to do the same operation on many pieces of data simultaneously.

When you multiply two large matrices:

A CPU computes elements one at a time (or 8 at a time with its cores)
A GPU computes thousands of elements simultaneously

The core advantage: GPUs do thousands of multiplications at the same time. This is why a task that takes hours on a CPU can finish in minutes on a GPU.

Memory Considerations

Speed is not the only concern. Large models need massive amounts of memory. A 7 billion parameter model in 16-bit precision requires:

14 GB just to store the parameters
28 GB for optimizer states (Adam keeps momentum and variance for each parameter)
Plus memory for activations, gradients, and the KV cache (we will cover this in later chapters)

Numerical Precision

Each parameter is stored as a floating-point number, but how many bits we use to represent it affects both memory and accuracy:

FP32 (32-bit floating point): Full precision. Each number uses 4 bytes. Most accurate but most memory-hungry.
FP16 (16-bit floating point): Half precision. Each number uses 2 bytes. Good balance of accuracy and efficiency.
INT8 (8-bit integer): Quarter the memory of FP32. Used primarily for inference, not training, with some accuracy loss.

A 7B parameter model stores 7 billion numbers. In FP32, that is 28 GB just for the weights. In FP16, it is 14 GB.

Interactive: Memory Breakdown

Select precision format:

7B Parameter Model

Memory usage during training (FP16 precision)

Model Parameters14 GB

15%

The actual weights of the neural network

Optimizer States28 GB

29%

Adam stores momentum and variance for each parameter

Activations4 GB

Intermediate values saved for backpropagation

KV Cache2 GB

Cached key-value pairs for attention (covered in later chapters)

Total48 GB

Half precision halves memory while maintaining good accuracy.

Training a 7B model requires significant GPU memory. The optimizer states alone can take 2× the parameter memory! This is why mixed-precision training and quantization are essential for large models.

This is why you hear about models requiring 80 GB GPUs or clusters of multiple GPUs. A single consumer GPU with 24 GB cannot even hold the optimizer states for a 7B model during training.

Multi-GPU Training

When one GPU is not enough, we distribute the workload:

Data parallelism: Each GPU processes a different batch of data, then they average their gradients
Model parallelism: Different layers live on different GPUs
Tensor parallelism: Individual layers are split across GPUs

Training GPT-3 required thousands of GPUs working together, coordinating their computations and sharing gradients. The engineering challenge is immense—but the fundamental operations are still just matrix multiplications and gradient updates.

Key Takeaways

Neural networks are mathematical functions with learnable parameters (weights and biases)
Parameters are learned during training; hyperparameters are set by humans before training
The loss function measures prediction error; we minimize it using gradient descent
Backpropagation computes gradients efficiently using the chain rule
GPUs excel at training because neural networks are dominated by parallel matrix operations
Large models require careful memory management—training a 7B model needs 40+ GB of GPU memory
Multi-GPU setups distribute computation and memory across multiple devices