μ-Parametrization Explained
An Interactive Guide to Tuning Large Neural Networks
Anatomy of a Wide Layer
To understand how to scale a network, we first need to identify its "width." This is determined by two key dimensions: fan_in and fan_out.
In code, fan_in is the number of input features, while fan_out is the number of output features. Together, they define the shape of a layer's weight matrix.
import torch.nn as nn
# A layer's "width" is defined by
# its input and output dimensions.
projection_layer = nn.Linear(
in_features=4, # fan_in
out_features=6 # fan_out
)
print(projection_layer.weight.shape)
# torch.Size([fan_out, fan_in])
The Core Problem & The μP Fix
A neuron's output variance grows with fan_in, causing instability. See how Standard Parametrization (SP) breaks and how μ-Parametrization (μP) fixes it.
Input Vector Size
(Fan-out is constant)
With SP, the output distribution spreads out as fan_in grows, leading to unstable training. Now, switch to μP to see the fix.
The Scaling Rule: Calculating the Divisor
μP keeps training stable by scaling the learning rate based on how much the model's width changes. Select a base and target model to see the rule.
Base Model (Tuned)
Target Model (Large)
Base Matrix
Target Matrix
Fan-in Scale: 1.0
Fan-out Scale: 1.0
Learning Rate Divisor: 1.0
(Divide base LR by this value)
The Payoff: Zero-Shot Hyperparameter Transfer
Because μP guarantees that a model's behavior remains stable as it gets wider, it unlocks a powerful capability: we can invest time finding the optimal hyperparameters (like learning rate and initialization) on a small, cheap-to-train model. Then, using the scaling rules of μP, we can transfer those hyperparameters directly to a massive, production-scale model and expect it to train optimally on the first try. This "zero-shot" transfer dramatically reduces the cost and complexity of training state-of-the-art neural networks.