μ-Parametrization Explained

An Interactive Guide to Tuning Large Neural Networks

Anatomy of a Wide Layer

To understand how to scale a network, we first need to identify its "width." This is determined by two key dimensions: fan_in and fan_out.

INPUTS (Fan-in)

↓

OUTPUTS (Fan-out)

In code, fan_in is the number of input features, while fan_out is the number of output features. Together, they define the shape of a layer's weight matrix.

import torch.nn as nn

# A layer's "width" is defined by
# its input and output dimensions.
projection_layer = nn.Linear(
    in_features=4,  # fan_in
    out_features=6 # fan_out
)

print(projection_layer.weight.shape)
# torch.Size([fan_out, fan_in])

The Core Problem & The μP Fix

A neuron's output variance grows with fan_in, causing instability. See how Standard Parametrization (SP) breaks and how μ-Parametrization (μP) fixes it.

1. Choose Parametrization:

2. Increase Fan-in:

Input Vector Size
(Fan-out is constant)

With SP, the output distribution spreads out as fan_in grows, leading to unstable training. Now, switch to μP to see the fix.

The Scaling Rule: Calculating the Divisor

μP keeps training stable by scaling the learning rate based on how much the model's width changes. Select a base and target model to see the rule.

Base Model (Tuned)

Target Model (Large)

Base Matrix

→

Target Matrix

Fan-in Scale: 1.0

Fan-out Scale: 1.0

Learning Rate Divisor: 1.0

(Divide base LR by this value)

The Payoff: Zero-Shot Hyperparameter Transfer

Because μP guarantees that a model's behavior remains stable as it gets wider, it unlocks a powerful capability: we can invest time finding the optimal hyperparameters (like learning rate and initialization) on a small, cheap-to-train model. Then, using the scaling rules of μP, we can transfer those hyperparameters directly to a massive, production-scale model and expect it to train optimally on the first try. This "zero-shot" transfer dramatically reduces the cost and complexity of training state-of-the-art neural networks.