The Leela Chess Zero’s neural network is largely based on the DeepMind’s AlphaGo Zero^{1} and AlphaZero^{2} architecture. There are however some changes.
Network topology
The core of the network is a residual tower with Squeeze and Excitation^{3} (SE) layers.
The number of the residual BLOCKS
and FILTERS
(channels) per block differs between networks.
Typical values for BLOCKS
×FILTERS
are 10×128, 20×256, 24×320.
SE layers have SE_CHANNELS
channels (typically 32 or so).
Input to the neural network is 112 planes 8×8 each.
The network consists of a “body” (residual tower) and several output “heads” attached to it.
All convolution layers also include bias layers.
Fully connected layer is MatMul plus adding Bias on top.
Body
 Input convolution: from 112×8×8 to
FILTERS
×8×8.  Residual tower consisting of
BLOCKS
blocks: Convolution from
FILTERS
×8×8 toFILTERS
×8×8.  Convolution from
FILTERS
×8×8 toFILTERS
×8×8.  SE layer (only in network type NETWORK_SE_WITH_HEADFORMAT [current]), i.e.:
 Global average pooling layer (
FILTERS
×8×8 toFILTERS
)  Fully connected layer (
FILTERS
toSE_CHANNELS
)  ReLU
 Fully connected layer (
SE_CHANNELS
to 2×FILTERS
).  2×
FILTERS
is split into twoFILTERS
sized vectorsW
andB
Z
= Sigmoid(W
) Output of the SE layer is
(Z × input) + B
.
 Global average pooling layer (
 Adding the residual tower skip connection.
 ReLU activation function.
 Convolution from
All convolutions have kernel size 3×3 and stride 1.
Batch normalization is already folded into weights, so there’s no need to do any normalization during the inference.
Policy head
Format: POLICY_CONVOLUTION [current]
 Convolution from
FILTERS
×8×8 toFILTERS
×8×8.  Convolution from
FILTERS
×8×8 to 80×8×8.  The vector of length 1858 is gathered from the 80×8×8 matrix using this mapping (only 73×8×8 is actually used, the rest is for padding).
 (note there is no activation function on the output)
Format: POLICY_CLASSICAL
POLICY_CONV_SIZE
is a parameter.
 Convolution from
FILTERS
×8×8 toPOLICY_CONV_SIZE
×8×8  Fully connected from
POLICY_CONV_SIZE
×8×8 to a vector of length1858
 (note there is no activation function on the output)
Value head
Common part
 Convolution from
FILTERS
×8×8 to 32×8×8  Convolution from 32×8×8 to the vector of length 128
 ReLU
Format: VALUE_WDL [current]
 Fully connected from vector of length 128 to the vector of length 3
 Softmax
Format: VALUE_CLASSICAL
 Fully connected from vector of length 128 to a scalar
 Tanh
Moves left head
MLH_CHANNELS
and FC_SIZE
are parameters.
 Convolution from
FILTERS
×8×8 toMLH_CHANNELS
×8×8.  Fully connected from
MLH_CHANNELS
×8×8 to a vector of sizeFC_SIZE
.  ReLU
 Fully connected from a vector of size
FC_SIZE
to a scalar  ReLU

AlphaGo Zero https://deepmind.com/research/publications/masteringgamegowithouthumanknowledge, scroll down for the paper link. ↩︎

AlphaZero https://deepmind.com/blog/article/alphazerosheddingnewlightgrandgameschessshogiandgo, scroll down for the paper link. ↩︎

Squeeze and Excitation networks: https://arxiv.org/abs/1709.01507 ↩︎