LLM Scaling Laws 论文摘录 - Scaling Laws for Neural Language Models

20 3月 2024
13 分钟阅读
标签:
LLM,
Scaling Laws,
论文摘录

LLM Scaling Laws 论文摘录 - Scaling Laws for Neural Language Models

type: Post status: Published date: 2024/03/20 tags: AI, LLM, NLP, 论文摘录 category: 技术分享

背景和内容

OpenAI在2020年发布了《Scaling Laws for Neural Language Models》探讨Scaling Laws（缩放法则），在其中探讨了对于基于Transformerde的大模型的Training Loss与模型参数规模N，数据集大小D，计算量C之间的联系。

而在2022年四月谷歌DeepMind在文章《Training Compute-Optimal Large Language Models》中重新讨论了Scaling Laws，他们指出当前的大模型都明显缺乏足够的训练，在使用四倍的数据（相较于280B参数Gopher)训练70B的Chinchilla后取得了更好的成绩（SOTA average accuracy of 67.5% on the MMLU benchmark, 7% increasement）。

然而，在2023年二月由Thaddée Tyl发布的博客《Chinchilla’s Death》（Chinchilla之死)中，指出，只要训练足够长时间，小模型也能超过大模型。

论文

Scaling Laws for Neural Language Models

主要发现

模型性能随模型大小N,数据大小D和计算量大小C的增加而提高，和模型形状（深度宽度）与自注意力头弱相关
在其他因素不受限的情况下，模型大小N、数据大小D、计算量大小C与性能存在幂律关系
同时扩大模型大小N和数据大小D可以提升模型性能，但是文章研究认为当模型大小增加8倍时，数据大小只需要增加5倍，这种情况下不会受到性能惩罚
由于训练曲线遵从幂律关系，损失与模型大小无关，因此我们可以大致预测后续训练的损失
？在迁移到另一个与训练集不同的数据集上时存在一个惩罚（更大的误差），但是这一个惩罚是恒定的，这意味着其他的提升是共通的
大模型相对于小模型有更高的样本效率（sample-effcient），达到相同的水平只需要使用更少的训练(Figure 2)和更少的数据(Figure 4)

Untitled

收敛是低效的，在计算量大小C固定时，可以在模型没完全收敛之前停止训练来获取最佳性能(Figure 3)。论文给出参考关系为$D\sim C^{0.27}$

这张图体现了在相同的计算量大小C的情况下，来自于不同因素究竟有多少贡献。首先是模型大小，其次是数据（通过更大的batch size和减少复用），而Serial Steps（更多的训练次数）帮助并不大

训练的理想batch size应该是训练的幂，并通过梯度噪音尺度（gradient noise scale）继续确定

Scaling Law 总结

参数：

non-embedding parameters $N$

the dataset size $D$

optimally allocated compute budget $C_{min}$

参数含义
- $L$ – the cross entropy loss in nats. Typically it will be averaged over the tokens in a context, but in some cases we report the loss for specific tokens within the context.
- $N$ – the number of model parameters, excluding all vocabulary and positional embeddings
- $C$ $≈ 6NBS$ – an estimate of the total non-embedding training compute, where $B$ is the batch size, and $S$ is the number of training steps (ie parameter updates). We quote numerical values in PF-days, where one $\text{PF-day}= 10^{15} × 24 × 3600 = 8.64 × 10^{19}$ floating point operations.
- $D$ – the dataset size in tokens
- $B_{crit}$ – the critical batch size [MKAT18], defined and discussed in Section 5.1. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency.
- $C_{min}$ – an estimate of the minimum amount of non-embedding compute to reach a given value of the loss. This is the training compute that would be used if the model were trained at a batch size much less than the critical batch size.
- $S_{min}$ – an estimate of the minimal number of training steps needed to reach a given value of the loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size.
- $\alpha_X$ – power-law exponents for the scaling of the loss as $L(X) ∝ 1/X^{α_X}$ where X can be any of $N, D, C, S, B, C^{min}$.

在其他两因素不受限的情况下，测试损失可以由以下公式预测：

N受限

$$ \begin{equation} L(N)=(N_c/N)^{\alpha_N} \end{equation}
$$

$$ \alpha_N \sim 0.076,N_c \sim 8.8 \times 10^{13} \text{(non-embedding parameters)} $$

D受限（with early stopping)

$$ \begin{equation} L(D)=(D_c/D)^{\alpha_D} \end{equation} $$

$$ \alpha_D \sim 0.095, D_c \sim 5.4 \times 10^{13}\text{(tokens)} $$

C受限

$$ \begin{equation}L(C_{min})=(C^{min}c/C{min})^{\alpha_{min}} \end{equation} $$

$$ \alpha_C^{min} \sim 0.050, C^{min}_c \sim 3.1 \times 10^8 \text{(PF-days)} $$

公式含义：

在以上三个公式中,$\alpha_N, \alpha_D, \alpha_C^{min}$给出了当我们提升$N,D,C_{min}$时性能提升的幂次。

举个例子，当我们将模型的参数量提升到两倍时，模型的损失将会减小，$2_{-\alpha_N}\approx 0.95$，因此损失将会是此前的0.95倍。而$N_C,D_C,C_C^{min}$的准确数字基于字典大小和tokenization因此没有实际意义，只代表数量级关系。

此外，文章还提到了batch size和loss的关系

$$ \begin{equation}B_{crit}(L)=\frac{B_*}{L^{1/a_B}} \end{equation} $$

$$ B_* \sim 2\cdot10^8 \text{tokens}, \alpha_B\sim0.21 $$

根据此前公式(1)和公式(2)可以得出，当我们提升模型大小时，我们应相应地增加数据集的数量，可以根据计算得出$D\propto N^{\frac{\alpha_{N}}{\alpha_D}} \sim N^{0.74}$。他们还发现一个结合(1)和(2)的公式来控制N和D的依赖以及控制过拟合：

$$ \begin{equation}L(N,D)=\left[\left(\frac{N_c}{N}^{\frac{\alpha_N}{\alpha_D}}+\frac{D_c}{D} \right) \right]^{\alpha_D}\end{equation} $$

作者推测这个函数也能生成其他生成式任务的最大对数似然

训练的曲线也可以由训练step数得出，因此可以求得最佳训练step数

$$ \begin{equation}L(N,S)=\left(\frac{N_c}{N}\right)^{\alpha{N}}+\left(\frac{S_c}{S_{min}(S)}\right)^{\alpha_S}\end{equation} $$

$S_c \approx 2.1 \times 10^3,\alpha_S \approx 0.76$ $S_{min}(S)$ is the minimum possible number of optimization steps (parameter updates) estimated using Equation

在固定计算量C的情况下，又得出了以下关系公式

$$ \begin{equation}N \propto C^{\alpha^{min}_C /\alpha_N}, B \propto C^{\alpha^{min}_C /\alpha_B}, S \propto C^{\alpha^{min}_C /\alpha_S}, D = B \cdot S\end{equation} $$

此处有

$$ \begin{equation}\alpha^{min}_C=1/(1/\alpha_S+1/\alpha_B+1/\alpha_N)\end{equation} $$

可得$N \propto C^{0.73}{min}, B \propto C^{0.24}{min},\text{ and }S \propto C^{0.03}_{min}$，此处提出观点：

当计算量C预算提升的时候，应该主要将其用于更大的模型，而不是更多的训练时间和数据大小。
同时当模型变得更大时，他们变得更加sample efficient。

研究方法

研究在数据集WebText2及其拓展（$2.29\times 10^{10}$ tokens)，tokenize方法为 byte-pair encoding，词汇大小$n_{vocab}=50257$，性能指标（Loss）为在1024个token上下文的中的交叉熵损失。模型使用的是decoder-only的Transformer，同时训练了LSTM和其他类型的Transformer作为比对。

除特别说明外，模型的训练使用了Adam优化器和$2.5 \times 10^5$步，batch size为512，上下文512 token。由于内存限制，最大的模型使用了Adafactor优化器。

除特别说明外，训练的学习率是一个3000步的热身和一个cosine decay余弦衰减到零。

模型的参数计算方法

为了计算模型参数和计算量，模型的超参数定义为：

| $n_{layer}$ | 层数 number of layers | | — | — | | $d_{model}$ | 残差流的维度 dimension of the residual stream | | $d_{ff}$ | 前馈层（全连接）的维度 dimension of the intermediate feed-forward layer | | $d_{attn}$ | 注意力输出的维度 dimension of the attention output | | $n_{heads}$ | 每层注意力头数量 number of attention heads per layer | | $n_{ctx}$ | 上下文词元数量，除另说明外为1024 input context |

使用$N$代表模型的参数大小，这里定义为除去embedding的参数：

$$ \begin{aligned} N&\approx 2d_{model}n_{layer}(2d_{attn}+d_{ff}) \ &= 12n_{layer}d^2_{model}\end{aligned} $$

$$ d_{attn}=d_{ff}/4=d_{model} $$

这里省去了embedding层的$n_{vocab}d_{model}$和$n_{ctx}d_{model}$参数，向前传递大概需要计算量$C$如下表示：

$$ C_{forward} \approx 2N+2n_{layer}n_{ctx}d_{model} $$

Untitled

实验结果

实验变量：

模型大小（非嵌入参数从768个到15亿个不等）
数据集大小（从2200万到230亿个token）
形状（包括深度、宽度、注意力头和前馈维度）
上下文长度（1024，但也会尝试更短的上下文）
批量大小（$2^{19}$，但也会改变它以测量临界批量大小）

Untitled

结论：

在非嵌入模型大小$N$固定的情况下模型的形状对性能的影响很小，调整很大也只影响几个百分点
如果计算嵌入的参数大小，那模型的性能和层数有较大关联（左图），但是如果去掉嵌入层的参数大小，则除了小于两层的模型，不同层数模型的性能基本在同一个趋势上（右图）。
在LSTM上同样适用，但是LSTM性能比Transformer差一点
幂律定理公式成立：

$$ \begin{aligned} L(N) &\approx (N_c/N)^{\alpha_N} \ L(D) &\approx (D_c/D)^{\alpha_D} \ L(C_{min})&\approx (C_{c}^{min}/C_{min})^{\alpha_{min}} \end{aligned} $$

Training Compute-Optimal Large Language Models

参考

[1] https://arxiv.org/pdf/2001.08361.pdf

[2] https://arxiv.org/pdf/2203.15556.pdf

[3] https://espadrine.github.io/blog/posts/chinchilla-s-death.html

[4] https://arxiv.org/pdf/2109.10686.pdf

[5] https://self-supervised.cs.jhu.edu/sp2023/files/17.retrieval-augmentation.pdf