ML Learning Notes — Training My Own Language Model From Scratch

流程总结
#

今天把三种"训练自己模型"的方式理清楚了：

1. Fine-tuning（微调现有模型）
#

用 LoRA 在现有模型（如 Qwen）上微调，只训练 0.1% 的参数。

结果是一个 adapter 文件（几 MB），需要叠加在原始模型上使用
适合：让模型适应特定风格/任务

2. Knowledge Distillation（知识蒸馏）
#

用大模型（teacher）训练自己定义的小模型（student）。

student 学习模仿 teacher 的输出分布（KL divergence loss）
结果是一个完全属于自己的独立模型

3. Fine-tune 自己的模型
#

蒸馏完再用具体任务数据 fine-tune，用 cross-entropy loss。

自定义模型结构
#

class MyModel(nn.Module):
    def __init__(self, vocab_size=151936, d_model=256, n_heads=8, n_layers=4, max_len=512):

每层结构
#

输入 token ids
↓
Embedding (vocab_size → d_model)        # token → 向量
+ Position Embedding (max_len → d_model) # 加位置信息
↓
TransformerDecoderLayer × n_layers      # 每层包含：
├── Self-Attention (n_heads)            # 理解上下文关系
└── FFN (d_model → d_model×4 → d_model) # 非线性变换
↓
Linear (d_model → vocab_size)           # 输出 token 概率

可调参数
#

参数	当前值	作用
`d_model`	256	每层宽度，越大越聪明
`n_layers`	4	深度，越多越能理解复杂逻辑
`n_heads`	8	注意力头数，必须能整除 d_model
`dim_feedforward`	1024	FFN 大小，一般是 d_model×4

参考对比
#

模型	d_model	n_layers	n_heads	参数量
我的模型	256	4	8	82M
GPT-2 small	768	12	12	117M
Qwen 0.5B	1024	24	16	494M

模型文件（EC2 上）
#

~/my-student-model.pt — 蒸馏后的模型
~/my-finetuned-student.pt — 蒸馏 + fine-tune 后的最终模型
~/my-finetuned-model/ — LoRA adapter（基于 Qwen）

第一次跑起来的结果
#

What is the capital of France?
the the the the the the the the the the the the the the the
the the the the the the the the the the the the the the the
— Completed in 7.836s

跑起来了，但输出是重复的 “the”。

这是正常的——模型太小、数据太少、训练太短，还没学会生成有意义的文本。这就是 repetition collapse，模型陷入了局部最优，只会预测最高频的 token。

下一步可以试：

加 temperature sampling（temperature > 1.0）打破重复
加更多训练数据 / 更多 epoch
或者直接加大模型（n_layers 8+，d_model 512+）

流程总结#

1. Fine-tuning（微调现有模型）#

2. Knowledge Distillation（知识蒸馏）#

3. Fine-tune 自己的模型#

自定义模型结构#

每层结构#

可调参数#

参考对比#

模型文件（EC2 上）#

第一次跑起来的结果#