llama2 保姆级windows环境配置，训练，部署及常见问题解决方法

业务需要开始研究LLM，并且二次开发用到我们的业务上。

罗嗦几句，我从开始看LLM到配起来训练总共花了两个礼拜时间，看了不下20个视频教程和100篇博客教程，没有哪一个教程是能够从头到尾配置完没有差错的跑起来的，真的呼吁一下大家提高一下教程的质量，自己验证完之后再发布，有错误及时更新或者下线。

以下正式开始配置

1. 环境

1.1 window11

1.2 nvdia 4080或4090显卡，（这里有坑，后面再说）

2. 根据显卡安装cuda和cudnn

具体看我的另一篇博客（我看的LLM教程基本都没写这一步），这一步是需要用nvida显卡训练必须的一步，还好我以前做过图像识别，知道pytorch需要安装cuda

Windows+Nvdia显卡配置Tensorflow_nvidia-tensorflow-CSDN博客

3. clone llama2-recipes项目

3.1 llama2-recipes是llama2用来微调和二次开发的一个仓库，我之前还跟其他教程用过其他的像mlc-chat这些已经封装好的仓库，我建议如果想二次开发的话还是用这种底层原生的库，用第三方的毕竟被加了一层东西，改起来不灵活

git clone https://github.com/facebookresearch/llama-recipes .

GitHub - facebookresearch/llama-recipes: Examples and recipes for Llama 2 model

3.2 安装虚拟环境（基操，不多说）

python -m venv [env folder]

3.3 安装依赖包（敲黑板，注意这里有大坑）

网上教程都是pip install -r requirement.txt直接无脑安装，爽是很爽，但是基本跑不通，我们先来看一下requirement.txt这个文件内容

这里有两个坑。

第一是torch的版本要区分cpu和gpu，并且要和其他库以及自身的显卡兼容，无脑装大概率会安装cpu的版本，你会发现你的模型会跑在cpu上而不是gpu上，你可以用以下代码确认以下是否跑在了gpu上

import torch
print(torch.__version__)
x = torch.rand(5, 3)
print(x)
print(torch.cuda.is_available())

第二是bitsandbytes这个库，就因为这个库我搞了一天,看一下他的说明，这个库是不支持windows的！！！！！！

如果直接装，大概率你会碰到这个问题（忘记截图了）

runtimeerror:
cuda setup failed despite GPU being available, Please run the following command to get more information:
...

大概就是说cuda找不到之类的。网上有五花八门的方法，比如安装bitsandbytes-windows，还比如自己编译bitsandbytes，我不清楚当时那些方法有没有效，但是现在统统无效处理。

现在开始讲正确的做法：

3.3.1 安装pytorch，具体看我另一篇博客

YOLO安装（Nvdia GPU）-CSDN博客

3.3.2 安装bitsandbytes

这个是搜到的windows可用版本，可以直接pip安装

bitsandbytesforwindows资源-CSDN文库

3.3.3 修改requirements.txt文件，删除torch和bitsandbytes两项，接着pip install -r requirements安装其他依赖，至此，环境安装完成

4. 下载llama2模型

这一块怎么下载我就不多说了，主要就是需要去meta和HuggingFace注册账号，接着就可以下载模型了

huggingface_meta_llama2()

训练

1. 准备数据集

1.1 可以下载或者自己准备，比如使用GuanacoDataset数据集

https:///datasets/JosephusCheung/GuanacoDataset

1.2 我们只需要使用其中的一份数据就好了，比如选取guanaco_non_chat-utf-8.json，将文件拷贝到llama_recipes/datasets下，并更名为alpaca_data.json，程序会自己找到他。

2. 开始训练

2.1 训练

网上随便一搜是一大串莫名其妙的命令，然后就说开始练吧，同样，大概率是跑不起来的（因为我都试过，我真是服了），这里送上保姆级教学。

敲黑板！！！！训练的关键是llama_recipes\configs\training.py这个文件，抓到这个根了，根本不用写那些长不拉吉奇奇怪怪的命令，我们看一下这个文件里有什么,以及修改方法

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from dataclasses import dataclass
@dataclass
class train_config:
    model_name: str="C:/install11/lama-recipes/Llama-2-7b-hf" #默认为"PATH/to/LLAMA/7B"，改为自己的模型文件夹 C:\install11\llama-recipes\Llama-2-7b-hf
    enable_fsdp: bool=False
    low_cpu_fsdp: bool=False
    run_validation: bool=True
    batch_size_training: int=1 # 根据自己情况填写
    batching_strategy: str="packing" 
    context_length: int=4096
    gradient_accumulation_steps: int=1
    gradient_clipping: bool = False
    gradient_clipping_threshold: float = 1.0
    num_epochs: int=1 # 根据自己情况填写
    num_workers_dataloader: int=1
    lr: float=1e-4
    weight_decay: float=0.0
    gamma: float= 0.85
    seed: int=42
    use_fp16: bool=True
    mixed_precision: bool=True
    val_batch_size: int=1
    dataset = "alpaca_dataset" # 默认为samsum_dataset，改为alpaca_dataset
    peft_method: str = "lora" # 需要lora
    use_peft: bool=True # 默认为false,请改为true
    output_dir: str = "C:/install11/llama-recipes-main/lora" # 默认为"PATH/to/save/PEFT/model"，改为自己的文件夹C:/install11/llama-recipes-main/lora
    freeze_layers: bool = False
    num_freeze_layers: int = 1
    quantization: bool = True # 默认为false，改为true
    one_gpu: bool = False
    save_model: bool = True
    dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" 
    dist_checkpoint_folder: str="fine-tuned" 
    save_optimizer: bool=False 
    use_fast_kernels: bool = False 
    save_metrics: bool = False

改完以后保存，然后输入以下命令，就开始训练了，根本不需要写那些臭长臭长的命令

python.exe -m llama_recipes.finetuning

2.2 训练过程中的问题

问题1：大概率会碰到OOM问题，硬件的瓶颈主要在VRAM，我在4090的卡上VRAM基本保持在这个情况，所以碰到OOM的问题，就换机器吧，网上有一些优化的方法，但实测，没什么效果，硬件不行就是不行。

问题2：如果用的是跟我一样的数据集，那么在读取的时候会碰到编码问题，需要在报错的代码行修改源码，增加读取的编码方式为'utf-8'