一. 项目介绍
🤗 Transformers library, 有个叫 “抱抱脸” 的组织, 出品了 transformer 库. 该项目有以下特色:
- 它首先是一个 NLP 的模型仓库. 远端存储了各种模型的 计算图源码和训练好的模型参数, 可通过互联网按需下载.
- 仅有模型是不足以运行 demo 的, 所以它引入 pipeline 设计, 将 {tokenizer 预处理, 模型推断, argmax_id 转 text 后处理} 等流水线化, 提供了统一便捷的api, 方便应用, 调参, 学习.
- 在 predict 之外, 它还提供了 Trainer 等类, 方便用户以开发者的角色去 train&evalua 模型, 支持单机多卡, tensorboard 记录等.
transformer 在ML, NLP领域是项里程碑的工作, 基于该思想的模型变种成百上千, 所以这个库也可以看作是一个 NLP 社区, 汇聚了热门模型资源, 大家都学习同样的东西也便于交流.
二. windows 下安装
可通过pip(建议)或conda安装. 常见问题有
- pyh5冲突
我通过conda安装后, pyh5 模块报错, 因为 pip装的pyh5 与 conda装的pyh5 有冲突, 卸载前者得解. - 模型下载过慢
首次加载一个模型时, 会自动下载, 缓存在 C:\Users\yichu\.cache\huggingface\transformers\ 目录下, 但名字都是 很长一串的 base64 编码, 不直观. 除了慢之外, 也难以失败后断点重来. 
那么, 可以去model hub []中把相应模型下载到本地磁盘, 代码中 model 填完整的本地路径.
# 目录下有文件 [config.json , tf_model.h5 , tokenizer_config.json , vocab.txt]
model_path = r'D:\model_repository\transformer\distilbert-base-uncased-finetuned-sst-2-english'
model = transformers.pipeline('sentiment-analysis', model=model_path)
- pytorch dataloder多线程报错
手动修改 transformers.pipelines.base.Pipeline.__call__(self, inputs, *args, num_workers=8, **kwargs) 中的 num_workers=0 .
- 打开日志
该库有集中化的日志模块, 默认关闭所有日志, 可简单设定级别. transformers.logging.set_verbosity(transformers.logging.INFO) .
三. Tokenizer
官方文档见 [4]. 完成原始输入到模型输入的数据处理. 以 pipeline 视角去看,
- 分词
- 查找词典
- 截断与填充
- mask标记
- 添加 special tokens
tokenizers.Tokenizer.encode_batch(self, input) input is raw text sequences.
自己训练 Tokenizer
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import CharDelimiterSplit
from tokenizers.processors import BertProcessing
from tokenizers.trainers import WordLevelTrainer
def train_tokenizer():
tokenizer = Tokenizer(model=WordLevel(unk_token="[UNK]"))
special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[MASK]", "[PAD]"]
# INFO:__main__:Train tokenizer done, vocab size is 4196
trainer = WordLevelTrainer(special_tokens=special_tokens, min_frequency=5)
pre_tokenizer = CharDelimiterSplit('|')
tokenizer.pre_tokenizer = pre_tokenizer
tokenizer.enable_truncation(max_length=100)
tokenizer.enable_padding(length=100)
tokenizer.train_from_iterator(iter(text_per_line_generator()), trainer)
# tokenizer.train_from_iterator(paths, trainer)
tokenizer.post_processor = BertProcessing(
cls=("[CLS]", tokenizer.token_to_id("[CLS]")),
sep=("[SEP]", tokenizer.token_to_id("[SEP]"))
)
tokenizer.save(path)
Tokenizer持久化与加载
that contains all its configuration and vocabulary, just use the save() method.
and you can reload your tokenizer from that file with the from_file() class method.
四. NLP Model
以 TFDistilBertForSequenceClassification 类来讲解.
主要类的依赖
class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSequenceClassificationLoss):
def __init__(self, config, *inputs, **kwargs):
self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
self.pre_classifier = tf.keras.layers.Dense(config.dim,...)
self.classifier = tf.keras.layers.Dense(config.num_labels,...)
def call(self,...):
hidden_state = distilbert_output[0] # (bs, seq_len, dim)
pooled_output = hidden_state[:, 0] # (bs, dim)
pooled_output = self.pre_classifier(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output, training=inputs["training"]) # (bs, dim)
logits = self.classifier(pooled_output) # (bs, dim)
# 上面类的父类, 主要是 tf.function 装饰器, 决定了 model 的签名, 用于 saved_model 等.
class TFDistilBertPreTrainedModel(TFPreTrainedModel):
@tf.function(
input_signature=[
{
"input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
"attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
}
]
)
def serving(self, inputs):
output = self.call(inputs)
return self.serving_output(output)
class TFSequenceClassificationLoss:
def compute_loss(self, labels, logits):
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
return loss_fn(labels, logits)
class TFDistilBertMainLayer(tf.keras.layers.Layer)
def __init__(self,config, **kwargs):
# 内含 word_embeddings 和 position_embeddings
self.embeddings = TFEmbeddings(config, name="embeddings") # Embeddings
# 内含 多个TFTransformerBlock
self.transformer = TFTransformer(config, name="transformer") # Encoder
class TFTransformer(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
self.layer = [TFTransformerBlock(config, name=f"layer_._{i}") for i in range(config.n_layers)]
# return 以下三个
# hidden_state: 最终层输出, tf.Tensor(bs, seq_length, dim)
# all_hidden_states: 每层的 hidden_state
# all_attentions: 每层的 attention_weight
def call(self, x, attn_mask, head_mask, output_attentions, output_hidden_states, return_dict, training=False):
pass
class TFTransformerBlock(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
self.attention = TFMultiHeadSelfAttention(config, name="attention")
self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="sa_layer_norm")
self.ffn = TFFFN(config, name="ffn")
self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="output_layer_norm")
input / output of model
五. Pipeline
从上方 模型 章节可以看到, 它有自己的IO边界. 实际使用中, input 是 text, output 是label, 所以有了 pipeline, 将预处理和后处理也给抽象并串联了起来. 做到开箱即用.
class transformers.pipelines.base.Pipeline():
def __init__(
self,
model: Union["PreTrainedModel", "TFPreTrainedModel"],
tokenizer: Optional[PreTrainedTokenizer] = None,
...)
pass
def __call__(self, inputs, *args, num_workers=8, **kwargs):
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
model_inputs = self.preprocess(inputs, **preprocess_params)
model_outputs = self.forward(model_inputs, **forward_params)
outputs = self.postprocess(model_outputs, **postprocess_params)
return outputs
上方 Pipeline 是抽象类, 一个可以与TFDistilBertForSequenceClassification 搭配的具体的子类 TextClassificationPipeline 见下.
class TextClassificationPipeline(Pipeline):
def preprocess(self, inputs, **tokenizer_kwargs) -> Dict[str, GenericTensor]:
return_tensors = self.framework
return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
def _forward(self, model_inputs):
return self.model(**model_inputs)
def postprocess(self, model_outputs, function_to_apply=None, return_all_scores=False):
outputs = model_outputs["logits"][0]
outputs = outputs.numpy()
if self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
scores = softmax(outputs)
return {"label": self.model.config.id2label[scores.argmax().item()], "score": scores.max().item()}
六. 模型导出
TFDistilBertForSequenceClassification 直接 saved_model 导出, 签名是
structured_input_signature ((), {'input_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_ids'), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='attention_mask')})
structured_outputs {'logits': TensorSpec(shape=(None, 2), dtype=tf.float32, name='logits')}
参考
- github
- 官方doc
- model hub
- tokenizer 官方文档
|