有声,穿越小说完本 ,玄幻小说改编的电视剧

前言這是一篇來自于 ACL 2022 的關于跨語言的 NER 蒸餾模型。主要的過程還是兩大塊：1）Teacher Model 的訓練；2）從 Teacher Model 蒸餾到 Student Model。采用了類似傳統的 Soft 蒸餾方式，其中利用了多任務的方式對 Teacher Model 進行訓練，一個任務是 NER 訓練的任務，另一個是計算句對的相似性任務。整體思路還是采用了序列標注的方法，也是一個不錯的 IDEA。

論文標題：

An Unsupervised Multiple-Task and Multiple-Teacher Model for Cross-lingual Named Entity Recognition

論文鏈接：

https://aclanthology.org/2022.acl-long.14.pdf

模型架構

2.1 Teacher Model

▲圖1.Teacher Model訓練架構從上圖可以明顯的看出，Teacher Model 在進行訓練時，采用了兩種不同的 Labeled Data，一種是傳統的單文本序列標注數據；另一種是句對類型的序列標注數據，然后通過三個獨立的 Encoder 編碼器進行特征抽取，一個任務就是我們常用的 NER 訓練任務，也就是將 Encoder 編碼器的輸出經過一個線性層映射為標簽數的特征矩陣，對映射的特征矩陣進行 softmax 歸一化（這里筆者理解就是 NER 任務中的 BERT+Softmax 模型），利用歸一化后的特征矩陣與輸入的 labels 進行 loss 計算，這里采用的是 CrossEntropyLoss。需要明確具體的是作者采用了 Multilingual BERT（也就是 mBert）作為編碼器，計算公式如下：

首先利用 mBERT 提取輸入文本序列的特征，這里的表示的是：

將計算得到的文本序列隱藏向量經過一個線性變換后進行 softmax 歸一化，計算如下：

以上就是 Teacher Model 的第一個任務，直接對標注序列進行 NER，并且采用交叉熵損失函數作為 loss_function，計算如下：

另外一個任務輸入的為序列標注的句對數據，分別采用兩個獨立的Encoder編碼器進行編碼，得到的對應的 last_hidden_state，然后計算這兩個輸出的 cosine_similar，并且將其使用進行激活，得到兩個序列的相似度向量，計算如下：

這里也就是一個類似于 senetnce_similar 的操作，不同點在于這里計算的是序列中每個 Token 的相似度。通過對比句對序列標簽得到一個，這里時表示（預測正確），反正的話，。到了計算相似度時，損失函數的設計就是基于與的，計算公式如下：

這里的是 BinaryCrossEntropy。這里的是句對序列所對應的標簽通過比對得到的對比標簽序列，也就是對于兩個句子序列標簽

來說，其生成的，通過這樣的損失設計就可以很直觀的理解 sim_loss 的計算了。 Tips：對于式（6）這里采用二元交叉熵（BCE）來計算 loss，筆者的理解是對輸入句對中的每個 Token 的相似度進行一個二分類，其最終目標是使得具有相同標簽的句對更加的靠近，也就是相似度更高。BCE 是用來評判一個二分類模型預測結果的好壞程度的，通俗的講，即對于標簽 y 為 1 的情況，如果預測值 p(y) 趨近于 1，那么損失函數的值應當趨近于 0。反之，如果此時預測值 p(y) 趨近于 0，那么損失函數的值應當非常大，這非常符合 log 函數的性質。 Teacher Model 的設計總體上就是這樣的，通過兩個任務來增加 Teacher Model 的準確性和泛化性，對于實體識別來說，使用句對相似度的思想來拉近具有相同標簽的 Token，并且結合傳統的 NER 模型（mBERT+softmax）可以使得模型的學習更加有指向性，不單單靠一個序列標簽來指導模型學習，筆者任務這是一個不錯的思路。

2.2 Student Model Distilled

▲圖2.Teacher Model--Student Model Distilled 上面筆者分析了 Teacher Model 的訓練，但這不是重點，筆者認為本篇文章在于作者在進行蒸餾時的想法是有亮點的。從蒸餾流程圖可以看出來，作者使用的 Student Model 也是一個雙塔 mBERT 模型作為編碼器，輸入的就是 Unlabeled Pairwise Data，其操作就是把 Teacher Model 的多任務直接進行統一，模型架構變化不大。蒸餾過程也是通用的蒸餾模式，Teacher Model 預測，Student Model 學習。 2.2.1 Teacher Model Inference Teacher Model 預測這一部分沒啥可說的，就是把無標簽的數據輸入到模型中，得到輸出的 ner_logits 和 similar_logits。這也是蒸餾模型的常規操作了，這里需要注意的是在使用 Teacher Model 進行預測時，輸入的數據是有講究的，筆者對于這里的理解有兩個：一個是是模型輸入的是句對數據，只不過從這個句對數據中抽取一條輸入到 Recognizer_teacher 中進行識別；另一個是作者采用了 BERT 模型的句對輸入方式，輸入的就是一個句對，只不過使用了 [SEP] 標簽進行分隔，具體是哪一種筆者也不知道，理解了的讀者可以告訴筆者一下。而且在 Teacher Model 訓練時，筆者也不知道采用哪種數據輸入方式。 2.2.2 Student Model Learning Student Model 這一部分輸入的就是 target 文本序列對，Student Model 的編碼器也是一個雙塔的 mBert 模型，分別對輸入的 target 序列進行進行編碼，這里也是進行一個 BERT+Softmax 的基本操作，在此期間也使用了序列 Token 相似度計算的操作，具體的計算如下所示：

獲得兩個序列的hidden_state后進行一個線性計算，然后利用softmax進行歸一化，得到每個Token預測的標簽，計算如下：

這里也類似 Teacher Model 的計算方式，計算 target 序列間的Token相似度，計算如下所示：

當然，這里做的是蒸餾模型，所以對于輸入到 Student Model 的序列對，也是Teacher Model Inference 預測模型的輸入，通過 Teacher Model 的預測計算得到一個 teacher_ner_logits 和 teacher_similar_logits，將 teacher_ner_logits 分別與和通過 CrossEntropyLoss 來計算 TS_ _Loss 和 TS_ _Loss，teacher_similar_logits 與通過計算 Similar_Loss，最終將幾個 loss 進行相加作為 DistilldeLoss。

這里作者還對每個 TS_ _Loss，TS_ _Loss 分別賦予了權重，對 Similar_Loss 賦予了權重，對最終的 DistilldeLoss 賦予權重，這樣的權重賦予能夠使得 Student Model 從 Teacher Model 學習到的噪聲減少。最終的 Loss 計算如下所示：

這里的權重筆者認為是用來控制 Student Model 學習傾向的參數，首先對于來說，由于 Student Model 輸入的是 Unlabeled 數據，所以在進行蒸餾學習時，需要盡可能使得 Student Model 的輸出的 student_ner_logits 來對齊 Teacher Model 預測輸出的 teacher_ner_logits，由于不知道輸入的無標簽數據的數據分布，所以設置一個權重參數來對整個 Teacher Model 的預測標簽進行加權，將各個無標簽的輸入序列看作一個數據量較少的類別。這里可以參考在進行數據標簽不平衡時使用權重系數對各個標簽進行加權的操作。而且作者也分析了，參數是一個隨著 Teacher Model 輸出而遞增的一個參數。如下圖所示：

▲圖3.α參數與Weight和F1 作者在文章中也給出了參數的計算方式，具體而言就是跟 Student Model 的序列編碼有關，計算如下所示：

對于參數而言，其加權的對象是 Similar_Loss，也就是對 Teacher Model 的相似度矩陣和Student Model 的相似度矩陣的交叉熵損失進行加權，參數的設置思路大致是當 Teacher Model 的 Similar_logits 接近 0 或 1 時，參數就較大，接近 0.5 時就較小，其目的也是讓 Student Model 學習更有用的信息，而不是一些似是而非的東西。其計算方式如下所示：

最后對于參數來說，其作用是用來調整 NER 任務和 Similarity 任務一致性的參數，對于兩個輸入的 Token，希望 Student Model 從 Teacher Model 的兩個任務中學習 Teacher Model 的 NER 任務的高預測準確率和 Similarity 任務遠離 0.5 相似度的 Token 信息，反之亦然。其計算方式如下所示：

實驗結果

作者分別在 CoNLL 和 WiKiAnn 數據集上進行了實驗，數據使用量如下圖所示：

▲圖4.CoNLL and WiKiAnn數據作者還與現有的一些 SOTA 模型進行了對比，實驗對比結果如下所示：

▲圖5.實驗對比結果從實驗對比結果圖可以看出，MTMT 模型在各方面都有不錯的表現，對于中文上的表現稍微不如 BERT-f 模型，其他部分語言上有著大幅度的領先。

簡單代碼實現

#!/usr/bin/envpython
#-*-coding:utf-8-*-
#@Time:2022/5/3013:59
#@Author:SinGaln

"""
AnUnsupervisedMultiple-TaskandMultiple-TeacherModelforCross-lingualNamedEntityRecognition
"""

importtorch
importtorch.nnasnn
importtorch.nn.functionalasF
fromtransformersimportBertModel,BertPreTrainedModel,logging

logging.set_verbosity_error()


classTeacherNER(BertPreTrainedModel):
def__init__(self,config,num_labels):
"""
teacher模型是在標簽數據上訓練得到的,
主要分為三個encoder.
:paramconfig:
:paramnum_labels:
"""
super(TeacherNER,self).__init__(config)
self.config=config
self.num_labels=num_labels
self.mbert=BertModel(config=config)
self.fc=nn.Linear(config.hidden_size,num_labels)

defforward(self,batch_token_input_ids,batch_attention_mask,batch_token_type_ids,batch_labels,training=True,
batch_pair_input_ids=None,batch_pair_attention_mask=None,batch_pair_token_type_ids=None,
batch_t=None):
"""
:parambatch_token_input_ids:單句子token序列
:parambatch_attention_mask:單句子attention_mask
:parambatch_token_type_ids:單句子token_type_ids
:parambatch_pair_input_ids:句對token序列
:parambatch_pair_attention_mask:句對attention_mask
:parambatch_pair_token_type_ids:句對token_type_ids

"""
#RecognizerTeacher
single_output=self.mbert(input_ids=batch_token_input_ids,attention_mask=batch_attention_mask,
token_type_ids=batch_token_type_ids).last_hidden_state
single_output=F.softmax(self.fc(single_output),dim=-1)
#EvaluatorTeacher(類似雙塔模型)
pair_output1=self.mbert(input_ids=batch_pair_input_ids[0],attention_mask=batch_pair_attention_mask[0],
token_type_ids=batch_pair_token_type_ids[0]).last_hidden_state
pair_output2=self.mbert(input_ids=batch_pair_input_ids[1],attention_mask=batch_pair_attention_mask[1],
token_type_ids=batch_pair_token_type_ids[1]).last_hidden_state
pair_output=torch.sigmoid(torch.cosine_similarity(pair_output1,pair_output2,dim=-1))#計算兩個輸出的cosine相似度
iftraining:
#計算loss,訓練時采用平均loss作為模型最終的loss
loss1=F.cross_entropy(single_output.view(-1,self.num_labels),batch_labels.view(-1))
loss2=F.binary_cross_entropy(pair_output,batch_t.type(torch.float))
loss=loss1+loss2
returnsingle_output,loss
else:
returnsingle_output,pair_output


classStudentNER(BertPreTrainedModel):
def__init__(self,config,num_labels):
"""
student模型采用的也是一個雙塔結構
:paramconfig:mBert的配置文件
:paramnum_labels:標簽數量
"""
super(StudentNER,self).__init__(config)
self.config=config
self.num_labels=num_labels
self.mbert=BertModel(config=config)
self.fc1=nn.Linear(config.hidden_size,num_labels)
self.fc2=nn.Linear(config.hidden_size,num_labels)

defforward(self,batch_pair_input_ids,batch_pair_attention_mask,batch_pair_token_type_ids,batch_pair_labels,
teacher_logits,teacher_similar):
"""
:parambatch_pair_input_ids:句對token序列
:parambatch_pair_attention_mask:句對attention_mask
:parambatch_pair_token_type_ids:句對token_type_ids

"""
output1=self.mbert(input_ids=batch_pair_input_ids[0],attention_mask=batch_pair_attention_mask[0],
token_type_ids=batch_pair_token_type_ids[0]).last_hidden_state
output2=self.mbert(input_ids=batch_pair_input_ids[1],attention_mask=batch_pair_attention_mask[1],
token_type_ids=batch_pair_token_type_ids[1]).last_hidden_state
soft_output1,soft_output2=self.fc1(output1),self.fc2(output2)
soft_logits1,soft_logits2=F.softmax(soft_output1,dim=-1),F.softmax(soft_output2,dim=-1)
alpha1,alpha2=torch.square(torch.max(input=soft_logits1,dim=-1)[0]).mean(),torch.square(
torch.max(soft_logits2,dim=-1)[0]).mean()
output_similar=torch.sigmoid(torch.cosine_similarity(soft_output1,soft_output2,dim=-1))
soft_similar=torch.sigmoid(torch.cosine_similarity(soft_logits1,soft_logits2,dim=-1))
beta=torch.square(2*output_similar-1).mean()
gamma=1-torch.abs(soft_similar-output_similar).mean()
#計算蒸餾的loss
#teacherlogits與studentlogits1的loss
loss1=alpha1*(F.cross_entropy(soft_logits1,teacher_logits))
#teachersimilar與studentsimilar的loss
loss2=beta*(F.binary_cross_entropy(soft_similar,teacher_similar))
#teacherlogits與studentlogits2的loss
loss3=alpha2*(F.cross_entropy(soft_logits2,teacher_logits))
#finalloss
loss=gamma*(loss1+loss2+loss3).mean()
returnloss


if__name__=="__main__":
fromtransformersimportBertConfig

pretarin_path="./pytorch_mbert_model"

batch_pair1_input_ids=torch.randint(1,100,(2,128))
batch_pair1_attention_mask=torch.ones_like(batch_pair1_input_ids)
batch_pair1_token_type_ids=torch.zeros_like(batch_pair1_input_ids)
batch_labels1=torch.randint(1,10,(2,128))
batch_labels2=torch.randint(1,10,(2,128))
#t(對比兩個序列標簽，相同為1，不同為0)
batch_t=torch.as_tensor(batch_labels1.numpy()==batch_labels2.numpy()).float()

batch_pair2_input_ids=torch.randint(1,100,(2,128))
batch_pair2_attention_mask=torch.ones_like(batch_pair2_input_ids)
batch_pair2_token_type_ids=torch.zeros_like(batch_pair2_input_ids)

batch_all_input_ids,batch_all_attention_mask,batch_all_token_type_ids,batch_all_labels=[],[],[],[]
batch_all_labels.append(batch_labels1)
batch_all_labels.append(batch_labels2)
batch_all_input_ids.append(batch_pair1_input_ids)
batch_all_input_ids.append(batch_pair2_input_ids)
batch_all_attention_mask.append(batch_pair1_attention_mask)
batch_all_attention_mask.append(batch_pair2_attention_mask)
batch_all_token_type_ids.append(batch_pair1_token_type_ids)
batch_all_token_type_ids.append(batch_pair2_token_type_ids)

config=BertConfig.from_pretrained(pretarin_path)
#teacher模型訓練
teacher_model=TeacherNER.from_pretrained(pretarin_path,config=config,num_labels=10)
outputs,loss=teacher_model(batch_token_input_ids=batch_pair1_input_ids,
batch_attention_mask=batch_pair1_attention_mask,
batch_token_type_ids=batch_pair1_token_type_ids,batch_labels=batch_labels1,
batch_pair_input_ids=batch_all_input_ids,
batch_pair_attention_mask=batch_all_attention_mask,
batch_pair_token_type_ids=batch_all_token_type_ids,
training=True,batch_t=batch_t)
#student模型蒸餾
teacher_logits,teacher_similar=teacher_model(batch_token_input_ids=batch_pair1_input_ids,
batch_attention_mask=batch_pair1_attention_mask,
batch_token_type_ids=batch_pair1_token_type_ids,
batch_labels=batch_labels1,
batch_pair_input_ids=batch_all_input_ids,
batch_pair_attention_mask=batch_all_attention_mask,
batch_pair_token_type_ids=batch_all_token_type_ids,
training=False)

student_model=StudentNER.from_pretrained(pretarin_path,config=config,num_labels=10)
loss_all=student_model(batch_pair_input_ids=batch_all_input_ids,
batch_pair_attention_mask=batch_all_attention_mask,
batch_pair_token_type_ids=batch_all_token_type_ids,
batch_pair_labels=batch_all_labels,teacher_logits=teacher_logits,
teacher_similar=teacher_similar)
print(loss_all)