中国建设教育学会网站,苏州做网站公司乛 苏州聚尚网络,建设网站哪个好,国外做健康的网站[oneAPI] 基于BERT预训练模型的英文文本蕴含任务 Intel DevCloud for oneAPI 和 Intel Optimization for PyTorch基于BERT预训练模型的英文文本蕴含任务语料介绍数据集构建 模型训练 结果参考资料 比赛#xff1a;https://marketing.csdn.net/p/f3e44fbfe46c465f4d9d6c23e38e0… [oneAPI] 基于BERT预训练模型的英文文本蕴含任务 Intel® DevCloud for oneAPI 和 Intel® Optimization for PyTorch基于BERT预训练模型的英文文本蕴含任务语料介绍数据集构建 模型训练 结果参考资料 比赛https://marketing.csdn.net/p/f3e44fbfe46c465f4d9d6c23e38e0517 Intel® DevCloud for oneAPIhttps://devcloud.intel.com/oneapi/get_started/aiAnalyticsToolkitSamples/ Intel® DevCloud for oneAPI 和 Intel® Optimization for PyTorch 
我们在Intel® DevCloud for oneAPI平台上构建了我们的实验环境充分利用了其完全虚拟化的特性使我们能够专注于模型的开发和优化无需烦心底层环境的配置和维护。为了进一步提升我们的实验效果我们充分利用了Intel® Optimization for PyTorch将其应用于我们的PyTorch模型中从而实现了高效的优化。  
基于BERT预训练模型的英文文本蕴含任务 
自然语言推理简称NLI是自然语言处理领域的一个重要任务而多种文本蕴含Textual Entailment是其一个具体的子任务。MNLIMultiNLI是一个广泛使用的NLI数据集旨在评估模型对于文本蕴含关系的理解能力。 
在MNLI任务中给定一个前提句子premise和一个假设句子hypothesis模型需要判断假设句子是否可以从前提句子中推断出来。这涉及到三种类别的关系蕴含entailment、中性neutral和矛盾contradiction。例如对于前提句子 “A cat is sitting on the couch.” 和假设句子 “A cat is on a piece of furniture.”模型应该判断这两个句子之间的关系是蕴含。 
MNLI的任务设计具有挑战性要求模型不仅仅理解句子的字面含义还需要进行逻辑推理和上下文理解。解决MNLI任务对于构建具有深层次语义理解能力的自然语言处理模型具有重要意义可以应用于问答系统、文本理解和语义推理等领域。 
基于BERT的文本蕴含文本对分类任务实质上是对一个文本序列进行分类。只是按照BERT模型的思想文本对分类任务在数据集的构建过程中需要通过Segment Embedding来区分前后两个不同的序列。换句话说与普通的单文本分类任务相比文本对的分类任务在构建模型输入上发生了变换。 
语料介绍 
在这里我们使用到的是论文中所提到的MNLIThe Multi-Genre Natural Language Inference Corpus, 多类型自然语言推理数据库自然语言推断任务数据集。也就是给定前提premise语句和假设hypothesis语句任务是预测前提语句是否包含假设蕴含, entailment与假设矛盾矛盾contradiction或者两者都不中立neutral。 
{annotator_labels: [entailment, neutral, entailment, neutral, entailment], genre: oup, gold_label: entailment, pairID: 82890e, promptID: 82890, sentence1:  From Home Work to Modern Manufacture, sentence1_binary_parse: ( From ( ( Home Work ) ( to ( Modern Manufacture ) ) ) ), sentence1_parse: (ROOT (PP (IN From) (NP (NP (NNP Home) (NNP Work)) (PP (TO to) (NP (NNP Modern) (NNP Manufacture)))))), sentence2: Modern manufacturing has changed over time., sentence2_binary_parse: ( ( Modern manufacturing ) ( ( has ( changed ( over time ) ) ) . ) ), sentence2_parse: (ROOT (S (NP (NNP Modern) (NN manufacturing)) (VP (VBZ has) (VP (VBN changed) (PP (IN over) (NP (NN time))))) (. .))) }
{annotator_labels: [neutral, neutral, entailment, neutral, neutral], genre: nineeleven, gold_label: neutral, pairID: 16525n, promptID: 16525, sentence1: They were promptly executed., sentence1_binary_parse: ( They ( ( were ( promptly executed ) ) . ) ), sentence1_parse: (ROOT (S (NP (PRP They)) (VP (VBD were) (VP (ADVP (RB promptly)) (VBN executed))) (. .))), sentence2: They were executed immediately upon capture., sentence2_binary_parse: ( They ( ( were ( ( executed immediately ) ( upon capture ) ) ) . ) ), sentence2_parse: (ROOT (S (NP (PRP They)) (VP (VBD were) (VP (VBN executed) (ADVP (RB immediately)) (PP (IN upon) (NP (NN capture))))) (. .)))}由于该数据集同时也可用于其它任务中因此除了我们需要的前提和假设两个句子和标签之外还有每个句子的语法解析结构等等。在这里下载完成数据后只需要执行项目中的format.py脚本即可将原始数据划分成训练集、验证集和测试集。格式化后的数据形式如下所示 
From Home Work to Modern Manufacture_!_Modern manufacturing has changed over time._!_1
They were promptly executed._!_They were executed immediately upon capture._!_2数据集构建 
定义一个类并在类的初始化过程中根据训练语料完成字典的构建等工作 
class LoadSingleSentenceClassificationDataset:def __init__(self,vocab_path./vocab.txt,  #tokenizerNone,batch_size32,max_sen_lenNone,split_sep\n,max_position_embeddings512,pad_index0,is_sample_shuffleTrue)::param vocab_path: 本地词表vocab.txt的路径:param tokenizer::param batch_size::param max_sen_len: 在对每个batch进行处理时的配置当max_sen_len  None时即以每个batch中最长样本长度为标准对其它进行padding当max_sen_len  same时以整个数据集中最长样本为标准对其它进行padding当max_sen_len  50 表示以某个固定长度符样本进行padding多余的截掉:param split_sep: 文本和标签之前的分隔符默认为\t:param max_position_embeddings: 指定最大样本长度超过这个长度的部分将本截取掉:param is_sample_shuffle: 是否打乱训练集样本只针对训练集在后续构造DataLoader时验证集和测试集均指定为了固定顺序即不进行打乱修改程序时请勿进行打乱因为当shuffle为True时每次通过for循环遍历data_iter时样本的顺序都不一样这会导致在模型预测时返回的标签顺序与原始的顺序不一样不方便处理。self.tokenizer  tokenizerself.vocab  build_vocab(vocab_path)self.PAD_IDX  pad_indexself.SEP_IDX  self.vocab[[SEP]]self.CLS_IDX  self.vocab[[CLS]]# self.UNK_IDX  [UNK]self.batch_size  batch_sizeself.split_sep  split_sepself.max_position_embeddings  max_position_embeddingsif isinstance(max_sen_len, int) and max_sen_len  max_position_embeddings:max_sen_len  max_position_embeddingsself.max_sen_len  max_sen_lenself.is_sample_shuffle  is_sample_shufflecachedef data_process(self, filepath, postfixcache):将每一句话中的每一个词根据字典转换成索引的形式同时返回所有样本中最长样本的长度:param filepath: 数据集路径:return:raw_iter  open(filepath, encodingutf8).readlines()data  []max_len  0for raw in tqdm(raw_iter, ncols80):line  raw.rstrip(\n).split(self.split_sep)s, l  line[0], line[1]tmp  [self.CLS_IDX]  [self.vocab[token] for token in self.tokenizer(s)]if len(tmp)  self.max_position_embeddings - 1:tmp  tmp[:self.max_position_embeddings - 1]  # BERT预训练模型只取前512个字符tmp  [self.SEP_IDX]tensor_  torch.tensor(tmp, dtypetorch.long)l  torch.tensor(int(l), dtypetorch.long)max_len  max(max_len, tensor_.size(0))data.append((tensor_, l))return data, max_lendef load_train_val_test_data(self, train_file_pathNone,val_file_pathNone,test_file_pathNone,only_testFalse):postfix  str(self.max_sen_len)test_data, _  self.data_process(filepathtest_file_path, postfixpostfix)test_iter  DataLoader(test_data, batch_sizeself.batch_size,shuffleFalse, collate_fnself.generate_batch)if only_test:return test_itertrain_data, max_sen_len  self.data_process(filepathtrain_file_path,postfixpostfix)  # 得到处理好的所有样本if self.max_sen_len  same:self.max_sen_len  max_sen_lenval_data, _  self.data_process(filepathval_file_path,postfixpostfix)train_iter  DataLoader(train_data, batch_sizeself.batch_size,  # 构造DataLoadershuffleself.is_sample_shuffle, collate_fnself.generate_batch)val_iter  DataLoader(val_data, batch_sizeself.batch_size,shuffleFalse, collate_fnself.generate_batch)return train_iter, test_iter, val_iterdef generate_batch(self, data_batch):batch_sentence, batch_label  [], []for (sen, label) in data_batch:  # 开始对一个batch中的每一个样本进行处理。batch_sentence.append(sen)batch_label.append(label)batch_sentence  pad_sequence(batch_sentence,  # [batch_size,max_len]padding_valueself.PAD_IDX,batch_firstFalse,max_lenself.max_sen_len)batch_label  torch.tensor(batch_label, dtypetorch.long)return batch_sentence, batch_labelclass LoadPairSentenceClassificationDataset(LoadSingleSentenceClassificationDataset):def __init__(self, **kwargs):super(LoadPairSentenceClassificationDataset, self).__init__(**kwargs)passcachedef data_process(self, filepath, postfixcache):将每一句话中的每一个词根据字典转换成索引的形式同时返回所有样本中最长样本的长度:param filepath: 数据集路径:return:raw_iter  open(filepath).readlines()data  []max_len  0for raw in tqdm(raw_iter, ncols80):line  raw.rstrip(\n).split(self.split_sep)s1, s2, l  line[0], line[1], line[2]token1  [self.vocab[token] for token in self.tokenizer(s1)]token2  [self.vocab[token] for token in self.tokenizer(s2)]tmp  [self.CLS_IDX]  token1  [self.SEP_IDX]  token2if len(tmp)  self.max_position_embeddings - 1:tmp  tmp[:self.max_position_embeddings - 1]  # BERT预训练模型只取前512个字符tmp  [self.SEP_IDX]seg1  [0] * (len(token1)  2)  # 2 表示[CLS]和中间的[SEP]这两个字符seg2  [1] * (len(tmp) - len(seg1))segs  torch.tensor(seg1  seg2, dtypetorch.long)tensor_  torch.tensor(tmp, dtypetorch.long)l  torch.tensor(int(l), dtypetorch.long)max_len  max(max_len, tensor_.size(0))data.append((tensor_, segs, l))return data, max_lendef generate_batch(self, data_batch):batch_sentence, batch_seg, batch_label  [], [], []for (sen, seg, label) in data_batch:  # 开始对一个batch中的每一个样本进行处理。batch_sentence.append(sen)batch_seg.append((seg))batch_label.append(label)batch_sentence  pad_sequence(batch_sentence,  # [batch_size,max_len]padding_valueself.PAD_IDX,batch_firstFalse,max_lenself.max_sen_len)  # [max_len,batch_size]batch_seg  pad_sequence(batch_seg,  # [batch_size,max_len]padding_valueself.PAD_IDX,batch_firstFalse,max_lenself.max_sen_len)  # [max_len, batch_size]batch_label  torch.tensor(batch_label, dtypetorch.long)return batch_sentence, batch_seg, batch_label模型训练 
TaskForPairSentenceClassification的模块来完成分类模型的微调训练任务。 
首先我们需要定义一个ModelConfig类来对分类模型中的超参数进行管理代码如下所示 
class BertConfig(object):Configuration for BertModel.def __init__(self,vocab_size21128,hidden_size768,num_hidden_layers12,num_attention_heads12,intermediate_size3072,pad_token_id0,hidden_actgelu,hidden_dropout_prob0.1,attention_probs_dropout_prob0.1,max_position_embeddings512,type_vocab_size2,initializer_range0.02):Constructs BertConfig.Args:vocab_size: Vocabulary size of inputs_ids in BertModel.hidden_size: Size of the encoder layers and the pooler layer.num_hidden_layers: Number of hidden layers in the Transformer encoder.num_attention_heads: Number of attention heads for each attention layer inthe Transformer encoder.intermediate_size: The size of the intermediate (i.e., feed-forward)layer in the Transformer encoder.hidden_act: The non-linear activation function (function or string) in theencoder and pooler.hidden_dropout_prob: The dropout probability for all fully connectedlayers in the embeddings, encoder, and pooler.attention_probs_dropout_prob: The dropout ratio for the attentionprobabilities.max_position_embeddings: The maximum sequence length that this model mightever be used with. Typically set this to something large just in case(e.g., 512 or 1024 or 2048).type_vocab_size: The vocabulary size of the token_type_ids passed intoBertModel.initializer_range: The stdev of the truncated_normal_initializer forinitializing all weight matrices.self.vocab_size  vocab_sizeself.hidden_size  hidden_sizeself.num_hidden_layers  num_hidden_layersself.num_attention_heads  num_attention_headsself.hidden_act  hidden_actself.intermediate_size  intermediate_sizeself.pad_token_id  pad_token_idself.hidden_dropout_prob  hidden_dropout_probself.attention_probs_dropout_prob  attention_probs_dropout_probself.max_position_embeddings  max_position_embeddingsself.type_vocab_size  type_vocab_sizeself.initializer_range  initializer_rangeclassmethoddef from_dict(cls, json_object):Constructs a BertConfig from a Python dictionary of parameters.config  BertConfig(vocab_sizeNone)for (key, value) in six.iteritems(json_object):config.__dict__[key]  valuereturn configclassmethoddef from_json_file(cls, json_file):Constructs a BertConfig from a json file of parameters.从json配置文件读取配置信息with open(json_file, r) as reader:text  reader.read()logging.info(f成功导入BERT配置文件 {json_file})return cls.from_dict(json.loads(text))def to_dict(self):Serializes this instance to a Python dictionary.output  copy.deepcopy(self.__dict__)return outputdef to_json_string(self):Serializes this instance to a JSON string.return json.dumps(self.to_dict(), indent2, sort_keysTrue)  \n最后我们只需要再定义一个train()函数来完成模型的训练即可代码如下 
def train(config):model  BertForSentenceClassification(config,config.pretrained_model_dir)model_save_path  os.path.join(config.model_save_dir, model.pt)if os.path.exists(model_save_path):loaded_paras  torch.load(model_save_path)model.load_state_dict(loaded_paras)logging.info(## 成功载入已有模型进行追加训练......)model  model.to(config.device)optimizer  torch.optim.Adam(model.parameters(), lrconfig.learning_rate)Apply Intel Extension for PyTorch optimization against the model object and optimizer object.model, optimizer  ipex.optimize(model, optimizeroptimizer)model.train()bert_tokenize  BertTokenizer.from_pretrained(model_config.pretrained_model_dir).tokenizedata_loader  LoadPairSentenceClassificationDataset(vocab_pathconfig.vocab_path,tokenizerbert_tokenize,batch_sizeconfig.batch_size,max_sen_lenconfig.max_sen_len,split_sepconfig.split_sep,max_position_embeddingsconfig.max_position_embeddings,pad_indexconfig.pad_token_id)train_iter, test_iter, val_iter  \data_loader.load_train_val_test_data(config.train_file_path,config.val_file_path,config.test_file_path)lr_scheduler  get_scheduler(namelinear,optimizeroptimizer,num_warmup_stepsint(len(train_iter) * 0),num_training_stepsint(config.epochs * len(train_iter)))max_acc  0for epoch in range(config.epochs):losses  0start_time  time.time()for idx, (sample, seg, label) in enumerate(train_iter):sample  sample.to(config.device)  # [src_len, batch_size]label  label.to(config.device)seg  seg.to(config.device)padding_mask  (sample  data_loader.PAD_IDX).transpose(0, 1)loss, logits  model(input_idssample,attention_maskpadding_mask,token_type_idsseg,position_idsNone,labelslabel)optimizer.zero_grad()loss.backward()lr_scheduler.step()optimizer.step()losses  loss.item()acc  (logits.argmax(1)  label).float().mean()if idx % 10  0:logging.info(fEpoch: {epoch}, Batch[{idx}/{len(train_iter)}], fTrain loss :{loss.item():.3f}, Train acc: {acc:.3f})end_time  time.time()train_loss  losses / len(train_iter)logging.info(fEpoch: {epoch}, Train loss: f{train_loss:.3f}, Epoch time  {(end_time - start_time):.3f}s)if (epoch  1) % config.model_val_per_epoch  0:acc  evaluate(val_iter, model, config.device, data_loader.PAD_IDX)logging.info(fAccuracy on val {acc:.3f})if acc  max_acc:max_acc  acctorch.save(model.state_dict(), model_save_path)结果 参考资料 基于BERT预训练模型的英文文本蕴含任务: https://www.ylkz.life/deeplearning/p10407402/