怎么创建自己的游戏网站,上海网站制作顾,更改wordpress语言,免费搭建网站模板引言
今天我们来实现ESIM文本匹配#xff0c;这是一个典型的交互型文本匹配方式#xff0c;也是近期第一个测试集准确率超过80%的模型。
我们来看下是如何实现的。
模型架构 我们主要实现左边的ESIM网络。
从下往上看#xff0c;分别是
输入编码层(Input Ecoding) 对前…引言
今天我们来实现ESIM文本匹配这是一个典型的交互型文本匹配方式也是近期第一个测试集准确率超过80%的模型。
我们来看下是如何实现的。
模型架构 我们主要实现左边的ESIM网络。
从下往上看分别是
输入编码层(Input Ecoding) 对前提和假设进行编码 把语句中的单词转换为词向量得到一个向量序列 把两句话的向量序列分别送入各自的Bi-LSTM网络进行语义特征抽取局部推理建模层(Local Inference Modeling) 就是注意力层 通过注意力层捕获LSTM输出向量间的局部特征 然后通过元素级方法构造了一些特征推理组合层(Inference Composition) 和输入编码层一样也是Bi-LSTM 在捕获了文本间的注意力特征后进一步做的融合/提取语义特征工作预测层(Prediction) 拼接平均池化和最大池化后得到的向量 接Softmax进行分类
模型实现
class ESIM(nn.Module):def __init__(self,vocab_size: int,embedding_size: int,hidden_size: int,num_classes: int,lstm_dropout: float 0.1,dropout: float 0.5,) - None:_summary_Args:vocab_size (int): the size of the Vocabularyembedding_size (int): the size of each embedding vectorhidden_size (int): the size of the hidden layernum_classes (int): the output sizelstm_dropout (float, optional): dropout ratio in lstm layer. Defaults to 0.1.dropout (float, optional): dropout ratio in linear layer. Defaults to 0.5.super().__init__()self.embedding nn.Embedding(vocab_size, embedding_size)# lstm for input embeddingself.lstm_a nn.LSTM(hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)self.lstm_b nn.LSTM(hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)首先有一个嵌入层然后对于输入的两个句子分别有一个Bi-LSTM。
然后定义推理组合层 # lstm for augment inference vectorself.lstm_v_a nn.LSTM(8 * hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)self.lstm_v_b nn.LSTM(8 * hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)
完了就是最后的预测层这里是一个多层前馈网络 self.predict nn.Sequential(nn.Linear(8 * hidden_size, 2 * hidden_size),nn.ReLU(),nn.Dropout(dropout),nn.Linear(2 * hidden_size, hidden_size),nn.ReLU(),nn.Dropout(dropout),nn.Linear(hidden_size, num_classes),)初始化函数的完整实现为
class ESIM(nn.Module):def __init__(self,vocab_size: int,embedding_size: int,hidden_size: int,num_classes: int,lstm_dropout: float 0.1,dropout: float 0.5,) - None:_summary_Args:vocab_size (int): the size of the Vocabularyembedding_size (int): the size of each embedding vectorhidden_size (int): the size of the hidden layernum_classes (int): the output sizelstm_dropout (float, optional): dropout ratio in lstm layer. Defaults to 0.1.dropout (float, optional): dropout ratio in linear layer. Defaults to 0.5.super().__init__()self.embedding nn.Embedding(vocab_size, embedding_size)# lstm for input embeddingself.lstm_a nn.LSTM(hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)self.lstm_b nn.LSTM(hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)# lstm for augment inference vectorself.lstm_v_a nn.LSTM(8 * hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)self.lstm_v_b nn.LSTM(8 * hidden_size,hidden_size,batch_firstTrue,bidirectionalTrue,dropoutlstm_dropout,)self.predict nn.Sequential(nn.Linear(8 * hidden_size, 2 * hidden_size),nn.ReLU(),nn.Dropout(dropout),nn.Linear(2 * hidden_size, hidden_size),nn.ReLU(),nn.Dropout(dropout),nn.Linear(hidden_size, num_classes),)使用ReLU激活函数在激活函数后都有一个Dropout。
重点是forward方法 def forward(self, a: torch.Tensor, b: torch.Tensor) - torch.Tensor:Args:a (torch.Tensor): input sequence a with shape (batch_size, a_seq_len)b (torch.Tensor): input sequence b with shape (batch_size, b_seq_len)Returns:torch.Tensor:# a (batch_size, a_seq_len, embedding_size)a_embed self.embedding(a)# b (batch_size, b_seq_len, embedding_size)b_embed self.embedding(b)# a_bar (batch_size, a_seq_len, 2 * hidden_size)a_bar, _ self.lstm_a(a_embed)# b_bar (batch_size, b_seq_len, 2 * hidden_size)b_bar, _ self.lstm_b(b_embed)# score (batch_size, a_seq_len, b_seq_len)score torch.matmul(a_bar, b_bar.permute(0, 2, 1))# softmax (batch_size, a_seq_len, b_seq_len) x (batch_size, b_seq_len, 2 * hidden_size)# a_tilde (batch_size, a_seq_len, 2 * hidden_size)a_tilde torch.matmul(torch.softmax(score, dim2), b_bar)# permute (batch_size, b_seq_len, a_seq_len) x (batch_size, a_seq_len, 2 * hidden_size)# b_tilde (batch_size, b_seq_len, 2 * hidden_size)b_tilde torch.matmul(torch.softmax(score, dim1).permute(0, 2, 1), a_bar)# m_a (batch_size, a_seq_len, 8 * hidden_size)m_a torch.cat([a_bar, a_tilde, a_bar - a_tilde, a_bar * a_tilde], dim-1)# m_b (batch_size, b_seq_len, 8 * hidden_size)m_b torch.cat([b_bar, b_tilde, b_bar - b_tilde, b_bar * b_tilde], dim-1)# v_a (batch_size, a_seq_len, 2 * hidden_size)v_a, _ self.lstm_v_a(m_a)# v_b (batch_size, b_seq_len, 2 * hidden_size)v_b, _ self.lstm_v_b(m_b)# (batch_size, 2 * hidden_size)avg_a torch.mean(v_a, dim1)avg_b torch.mean(v_b, dim1)max_a, _ torch.max(v_a, dim1)max_b, _ torch.max(v_b, dim1)# (batch_size, 8 * hidden_size)v torch.cat([avg_a, max_a, avg_b, max_b], dim-1)return self.predict(v)
标注出每个向量的维度就挺好理解的。
首先分别为两个输入得到嵌入向量
其次喂给各自的LSTM层得到每个时间步的输出注意由于是双向LSTM因此最后的维度是2 * hidden_size
接着对它两计算注意力分数score (batch_size, a_seq_len, b_seq_len)
然后分别计算a对b和b对a的注意力这里要注意softmax中dim的维度。在计算a的注意力输出时是利用b每个输入的加权和所以softmax(score, dim2)在b_seq_len维度上间求和
然后是增强的局部推理差、乘各种拼接得到一个8 * hidden_size的维度所以lstm_v_a的输入维度是8 * hidden_size
接下来就是对得到的a和b进行平均池化和最大池化然后把它们拼接起来也得到了8 * hidden_size的一个维度
最后经过我们的预测层多层前馈网络进行了一些非线性变换后变成了输出维度大小比较最后一个维度的值哪个大当成对哪个类的一个预测
数据准备
和前面几篇几乎一样这里直接贴代码
from collections import defaultdict
from tqdm import tqdm
import numpy as np
import json
from torch.utils.data import Dataset
import pandas as pd
from typing import TupleUNK_TOKEN UNK
PAD_TOKEN PADclass Vocabulary:Class to process text and extract vocabulary for mappingdef __init__(self, token_to_idx: dict None, tokens: list[str] None) - None:Args:token_to_idx (dict, optional): a pre-existing map of tokens to indices. Defaults to None.tokens (list[str], optional): a list of unique tokens with no duplicates. Defaults to None.assert any([tokens, token_to_idx]), At least one of these parameters should be set as not None.if token_to_idx:self._token_to_idx token_to_idxelse:self._token_to_idx {}if PAD_TOKEN not in tokens:tokens [PAD_TOKEN] tokensfor idx, token in enumerate(tokens):self._token_to_idx[token] idxself._idx_to_token {idx: token for token, idx in self._token_to_idx.items()}self.unk_index self._token_to_idx[UNK_TOKEN]self.pad_index self._token_to_idx[PAD_TOKEN]classmethoddef build(cls,sentences: list[list[str]],min_freq: int 2,reserved_tokens: list[str] None,) - Vocabulary:Construct the Vocabulary from sentencesArgs:sentences (list[list[str]]): a list of tokenized sequencesmin_freq (int, optional): the minimum word frequency to be saved. Defaults to 2.reserved_tokens (list[str], optional): the reserved tokens to add into the Vocabulary. Defaults to None.Returns:Vocabulary: a Vocubulary instanetoken_freqs defaultdict(int)for sentence in tqdm(sentences):for token in sentence:token_freqs[token] 1unique_tokens (reserved_tokens if reserved_tokens else []) [UNK_TOKEN]unique_tokens [tokenfor token, freq in token_freqs.items()if freq min_freq and token ! UNK_TOKEN]return cls(tokensunique_tokens)def __len__(self) - int:return len(self._idx_to_token)def __getitem__(self, tokens: list[str] | str) - list[int] | int:Retrieve the indices associated with the tokens or the index with the single tokenArgs:tokens (list[str] | str): a list of tokens or single tokenReturns:list[int] | int: the indices or the single indexif not isinstance(tokens, (list, tuple)):return self._token_to_idx.get(tokens, self.unk_index)return [self.__getitem__(token) for token in tokens]def lookup_token(self, indices: list[int] | int) - list[str] | str:Retrive the tokens associated with the indices or the token with the single indexArgs:indices (list[int] | int): a list of index or single indexReturns:list[str] | str: the corresponding tokens (or token)if not isinstance(indices, (list, tuple)):return self._idx_to_token[indices]return [self._idx_to_token[index] for index in indices]def to_serializable(self) - dict:Returns a dictionary that can be serializedreturn {token_to_idx: self._token_to_idx}classmethoddef from_serializable(cls, contents: dict) - Vocabulary:Instantiates the Vocabulary from a serialized dictionaryArgs:contents (dict): a dictionary generated by to_serializableReturns:Vocabulary: the Vocabulary instancereturn cls(**contents)def __repr__(self):return fVocabulary(size{len(self)})class TMVectorizer:The Vectorizer which vectorizes the Vocabularydef __init__(self, vocab: Vocabulary, max_len: int) - None:Args:vocab (Vocabulary): maps characters to integersmax_len (int): the max length of the sequence in the datasetself.vocab vocabself.max_len max_lendef _vectorize(self, indices: list[int], vector_length: int -1, padding_index: int 0) - np.ndarray:Vectorize the provided indicesArgs:indices (list[int]): a list of integers that represent a sequencevector_length (int, optional): an arugment for forcing the length of index vector. Defaults to -1.padding_index (int, optional): the padding index to use. Defaults to 0.Returns:np.ndarray: the vectorized index arrayif vector_length 0:vector_length len(indices)vector np.zeros(vector_length, dtypenp.int64)if len(indices) vector_length:vector[:] indices[:vector_length]else:vector[: len(indices)] indicesvector[len(indices) :] padding_indexreturn vectordef _get_indices(self, sentence: list[str]) - list[int]:Return the vectorized sentenceArgs:sentence (list[str]): list of tokensReturns:indices (list[int]): list of integers representing the sentencereturn [self.vocab[token] for token in sentence]def vectorize(self, sentence: list[str], use_dataset_max_length: bool True) - np.ndarray:Return the vectorized sequenceArgs:sentence (list[str]): raw sentence from the datasetuse_dataset_max_length (bool): whether to use the global max vector lengthReturns:the vectorized sequence with paddingvector_length -1if use_dataset_max_length:vector_length self.max_lenindices self._get_indices(sentence)vector self._vectorize(indices, vector_lengthvector_length, padding_indexself.vocab.pad_index)return vectorclassmethoddef from_serializable(cls, contents: dict) - TMVectorizer:Instantiates the TMVectorizer from a serialized dictionaryArgs:contents (dict): a dictionary generated by to_serializableReturns:TMVectorizer:vocab Vocabulary.from_serializable(contents[vocab])max_len contents[max_len]return cls(vocabvocab, max_lenmax_len)def to_serializable(self) - dict:Returns a dictionary that can be serializedReturns:dict: a dict contains Vocabulary instance and max_len attributereturn {vocab: self.vocab.to_serializable(), max_len: self.max_len}def save_vectorizer(self, filepath: str) - None:Dump this TMVectorizer instance to fileArgs:filepath (str): the path to store the filewith open(filepath, w) as f:json.dump(self.to_serializable(), f)classmethoddef load_vectorizer(cls, filepath: str) - TMVectorizer:Load TMVectorizer from a fileArgs:filepath (str): the path stored the fileReturns:TMVectorizer:with open(filepath) as f:return TMVectorizer.from_serializable(json.load(f))class TMDataset(Dataset):Dataset for text matchingdef __init__(self, text_df: pd.DataFrame, vectorizer: TMVectorizer) - None:Args:text_df (pd.DataFrame): a DataFrame which contains the processed data examplesvectorizer (TMVectorizer): a TMVectorizer instanceself.text_df text_dfself._vectorizer vectorizerdef __getitem__(self, index: int) - Tuple[np.ndarray, np.ndarray, int]:row self.text_df.iloc[index]return (self._vectorizer.vectorize(row.sentence1),self._vectorizer.vectorize(row.sentence2),row.label,)def get_vectorizer(self) - TMVectorizer:return self._vectorizerdef __len__(self) - int:return len(self.text_df)
模型训练
定义辅助函数和指标
def make_dirs(dirpath):if not os.path.exists(dirpath):os.makedirs(dirpath)def tokenize(sentence: str):return list(jieba.cut(sentence))def build_dataframe_from_csv(dataset_csv: str) - pd.DataFrame:df pd.read_csv(dataset_csv,sep\t,headerNone,names[sentence1, sentence2, label],)df.sentence1 df.sentence1.apply(tokenize)df.sentence2 df.sentence2.apply(tokenize)return dfdef metrics(y: torch.Tensor, y_pred: torch.Tensor) - Tuple[float, float, float, float]:TP ((y_pred 1) (y 1)).sum().float() # True PositiveTN ((y_pred 0) (y 0)).sum().float() # True NegativeFN ((y_pred 0) (y 1)).sum().float() # False NegatvieFP ((y_pred 1) (y 0)).sum().float() # False Positivep TP / (TP FP).clamp(min1e-8) # Precisionr TP / (TP FN).clamp(min1e-8) # RecallF1 2 * r * p / (r p).clamp(min1e-8) # F1 scoreacc (TP TN) / (TP TN FP FN).clamp(min1e-8) # Accuraryreturn acc, p, r, F1
定义评估函数和训练函数
def evaluate(data_iter: DataLoader, model: nn.Module
) - Tuple[float, float, float, float]:y_list, y_pred_list [], []model.eval()for x1, x2, y in tqdm(data_iter):x1 x1.to(device).long()x2 x2.to(device).long()y torch.LongTensor(y).to(device)output model(x1, x2)pred torch.argmax(output, dim1).long()y_pred_list.append(pred)y_list.append(y)y_pred torch.cat(y_pred_list, 0)y torch.cat(y_list, 0)acc, p, r, f1 metrics(y, y_pred)return acc, p, r, f1def train(data_iter: DataLoader,model: nn.Module,criterion: nn.CrossEntropyLoss,optimizer: torch.optim.Optimizer,print_every: int 500,verboseTrue,
) - None:model.train()for step, (x1, x2, y) in enumerate(tqdm(data_iter)):x1 x1.to(device).long()x2 x2.to(device).long()y torch.LongTensor(y).to(device)output model(x1, x2)loss criterion(output, y)optimizer.zero_grad()loss.backward()optimizer.step()if verbose and (step 1) % print_every 0:pred torch.argmax(output, dim1).long()acc, p, r, f1 metrics(y, pred)print(f TRAIN iter{step1} loss{loss.item():.6f} accuracy{acc:.3f} precision{p:.3f} recal{r:.3f} f1 score{f1:.4f})
参数定义 args Namespace(dataset_csvtext_matching/data/lcqmc/{}.txt,vectorizer_filevectorizer.json,model_state_filemodel.pth,save_dirf{os.path.dirname(__file__)}/model_storage,reload_modelFalse,cudaTrue,learning_rate4e-4,batch_size128,num_epochs10,max_len50,embedding_dim300,hidden_size300,num_classes2,lstm_dropout0.8,dropout0.5,min_freq2,print_every500,verboseTrue,)
这个模型非常简单但效果是非常不错的可以作为一个很好的baseline。在训练过程中发现对训练集的准确率达到了100%有点过拟合了因此最后把lstm的dropout加大到0.8最后看一下训练效果如何 make_dirs(args.save_dir)if args.cuda:device torch.device(cuda:0 if torch.cuda.is_available() else cpu)else:device torch.device(cpu)print(fUsing device: {device}.)vectorizer_path os.path.join(args.save_dir, args.vectorizer_file)train_df build_dataframe_from_csv(args.dataset_csv.format(train))test_df build_dataframe_from_csv(args.dataset_csv.format(test))dev_df build_dataframe_from_csv(args.dataset_csv.format(dev))if os.path.exists(vectorizer_path):print(Loading vectorizer file.)vectorizer TMVectorizer.load_vectorizer(vectorizer_path)args.vocab_size len(vectorizer.vocab)else:print(Creating a new Vectorizer.)train_sentences train_df.sentence1.to_list() train_df.sentence2.to_list()vocab Vocabulary.build(train_sentences, args.min_freq)args.vocab_size len(vocab)print(fBuilds vocabulary : {vocab})vectorizer TMVectorizer(vocab, args.max_len)vectorizer.save_vectorizer(vectorizer_path)train_dataset TMDataset(train_df, vectorizer)test_dataset TMDataset(test_df, vectorizer)dev_dataset TMDataset(dev_df, vectorizer)train_data_loader DataLoader(train_dataset, batch_sizeargs.batch_size, shuffleTrue)dev_data_loader DataLoader(dev_dataset, batch_sizeargs.batch_size)test_data_loader DataLoader(test_dataset, batch_sizeargs.batch_size)print(fArguments : {args})model ESIM(args.vocab_size,args.embedding_dim,args.hidden_size,args.num_classes,args.lstm_dropout,args.dropout,)print(fModel: {model})model_saved_path os.path.join(args.save_dir, args.model_state_file)if args.reload_model and os.path.exists(model_saved_path):model.load_state_dict(torch.load(args.model_saved_path))print(Reloaded model)else:print(New model)model model.to(device)optimizer torch.optim.Adam(model.parameters(), lrargs.learning_rate)criterion nn.CrossEntropyLoss()for epoch in range(args.num_epochs):train(train_data_loader,model,criterion,optimizer,print_everyargs.print_every,verboseargs.verbose,)print(Begin evalute on dev set.)with torch.no_grad():acc, p, r, f1 evaluate(dev_data_loader, model)print(fEVALUATE [{epoch1}/{args.num_epochs}] accuracy{acc:.3f} precision{p:.3f} recal{r:.3f} f1 score{f1:.4f})model.eval()acc, p, r, f1 evaluate(test_data_loader, model)print(fTEST accuracy{acc:.3f} precision{p:.3f} recal{r:.3f} f1 score{f1:.4f})
python .\text_matching\esim\train.py
Using device: cuda:0.
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.563 seconds.
Prefix dict has been built successfully.
Loading vectorizer file.
Arguments : Namespace(dataset_csvtext_matching/data/lcqmc/{}.txt, vectorizer_filevectorizer.json, model_state_filemodel.pth, save_dirD:\\workspace\\nlp-in-action\\text_matching\\esim/model_storage, reload_modelFalse, cudaTrue, learning_rate0.0004, batch_size128, num_epochs10, max_len50, embedding_dim300, hidden_size300, num_classes2, lstm_dropout0.8, dropout0.5, min_freq2, print_every500, verboseTrue, vocab_size35925)
D:\workspace\nlp-in-action\.venv\lib\site-packages\torch\nn\modules\rnn.py:67: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout0.8 and num_layers1warnings.warn(dropout option adds dropout after all but last
Model: ESIM((embedding): Embedding(35925, 300)(lstm_a): LSTM(300, 300, batch_firstTrue, dropout0.8, bidirectionalTrue)(lstm_b): LSTM(300, 300, batch_firstTrue, dropout0.8, bidirectionalTrue)(lstm_v_a): LSTM(2400, 300, batch_firstTrue, dropout0.8, bidirectionalTrue)(lstm_v_b): LSTM(2400, 300, batch_firstTrue, dropout0.8, bidirectionalTrue)(predict): Sequential((0): Linear(in_features2400, out_features600, biasTrue)(1): ReLU()(2): Dropout(p0.5, inplaceFalse)(3): Linear(in_features600, out_features300, biasTrue)(4): ReLU()(5): Dropout(p0.5, inplaceFalse)(6): Linear(in_features300, out_features2, biasTrue))
)
New model27%|█████████████████████████████████████████████████▋ | 499/1866 [01:3304:13, 5.40it/s]
TRAIN iter500 loss0.500601 accuracy0.750 precision0.714 recal0.846 f1 score0.774654%|███████████████████████████████████████████████████████████████████████████████████████████████████▌ | 999/1866 [03:0402:40, 5.41it/s]
TRAIN iter1000 loss0.250350 accuracy0.883 precision0.907 recal0.895 f1 score0.900780%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 1499/1866 [04:3501:07, 5.43it/s]
TRAIN iter1500 loss0.311494 accuracy0.844 precision0.868 recal0.868 f1 score0.8684
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1866/1866 [05:4200:00, 5.45it/s]
Begin evalute on dev set.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69/69 [00:0400:00, 17.10it/s]
EVALUATE [1/10] accuracy0.771 precision0.788 recal0.741 f1 score0.7637
...
TRAIN iter500 loss0.005086 accuracy1.000 precision1.000 recal1.000 f1 score1.0000
...
EVALUATE [10/10] accuracy0.815 precision0.807 recal0.829 f1 score0.8178
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 98/98 [00:0500:00, 17.38it/s]
TEST accuracy0.817 precision0.771 recal0.903 f1 score0.8318可以看到训练集上的损失第一次降到了0.005级别准确率甚至达到了100%但最终在测试集上的准确率相比训练集还是逊色了一些只有81.7%不过相比前面几个模型已经很不错了。