蒙特网站建设公司,培训机构退费,郑州模板建站定制网站,农村小学校园网站建设方案垃圾短信的泛滥给人们的日常生活带来了严重干扰#xff0c;其中诈骗短信更是威胁到人们的信息与财产安全。因此#xff0c;研究如何构建一种自动拦截过滤垃圾短信的机制有较强的实际应用价值。本文基于中文垃圾短信数据集#xff0c;分别对比了朴素贝叶斯、逻辑回归、随机森…垃圾短信的泛滥给人们的日常生活带来了严重干扰其中诈骗短信更是威胁到人们的信息与财产安全。因此研究如何构建一种自动拦截过滤垃圾短信的机制有较强的实际应用价值。本文基于中文垃圾短信数据集分别对比了朴素贝叶斯、逻辑回归、随机森林、SVM、LSTM、BiLSTM、BERT七种文本分类算法的垃圾短信分类效果。
1. 数据集设置与分析
统计发现给定数据集包含正常短信679,365条垃圾短信75,478条垃圾短信数量约占短信总数的10%。将数据集按7:3的比例随机拆分为训练集与测试集。训练集与测试集的数据分布如下表所示
类别训练集测试集正常短信正类475,560203,805垃圾短信负类52,83022,648总计528,390226,453
另外绘制训练集中正常短信与垃圾短信的词云图可以对正常短信与垃圾短信的文本特征有较为直观的认识。从正常短信出现频率最高的前500词中随机选取的200个词的词云图如下图所示 从垃圾短信出现频率最高的前500词中随机选取的200个词的词云图如下图所示 可以发现正常短信和垃圾短信在频繁词项上的区别是比较明显的。正常短信多与人们的日常生活相关包含个人情感如“哈哈哈”、“宝宝”、时事新闻如“记者”、“发布”、衣食住行如“飞机”、“医疗”等。而垃圾短信多与广告营销相关包含促销力度如“元起”、“钜”、“超值”、“最低”、时间紧迫性如“赶紧”、“机会”、促销手段如“抽奖”、“话费”、时令节日如“妇女节”、“三月”等。
2. 算法实现
基于上述数据集本文从传统的机器学习方法中选择了朴素贝叶斯、逻辑回归、随机森林、SVM分类模型从深度学习方法中选择了LSTM、BiLSTM以及预训练模型BERT进行对比实验。七种文本分类算法的优缺点总结如下表所示
算法优点缺点朴素贝叶斯有着坚实的数学理论基础实现简单学习与预测的效率都较高。实际往往不能满足特征条件独立性在特征之间的相关性较大时分类效果不好预设的先验概率分布的影响分类效果在类别不平衡的数据上表现不佳。逻辑回归实现简单训练速度快。对于非线性的样本数据难以建模拟合在特征空间很大时性能不好临界值不易确定容易欠拟合。随机森林训练可以高度并行化在大数据集上训练速度有优势能够处理高维度数据能给出各个特征属性对输出的重要性评分。在噪声较大的情况下容易发生过拟合。SVM可以处理线性与非线性的数据具有较良好的泛化推广能力。参数调节与核函数选择较多地依赖于经验具有一定的随意性。LSTM结合词序信息。只能结合正向的词序信息。BiLSTM结合上下文信息。模型收敛需要较长的训练时间。BERT捕捉上下文信息的能力更强。预训练的[MASK]标记造成预训练与微调阶段的不匹配影响模型效果模型收敛需要更多时间。
下面依次介绍各文本分类算法的实现细节。
2.1 朴素贝叶斯
首先使用结巴分词工具将短信文本分词去除停用词然后抽取unigram和bigram特征使用TF-IDF编码将分词后的短信文本向量化最后训练朴素贝叶斯分类器。模型使用scikit-learn中的MultinomialNB参数使用默认参数。其中假设特征的先验概率分布为多项式分布采用拉普拉斯平滑所有的样本类别输出都有相同的类别先验概率。
代码如下
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout io.TextIOWrapper(sys.stdout.buffer, encodingutf-8)#读取停用词列表
def stopwordslist(filepath): stopwords [line.strip() for line in open(filepath, r, encodingutf-8).readlines()] return stopwords if __name__ __main__:#读取训练集数据print(Loading train dataset ...)t time()train_data pd.read_csv(train.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))#读取测试集数据print(Loading test dataset ...)t time()test_data pd.read_csv(test.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Total number of labeled documents(train): %d . % len(train_data))print(Total number of labeled documents(test): %d . % len(test_data))X_train train_data[text]X_test test_data[text]y_train train_data[labels]y_test test_data[labels]#计算训练集中每个类别的标注数量d {labels:train_data[labels].value_counts().index, count: train_data[labels].value_counts()}df_label pd.DataFrame(datad).reset_index(dropTrue)print(df_label)#加载停用词print(Loading stopwords ...)t time()stopwords stopwordslist(stopwords.txt)print(Done in {0} seconds\n.format(round(time() - t, 2)))#分词并过滤停用词print(Starting word segmentation on train dataset...)t time()X_train X_train.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Starting word segmentation on test dataset...)t time()X_test X_test.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成TF-IDF词向量print(Vectorizing train dataset...)t time()tfidf TfidfVectorizer(norml2, ngram_range(1, 2))X_train tfidf.fit_transform(X_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Vectorizing test dataset...)t time()X_test tfidf.transform(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print(-----------------------------)print(X_train)print(-----------------------------)print(X_test)#训练模型print(Training model...)t time()model MultinomialNB()model.fit(X_train, y_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Predicting test dataset...)t time()y_pred model.predict(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成混淆矩阵conf_mat confusion_matrix(y_test, y_pred)print(conf_mat)print(accuracy %s % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits4))2.2 逻辑回归
文本向量化方式与朴素贝叶斯相同。模型使用scikit-learn中的LogisticRegression参数使用默认参数。其中惩罚系数设置为1正则化参数使用L2正则化终止迭代的阈值为0.0001。
代码如下
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout io.TextIOWrapper(sys.stdout.buffer, encodingutf-8)#读取停用词列表
def stopwordslist(filepath): stopwords [line.strip() for line in open(filepath, r, encodingutf-8).readlines()] return stopwords if __name__ __main__:#读取训练集数据print(Loading train dataset ...)t time()train_data pd.read_csv(train.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))#读取测试集数据print(Loading test dataset ...)t time()test_data pd.read_csv(test.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Total number of labeled documents(train): %d . % len(train_data))print(Total number of labeled documents(test): %d . % len(test_data))X_train train_data[text]X_test test_data[text]y_train train_data[labels]y_test test_data[labels]#计算训练集中每个类别的标注数量d {labels:train_data[labels].value_counts().index, count: train_data[labels].value_counts()}df_label pd.DataFrame(datad).reset_index(dropTrue)print(df_label)#加载停用词print(Loading stopwords ...)t time()stopwords stopwordslist(stopwords.txt)print(Done in {0} seconds\n.format(round(time() - t, 2)))#分词并过滤停用词print(Starting word segmentation on train dataset...)t time()X_train X_train.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Starting word segmentation on test dataset...)t time()X_test X_test.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成TF-IDF词向量print(Vectorizing train dataset...)t time()tfidf TfidfVectorizer(norml2, ngram_range(1, 2))X_train tfidf.fit_transform(X_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Vectorizing test dataset...)t time()X_test tfidf.transform(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print(-----------------------------)print(X_train)print(-----------------------------)print(X_test)#训练模型print(Training model...)t time()model LogisticRegression(random_state0)model.fit(X_train, y_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Predicting test dataset...)t time()y_pred model.predict(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成混淆矩阵conf_mat confusion_matrix(y_test, y_pred)print(conf_mat)print(accuracy %s % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits4))2.3 随机森林
文本向量化方式与朴素贝叶斯相同。模型使用scikit-learn中的RandomForestClassifier参数使用默认参数。其中决策树的最大个数为100不采用袋外样本来评估模型的好坏CART树做划分时对特征的评价标准为基尼系数。
代码如下
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout io.TextIOWrapper(sys.stdout.buffer, encodingutf-8)#读取停用词列表
def stopwordslist(filepath): stopwords [line.strip() for line in open(filepath, r, encodingutf-8).readlines()] return stopwords if __name__ __main__:#读取训练集数据print(Loading train dataset ...)t time()train_data pd.read_csv(train.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))#读取测试集数据print(Loading test dataset ...)t time()test_data pd.read_csv(test.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Total number of labeled documents(train): %d . % len(train_data))print(Total number of labeled documents(test): %d . % len(test_data))X_train train_data[text]X_test test_data[text]y_train train_data[labels]y_test test_data[labels]#计算训练集中每个类别的标注数量d {labels:train_data[labels].value_counts().index, count: train_data[labels].value_counts()}df_label pd.DataFrame(datad).reset_index(dropTrue)print(df_label)#加载停用词print(Loading stopwords ...)t time()stopwords stopwordslist(stopwords.txt)print(Done in {0} seconds\n.format(round(time() - t, 2)))#分词并过滤停用词print(Starting word segmentation on train dataset...)t time()X_train X_train.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Starting word segmentation on test dataset...)t time()X_test X_test.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成TF-IDF词向量print(Vectorizing train dataset...)t time()tfidf TfidfVectorizer(norml2, ngram_range(1, 2))X_train tfidf.fit_transform(X_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Vectorizing test dataset...)t time()X_test tfidf.transform(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print(-----------------------------)print(X_train)print(-----------------------------)print(X_test)#训练模型print(Training model...)t time()model RandomForestClassifier()model.fit(X_train, y_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Predicting test dataset...)t time()y_pred model.predict(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成混淆矩阵conf_mat confusion_matrix(y_test, y_pred)print(conf_mat)print(accuracy %s % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits4))
2.4 SVM
文本向量化方式与朴素贝叶斯相同。模型使用scikit-learn中的LinearSVC参数使用默认参数。其中SVM的核函数选用线性核函数惩罚系数设置为1正则化参数使用L2正则化采用对偶形式优化算法最大迭代次数为1000终止迭代的阈值为0.0001。
代码如下
# -*- coding: utf-8 -*-import pandas as pd
import numpy as np
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
import io
import sys
sys.stdout io.TextIOWrapper(sys.stdout.buffer, encodingutf-8)#读取停用词列表
def stopwordslist(filepath): stopwords [line.strip() for line in open(filepath, r, encodingutf-8).readlines()] return stopwords if __name__ __main__:#读取训练集数据print(Loading train dataset ...)t time()train_data pd.read_csv(train.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))#读取测试集数据print(Loading test dataset ...)t time()test_data pd.read_csv(test.csv, names[labels, text], sep\t)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Total number of labeled documents(train): %d . % len(train_data))print(Total number of labeled documents(test): %d . % len(test_data))X_train train_data[text]X_test test_data[text]y_train train_data[labels]y_test test_data[labels]#计算训练集中每个类别的标注数量d {labels:train_data[labels].value_counts().index, count: train_data[labels].value_counts()}df_label pd.DataFrame(datad).reset_index(dropTrue)print(df_label)#加载停用词print(Loading stopwords ...)t time()stopwords stopwordslist(stopwords.txt)print(Done in {0} seconds\n.format(round(time() - t, 2)))#分词并过滤停用词print(Starting word segmentation on train dataset...)t time()X_train X_train.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Starting word segmentation on test dataset...)t time()X_test X_test.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成TF-IDF词向量print(Vectorizing train dataset...)t time()tfidf TfidfVectorizer(norml2, ngram_range(1, 2))X_train tfidf.fit_transform(X_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Vectorizing test dataset...)t time()X_test tfidf.transform(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(X_train.shape)print(X_test.shape)print(-----------------------------)print(X_train)print(-----------------------------)print(X_test)#训练模型print(Training model...)t time()model LinearSVC()model.fit(X_train, y_train)print(Done in {0} seconds\n.format(round(time() - t, 2)))print(Predicting test dataset...)t time()y_pred model.predict(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))#生成混淆矩阵conf_mat confusion_matrix(y_test, y_pred)print(conf_mat)print(accuracy %s % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits4))2.5 LSTM
首先使用结巴分词工具将短信文本分词去除停用词然后设置保留的最大词数为最频繁出现的前50,000序列的最大长度为100使用200维的腾讯词向量将所有的论文标题转化为词嵌入层的权重矩阵。然后对词嵌入层的输出执行SpatialDropout1D以0.2的比例随机将1D特征映射置零。之后输入到LSTM层LSTM层的神经元个数为300。最后通过一个全连接层利用softmax函数输出分类。损失函数使用交叉熵损失函数设置batch大小为64训练10个epoch。
代码如下
# -*- coding: utf-8 -*-import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from time import time
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout io.TextIOWrapper(sys.stdout.buffer, encodingutf-8)#读取停用词列表
def stopwordslist(filepath): stopwords [line.strip() for line in open(filepath, r, encodingutf-8).readlines()] return stopwords if __name__ __main__:#读取训练集数据train_data pd.read_csv(train.csv, names[labels, text], sep\t)#读取测试集数据test_data pd.read_csv(test.csv, names[labels, text], sep\t)print(Total number of labeled documents(train): %d . % len(train_data))print(Total number of labeled documents(test): %d . % len(test_data))X_train train_data[text]X_test test_data[text]y_train train_data[labels]y_test test_data[labels]#计算训练集中每个类别的标注数量d {labels:train_data[labels].value_counts().index, count: train_data[labels].value_counts()}df_label pd.DataFrame(datad).reset_index(dropTrue)print(df_label)#加载停用词stopwords stopwordslist(stopwords.txt)#分词并过滤停用词X_train X_train.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))X_test X_test.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))# 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)MAX_NB_WORDS 50000# 每个标题最大的长度MAX_SEQUENCE_LENGTH 100# 设置Embeddingceng层的维度EMBEDDING_DIM 200tokenizer Tokenizer(num_wordsMAX_NB_WORDS, filters!#$%()*,-./:;?[\]^_{|}~, lowerTrue)tokenizer.fit_on_texts(X_train)word_index tokenizer.word_indexprint(There are %s different words. % len(word_index))X_train tokenizer.texts_to_sequences(X_train)X_test tokenizer.texts_to_sequences(X_test)#填充X,让X的各个列的长度统一X_train pad_sequences(X_train, maxlenMAX_SEQUENCE_LENGTH)X_test pad_sequences(X_test, maxlenMAX_SEQUENCE_LENGTH)#多类标签的onehot展开y_train pd.get_dummies(y_train).valuesy_test pd.get_dummies(y_test).valuesprint(X_train.shape,y_train.shape)print(X_test.shape,y_test.shape)#加载tencent词向量wv_from_text KeyedVectors.load_word2vec_format(tencent.txt, binaryFalse, unicode_errorsignore)embedding_matrix np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))for word, i in word_index.items():if i MAX_NB_WORDS:continuetry:embedding_matrix[i] wv_from_text.wv.get_vector(word)except:continuedel wv_from_text#定义模型print(Training model...)t time()model Sequential()model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_lengthX_train.shape[1], weights [embedding_matrix], trainable False))model.add(SpatialDropout1D(0.2))model.add(LSTM(300, dropout0.2, recurrent_dropout0.2))model.add(Dense(2, activationsoftmax))model.compile(losscategorical_crossentropy, optimizeradam, metrics[accuracy])print(model.summary())epochs 10batch_size 64history model.fit(X_train, y_train, epochsepochs, batch_sizebatch_size,validation_split0.1,callbacks[EarlyStopping(monitorval_loss, patience3, min_delta0.0001)])print(Done in {0} seconds\n.format(round(time() - t, 2)))accr model.evaluate(X_test,y_test)print(Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}.format(accr[0],accr[1]))print(Predicting test dataset...)t time()y_pred model.predict(X_test)print(Done in {0} seconds\n.format(round(time() - t, 2)))y_pred y_pred.argmax(axis 1)y_test y_test.argmax(axis 1)#生成混淆矩阵conf_mat confusion_matrix(y_test, y_pred)print(conf_mat)print(accuracy %s % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits4))2.6 BiLSTM
与LSTM的参数设置基本一致只是将单向的LSTM改为双向的训练60个epoch。
代码如下
# -*- coding: utf-8 -*-import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import jieba
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Bidirectional
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
from gensim.models import KeyedVectors
import io
import sys
sys.stdout io.TextIOWrapper(sys.stdout.buffer, encodingutf-8)#读取停用词列表
def stopwordslist(filepath): stopwords [line.strip() for line in open(filepath, r, encodingutf-8).readlines()] return stopwords if __name__ __main__:#读取训练集数据train_data pd.read_csv(train.csv, names[labels, text], sep\t)#读取测试集数据test_data pd.read_csv(test.csv, names[labels, text], sep\t)print(Total number of labeled documents(train): %d . % len(train_data))print(Total number of labeled documents(test): %d . % len(test_data))X_train train_data[text]X_test test_data[text]y_train train_data[labels]y_test test_data[labels]#计算训练集中每个类别的标注数量d {labels:train_data[labels].value_counts().index, count: train_data[labels].value_counts()}df_label pd.DataFrame(datad).reset_index(dropTrue)print(df_label)#加载停用词stopwords stopwordslist(stopwords.txt)#分词并过滤停用词X_train X_train.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))X_test X_test.apply(lambda x: .join([w for w in list(jieba.cut(x)) if w not in stopwords]))# 设置最频繁使用的50000个词(在texts_to_matrix是会取前MAX_NB_WORDS,会取前MAX_NB_WORDS列)MAX_NB_WORDS 50000# 每个标题最大的长度MAX_SEQUENCE_LENGTH 100# 设置Embeddingceng层的维度EMBEDDING_DIM 200tokenizer Tokenizer(num_wordsMAX_NB_WORDS, filters!#$%()*,-./:;?[\]^_{|}~, lowerTrue)tokenizer.fit_on_texts(X_train)word_index tokenizer.word_indexprint(There are %s different words. % len(word_index))X_train tokenizer.texts_to_sequences(X_train)X_test tokenizer.texts_to_sequences(X_test)#填充X,让X的各个列的长度统一X_train pad_sequences(X_train, maxlenMAX_SEQUENCE_LENGTH)X_test pad_sequences(X_test, maxlenMAX_SEQUENCE_LENGTH)#多类标签的onehot展开y_train pd.get_dummies(y_train).valuesy_test pd.get_dummies(y_test).valuesprint(X_train.shape,y_train.shape)print(X_test.shape,y_test.shape)#加载tencent词向量wv_from_text KeyedVectors.load_word2vec_format(tencent.txt, binaryFalse, unicode_errorsignore)embedding_matrix np.zeros((MAX_NB_WORDS, EMBEDDING_DIM))for word, i in word_index.items():if i MAX_NB_WORDS:continuetry:embedding_matrix[i] wv_from_text.wv.get_vector(word)except:continuedel wv_from_text#定义模型model Sequential()model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_lengthX_train.shape[1], weights [embedding_matrix], trainable False))model.add(SpatialDropout1D(0.2))model.add(Bidirectional(LSTM(300)))model.add(Dense(2, activationsoftmax))model.compile(losscategorical_crossentropy, optimizeradam, metrics[accuracy])print(model.summary())epochs 10batch_size 64history model.fit(X_train, y_train, epochsepochs, batch_sizebatch_size,validation_split0.1,callbacks[EarlyStopping(monitorval_loss, patience3, min_delta0.0001)])accr model.evaluate(X_test,y_test)print(Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}.format(accr[0],accr[1]))y_pred model.predict(X_test)y_pred y_pred.argmax(axis 1)y_test y_test.argmax(axis 1)#生成混淆矩阵conf_mat confusion_matrix(y_test, y_pred)print(conf_mat)print(accuracy %s % accuracy_score(y_pred, y_test))print(classification_report(y_test, y_pred, digits4))2.7 BERT
使用BERT-Base-Chinese预训练模型在训练集上进行微调设置学习率为1e-5序列的最大长度为128batch大小设置为8训练2个epoch。
代码如下
import pandas as pd
from simpletransformers.model import TransformerModel
from sklearn.metrics import f1_score, accuracy_scoredef f1_multiclass(labels, preds):return f1_score(labels, preds, averagemicro)if __name__ __main__:#读取训练集数据train_data pd.read_csv(train.csv, names[labels, text], sep\t)#读取测试集数据test_data pd.read_csv(test.csv, names[labels, text], sep\t)print(Total number of labeled papers(train): %d . % len(train_data))print(Total number of labeled papers(test): %d . % len(test_data))#构建模型#bert-base-chinesemodel TransformerModel(bert, bert-base-chinese, num_labels2, args{learning_rate:1e-5, num_train_epochs: 2,
reprocess_input_data: True, overwrite_output_dir: True, fp16: False})#bert-base-multilingual 前两个参数换成: bert, bert-base-multilingual-cased#roberta 前两个参数换成: roberta, roberta-base#xlmroberta 前两个参数换成: xlmroberta, xlm-roberta-base#模型训练model.train_model(train_data)result, model_outputs, wrong_predictions model.eval_model(test_data, f1f1_multiclass, accaccuracy_score)3. 结果对比
为定量分析算法效果假设正常短信为正样本数量为PPositive垃圾短信为负样本数量为NNegative文本分类算法正确分类样本数为TTrue错误分类样本数为FFalse。因此真正True positive, TP表示正常短信被正确分类的数量假正False positive, FP表示垃圾短信被误认为正常短信的数量真负True negative, TN表示垃圾短信被正确分类的数量假负False negative, FN表示正常短信被误认为垃圾短信的数量。在此基础上实验中使用如下五个评估指标
1精确率加权平均Precision-weighted计算如下 Precision-weighted ( P r e c i s i o n P ∗ P P r e c i s i o n N ∗ N ) / ( P N ) (Precision_P*PPrecision_N*N)/(PN) (PrecisionP∗PPrecisionN∗N)/(PN) 其中 P r e c i s i o n P T P / ( T P F P ) Precision_PTP/(TPFP) PrecisionPTP/(TPFP) P r e c i s i o n N T N / ( T N F N ) Precision_NTN/(TNFN) PrecisionNTN/(TNFN)。
2召回率加权平均Recall-weighted计算如下 Recall-weighted ( R e c a l l P ∗ P R e c a l l N ∗ N ) / ( P N ) (Recall_P*PRecall_N*N)/(PN) (RecallP∗PRecallN∗N)/(PN) 其中 R e c a l l P T P / ( T P F N ) Recall_PTP/(TPFN) RecallPTP/(TPFN) R e c a l l N T N / ( T N F P ) Recall_NTN/(TNFP) RecallNTN/(TNFP)。
3F1值加权平均F1-score-weighted计算如下 F1-score-weighted ( F 1 P ∗ P F 1 N ∗ N ) / ( P N ) (F1_P*PF1_N*N)/(PN) (F1P∗PF1N∗N)/(PN) 其中 F 1 P 2 ∗ P r e c i s i o n P ∗ R e c a l l P / ( P r e c i s i o n P R e c a l l P ) F1_P2*Precision_P*Recall_P/(Precision_PRecall_P) F1P2∗PrecisionP∗RecallP/(PrecisionPRecallP) F 1 N 2 ∗ P r e c i s i o n N ∗ R e c a l l N / ( P r e c i s i o n N R e c a l l N ) F1_N2*Precision_N*Recall_N/(Precision_NRecall_N) F1N2∗PrecisionN∗RecallN/(PrecisionNRecallN)。
4假负率False negative rate, FNR计算如下 FNR F N / ( T P F N ) FN/(TPFN) FN/(TPFN)即被预测为垃圾短信的正常短信数量/正常短信实际的数量。
5真负率True negative rate, TNR计算如下 TNR T N / ( T N F P ) TN/(TNFP) TN/(TNFP)即垃圾短信的正确识别数量/垃圾短信实际的数量亦为垃圾短信的召回率。
针对垃圾短信分类的场景我们希望一个好的文本分类算法使得精确率加权平均、召回率加权平均、F1值加权平均、真负率要尽可能的高即垃圾短信的正确拦截率高同时必须保证假负率尽可能的低即正常短信被误认为是垃圾短信的比率低。这是因为对于用户来说“正常短信被误认为是垃圾短信”比“垃圾短信被误认为是正常短信”更不可容忍对于运营商来说宁可放过部分垃圾短信也要保障用户的正常使用。
模型精确率加权平均召回率加权平均F1值加权平均假负率真负率朴素贝叶斯0.97640.97610.97480.00100.7700逻辑回归0.98860.98870.98870.00610.9414随机森林0.98090.98080.98000.00120.8181SVM0.99250.99240.99240.00520.9713LSTM0.99630.99630.99630.00150.9771BiLSTM0.99640.99640.99640.00090.9720BERT0.99910.99910.99910.00020.9926
上表给出了七种文本分类算法的实验结果。可以发现
第一BERT具有最高的F1值加权平均和真负率同时具有最低的假负率垃圾短信的过滤效果最好。分析原因是BERT经过大规模通用语料上的预训练对文本特征的捕捉能力更强。
第二BiLSTM与LSTM的F1值加权平均接近因此模型整体的分类效果接近但二者的假负率与真负率存在差异从假负率来看BiLSTM的正常短信错误识别率更低从真负率来看LSTM的垃圾短信正确拦截率更高。
第三SVM与逻辑回归的F1值加权平均比较接近但相较而言SVM的效果更好一些SVM在精确率加权平均、召回率加权平均、F1值加权平均、假负率、真负率这五个指标上均比逻辑回归略胜一筹。分析原因可能是SVM仅考虑支持向量也就是和分类最相关的少数样本点而逻辑回归考虑所有样本点因此逻辑回归对异常值与数据分布的不平衡更敏感分类效果受到影响。
第四朴素贝叶斯与随机森林在F1值加权平均和真负率上表现较差。分析原因可能是正负例数据的不平衡对二者的模型效果造成影响模型在正常短信数据上有些过拟合。此外朴素贝叶斯的条件独立性假设在实际中不满足这在一定程度上影响分类效果。