与做网站的人怎么谈判,高德地图国际版,品牌形象设计包括什么,网上开小店怎么开文章目录1. 读取数据2. 数据集拆分3. 文本向量化4. 建立CNN模型5. 训练、测试参考 基于深度学习的自然语言处理
1. 读取数据
数据文件#xff1a;
import numpy as np
import pandas as pddata pd.read_csv(yelp_labelled.txt, sep\t, names[sentence, label…
文章目录1. 读取数据2. 数据集拆分3. 文本向量化4. 建立CNN模型5. 训练、测试参考 基于深度学习的自然语言处理
1. 读取数据
数据文件
import numpy as np
import pandas as pddata pd.read_csv(yelp_labelled.txt, sep\t, names[sentence, label])data.head() # 1000条数据# 数据 X 和 标签 y
sentence data[sentence].values
label data[label].values2. 数据集拆分
# 训练集 测试集拆分
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test train_test_split(sentence, label, test_size0.3, random_state1)3. 文本向量化
训练 tokenizer文本转成 ids 序列
# 文本向量化
import keras
from keras.preprocessing.text import Tokenizer
tokenizer Tokenizer(num_words6000)
tokenizer.fit_on_texts(X_train) # 训练tokenizer
X_train tokenizer.texts_to_sequences(X_train) # 转成 [[ids...],[ids...],...]
X_test tokenizer.texts_to_sequences(X_test)
vocab_size len(tokenizer.word_index)1 # 1 是因为index 0, 0 不对应任何词用来padpad ids 序列使之有相同的长度
maxlen 100
# pad 保证每个句子的长度相等
from keras.preprocessing.sequence import pad_sequences
X_train pad_sequences(X_train, maxlenmaxlen, paddingpost)
# post 尾部补0pre 前部补0
X_test pad_sequences(X_test, maxlenmaxlen, paddingpost)4. 建立CNN模型
from keras import layers
embeddings_dim 150
filters 64
kernel_size 5
batch_size 64nn_model keras.Sequential()
nn_model.add(layers.Embedding(input_dimvocab_size, output_dimembeddings_dim, input_lengthmaxlen))
nn_model.add(layers.Conv1D(filtersfilters,kernel_sizekernel_size,activationrelu))
nn_model.add(layers.GlobalMaxPool1D())
nn_model.add(layers.Dropout(0.3))
# 上面 GlobalMaxPool1D 后维度少了一维下面自定义layers再扩展一维
nn_model.add(layers.Lambda(lambda x : keras.backend.expand_dims(x, axis-1)))
nn_model.add(layers.Conv1D(filtersfilters,kernel_sizekernel_size,activationrelu))
nn_model.add(layers.GlobalMaxPool1D())
nn_model.add(layers.Dropout(0.3))
nn_model.add(layers.Dense(10, activationrelu))
nn_model.add(layers.Dense(1, activationsigmoid)) # 二分类sigmoid, 多分类 softmax参考文章 Embedding层详解 Keras: GlobalMaxPooling vs. MaxPooling
配置模型
nn_model.compile(optimizeradam, lossbinary_crossentropy,metrics[accuracy])
nn_model.summary()
from keras.utils import plot_model
plot_model(nn_model, to_filemodel.jpg) # 绘制模型结构到文件Model: sequential_4
_________________________________________________________________
Layer (type) Output Shape Param # embedding_4 (Embedding) (None, 100, 150) 251400
_________________________________________________________________
conv1d_8 (Conv1D) (None, 96, 64) 48064
_________________________________________________________________
global_max_pooling1d_7 (Glob (None, 64) 0
_________________________________________________________________
dropout_7 (Dropout) (None, 64) 0
_________________________________________________________________
lambda_4 (Lambda) (None, 64, 1) 0
_________________________________________________________________
conv1d_9 (Conv1D) (None, 60, 64) 384
_________________________________________________________________
global_max_pooling1d_8 (Glob (None, 64) 0
_________________________________________________________________
dropout_8 (Dropout) (None, 64) 0
_________________________________________________________________
dense_6 (Dense) (None, 10) 650
_________________________________________________________________
dense_7 (Dense) (None, 1) 11 Total params: 300,509
Trainable params: 300,509
Non-trainable params: 05. 训练、测试
history nn_model.fit(X_train,y_train,batch_sizebatch_size,epochs50,verbose2,validation_data(X_test,y_test))
# verbose 是否显示日志信息0不显示1显示进度条2不显示进度条
loss, accuracy nn_model.evaluate(X_train, y_train, verbose1)
print(训练集loss {0:.3f}, 准确率{1:.3f}.format(loss, accuracy))
loss, accuracy nn_model.evaluate(X_test, y_test, verbose1)
print(测试集loss {0:.3f}, 准确率{1:.3f}.format(loss, accuracy))# 绘制训练曲线
from matplotlib import pyplot as plt
pd.DataFrame(history.history).plot(figsize(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.show()输出
Epoch 1/50
11/11 - 1s - loss: 0.6933 - accuracy: 0.5014 - val_loss: 0.6933 - val_accuracy: 0.4633
Epoch 2/50
11/11 - 0s - loss: 0.6931 - accuracy: 0.5214 - val_loss: 0.6935 - val_accuracy: 0.4633
Epoch 3/50
11/11 - 1s - loss: 0.6930 - accuracy: 0.5257 - val_loss: 0.6936 - val_accuracy: 0.4633
....省略
11/11 - 0s - loss: 0.0024 - accuracy: 1.0000 - val_loss: 0.7943 - val_accuracy: 0.7600
Epoch 49/50
11/11 - 1s - loss: 0.0016 - accuracy: 1.0000 - val_loss: 0.7970 - val_accuracy: 0.7600
Epoch 50/50
11/11 - 0s - loss: 0.0027 - accuracy: 1.0000 - val_loss: 0.7994 - val_accuracy: 0.7600
22/22 [] - 0s 4ms/step - loss: 9.0586e-04 - accuracy: 1.0000
训练集loss 0.001, 准确率1.000
10/10 [] - 0s 5ms/step - loss: 0.7994 - accuracy: 0.7600
测试集loss 0.799, 准确率0.760训练集loss 0.001, 准确率1.000 测试集loss 0.799, 准确率0.760 存在过拟合训练集准确率很高测试集效果差 随意测试
text [i am not very good., i am very good.]
x tokenizer.texts_to_sequences(text)
x pad_sequences(x, maxlenmaxlen, paddingpost)
pred nn_model.predict(x)
print(预测{}的类别为.format(text[0]), 1 if pred[0][0]0.5 else 0)
print(预测{}的类别为.format(text[1]), 1 if pred[1][0]0.5 else 0)输出
预测i am not very good.的类别为 0
预测i am very good.的类别为 1