企业门户网站运营推广,天津哪里能做网站,河南建设教育中心网站,wordpress的教程CRNN是OCR领域非常经典且被广泛使用的识别算法#xff0c;其理论基础可以参考我上一篇文章#xff0c;本文将着重讲解CRNN代码实现过程以及识别效果。 数据处理 利用图像处理技术我们手工大批量生成文字图像#xff0c;一共360万张图像样本#xff0c;效果如下#xff1a;…CRNN是OCR领域非常经典且被广泛使用的识别算法其理论基础可以参考我上一篇文章本文将着重讲解CRNN代码实现过程以及识别效果。 数据处理 利用图像处理技术我们手工大批量生成文字图像一共360万张图像样本效果如下 我们划分了训练集和测试集10:1并单独存储为两个文本文件 文本文件里的标签格式如下 我们获取到的是最原始的数据集在图像深度学习训练中我们一般都会把原始数据集转化为lmdb格式以方便后续的网络训练。因此我们也需要对该数据集进行lmdb格式转化。下面代码就是用于lmdb格式转化思路比较简单就是首先读入图像和对应的文本标签先使用字典将该组合存储起来cache再利用lmdb包的put函数把字典(cache)存储的k,v写成lmdb格式存储好cache当有了1000个元素就put一次。
import lmdb
import cv2
import numpy as np
import osdef checkImageIsValid(imageBin):if imageBin is None:return Falsetry:imageBuf np.fromstring(imageBin, dtypenp.uint8)img cv2.imdecode(imageBuf, cv2.IMREAD_GRAYSCALE)imgH, imgW img.shape[0], img.shape[1]except:return Falseelse:if imgH * imgW 0:return Falsereturn Truedef writeCache(env, cache):with env.begin(writeTrue) as txn:for k, v in cache.items():txn.put(k, v)def createDataset(outputPath, imagePathList, labelList, lexiconListNone, checkValidTrue):Create LMDB dataset for CRNN training.ARGS:outputPath : LMDB output pathimagePathList : list of image pathlabelList : list of corresponding groundtruth textslexiconList : (optional) list of lexicon listscheckValid : if true, check the validity of every imageassert (len(imagePathList) len(labelList))nSamples len(imagePathList)env lmdb.open(outputPath, map_size1099511627776)cache {}cnt 1for i in range(nSamples):imagePath .join(imagePathList[i]).split()[0].replace(\n, ).replace(\r\n, )# print(imagePath)label .join(labelList[i])print(label)# if not os.path.exists(imagePath):# print(%s does not exist % imagePath)# continuewith open(. imagePath, r) as f:imageBin f.read()if checkValid:if not checkImageIsValid(imageBin):print(%s is not a valid image % imagePath)continueimageKey image-%09d % cntlabelKey label-%09d % cntcache[imageKey] imageBincache[labelKey] labelif lexiconList:lexiconKey lexicon-%09d % cntcache[lexiconKey] .join(lexiconList[i])if cnt % 1000 0:writeCache(env, cache)cache {}print(Written %d / %d % (cnt, nSamples))cnt 1print(cnt)nSamples cnt - 1cache[num-samples] str(nSamples)writeCache(env, cache)print(Created dataset with %d samples % nSamples)OUT_PATH ../crnn_train_lmdb
IN_PATH ./train.txtif __name__ __main__:outputPath OUT_PATHif not os.path.exists(OUT_PATH):os.mkdir(OUT_PATH)imgdata open(IN_PATH)imagePathList list(imgdata)labelList []for line in imagePathList:word line.split()[1]labelList.append(word)createDataset(outputPath, imagePathList, labelList) 我们运行上面的代码可以得到训练集和测试集的lmdb 在数据准备部分还有一个操作需要强调的那就是文字标签数字化即我们用数字来表示每一个文字汉字英文字母标点符号。比如“我”字对应的id是1“l”对应的id是1000“”对应的id是90如此类推这种编解码工作使用字典数据结构存储即可训练时先把标签编码encode预测时就将网络输出结果解码(decode)成文字输出。
class strLabelConverter(object):Convert between str and label.NOTE:Insert blank to the alphabet for CTC.Args:alphabet (str): set of the possible characters.ignore_case (bool, defaultTrue): whether or not to ignore all of the case.def __init__(self, alphabet, ignore_caseFalse):self._ignore_case ignore_caseif self._ignore_case:alphabet alphabet.lower()self.alphabet alphabet - # for -1 indexself.dict {}for i, char in enumerate(alphabet):# NOTE: 0 is reserved for blank required by wrap_ctcself.dict[char] i 1def encode(self, text):Support batch or single str.Args:text (str or list of str): texts to convert.Returns:torch.IntTensor [length_0 length_1 ... length_{n - 1}]: encoded texts.torch.IntTensor [n]: length of each text.length []result []for item in text:item item.decode(utf-8, strict)length.append(len(item))for char in item:index self.dict[char]result.append(index)text result# print(text,length)return (torch.IntTensor(text), torch.IntTensor(length))def decode(self, t, length, rawFalse):Decode encoded texts back into strs.Args:torch.IntTensor [length_0 length_1 ... length_{n - 1}]: encoded texts.torch.IntTensor [n]: length of each text.Raises:AssertionError: when the texts and its length does not match.Returns:text (str or list of str): texts to convert.if length.numel() 1:length length[0]assert t.numel() length, text with length: {} does not match declared length: {}.format(t.numel(),length)if raw:return .join([self.alphabet[i - 1] for i in t])else:char_list []for i in range(length):if t[i] ! 0 and (not (i 0 and t[i - 1] t[i])):char_list.append(self.alphabet[t[i] - 1])return .join(char_list)else:# batch modeassert t.numel() length.sum(), texts with length: {} does not match declared length: {}.format(t.numel(), length.sum())texts []index 0for i in range(length.numel()):l length[i]texts.append(self.decode(t[index:index l], torch.IntTensor([l]), rawraw))index lreturn texts网络设计 根据CRNN的论文描述CRNN是由CNN-》RNN-》CTC三大部分架构而成分别对应卷积层、循环层和转录层。首先CNN部分用于底层的特征提取RNN采取了BiLSTM用于学习关联序列信息并预测标签分布CTC用于序列对齐输出预测结果。 为了将特征输入到Recurrent Layers做如下处理 首先会将图像缩放到 32×W×3 大小然后经过CNN后变为 1×W/4× 512接着针对LSTM设置 T(W/4) D512 即可将特征输入LSTM。以上是理想训练时的操作但是CRNN论文提到的网络输入是归一化好的100×32大小的灰度图像即高度统一为32个像素。下面是CRNN的深度神经网络结构图CNN采取了经典的VGG16值得注意的是在VGG16的第3第4个max pooling层CRNN采取的是1×2的矩形池化窗口(w×h)这有别于经典的VGG16的2×2的正方形池化窗口这个改动是因为文本图像多数都是高较小而宽较长所以其feature map也是这种高小宽长的矩形形状如果使用1×2的池化窗口则更适合英文字母识别比如区分i和l。VGG16部分还引入了BatchNormalization模块旨在加速模型收敛。还有值得注意一点CRNN的输入是灰度图像即图像深度为1。CNN部分的输出是512x1x16c×h×w的特征向量。 接下来分析RNN层。RNN部分使用了双向LSTM隐藏层单元数为256CRNN采用了两层BiLSTM来组成这个RNN层RNN层的输出维度将是s,b,class_num 其中class_num为文字类别总数。 值得注意的是Pytorch里的LSTM单元接受的输入都必须是3维的张量Tensors.每一维代表的意思不能弄错。第一维体现的是序列sequence结构第二维度体现的是小块mini-batch结构第三位体现的是输入的元素elements of input。如果在应用中不适用小块结构那么可以将输入的张量中该维度设为1但必须要体现出这个维度。 LSTM的输入 input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence.
The input can also be a packed variable length sequence.
input shape(a,b,c)
a:seq_len - 序列长度
b:batch
c:input_size 输入特征数目 根据LSTM的输入要求我们要对CNN的输出做些调整即把CNN层的输出调整为[seq_len, batch, input_size]形式下面为具体操作先使用squeeze函数移除h维度再使用permute函数调整各维顺序即从原来[w, b, c]的调整为[seq_len, batch, input_size]具体尺寸为[16,batch,512]调整好之后即可以将该矩阵送入RNN层。
x self.cnn(x)
b, c, h, w x.size()
# print(x.size()): b,c,h,w
assert h 1 # the height of conv must be 1
x x.squeeze(2) # remove h dimension, b *512 * width
x x.permute(2, 0, 1) # [w, b, c] [seq_len, batch, input_size]
x self.rnn(x) RNN层输出格式如下因为我们采用的是双向BiLSTM所以输出维度将是hidden_unit * 2 Outputs: output, (h_n, c_n)
output of shape (seq_len, batch, num_directions * hidden_size)
h_n of shape (num_layers * num_directions, batch, hidden_size)
c_n (num_layers * num_directions, batch, hidden_size) 然后我们再通过线性变换操作self.embedding1 torch.nn.Linear(hidden_unit * 2, 512)是的输出维度再次变为512继续送入第二个LSTM层。第二个LSTM层后继续接线性操作torch.nn.Linear(hidden_unit * 2, class_num)使得整个RNN层的输出为文字类别总数。 import torch
import torch.nn.functional as Fclass Vgg_16(torch.nn.Module):def __init__(self):super(Vgg_16, self).__init__()self.convolution1 torch.nn.Conv2d(1, 64, 3, padding1)self.pooling1 torch.nn.MaxPool2d(2, stride2)self.convolution2 torch.nn.Conv2d(64, 128, 3, padding1)self.pooling2 torch.nn.MaxPool2d(2, stride2)self.convolution3 torch.nn.Conv2d(128, 256, 3, padding1)self.convolution4 torch.nn.Conv2d(256, 256, 3, padding1)self.pooling3 torch.nn.MaxPool2d((1, 2), stride(2, 1)) # notice stride of the non-square poolingself.convolution5 torch.nn.Conv2d(256, 512, 3, padding1)self.BatchNorm1 torch.nn.BatchNorm2d(512)self.convolution6 torch.nn.Conv2d(512, 512, 3, padding1)self.BatchNorm2 torch.nn.BatchNorm2d(512)self.pooling4 torch.nn.MaxPool2d((1, 2), stride(2, 1))self.convolution7 torch.nn.Conv2d(512, 512, 2)def forward(self, x):x F.relu(self.convolution1(x), inplaceTrue)x self.pooling1(x)x F.relu(self.convolution2(x), inplaceTrue)x self.pooling2(x)x F.relu(self.convolution3(x), inplaceTrue)x F.relu(self.convolution4(x), inplaceTrue)x self.pooling3(x)x self.convolution5(x)x F.relu(self.BatchNorm1(x), inplaceTrue)x self.convolution6(x)x F.relu(self.BatchNorm2(x), inplaceTrue)x self.pooling4(x)x F.relu(self.convolution7(x), inplaceTrue)return x # b*512x1x16class RNN(torch.nn.Module):def __init__(self, class_num, hidden_unit):super(RNN, self).__init__()self.Bidirectional_LSTM1 torch.nn.LSTM(512, hidden_unit, bidirectionalTrue)self.embedding1 torch.nn.Linear(hidden_unit * 2, 512)self.Bidirectional_LSTM2 torch.nn.LSTM(512, hidden_unit, bidirectionalTrue)self.embedding2 torch.nn.Linear(hidden_unit * 2, class_num)def forward(self, x):x self.Bidirectional_LSTM1(x) # LSTM output: output, (h_n, c_n)T, b, h x[0].size() # x[0]: (seq_len, batch, num_directions * hidden_size)x self.embedding1(x[0].view(T * b, h)) # pytorch view() reshape as [T * b, nOut]x x.view(T, b, -1) # [16, b, 512]x self.Bidirectional_LSTM2(x)T, b, h x[0].size()x self.embedding2(x[0].view(T * b, h))x x.view(T, b, -1)return x # [16,b,class_num]# output: [s,b,class_num]
class CRNN(torch.nn.Module):def __init__(self, class_num, hidden_unit256):super(CRNN, self).__init__()self.cnn torch.nn.Sequential()self.cnn.add_module(vgg_16, Vgg_16())self.rnn torch.nn.Sequential()self.rnn.add_module(rnn, RNN(class_num, hidden_unit))def forward(self, x):x self.cnn(x)b, c, h, w x.size()# print(x.size()): b,c,h,wassert h 1 # the height of conv must be 1x x.squeeze(2) # remove h dimension, b *512 * widthx x.permute(2, 0, 1) # [w, b, c] [seq_len, batch, input_size]# x x.transpose(0, 2)# x x.transpose(1, 2)x self.rnn(x)return x损失函数设计 刚刚完成了CNN层和RNN层的设计现在开始设计转录层即将RNN层输出的结果翻译成最终的识别文字结果从而实现不定长的文字识别。pytorch没有内置的CTC loss所以只能去Github下载别人实现的CTC loss来完成损失函数部分的设计。安装CTC-loss的方式如下 git clone https://github.com/SeanNaren/warp-ctc.git
cd warp-ctc
mkdir build; cd build
cmake ..
make
cd ../pytorch_binding/
python setup.py install
cd ../build
cp libwarpctc.so ../../usr/lib 待安装完毕后我们可以直接调用CTC loss了以一个小例子来说明ctc loss的用法。 import torch
from warpctc_pytorch import CTCLoss
ctc_loss CTCLoss()
# expected shape of seqLength x batchSize x alphabet_size
probs torch.FloatTensor([[[0.1, 0.6, 0.1, 0.1, 0.1], [0.1, 0.1, 0.6, 0.1, 0.1]]]).transpose(0, 1).contiguous()
labels torch.IntTensor([1, 2])
label_sizes torch.IntTensor([2])
probs_sizes torch.IntTensor([2])
probs.requires_grad_(True) # tells autograd to compute gradients for probs
cost ctc_loss(probs, labels, probs_sizes, label_sizes)
cost.backward() CTCLoss(size_averageFalse, length_averageFalse)# size_average (bool): normalize the loss by the batch size (default: False)# length_average (bool): normalize the loss by the total number of frames in the batch. If True, supersedes size_average (default: False)forward(acts, labels, act_lens, label_lens)# acts: Tensor of (seqLength x batch x outputDim) containing output activations from network (before softmax)# labels: 1 dimensional Tensor containing all the targets of the batch in one large sequence# act_lens: Tensor of size (batch) containing size of each output sequence from the network# label_lens: Tensor of (batch) containing label length of each example 从上面的代码可以看出CTCLoss的输入为[probs, labels, probs_sizes, label_sizes]即预测结果、标签、预测结果的数目和标签数目。那么我们仿照这个例子开始设计CRNN的CTC LOSS。
preds net(image)
preds_size Variable(torch.IntTensor([preds.size(0)] * batch_size)) # preds.size(0)w16
cost criterion(preds, text, preds_size, length) / batch_size # 这里的length就是包含每个文本标签的长度的list除以batch_size来求平均loss
cost.backward() 网络训练设计 接下来我们需要完善具体的训练流程我们还写了个trainBatch函数用于bacth形式的梯度更新。 def trainBatch(net, criterion, optimizer, train_iter):data train_iter.next()cpu_images, cpu_texts databatch_size cpu_images.size(0)lib.dataset.loadData(image, cpu_images)t, l converter.encode(cpu_texts)lib.dataset.loadData(text, t)lib.dataset.loadData(length, l)preds net(image)#print(preds.size%s % preds.size)preds_size Variable(torch.IntTensor([preds.size(0)] * batch_size)) # preds.size(0)w22cost criterion(preds, text, preds_size, length) / batch_size # length a list that contains the len of text label in a batchnet.zero_grad()cost.backward()optimizer.step()return cost 整个网络训练的流程如下CTC-LOSS对象-CRNN网络对象-image,text,len的tensor初始化-优化器初始化然后开始循环每个epoch指定迭代次数就进行模型验证和模型保存。CRNN论文提到所采用的优化器是Adadelta但是经过我实验看来Adadelta的收敛速度非常慢所以改用了RMSprop优化器模型收敛速度大幅度提升。 criterion CTCLoss()net Net.CRNN(n_class)print(net)net.apply(lib.utility.weights_init)image torch.FloatTensor(Config.batch_size, 3, Config.img_height, Config.img_width)text torch.IntTensor(Config.batch_size * 5)length torch.IntTensor(Config.batch_size)if cuda:net.cuda()image image.cuda()criterion criterion.cuda()image Variable(image)text Variable(text)length Variable(length)loss_avg lib.utility.averager()optimizer optim.RMSprop(net.parameters(), lrConfig.lr)#optimizer optim.Adadelta(net.parameters(), lrConfig.lr)#optimizer optim.Adam(net.parameters(), lrConfig.lr,#betas(Config.beta1, 0.999))for epoch in range(Config.epoch):train_iter iter(train_loader)i 0while i len(train_loader):for p in net.parameters():p.requires_grad Truenet.train()cost trainBatch(net, criterion, optimizer, train_iter)loss_avg.add(cost)i 1if i % Config.display_interval 0:print([%d/%d][%d/%d] Loss: %f %(epoch, Config.epoch, i, len(train_loader), loss_avg.val()))loss_avg.reset()if i % Config.test_interval 0:val(net, test_dataset, criterion)# do checkpointingif i % Config.save_interval 0:torch.save(net.state_dict(), {0}/netCRNN_{1}_{2}.pth.format(Config.model_dir, epoch, i))训练过程与测试设计 下面这幅图表示的就是CRNN训练过程文字类别数为6732一共训练20个epochbatch_Szie设置为64所以一共是51244次迭代/epoch。 在迭代4个epoch时loss降到0.1左右acc上升到0.98。 接下来我们设计推断预测部分的代码首先需初始化CRNN网络载入训练好的模型读入待预测的图像并resize为高为32的灰度图像接着讲该图像送入网络最后再将网络输出解码成文字即可输出。
import time
import torch
import os
from torch.autograd import Variable
import lib.convert
import lib.dataset
from PIL import Image
import Net.net as Net
import alphabets
import sys
import Configos.environ[CUDA_VISIBLE_DEVICES] 4crnn_model_path ./bs64_model/netCRNN_9_48000.pth
IMG_ROOT ./test_images
running_mode gpu
alphabet alphabets.alphabet
nclass len(alphabet) 1def crnn_recognition(cropped_image, model):converter lib.convert.strLabelConverter(alphabet) # 标签转换image cropped_image.convert(L) # 图像灰度化### Testing images are scaled to have height 32. Widths are# proportionally scaled with heights, but at least 100 pixelsw int(image.size[0] / (280 * 1.0 / Config.infer_img_w))#scale image.size[1] * 1.0 / Config.img_height#w int(image.size[0] / scale)transformer lib.dataset.resizeNormalize((w, Config.img_height))image transformer(image)if torch.cuda.is_available():image image.cuda()image image.view(1, *image.size())image Variable(image)model.eval()preds model(image)_, preds preds.max(2)preds preds.transpose(1, 0).contiguous().view(-1)preds_size Variable(torch.IntTensor([preds.size(0)]))sim_pred converter.decode(preds.data, preds_size.data, rawFalse) # 预测输出解码成文字print(results: {0}.format(sim_pred))if __name__ __main__:# crnn networkmodel Net.CRNN(nclass)# 载入训练好的模型CPU和GPU的载入方式不一样需分开处理if running_mode gpu and torch.cuda.is_available():model model.cuda()model.load_state_dict(torch.load(crnn_model_path))else:model.load_state_dict(torch.load(crnn_model_path, map_locationcpu))print(loading pretrained model from {0}.format(crnn_model_path))files sorted(os.listdir(IMG_ROOT)) # 按文件名排序for file in files:started time.time()full_path os.path.join(IMG_ROOT, file)print()print(ocr image is %s % full_path)image Image.open(full_path)crnn_recognition(image, model)finished time.time()print(elapsed time: {0}.format(finished - started)) 识别效果和总结 首先我从测试集中抽取几张图像送入模型识别识别全部正确。 我也随机在一些文档图片、扫描图像上截取了一段文字图像送入我们该模型进行识别识别效果也挺好的基本识别正确表明模型泛化能力很强。 我还截取了增值税扫描发票上的文本图像来看看我们的模型能否还可以表现出稳定的识别效果 这里做个小小的总结对于端到端不定长的文字识别CRNN是最为经典的识别算法而且实战看来效果非常不错。上面识别结果可以看出虽然我们用于训练的数据集是自己生成的但是我们该模型对于pdf文档、扫描图像等都有很不错的识别结果如果需要继续提升对特定领域的文本图像的识别直接大量加入该类图像用于训练即可。CRNN的完整代码可以参考我的Github。 转载于:https://www.cnblogs.com/skyfsm/p/10345305.html