当前位置：首页 > news >正文

做网站完整过程跨境电商平台有哪些公司

news 2025/12/20 13:54:40

做网站完整过程,跨境电商平台有哪些公司,新站整站优化,php开发微网站开发作者于2023年8月新开专栏——《文本挖掘和知识发现》#xff0c;主要结合Python、大数据分析和人工智能分享文本挖掘、知识图谱、知识发现、图书情报等内容。这些内容也是作者《文本挖掘和知识发现#xff08;Python版#xff09;》书籍的部分介绍#xff0c;本书预计2024年…作者于2023年8月新开专栏——《文本挖掘和知识发现》主要结合Python、大数据分析和人工智能分享文本挖掘、知识图谱、知识发现、图书情报等内容。这些内容也是作者《文本挖掘和知识发现Python版》书籍的部分介绍本书预计2024年上市采用通俗易懂和图文并茂的形式藐视会更加系统地介绍文本挖掘和知识发现共计20章节内容涵盖上百个案例。您的关注、点赞和转发就是对秀璋最大的支持知识无价人有情希望我们都能在人生路上开心快乐、共同成长。前一篇文章介绍文献可视化分析软件CiteSpace基础知识以中国知网《红楼梦》文献为例开展主题挖掘、关键词聚类及主题演化分析。这篇文章将讲解如何实现威胁情报实体识别利用BiLSTM-CRF算法实现对ATTCK相关的技战术实体进行提取是安全知识图谱构建的重要支撑。基础性文章希望对您有所帮助版本信息 keras-contrib V2.0.8keras V2.3.1tensorflow V2.2.0 常见框架如下图所示 https://aclanthology.org/2021.acl-short.4/ 文章目录一.ATTCK数据采集二.数据拆分及内容统计1.段落拆分2.句子拆分三.数据标注四.数据集划分五.基于CRF的实体识别1.安装keras-contrib2.安装Keras3.完整代码六.基于BiLSTM-CRF的实体识别七.总结代码下载地址 https://github.com/eastmountyxz/Text-Mining-Knowledge-Discovery 前文赏析 [文本挖掘和知识发现] 01.红楼梦主题演化分析——文献可视化分析软件CiteSpace入门[文本挖掘和知识发现] 02.命名实体识别之基于BiLSTM-CRF的威胁情报实体识别万字详解一.ATTCK数据采集了解威胁情报的同学应该都熟悉Mitre的ATTCK网站本文将采集该网站APT组织的攻击技战术数据开展威胁情报实体识别实验。网址如下 http://attack.mitre.org 第一步通过ATTCK网站源码分析定位APT组织名称并进行系统采集。安装BeautifulSoup扩展包该部分代码如下所示 01-get-aptentity.py #encoding:utf-8 #By:Eastmount CSDN import re import requests from lxml import etree from bs4 import BeautifulSoup import urllib.request#------------------------------------------------------------------------------------------- #获取APT组织名称及链接#设置浏览器代理,它是一个字典 headers {User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) \AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 } url https://attack.mitre.org/groups/#向服务器发出请求 r requests.get(url url, headers headers).text#解析DOM树结构 html_etree etree.HTML(r) names html_etree.xpath(//*[classtable table-bordered table-alternate mt-2]/tbody/tr/td[2]/a/text()) print (names) print(len(names),names[0]) filename [] for name in names:filename.append(name.strip()) print(filename)#链接 urls html_etree.xpath(//*[classtable table-bordered table-alternate mt-2]/tbody/tr/td[2]/a/href) print(urls) print(len(urls), urls[0]) print(\n)此时输出结果如下图所示包括APT组织名称及对应的URL网址。第二步访问APT组织对应的URL采集详细信息正文描述。第三步采集对应的技战术TTPs信息其源码定位如下图所示。第四步编写代码完成威胁情报数据采集。01-spider-mitre.py 完整代码如下 #encoding:utf-8 #By:Eastmount CSDN import re import requests from lxml import etree from bs4 import BeautifulSoup import urllib.request#------------------------------------------------------------------------------------------- #获取APT组织名称及链接#设置浏览器代理,它是一个字典 headers {User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) \AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 } url https://attack.mitre.org/groups/#向服务器发出请求 r requests.get(url url, headers headers).text #解析DOM树结构 html_etree etree.HTML(r) names html_etree.xpath(//*[classtable table-bordered table-alternate mt-2]/tbody/tr/td[2]/a/text()) print (names) print(len(names),names[0]) #链接 urls html_etree.xpath(//*[classtable table-bordered table-alternate mt-2]/tbody/tr/td[2]/a/href) print(urls) print(len(urls), urls[0]) print(\n)#------------------------------------------------------------------------------------------- #获取详细信息 k 0 while klen(names):filename str(names[k]).strip() .txturl https://attack.mitre.org urls[k]print(url)#获取正文信息page urllib.request.Request(url, headersheaders)page urllib.request.urlopen(page)contents page.read()soup BeautifulSoup(contents, html.parser)#获取正文摘要信息content for tag in soup.find_all(attrs{class:description-body}):#contents tag.find(p).get_text()contents tag.find_all(p)for con in contents:content con.get_text().strip() ###\n #标记句子结束(第二部分分句用)#print(content)#获取表格中的技术信息for tag in soup.find_all(attrs{class:table techniques-used table-bordered mt-2}):contents tag.find(tbody).find_all(tr)for con in contents:value con.find(p).get_text() #存在4列或5列故获取p值#print(value)content value.strip() ###\n #标记句子结束(第二部分分句用)#删除内容中的参考文献括号 [n]result re.sub(u\\[.*?], , content)print(result)#文件写入filename Mitre// filenameprint(filename)f open(filename, w, encodingutf-8)f.write(result)f.close() k 1输出结果如下图所示共整理100个组织信息。每个文件显示内容如下图所示温馨提示由于网站的布局会不断变化和优化因此读者需要掌握数据采集及语法树定位的基本方法以不变应万变。此外读者可以尝试采集所有锻炼甚至是URL跳转链接内容请读者自行尝试和拓展二.数据拆分及内容统计 1.段落拆分为了扩充数据集和更好地开展NLP处理我们需要将文本数据进行分段处理。采用的方法是获取先前定义的标志位“###”每隔五句生成一个TXT文件命名方式为“10XX_组织名称” 02-dataset-split.py 完整代码 #encoding:utf-8 #By:Eastmount CSDN import re import os#------------------------------------------------------------------------ #获取文件路径及名称 def get_filepath(path):entities {} #字段实体类别files os.listdir(path) #遍历路径return files#----------------------------------------------------------------------- #获取文件内容 def get_content(filename):content with open(filename, r, encodingutf8) as f:for line in f.readlines():content line.replace(\n, )return content#--------------------------------------------------------------------- #自定义分隔符文本分割 def split_text(text):pattern ###nums text.split(pattern) #获取字符的下标位置return nums#----------------------------------------------------------------------- #主函数 if __name__ __main__:#获取文件名path Mitresavepath Mitre-Splitfilenames get_filepath(path)print(filenames)print(\n)#遍历文件内容k 0begin 1001 #命名计数while klen(filenames):filename Mitre// filenames[k]print(filename)content get_content(filename)print(content)#分割句子nums split_text(content)#每隔五句输出为一个TXT文档n 0result while nlen(nums):if n0 and (n%5)0: #存储savename savepath // str(begin) - filenames[k]print(savename)f open(savename, w, encodingutf8)f.write(result)result result nums[n].lstrip() ### #第一句begin 1f.close()else: #赋值result nums[n].lstrip() ### n 1k 1最终拆分成381个文件位于“Mitre-Split”文件夹。单个文件如下图所示 2.句子拆分命名实体识别任务在数据标注之前需要完成将段落拆分成句子将句子按照单词分隔每行对应一个单词每个单词对应后续的一个标注关键代码 text.split( ) 句子拆分后的效果如下图所示完整代码如下所示并生成“Mitre-Split-Word”文件夹。 #encoding:utf-8 #By:Eastmount CSDN import re import os#------------------------------------------------------------------------ #获取文件路径及名称 def get_filepath(path):entities {} #字段实体类别files os.listdir(path) #遍历路径return files#----------------------------------------------------------------------- #获取文件内容 def get_content(filename):content with open(filename, r, encodingutf8) as f:for line in f.readlines():content line.replace(\n, )return content#--------------------------------------------------------------------- #空格分隔获取英文单词 def split_word(text):nums text.split( )#print(nums)return nums#----------------------------------------------------------------------- #主函数 if __name__ __main__:#获取文件名path Mitre-Splitsavepath Mitre-Split-Wordfilenames get_filepath(path)print(filenames)print(\n)#遍历文件内容k 0while klen(filenames):filename path // filenames[k]print(filename)content get_content(filename)content content.replace(###,\n)#分割句子nums split_word(content)#print(nums)savename savepath // filenames[k]f open(savename, w, encodingutf8)for n in nums:if n ! :#替换标点符号n n.replace(,, )n n.replace(;, )n n.replace(!, )n n.replace(?, )n n.replace(:, )n n.replace(, )n n.replace((, )n n.replace(), )n n.replace(’, )n n.replace(\s, )#替换句号if (. in n) and (n not in [U.S.,U.K.]):n n.rstrip(.)n n.rstrip(.\n)n n \nf.write(n\n)f.close()k 1三.数据标注数据标注采用暴力的方式进行即定义不同类型的实体名称并利用BIO的方式进行标注。通过ATTCK技战术方式进行标注后续可以结合人工校正同时可以定义更多类型的实体。 BIO标注实体名称实体数量示例APT攻击组织128APT32、Lazarus Group攻击漏洞56CVE-2009-0927区域位置72America、Europe攻击行业34companies、finance攻击手法65CC、RAT、DDoS利用软件487-Zip、Microsoft操作系统10Linux、Windows 常见的数据标注工具图像标注labelmeLabelImgLabelboxRectLabelCVATVIA半自动ocr标注PPOCRLabelNLP标注工具labelstudio 该部分完整代码04-BIO-data-annotation.py如下所示 #encoding:utf-8 import re import os import csv#-----------------------------------------定义实体类型------------------------------------- #APT攻击组织 aptName [admin338, Ajax Security Team, APT-C-36, APT1, APT12, APT16, APT17, APT18, APT19, APT28, APT29, APT3, APT30, APT32,APT33, APT37, APT38, APT39, APT41, Axiom, BlackOasis, BlackTech, Blue Mockingbird, Bouncing Golf, BRONZE BUTLER,Carbanak, Chimera, Cleaver, Cobalt Group, CopyKittens, Dark Caracal, Darkhotel, DarkHydrus, DarkVishnya, Deep Panda,Dragonfly, Dragonfly 2.0, DragonOK, Dust Storm, Elderwood, Equation, Evilnum, FIN10, FIN4, FIN5, FIN6, FIN7, FIN8,Fox Kitten, Frankenstein, GALLIUM, Gallmaker, Gamaredon Group, GCMAN, GOLD SOUTHFIELD, Gorgon Group, Group5, HAFNIUM,Higaisa, Honeybee, Inception, Indrik Spider, Ke3chang, Kimsuky, Lazarus Group, Leafminer, Leviathan, Lotus Blossom,Machete, Magic Hound, menuPass, Moafee, Mofang, Molerats, MuddyWater, Mustang Panda, Naikon, NEODYMIUM, Night Dragon,OilRig, Operation Wocao, Orangeworm, Patchwork, PittyTiger, PLATINUM, Poseidon Group, PROMETHIUM, Putter Panda, Rancor,Rocke, RTM, Sandworm Team, Scarlet Mimic, Sharpshooter, Sidewinder, Silence, Silent Librarian, SilverTerrier, Sowbug, Stealth Falcon,Stolen Pencil, Strider, Suckfly, TA459, TA505, TA551, Taidoor, TEMP.Veles, The White Company, Threat Group-1314, Threat Group-3390,Thrip, Tropic Trooper, Turla, Volatile Cedar, Whitefly, Windigo, Windshift, Winnti Group, WIRTE, Wizard Spider, ZIRCONIUM,UNC2452, NOBELIUM, StellarParticle]#特殊名称的攻击漏洞 cveName [CVE-2009-3129, CVE-2012-0158, CVE-2009-4324 CVE-2009-0927, CVE-2011-0609, CVE-2011-0611, CVE-2012-0158,CVE-2017-0262, CVE-2015-4902, CVE-2015-1701, CVE-2014-4076, CVE-2015-2387, CVE-2015-1701, CVE-2017-0263]#区域位置 locationName [China-based, China, North, Korea, Russia, South, Asia, US, U.S., UK, U.K., Iran, Iranian, America, Colombian,Chinese, People’s, Liberation, Army, PLA, General, Staff, Department’s, GSD, MUCD, Unit, 61398, Chinese-based,Russias, General, Staff, Main, Intelligence, Directorate, GRU, GTsSS, unit, 26165, 74455, Georgian, SVR,Europe, Asia, Hong Kong, Vietnam, Cambodia, Thailand, Germany, Spain, Finland, Israel, India, Italy, South Asia,Korea, Kuwait, Lebanon, Malaysia, United, Kingdom, Netherlands, Southeast, Asia, Pakistan, Canada, Bangladesh,Ukraine, Austria, France, Korea]#攻击行业 industryName [financial, economic, trade, policy, defense, industrial, espionage, government, institutions, institution, petroleum,industry, manufacturing, corporations, media, outlets, high-tech, companies, governments, medical, defense, finance,energy, pharmaceutical, telecommunications, high, tech, education, investment, firms, organizations, research, institutes,]#攻击方法 methodName [RATs, RAT, SQL, injection, spearphishing, spear, phishing, backdoors, vulnerabilities, vulnerability, commands, command,anti-censorship, keystrokes, VBScript, malicious, document, scheduled, tasks, C2, CC, communications, batch, script,shell, scripting, social, engineering, privilege, escalation, credential, dumping, control, obfuscates, obfuscate, payload, upload,payloads, encode, decrypts, attachments, attachment, inject, collect, large-scale, scans, persistence, brute-force/password-spray,password-spraying, backdoor, bypass, hijacking, escalate, privileges, lateral, movement, Vulnerability, timestomping,keylogging, DDoS, bootkit, UPX ]#利用软件 softwareName [Microsoft, Word, Office, Firefox, Google, RAR, WinRAR, zip, GETMAIL, MAPIGET, Outlook, Exchange, Adobes, Adobe,Acrobat, Reader, RDP, PDFs, PDF, RTF, XLSM, USB, SharePoint, Forfiles, Delphi, COM, Excel, NetBIOS,Tor, Defender, Scanner, Gmail, Yahoo, Mail, 7-Zip, Twitter, gMSA, Azure, Exchange, OWA, SMB, Netbios,WinRM]#操作系统 osName [Windows, windows, Mac, Linux, Android, android, linux, mac, unix, Unix]#计算并输出相关的内容 saveCVE cveName saveAPT aptName saveLocation locationName saveIndustry industryName saveMethod methodName saveSoftware softwareName saveOS osName#------------------------------------------------------------------------ #获取文件路径及名称 def get_filepath(path):entities {} #字段实体类别files os.listdir(path) #遍历路径return files#----------------------------------------------------------------------- #获取文件内容 def get_content(filename):content []with open(filename, r, encodingutf8) as f:for line in f.readlines():content.append(line.strip())return content#--------------------------------------------------------------------- #空格分隔获取英文单词 def data_annotation(text):n 0nums []while nlen(text):word text[n].strip()if word : #换行 startswithn 1nums.append()continue#APT攻击组织if word in aptName:nums.append(B-AG)#攻击漏洞elif CVE- in word or MS- in word:nums.append(B-AV)print(CVE漏洞:, word)if word not in saveCVE:saveCVE.append(word)#区域位置elif word in locationName:nums.append(B-RL)#攻击行业elif word in industryName:nums.append(B-AI)#攻击手法elif word in methodName:nums.append(B-AM)#利用软件elif word in softwareName:nums.append(B-SI)#操作系统elif word in osName:nums.append(B-OS)#特殊情况-APT组织#Ajax Security Team、Deep Panda、Sandworm Team、Cozy Bear、The Dukes、Dark Haloelif ((word in Ajax Security Team) and (text[n1].strip() in Ajax Security Team) and word!a and word!it) or \((word in Ajax Security Team) and (text[n-1].strip() in Ajax Security Team) and word!a and word!it) or \((wordDeep) and (text[n1].strip()Panda)) or \((wordPanda) and (text[n-1].strip()Deep)) or \((wordSandworm) and (text[n1].strip()Team)) or \((wordTeam) and (text[n-1].strip()Sandworm)) or \((wordCozy) and (text[n1].strip()Bear)) or \((wordBear) and (text[n-1].strip()Cozy)) or \((wordThe) and (text[n1].strip()Dukes)) or \((wordDukes) and (text[n-1].strip()The)) or \((wordDark) and (text[n1].strip()Halo)) or \((wordHalo) and (text[n-1].strip()Dark)):nums.append(B-AG)if Deep Panda not in saveAPT:saveAPT.append(Deep Panda)if Sandworm Team not in saveAPT:saveAPT.append(Sandworm Team)if Cozy Bear not in saveAPT:saveAPT.append(Cozy Bear)if The Dukes not in saveAPT:saveAPT.append(The Dukes)if Dark Halo not in saveAPT:saveAPT.append(Dark Halo) #特殊情况-攻击行业elif ((wordlegal) and (text[n1].strip()services)) or \((wordservices) and (text[n-1].strip()legal)):nums.append(B-AI)if legal services not in saveIndustry:saveIndustry.append(legal services)#特殊情况-攻击方法#watering hole attack、bypass application control、take screenshotselif ((word in watering hole attack) and (text[n1].strip() in watering hole attack) and word!a and text[n1].strip()!a) or \((word in watering hole attack) and (text[n-1].strip() in watering hole attack) and word!a and text[n1].strip()!a) or \((word in bypass application control) and (text[n1].strip() in bypass application control) and word!a and text[n1].strip()!a) or \((word in bypass application control) and (text[n-1].strip() in bypass application control) and word!a and text[n-1].strip()!a) or \((wordtake) and (text[n1].strip()screenshots)) or \((wordscreenshots) and (text[n-1].strip()take)):nums.append(B-AM)if watering hole attack not in saveMethod:saveMethod.append(watering hole attack)if bypass application control not in saveMethod:saveMethod.append(bypass application control)if take screenshots not in saveMethod:saveMethod.append(take screenshots)#特殊情况-利用软件#MAC address、IP address、Port 22、Delivery Service、McAfee Email Protectionelif ((wordlegal) and (text[n1].strip()services)) or \((wordservices) and (text[n-1].strip()legal)) or \((wordMAC) and (text[n1].strip()address)) or \((wordaddress) and (text[n-1].strip()MAC)) or \((wordIP) and (text[n1].strip()address)) or \((wordaddress) and (text[n-1].strip()IP)) or \((wordPort) and (text[n1].strip()22)) or \((word22) and (text[n-1].strip()Port)) or \((wordDelivery) and (text[n1].strip()Service)) or \((wordService) and (text[n-1].strip()Delivery)) or \((word in McAfee Email Protection) and (text[n1].strip() in McAfee Email Protection)) or \((word in McAfee Email Protection) and (text[n-1].strip() in McAfee Email Protection)):nums.append(B-SI)if MAC address not in saveSoftware:saveSoftware.append(MAC address)if IP address not in saveSoftware:saveSoftware.append(IP address)if Port 22 not in saveSoftware:saveSoftware.append(Port 22)if Delivery Service not in saveSoftware:saveSoftware.append(Delivery Service)if McAfee Email Protection not in saveSoftware:saveSoftware.append(McAfee Email Protection)#特殊情况-区域位置#Russias Foreign Intelligence Service、the Middle Eastelif ((word in Russias Foreign Intelligence Service) and (text[n1].strip() in Russias Foreign Intelligence Service)) or \((word in Russias Foreign Intelligence Service) and (text[n-1].strip() in Russias Foreign Intelligence Service)) or \((word in the Middle East) and (text[n1].strip() in the Middle East)) or \((word in the Middle East) and (text[n-1].strip() in the Middle East)) :nums.append(B-RL)if Russias Foreign Intelligence Service not in saveLocation:saveLocation.append(Russias Foreign Intelligence Service)if the Middle East not in saveLocation:saveLocation.append(the Middle East)else:nums.append(O)n 1return nums#----------------------------------------------------------------------- #主函数 if __name__ __main__:path Mitre-Split-Wordsavepath Mitre-Split-Word-BIOfilenames get_filepath(path)print(filenames)print(\n)#遍历文件内容k 0while klen(filenames):filename path // filenames[k]print(-------------------------)print(filename)content get_content(filename)#分割句子nums data_annotation(content)#print(nums)print(len(content),len(nums))#数据存储filename filenames[k].replace(.txt, .csv)savename savepath // filenamef open(savename, w, encodingutf8, newline)fwrite csv.writer(f)fwrite.writerow([word,label])n 0while nlen(content):fwrite.writerow([content[n],nums[n]])n 1f.close()print(-------------------------\n\n)#if k28:# breakk 1#-------------------------------------------------------------------------------------------------#输出存储的漏洞结果saveCVE.remove(CVE-2009-4324CVE-2009-0927)saveCVE.sort()print(saveCVE)print(CVE漏洞:, len(saveCVE))saveAPT.sort()print(saveAPT)print(APT组织:, len(saveAPT))saveLocation.sort()print(saveLocation)print(区域位置:, len(saveLocation))saveIndustry.sort()print(saveIndustry)print(攻击行业:, len(saveIndustry))saveSoftware.sort()print(saveSoftware)print(利用软件:, len(saveSoftware))saveMethod.sort()print(saveMethod)print(攻击手法:, len(saveMethod))saveOS.sort()print(saveOS)print(操作系统:, len(saveOS))此时的输出结果如下图所示温馨提示关于数据标注的校正和优化过程请读着自行思考此外BIO结尾标注代码还需要调整。当我们拥有更准确的标注将有利于所有的实体识别研究。四.数据集划分在进行实体识别标注之前我们将数据集随机划分为训练集、测试集、验证集。将Mitre-Split-Word-BIO中的文件随机划分并存储在三个文件夹中构建代码合成三个TXT文件后续代码将对这些文件开展训练和测试任务 – dataset-train.txt、dataset-test.txt、dataset-val.txt 如下图所示完整代码如下所示 #encoding:utf-8 #By:Eastmount CSDN import re import os import csv#------------------------------------------------------------------------ #获取文件路径及名称 def get_filepath(path):entities {} #字段实体类别files os.listdir(path) #遍历路径return files#----------------------------------------------------------------------- #获取文件内容 def get_content(filename):content fr open(filename, r, encodingutf8)reader csv.reader(fr)k 0for r in reader:if k0 and (r[0]! or r[0]! ) and r[1]!:content r[0] r[1] \nelif (r[0] or r[0] ) and r[1]!:content UNK r[1] \nelif (r[0] or r[0] ) and r[1]:content \nk 1return content#----------------------------------------------------------------------- #主函数 if __name__ __main__:#获取文件名path train#path test#path valfilenames get_filepath(path)print(filenames)print(\n)savefilename dataset-train.txt#savefilename dataset-test.txt#savefilename dataset-val.txtf open(savefilename, w, encodingutf8)#遍历文件内容k 0while klen(filenames):filename path // filenames[k]print(filename)content get_content(filename)print(content)f.write(content)k 1f.close()运行结果如下图所示五.基于CRF的实体识别写到该部分我们即可开展实体识别研究首先利用代表性的条件随机场Conditional Random FieldsCRF模型讲解。关于CRF原理请读者自行了解。 1.安装keras-contrib CRF模型作者安装的是 keras-contrib。第一步如果读者直接使用“pip install keras-contrib”可能会报错远程下载也报错。 pip install githttps://www.github.com/keras-team/keras-contrib.git 甚至会报错 ModuleNotFoundError: No module named ‘keras_contrib’。第二步作者从github中下载该资源并在本地安装。 https://github.com/keras-team/keras-contribkeras-contrib 版本2.0.8 git clone https://www.github.com/keras-team/keras-contrib.git cd keras-contrib python setup.py install安装成功如下图所示读者可以从我的资源中下载代码和扩展包。 https://github.com/eastmountyxz/When-AI-meet-Security 2.安装Keras 同样需要安装keras和TensorFlow扩展包。如果TensorFlow下载太慢可以设置清华大学镜像实际安装2.2版本。 pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install tensorflow2.23.完整代码代码如下所示推荐资料 https://github.com/huanghao128/zh-nlp-demohttps://blog.csdn.net/qq_35549634/article/details/106861168 #encoding:utf-8 #By:Eastmount CSDN import re import os import csv import numpy as np import keras from keras.preprocessing import sequence from keras.models import Sequential from keras.models import Model from keras.layers import Masking, Embedding, Bidirectional, LSTM, Dense from keras.layers import Input, TimeDistributed, Activation from keras.models import load_model from keras_contrib.layers import CRF from keras_contrib.losses import crf_loss from keras_contrib.metrics import crf_viterbi_accuracy from keras import backend as K from sklearn import metrics#------------------------------------------------------------------------ #第一步数据预处理 #------------------------------------------------------------------------ train_data_path dataset-train.txt #训练数据 test_data_path dataset-test.txt #测试数据 val_data_path dataset-val.txt #验证数据 char_vocab_path char_vocabs.txt #字典文件special_words [PAD, UNK] #特殊词表示#BIO标记的标签 label2idx {O: 0, B-AG: 1, B-AV: 2, B-RL: 3,B-AI:4, B-AM: 5, B-SI: 6, B-OS: 7 }# 索引和BIO标签对应 idx2label {idx: label for label, idx in label2idx.items()} print(idx2label)# 读取字符词典文件 with open(char_vocab_path, r, encodingutf8) as fo:char_vocabs [line.strip() for line in fo] char_vocabs special_words char_vocabs print(char_vocabs) print(--------------------------------------------\n\n)# 字符和索引编号对应 {PAD: 0, UNK: 1, APT-C-36: 2, ...} idx2vocab {idx: char for idx, char in enumerate(char_vocabs)} vocab2idx {char: idx for idx, char in idx2vocab.items()} print(idx2vocab) print(--------------------------------------------\n\n) print(vocab2idx) print(--------------------------------------------\n\n)#------------------------------------------------------------------------ #第二步读取训练语料 #------------------------------------------------------------------------ def read_corpus(corpus_path, vocab2idx, label2idx):datas, labels [], []with open(corpus_path, encodingutf-8) as fr:lines fr.readlines()sent_, tag_ [], []for line in lines:if line ! \n: #断句line line.strip()[char, label] line.split()sent_.append(char)tag_.append(label)else:#print(line)#vocab2idx[0] PADsent_ids [vocab2idx[char] if char in vocab2idx else vocab2idx[UNK] for char in sent_]tag_ids [label2idx[label] if label in label2idx else 0 for label in tag_]datas.append(sent_ids)labels.append(tag_ids)sent_, tag_ [], []return datas, labels#原始数据 train_datas_, train_labels_ read_corpus(train_data_path, vocab2idx, label2idx) test_datas_, test_labels_ read_corpus(test_data_path, vocab2idx, label2idx)#输出测试结果 1639 1639 923 923 print(len(train_datas_), len(train_labels_), len(test_datas_), len(test_labels_)) print(train_datas_[5]) print([idx2vocab[idx] for idx in train_datas_[5]]) print(train_labels_[5]) print([idx2label[idx] for idx in train_labels_[5]])#------------------------------------------------------------------------ #第三步数据填充 one-hot编码 #------------------------------------------------------------------------ MAX_LEN 100 VOCAB_SIZE len(vocab2idx) CLASS_NUMS len(label2idx)# padding data print(padding sequences) train_datas sequence.pad_sequences(train_datas_, maxlenMAX_LEN) train_labels sequence.pad_sequences(train_labels_, maxlenMAX_LEN)test_datas sequence.pad_sequences(test_datas_, maxlenMAX_LEN) test_labels sequence.pad_sequences(test_labels_, maxlenMAX_LEN) print(x_train shape:, train_datas.shape) print(x_test shape:, test_datas.shape) # (1639, 100) (923, 100)# encoder one-hot train_labels keras.utils.to_categorical(train_labels, CLASS_NUMS) test_labels keras.utils.to_categorical(test_labels, CLASS_NUMS) print(trainlabels shape:, train_labels.shape) print(testlabels shape:, test_labels.shape) # (1639, 100, 8) (923, 100, 8)#------------------------------------------------------------------------ #第四步构建CRF模型 #------------------------------------------------------------------------ EPOCHS 20 BATCH_SIZE 64 EMBED_DIM 128 HIDDEN_SIZE 64 MAX_LEN 100 VOCAB_SIZE len(vocab2idx) CLASS_NUMS len(label2idx) K.clear_session() print(VOCAB_SIZE, CLASS_NUMS, \n) #3860 8#模型构建 CRF inputs Input(shape(MAX_LEN,), dtypeint32) x Masking(mask_value0)(inputs) x Embedding(VOCAB_SIZE, 32, mask_zeroFalse)(x) x TimeDistributed(Dense(CLASS_NUMS))(x) outputs CRF(CLASS_NUMS)(x) model Model(inputsinputs, outputsoutputs) model.summary()flag test if flagtrain:#模型训练model.compile(losscrf_loss, optimizeradam, metrics[crf_viterbi_accuracy])model.fit(train_datas, train_labels, epochsEPOCHS, verbose1, validation_split0.1)score model.evaluate(test_datas, test_labels, batch_sizeBATCH_SIZE)print(model.metrics_names)print(score)model.save(ch_ner_model.h5) else:#------------------------------------------------------------------------#第五步训练模型#------------------------------------------------------------------------char_vocab_path char_vocabs.txt #字典文件model_path ch_ner_model.h5 #模型文件ner_labels {O: 0, B-AG: 1, B-AV: 2, B-RL: 3,B-AI:4, B-AM: 5, B-SI: 6, B-OS: 7 }special_words [PAD, UNK]MAX_LEN 100#预测结果model load_model(model_path, custom_objects{CRF: CRF}, compileFalse) y_pred model.predict(test_datas)y_labels np.argmax(y_pred, axis2) #取最大值z_labels np.argmax(test_labels, axis2) #真实值word_labels test_datas #真实值k 0final_y [] #预测结果对应的标签final_z [] #真实结果对应的标签final_word [] #对应的特征单词while klen(y_labels):y y_labels[k]for idx in y:final_y.append(idx2label[idx])#print(预测结果:, [idx2label[idx] for idx in y])z z_labels[k]#print(z)for idx in z: final_z.append(idx2label[idx])#print(真实结果:, [idx2label[idx] for idx in z])word word_labels[k]#print(word) n for idx in word:final_word.append(idx2vocab[idx])k 1print(最终结果大小:, len(final_y),len(final_z))n 0numError 0numRight 0while nlen(final_y):if final_y[n]!final_z[n] and final_z[n]!O:numError 1if final_y[n]final_z[n] and final_z[n]!O:numRight 1n 1print(预测错误数量:, numError)print(预测正确数量:, numRight)print(Acc:, numRight*1.0/(numErrornumRight))print(y_pred.shape)print(len(test_datas_), len(test_labels_))print(预测单词:, [idx2vocab[idx] for idx in test_datas_[0]])print(真实结果:, [idx2label[idx] for idx in test_labels_[0]])#文件存储fw open(Final_CRF_Result.csv, w, encodingutf8, newline)fwrite csv.writer(fw)fwrite.writerow([pre_label,real_label, word])n 0while nlen(final_y):fwrite.writerow([final_y[n],final_z[n],final_word[n]])n 1fw.close()构建的模型如下图所示运行结果如下训练完成后将flag变量修改为“test”测试。 32/1475 [..............................] - ETA: 0s - loss: 0.0102 - crf_viterbi_accuracy: 0.9997416/1475 [......................] - ETA: 5s - loss: 0.0143 - crf_viterbi_accuracy: 0.9982736/1475 [................] - ETA: 4s - loss: 0.0147 - crf_viterbi_accuracy: 0.9981 1056/1475 [.........] - ETA: 2s - loss: 0.0141 - crf_viterbi_accuracy: 0.9983 1344/1475 [...] - ETA: 0s - loss: 0.0138 - crf_viterbi_accuracy: 0.9984 1472/1475 [.] - ETA: 0s - loss: 0.0136 - crf_viterbi_accuracy: 0.9984 [loss, crf_viterbi_accuracy] [0.021301430796362854, 0.9972449541091919]六.基于BiLSTM-CRF的实体识别下面的代码是构建BiLSTM-CRF模型实现实体识别。 #encoding:utf-8 #By:Eastmount CSDN import re import os import csv import numpy as np import keras from keras.preprocessing import sequence from keras.models import Sequential from keras.models import Model from keras.layers import Masking, Embedding, Bidirectional, LSTM, Dense from keras.layers import Input, TimeDistributed, Activation from keras.models import load_model from keras_contrib.layers import CRF from keras_contrib.losses import crf_loss from keras_contrib.metrics import crf_viterbi_accuracy from keras import backend as K from sklearn import metrics#------------------------------------------------------------------------ #第一步数据预处理 #------------------------------------------------------------------------ train_data_path dataset-train.txt #训练数据 test_data_path dataset-test.txt #测试数据 val_data_path dataset-val.txt #验证数据 char_vocab_path char_vocabs.txt #字典文件 special_words [PAD, UNK] #特殊词表示#BIO标记的标签 label2idx {O: 0, B-AG: 1, B-AV: 2, B-RL: 3,B-AI:4, B-AM: 5, B-SI: 6, B-OS: 7 }# 索引和BIO标签对应 idx2label {idx: label for label, idx in label2idx.items()} print(idx2label)# 读取字符词典文件 with open(char_vocab_path, r, encodingutf8) as fo:char_vocabs [line.strip() for line in fo] char_vocabs special_words char_vocabs# 字符和索引编号对应 {PAD: 0, UNK: 1, APT-C-36: 2, ...} idx2vocab {idx: char for idx, char in enumerate(char_vocabs)} vocab2idx {char: idx for idx, char in idx2vocab.items()}#------------------------------------------------------------------------ #第二步读取训练语料 #------------------------------------------------------------------------ def read_corpus(corpus_path, vocab2idx, label2idx):datas, labels [], []with open(corpus_path, encodingutf-8) as fr:lines fr.readlines()sent_, tag_ [], []for line in lines:if line ! \n: #断句line line.strip()[char, label] line.split()sent_.append(char)tag_.append(label)else:sent_ids [vocab2idx[char] if char in vocab2idx else vocab2idx[UNK] for char in sent_]tag_ids [label2idx[label] if label in label2idx else 0 for label in tag_]datas.append(sent_ids)labels.append(tag_ids)sent_, tag_ [], []return datas, labels#原始数据 train_datas_, train_labels_ read_corpus(train_data_path, vocab2idx, label2idx) test_datas_, test_labels_ read_corpus(test_data_path, vocab2idx, label2idx)#------------------------------------------------------------------------ #第三步数据填充 one-hot编码 #------------------------------------------------------------------------ MAX_LEN 100 VOCAB_SIZE len(vocab2idx) CLASS_NUMS len(label2idx)print(padding sequences) train_datas sequence.pad_sequences(train_datas_, maxlenMAX_LEN) train_labels sequence.pad_sequences(train_labels_, maxlenMAX_LEN) test_datas sequence.pad_sequences(test_datas_, maxlenMAX_LEN) test_labels sequence.pad_sequences(test_labels_, maxlenMAX_LEN) print(x_train shape:, train_datas.shape) print(x_test shape:, test_datas.shape)train_labels keras.utils.to_categorical(train_labels, CLASS_NUMS) test_labels keras.utils.to_categorical(test_labels, CLASS_NUMS) print(trainlabels shape:, train_labels.shape) print(testlabels shape:, test_labels.shape)#------------------------------------------------------------------------ #第四步构建BiLSTMCRF模型 #------------------------------------------------------------------------ EPOCHS 12 BATCH_SIZE 64 EMBED_DIM 128 HIDDEN_SIZE 64 MAX_LEN 100 VOCAB_SIZE len(vocab2idx) CLASS_NUMS len(label2idx) K.clear_session() print(VOCAB_SIZE, CLASS_NUMS, \n) #3860 8#模型构建 BiLSTM-CRF inputs Input(shape(MAX_LEN,), dtypeint32) x Masking(mask_value0)(inputs) x Embedding(VOCAB_SIZE, EMBED_DIM, mask_zeroFalse)(x) #修改掩码False x Bidirectional(LSTM(HIDDEN_SIZE, return_sequencesTrue))(x) x TimeDistributed(Dense(CLASS_NUMS))(x) outputs CRF(CLASS_NUMS)(x) model Model(inputsinputs, outputsoutputs) model.summary()flag train if flagtrain:#模型训练model.compile(losscrf_loss, optimizeradam, metrics[crf_viterbi_accuracy])model.fit(train_datas, train_labels, epochsEPOCHS, verbose1, validation_split0.1)score model.evaluate(test_datas, test_labels, batch_sizeBATCH_SIZE)print(model.metrics_names)print(score)model.save(bilstm_ner_model.h5) else:#------------------------------------------------------------------------#第五步训练模型#------------------------------------------------------------------------char_vocab_path char_vocabs.txt #字典文件model_path bilstm_ner_model.h5 #模型文件ner_labels {O: 0, B-AG: 1, B-AV: 2, B-RL: 3,B-AI:4, B-AM: 5, B-SI: 6, B-OS: 7 }special_words [PAD, UNK]MAX_LEN 100#预测结果model load_model(model_path, custom_objects{CRF: CRF}, compileFalse) y_pred model.predict(test_datas)y_labels np.argmax(y_pred, axis2) #取最大值z_labels np.argmax(test_labels, axis2) #真实值word_labels test_datas #真实值k 0final_y [] #预测结果对应的标签final_z [] #真实结果对应的标签final_word [] #对应的特征单词while klen(y_labels):y y_labels[k]for idx in y:final_y.append(idx2label[idx])z z_labels[k]for idx in z: final_z.append(idx2label[idx])word word_labels[k]for idx in word:final_word.append(idx2vocab[idx])k 1print(最终结果大小:, len(final_y),len(final_z))n 0numError 0numRight 0while nlen(final_y):if final_y[n]!final_z[n] and final_z[n]!O:numError 1if final_y[n]final_z[n] and final_z[n]!O:numRight 1n 1print(预测错误数量:, numError)print(预测正确数量:, numRight)print(Acc:, numRight*1.0/(numErrornumRight))print(预测单词:, [idx2vocab[idx] for idx in test_datas_[0]])print(真实结果:, [idx2label[idx] for idx in test_labels_[0]])构建的模型如下图所示对比实验及调参请读者自行尝试喔以后有时间再分享调参内容。七.总结写到这里这篇文章就结束希望对您有所帮助后续将结合经典的Bert进行分享。忙碌的2023真的很忙项目本子论文毕业工作等忙完后好好写几篇安全博客感谢支持和陪伴尤其是家人的鼓励和支持继续加油一.ATTCK数据采集二.数据拆分及内容统计 1.段落拆分 2.句子拆分三.数据标注四.数据集划分五.基于CRF的实体识别 1.安装keras-contrib 2.安装Keras 3.完整代码六.基于BiLSTM-CRF的实体识别人生路是一个个十字路口一次次博弈一次次纠结和得失组成。得失得失有得有失不同的选择不一样的精彩。虽然累和忙但看到小珞珞还是挺满足的感谢家人的陪伴。望小珞能开心健康成长爱你们喔继续干活加油 (By:Eastmount 2024-01-31 写于省图书馆 http://blog.csdn.net/eastmount/ )

查看全文

http://www.pierceye.com/news/415099/