当前位置：首页 > news >正文

dnf怎么做辅助网站wordpress销售页面

news 2025/12/20 11:08:04

dnf怎么做辅助网站,wordpress销售页面,平阳企业网站建设,一个网站突然打不开xpath解析抓取主页面当中所有壁纸的链接地址 xpath是专门针对xml而创建的表达式语言#xff0c;可以直接从xml中提取表达式数据#xff1b;也可以取html取数据#xff1b;html是xml的子集。 1.按照lxml安装包在python终端输入 pip install lxml from lxml import etre…xpath解析抓取主页面当中所有壁纸的链接地址 xpath是专门针对xml而创建的表达式语言可以直接从xml中提取表达式数据也可以取html取数据html是xml的子集。 1.按照lxml安装包在python终端输入 pip install lxml from lxml import etree # 或者 # from lxml import html # etree html.etree # 需要加载准备解析的数据 f open(text.html,moder,encodingutf-8) pageSource f.read() #print(pageSource) # 加载数据返回element对象 et etree.HTML(pageSource) # print(et) Element html at 0x1f349424980 # 从elemnt提取界面所有内容 # xpath的语法 result et.xpath(/html/body/span/text()) # text()提取标签中的文本信息 print(result) 加载需要解析的html文件但要提取数据要定位html的内容例如 text.html,/html/body/span/定位到span我爱你/span》内容那么要提取里面文本需要加上给text(),即可提取定位位置的文本信息。 !DOCTYPE html html langenheadmeta charsetUTF-8titleTitle/title/headbodyspan我爱你/span ullia hrefhttp://www.baidu.com百度/a/lilia hrefhttp://www.google.com谷歌/a/lilia hrefhttp://www.sohu.com搜孤/a/li/ulollia hrefhttp://feiji飞机/a/lilia hrefhttp://dapao大炮/a/lilia hrefhttp://huoche火车/a/li/oldiv classjob浙江/divdiv classcommon美女/div/body /html a href XX/a超链接XX是展示在页面上的文字用户可以点击点击后就跳转到href的链接地址需求提取超链接中的文字百度谷歌搜狐被ul包围飞机大炮火车被ol包围写两个提取信息代码或者使用通配符* * 通配符什么都可以满足后面即可 result et.xpath(/html/body/*/li/a/text()) print(result) # 输出结果 [百度, 谷歌, 搜孤, 飞机, 大炮, 火车] 需求提取a标签中的属性提取代码中的超链接使用href提取html中的链接地址表示属性href提取a标签中的href属性 # 提取a标签中的属性即超链接 result et.xpath(/html/body/*/li/a/href) #表示属性href提取a标签中的href属性 print(result) # 输出结果[http://www.baidu.com, http://www.google.com, http://www.sohu.com, http://feiji, http://dapao, http://huoche] 优化欠缺提取信息的索引太多可以使用//优化//表示任意位置 result et.xpath(//a/href) # //表示任意位置 print(result) # 输出结果[http://www.baidu.com, http://www.google.com, http://www.sohu.com, http://feiji, http://dapao, http://huoche] 限定假如要提取div中的信息并且只要浙江不要美女要对div中class属性进行限定即可提取自己想要的信息而不是把符合div的数据全部提取 [属性‘’]在属性上限定不把符合div的数据全部提出 result et.xpath(//div[classjob]/text()) # [属性‘’]在属性上限定不把符合div的数据全部提出 print(result) # 输出[浙江] 需求要ul中的信息并且要求文本和属性一一对应思路对li进行遍历逐个提取各li中的所有信息 ./表示当前元素 # 带循环的 result et.xpath(/html/body/ul/li) for item in result:href item.xpath(./a/href)[0] # ./表示当前元素text item.xpath(./a/text())[0] # ./表示当前元素print(href,text) # 输出结果 # http://www.baidu.com 百度 # http://www.google.com 谷歌 # http://www.sohu.com 搜孤主页面分析 /lili classphoto-list-paddinga classpic href[/bizhi/3114_39082_2.html](https://desk.zol.com.cn/bizhi/3114_39082_2.html) target_blank hidefocustruespan title小黄人可爱高清壁纸大全小黄人可爱高清壁纸大全找数据先找a标签查看a标签中的href数据点入确定照片拿到页面源代码提取所有a标签中的href数据要对a标签中的内容进行限定我们这里要中的内容即提取ullia标签中的href的值问题及解决方法 1.如果出现乱码则查看源代码的编码方式图中为gb2312即gbk编码我们把encode改一下就行。 2.如果出现503 Service Unavailable 尝试把请求头信息一致即可完成。 import requests from lxml import etree # 提取源代码 url https://desk.zol.com.cn/dongman/good_1.html head { user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0 } resp requests.get(url,headershead) resp.encoding gbk txt resp.text # print(txt) # 提取代码中的href信息 # ul classpic-list2 clearfix li classphoto-list-padding et etree.HTML(txt) result et.xpath(//ul[classpic-list2 clearfix]/li/a/href) print(result) #输出结果 #[https://down10.zol.com.cn/desktoptools/XZDesktop_5018_3.1.3.6.exe, #//desk.zol.com.cn/bizhi/9109_111583_2.html, /bizhi/8676_107002_2.html, #/bizhi/8530_105480_2.html, /bizhi/8376_103851_2.html, #/bizhi/8365_103747_2.html, /bizhi/8339_103479_2.html, #/bizhi/8336_103439_2.html, /bizhi/8317_103216_2.html, #/bizhi/8287_102877_2.html, /bizhi/8286_102865_2.html, #/bizhi/8280_102793_2.html, /bizhi/8264_102617_2.html, #/bizhi/8263_102613_2.html, /bizhi/8261_102593_2.html, #/bizhi/8260_102585_2.html, /bizhi/8246_102427_2.html, #/bizhi/8245_102410_2.html, /bizhi/8242_102394_2.html, #/bizhi/8234_102309_2.html, /bizhi/8231_102269_2.html] 但是现在的链接无法打开图片缺少域名所以加上 domin https://desk.zol.com.cn for item in result:url dominitemprint(url) 所有代码 import requests from lxml import etree # 提取源代码 url https://desk.zol.com.cn/dongman/good_1.html head { user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0 } resp requests.get(url,headershead) resp.encoding gbk txt resp.text print(txt) # 提取代码中的href信息 # ul classpic-list2 clearfix li classphoto-list-padding et etree.HTML(txt) result et.xpath(//ul[classpic-list2 clearfix]/li/a/href) print(result) domin https://desk.zol.com.cn for item in result:url dominitemprint(url)

查看全文

http://www.pierceye.com/news/403003/