当前位置：首页 > news >正文

南京网站流量优化七色板网站建设

news 2025/12/21 9:45:11

南京网站流量优化,七色板网站建设,如何成为一个电商,学网站开发工作好找吗1 搜索文档树 1.1 find和find_all 1.2 爬取美女图片 2 bs4其它用法 3 css选择器 4 selenium基本使用 4.1 模拟登录 5 selenium其它用法 5.1 无头浏览器 5.2 搜索标签遍历文档树 -1 request 使用代理proxies {https: 192.168.1.12:8090,}-2 代理的使用-高匿透明-免费---》…1 搜索文档树 1.1 find和find_all 1.2 爬取美女图片 2 bs4其它用法 3 css选择器 4 selenium基本使用 4.1 模拟登录 5 selenium其它用法 5.1 无头浏览器 5.2 搜索标签遍历文档树 -1 request 使用代理proxies {https: 192.168.1.12:8090,}-2 代理的使用-高匿透明-免费---》爬取免费代理--》开源-https://www.zdaye.com/free/ ---》验证-收费-3 django 获取访问者ip---》公网-django如果在内网---》局域网内访问没问题-如果到了公网再回就回不来了-使用内网穿透技术实现-公网内网-4 爬取视频网站-1 获取一条条视频--》分析出一个地址--》正则-2 解析出视频id视频地址-3 携带referer-4 视频不能播放--》能播的和不能播的有什么区别-5 爬新闻 -requestsbs4-find_all-find-6 bs介绍和使用-解析库---》xml-指定解析器 lxml html.parser-7 遍历文档树-soupBeautifulSoup()-soup.body.title 返回的对象也有这些方法和属性 Tag BeautifulSoup继承了Tag-BeautifulSoup类继承了Tag所以以后拿到的任意一个标签都是Tag类的对象所有的遍历文档获取属性文本---》跟BeautifulSoup的对象一样用 - . 找标签只能找到第一个- .标签.标签- 获取标签名 soup.body.name- 获取标签属性soup.标签.attrs.get(属性名) 类 class标签列表- 获取标签文本内容-text子子孙孙的内容拼到一起-string该标签有且只有它自己有内容-strings子子孙孙放到生成器中-子节点-兄弟节点-父亲节点 1 搜索文档树 # 1 find_all 找所有列表 # 2 find 找一个 Tag类的对象 1.1 find和find_all from bs4 import BeautifulSouphtml_doc htmlheadtitleThe Dormouses story/title/head body p classtitlebThe Dormouses story/bspanlqz/span/pp classstoryOnce upon a time there were three little sisters; and their names were a hrefhttp://example.com/elsie classsister idlink1Elsie/a, a hrefhttp://example.com/lacie classsister idlink2Lacie/a and a hrefhttp://example.com/tillie classsister idlink3Tillie/a; and they lived at the bottom of a well./pp classstory.../psoup BeautifulSoup(html_doc, html.parser) # 1、五种过滤器: 字符串、正则表达式、列表、True、方法#### 字符串 # -可以按标签名可以按属性可以按文本内容 # - 无论按标签名按属性按文本内容都是按字符串形式查找# psoup.find(p) # 找到类名叫 story的p标签 # psoup.find(namep,class_story) #### 可以按标签名可以按属性可以按文本内容 # objsoup.find(namespan,textlqz) # objsoup.find(hrefhttp://example.com/tillie)# 属性可以写成这样 # objsoup.find(attrs{class:title}) # print(obj)#### 正则无论按标签名按属性按文本内容都是按正则形式查找 # 找到所有名字以b开头的所有标签 import re# objsoup.find_all(namere.compile(^b)) # objsoup.find_all(namere.compile(y$)) # objsoup.find_all(hrefre.compile(^http:)) # objsoup.find_all(textre.compile(i)) # print(obj)### 列表无论按标签名按属性按文本内容都是按列表形式查找 # objsoup.find_all(name[p,a]) # obj soup.find_all(class_[sister, title]) # print(obj)# True无论按标签名按属性按文本内容都是按布尔形式查找 # objsoup.find_all(idTrue) # objsoup.find_all(hrefTrue) # objsoup.find_all(nameimg,srcTrue) # print(obj)### 方法无论按标签名按属性按文本内容都是按方法形式查找 def has_class_but_no_id(tag):return tag.has_attr(class) and not tag.has_attr(id)print(soup.find_all(namehas_class_but_no_id))1.2 爬取图片 import requests from bs4 import BeautifulSoupres requests.get(https://pic.netbian.com/tupian/32518.html) res.encoding gbk # print(res.text)soup BeautifulSoup(res.text, html.parser) ul soup.find(ul, class_clearfix) img_list ul.find_all(nameimg, srcTrue) for img in img_list:try:url img.attrs.get(src)if not url.startswith(http):url https://pic.netbian.com urlprint(url)res1requests.get(url)nameurl.split(-)[-1]with open(./img/%s%name,wb) as f:for line in res1.iter_content():f.write(line)except Exception as e:continue 2 bs4其它用法 # 1 遍历搜索文档树---》bs4还可以修改xml-java的配置文件一般喜欢用xml写-.conf-.ini-.yaml-.xml# 2 find_all 其他参数-limit数字找几条如果写1 就是一条-recursive# 3 搜索文档树和遍历文档树可以混用找属性找文本跟之前学的一样h1 id“css”3 css选择器 # id选择器#id号 # 标签选择器标签名 # 类选择器.类名# 记住的#id.sisterheaddiva # div下直接子节点adiv a # div下子子孙孙节点a# 一旦会了css选择器的用法---》以后所有的解析库都可以使用css选择器去找 import requests from bs4 import BeautifulSoupres requests.get(https://www.cnblogs.com/liuqingzheng/p/16005896.html) # print(res.text) soup BeautifulSoup(res.text, html.parser) # asoup.find(namea,title下载哔哩哔哩视频) # print(a.attrs.get(href))# psoup.select(#cnblogs_post_body p:nth-child(2) a:nth-child(5))[0].attrs.get(href) # psoup.select(#cnblogs_post_body p:nth-child(2) a:nth-child(5))[0].attrs.get(href) # 以后直接复制即可 psoup.select(a[title下载哔哩哔哩视频])[0].attrs.get(href) # 以后直接复制即可 print(p)4 selenium基本使用 # 这个模块既能发请求又能解析还能执行js # selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题# selenium 会做web方向的自动化测试 # appnium 会做 app方向的自动化测试# selenium 可以操作浏览器模拟人的行为# 如何使用1 下载浏览器驱动https://registry.npmmirror.com/binary.html?pathchromedriver/https://googlechromelabs.github.io/chrome-for-testing/https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/119.0.6045.105/win64/chromedriver-win64.zip跟浏览器型号和版本一一对应的ie火狐谷歌谷歌为例谷歌浏览器有很多版本跟版本一一对应2 安装 selenium3 写python代码操作浏览器import timefrom selenium import webdriver# 跟人操作浏览器一样打开了谷歌浏览器拿到浏览器对象browebdriver.Chrome()# 在地址栏中输入地址bro.get(https://www.baidu.com)time.sleep(5)bro.close() 4.1 模拟登录 import timefrom selenium import webdriver from selenium.webdriver.common.by import Bybro webdriver.Chrome() bro.get(https://www.baidu.com) bro.implicitly_wait(10) # 设置等待---》从页面中找标签如果找不到就等待 # 最大化 bro.maximize_window() # print(bro.page_source) # 当前页面的html内容 # 找到登录按钮--》选择器---》css选择器 # a_loginbro.find_element(byBy.NAME,valuetj_login) # a_loginbro.find_element(byBy.ID,values-top-loginbtn) a_login bro.find_element(byBy.LINK_TEXT, value登录) # a 标签连接文字 time.sleep(2) # 点击 a_login.click()# 找到短信登录点击 sms_login bro.find_element(byBy.ID, valueTANGRAM__PSP_11__changeSmsCodeItem) sms_login.click() time.sleep(1) user_login bro.find_element(byBy.ID, valueTANGRAM__PSP_11__changePwdCodeItem) user_login.click() time.sleep(1) username bro.find_element(byBy.NAME, valueuserName) # 往输入框中写文字 username.send_keys(lqzqq.com) password bro.find_element(byBy.ID, valueTANGRAM__PSP_11__password) # 往输入框中写文字 password.send_keys(lqzqq.com)agree bro.find_element(By.ID, TANGRAM__PSP_11__isAgree) agree.click() time.sleep(1)submit bro.find_element(By.ID, TANGRAM__PSP_11__submit) submit.click()time.sleep(3) bro.close() 5 selenium其它用法 5.1 无头浏览器 # 如果我们做爬虫我们只是为了获取数据不需要非有浏览器在显示---》隐藏浏览器图形化界面import timefrom selenium import webdriver from selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.options import Options chrome_options Options() chrome_options.add_argument(blink-settingsimagesEnabledfalse) #不加载图片, 提升速度 chrome_options.add_argument(--headless) #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败 bro webdriver.Chrome(optionschrome_options)bro.get(https://www.cnblogs.com/liuqingzheng/p/16005896.html)print(bro.page_source) time.sleep(3) bro.close() 5.2 搜索标签 1 搜索标签 By.ID # 根据id号查找标签 By.NAME # 根据name属性查找标签 By.TAG_NAME # # 根据标签查找标签 By.CLASS_NAME # 按类名找 By.LINK_TEXT # a标签文字 By.PARTIAL_LINK_TEXT # a标签文字模糊匹配---------selenium 自己的-------- By.CSS_SELECTOR # 按css选择器找 By.XPATH #按xpath找2 获取标签的属性文本大小位置 print(tag.get_attribute(src)) print(tag.id) # 这个id不是id号不需要关注 print(tag.location) print(tag.tag_name) print(tag.size)import time from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.options import Options chrome_options Options() chrome_options.add_argument(blink-settingsimagesEnabledfalse) #不加载图片, 提升速度 chrome_options.add_argument(--headless) #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败 bro webdriver.Chrome(optionschrome_options)bro.get(https://www.cnblogs.com/liuqingzheng/p/16005896.html)#### 不建议使用----》selenium提供的查找 # soupBeautifulSoup(bro.page_source,html.parser) # print(soup.find(title下载哔哩哔哩视频).attrs.get(href))# selenium提供的查找 # By.ID # 根据id号查找标签 # By.NAME # 根据name属性查找标签 # By.TAG_NAME # # 根据标签查找标签 # By.CLASS_NAME # 按类名找 # By.LINK_TEXT # a标签文字 # By.PARTIAL_LINK_TEXT # a标签文字模糊匹配 #---------selenium 自己的-------- # By.CSS_SELECTOR # 按css选择器找 # By.XPATH #按xpath找#### 找到标签后获取标签属性文本位置大小等 # print(tag.get_attribute(src)) # print(tag.id) # 这个id不是id号不需要关注 # print(tag.location) # print(tag.tag_name) # print(tag.size) divbro.find_element(By.ID,cnblogs_post_body) # resdiv.get_attribute(class) # 获取标签属性 print(div.get_attribute(class)) print(div.id) # 这个id不是id号不需要关注 print(div.location) # 在页面中位置 x y轴效果---》 print(div.tag_name) # 标签名 print(div.size) # 标签大小 x y print(div.text) # 文本内容## 找到页面中所有div # divsbro.find_elements(By.TAG_NAME,div) # print(len(divs))# 按类名找 # divbro.find_element(By.CLASS_NAME,postDesc).text # print(div)# 按css选择器 # divbro.find_element(By.CSS_SELECTOR,div.postDesc).text # divbro.find_element(By.CSS_SELECTOR,#topics div div.postDesc).text # print(div)# 按xpath选择---专门学xpath的语法 # divbro.find_element(By.XPATH,//*[idtopics]/div/div[3]).text # print(div)time.sleep(1) bro.close()

查看全文

http://www.pierceye.com/news/989737/