aspcms中英文双语网站,如何做流量网站,天津做网站优化的公司,免费个人网站空间注册目录 爬虫基本流程request和responserequestresponse演示解析方式requests库基本get请求1. 基本写法2. 带参数get请求3. 解析json4. 获取二进制数据5. 添加headers基本post请求响应状态码判断#xff1a;高级操作beautifulsoup库爬取汽车之家示例爬虫基本流程 发起请求#x… 目录 爬虫基本流程request和responserequestresponse演示解析方式requests库基本get请求1. 基本写法2. 带参数get请求3. 解析json4. 获取二进制数据5. 添加headers基本post请求响应状态码判断高级操作beautifulsoup库爬取汽车之家示例 爬虫基本流程 发起请求通过http库向目标站点发起请求即发送一个request请求可以包含额外的headers等信息等待服务器相应获取相应内容如果服务器能正常相应会得到一个responseresponse的内容便是所要获取的页面内容类型可能有HTMLjson字符串二进制数据如图片视频等类型解析内容得到的内容可能是HTML可以用正则表达式网页解析库进行解析可能是json可以直接转为json对象可能是二进制数据可以做保存或者进一步的处理保存数据可以存为文本也可以保存至数据库或者特定格式的文件request和response 浏览器发送消息给该网址所在的服务器叫http request服务器收到浏览器发送的消息后能够根据浏览器发送消息的内容做相应处理然后把消息回传给浏览器叫http response浏览器收到服务器的response信息后对信息进行相应处理然后展示request 请求方式有getpost两种另外还有headputdeleteoptions等 请求URL全球统一资源定位符任何一个网页图片文档等都可以用URL唯一确定 请求头包含请求时的头部信息如user-agenthostcookies等 请求体请求时额外携带的数据如表单提交时的表单数据 response 响应状态有多种相应状态200成功301跳转404找不到页面502服务器错误 响应头如内容类型内容长度吗服务器信息设置cookies等 响应体最主要的部分包括了请求资源的内容如网页HTML图片二进制数据等 演示 import requests # 网页uheaders {
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36}
response requests.get(http://www.baidu.com, headersuheaders)
print(response.text)
print(response.headers)
print(response.status_code)response requests.get(https://www.baidu.com/img/baidu_jgylogo3.gif) # 图片
res response.content
with open(1.gif,wb) as f:f.write(res) 解析方式 直接处理 json解析 正则表达式 beautifulsoup pyquery xpath requests库 各种请求方法 import requestsrequests.post(http://httpbin.org/post)
requests.put(http://httpbin.org/put)
requests.delete(http://httpbin.org/delete)
requests.head(http://httpbin.org/get)
requests.options(http://httpbin.org/get) 基本get请求 1. 基本写法 import requestsresponserequests.get(http://httpbin.org/get)
print(response.text){args: {}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, Connection: close, Host: httpbin.org, User-Agent: python-requests/2.19.1}, origin: 115.214.23.142, url: http://httpbin.org/get
} 2. 带参数get请求 import requestsresponserequests.get(http://httpbin.org/get?namegermeyage22)
print(response.text)data{name:germey,age:22}
responserequests.get(http://httpbin.org/get,paramsdata){args: {age: 22, name: germey}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, Connection: close, Host: httpbin.org, User-Agent: python-requests/2.19.1}, origin: 115.214.23.142, url: http://httpbin.org/get?namegermeyage22
} 3. 解析json import requestsresponserequests.get(http://httpbin.org/get)
print(type(response.text))
print(response.json())
print(type(response.json()))class str
{args: {}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, Connection: close, Host: httpbin.org, User-Agent: python-requests/2.19.1}, origin: 115.214.23.142, url: http://httpbin.org/get}
class dict 4. 获取二进制数据 import requestsresponserequests.get(http://github.com/favicon.ico)
with open() as f:f.write(response.content) 5. 添加headers import requestsheaders{User-Agent:}
responserequests.get(http://www.zhihu.com/explore,headersheaders)
print(response.text) 基本post请求 import requestsdata{name:germey,age:22}
headers{User-Agent:}
responserequests.post(http://httpbin.org/post,datadata,headersheaders)
print(response.json()){args: {}, data: , files: {}, form: {age: 22, name: germey}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, Connection: close, Content-Length: 18, Content-Type: application/x-www-form-urlencoded, Host: httpbin.org, User-Agent: }, json: None, origin: 115.214.23.142, url: http://httpbin.org/post} 响应 response属性 import requestsresponse requests.get(http://www.jianshu.com)
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)class int 403
class requests.structures.CaseInsensitiveDict {Date: Wed, 31 Oct 2018 06:25:29 GMT, Content-Type: text/html, Transfer-Encoding: chunked, Connection: keep-alive, Server: Tengine, Strict-Transport-Security: max-age31536000; includeSubDomains; preload, Content-Encoding: gzip, X-Via: 1.1 dianxinxiazai180:5 (Cdn Cache Server V2.0), 1.1 PSzjjxdx10wx178:11 (Cdn Cache Server V2.0)}
class requests.cookies.RequestsCookieJar RequestsCookieJar[]
class str https://www.jianshu.com/
class list [Response [301]] 状态码判断 好多种 高级操作 文件上传 import requestsfiles{file:open(1.jpg,rb)}
responserequests.post(http://httpbin.org/post,filesfiles)
print(response.text) 获取cookie import requestsresponserequests.get(http://www.baidu.com)
print(response.cookies)
for key,value in response.cookies.items():print(keyvalue)RequestsCookieJar[Cookie BDORZ27315 for .baidu.com/]
BDORZ27315 会话维持 import requestssrequests.Session()
s.get(http://httpbin.org/cookies/set/number/123456789)
responses.get(http://httpbin.org/cookies)
print(response.text){cookies: {number: 123456789}} 证书验证 代理设置 超时设置 import requests
from requests.exceptions import ReadTimeout
try:responserequests.get(https://www.taobao.com,timeout 1)print(response.status_code)
except ReadTimeout:print(Timeout) 认证设置 import requestsr requests.get(, auth(user, 123))
print(r.status_code) 异常处理 import requests
from requests.exceptions import ReadTimeout,ConnectionError,RequestException
try:responserequests.get(http://httpbin.org/get,timeout0.5)print(response.status_code)
except ReadTimeout:print(Timeout)
except ConnectionError:print(connect error)
except RequestException:print(Error) beautifulsoup库 爬取汽车之家示例 import requests # 伪造浏览器发起Http请求
from bs4 import BeautifulSoup # 将html格式的字符串解析成对象对象.find/find__allresponse requests.get(https://www.autohome.com.cn/news/)
response.encoding gbk # 网站是gbk编码的
soup BeautifulSoup(response.text, html.parser)
div soup.find(namediv, attrs{id: auto-channel-lazyload-article})
li_list div.find_all(nameli)
for li in li_list:title li.find(nameh3)if not title:continuep li.find(namep)a li.find(namea)print(title.text) # 标题print(a.attrs.get(href)) # 标题链接取属性值字典形式print(p.text) # 摘要img li.find(nameimg) # 图片src img.get(src)src https: srcprint(src)file_name src.rsplit(/, maxsplit1)[1]ret requests.get(src)with open(file_name, wb) as f:f.write(ret.content) # 二进制转载于:https://www.cnblogs.com/qiuyicheng/p/10753117.html