当前位置：首页 > news >正文

建设银行基金网站软件开发工程师报考条件

news 2025/11/20 1:12:35

建设银行基金网站,软件开发工程师报考条件,中介网站设计,施工企业科技宣传片一介绍 Scrapy一个开源和协作的框架#xff0c;其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的#xff0c;使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛#xff0c;可用于如数据挖掘、监测和自动化测试等领域#x…一介绍 Scrapy一个开源和协作的框架其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛可用于如数据挖掘、监测和自动化测试等领域也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy 是基于twisted框架开发而来twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞又名异步的代码来实现并发。整体架构大致如下 The data flow in Scrapy is controlled by the execution engine, and goes like this: The Engine gets the initial Requests to crawl from the Spider.The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.The Scheduler returns the next Requests to the Engine.The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.The process repeats (from step 1) until there are no more requests from the Scheduler. Components 引擎(EGINE) 引擎负责控制系统所有组件之间的数据流并在某些动作发生时触发事件。有关详细信息请参见上面的数据流部分。调度器(SCHEDULER) 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址下载器(DOWLOADER) 用于下载网页内容, 并将网页内容返回给EGINE下载器是建立在twisted这个高效的异步模型上的爬虫(SPIDERS) SPIDERS是开发人员自定义的类用来解析responses并且提取items或者发送新的请求项目管道(ITEM PIPLINES) 在items被提取后负责处理它们主要包括清理、验证、持久化比如存到数据库等操作下载器中间件(Downloader Middlewares) 位于Scrapy引擎和下载器之间主要用来处理从EGINE传到DOWLOADER的请求request已经从DOWNLOADER传到EGINE的响应response你可用该中间件做以下几件事 process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);change received response before passing it to a spider;send a new Request instead of passing received response to a spider;pass response to a spider without fetching a web page;silently drop some requests. 爬虫中间件(Spider Middlewares) 位于EGINE和SPIDERS之间主要工作是处理SPIDERS的输入即responses和输出即requests 二安装 #Windows平台1、pip3 install wheel #安装后便支持通过wheel文件安装软件wheel文件官网https://www.lfd.uci.edu/~gohlke/pythonlibs3、pip3 install lxml4、pip3 install pyopenssl5、下载并安装pywin32https://sourceforge.net/projects/pywin32/files/pywin32/6、下载twisted的wheel文件http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl8、pip3 install scrapy#Linux平台1、pip3 install scrapy 三命令行工具 #1 查看帮助scrapy -hscrapy command -h#2 有两种命令其中Project-only必须切到项目文件夹下才能执行而Global的命令则不需要Global commands:startproject #创建项目genspider #创建爬虫程序settings #如果是在项目目录下则得到的是该项目的配置runspider #运行一个独立的python文件不必创建项目shell #scrapy shell url地址在交互式调试如选择器规则正确与否fetch #独立于程单纯地爬取一个页面可以拿到请求头view #下载完毕后直接弹出浏览器以此可以分辨出哪些数据是ajax请求version #scrapy version 查看scrapy的版本scrapy version -v查看scrapy依赖库的版本Project-only commands:crawl #运行爬虫必须创建项目才行确保配置文件中ROBOTSTXT_OBEY Falsecheck #检测项目中有无语法错误list #列出项目中所包含的爬虫名edit #编辑器一般不用parse #scrapy parse url地址 --callback 回调函数 #以此可以验证我们的回调函数是否正确bench #scrapy bentch压力测试#1、执行全局命令请确保不在某个项目的目录下排除受该项目配置的影响 scrapy startproject MyProjectcd MyProject scrapy genspider baidu www.baidu.comscrapy settings --get XXX #如果切换到项目目录下看到的则是该项目的配置scrapy runspider baidu.pyscrapy shell https://www.baidu.comresponseresponse.statusresponse.bodyview(response)scrapy view https://www.taobao.com #如果页面显示内容不全不全的内容则是ajax请求实现的以此快速定位问题scrapy fetch --nolog --headers https://www.taobao.comscrapy version #scrapy的版本scrapy version -v #依赖库的版本#2、执行项目命令切到项目目录下 scrapy crawl baidu scrapy check scrapy list scrapy parse http://quotes.toscrape.com/ --callback parse scrapy bench 四项目结构以及爬虫应用简介 project_name/scrapy.cfgproject_name/__init__.pyitems.pypipelines.pysettings.pyspiders/__init__.py爬虫1.py爬虫2.py爬虫3.py 文件说明 scrapy.cfg 项目的主配置信息用来部署scrapy时使用爬虫相关的配置信息在settings.py文件中。items.py 设置数据存储模板用于结构化数据如Django的Modelpipelines 数据处理行为如一般结构化的数据持久化settings.py 配置文件如递归的层数、并发数延迟下载等。强调:配置文件的选项必须大写否则视为无效****正确写法USER_AGENTxxxxspiders 爬虫目录如创建文件编写爬虫规则注意一般创建爬虫文件时以网站域名命名 #在项目目录下新建entrypoint.py from scrapy.cmdline import execute execute([scrapy, crawl, xiaohua]) import sys,os sys.stdoutio.TextIOWrapper(sys.stdout.buffer,encodinggb18030) 五 Spiders 1、介绍 #1、Spiders是由一系列类定义了一个网址或一组网址将被爬取组成具体包括如何执行爬取任务并且如何从页面中提取结构化的数据。 #2、换句话说Spiders是你为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方 2、Spiders会循环做如下事情 #1、生成初始的Requests来爬取第一个URLS并且标识一个回调函数第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求默认的回调函数是parse方法。回调函数在下载完成返回response时自动触发#2、在回调函数中解析response并且返回值返回值可以4种包含解析数据的字典Item对象新的Request对象新的Requests也需要指定一个回调函数或者是可迭代对象包含Items或Request#3、在回调函数中解析页面内容通常使用Scrapy自带的Selectors但很明显你也可以使用Beutifulsouplxml或其他你爱用啥用啥。#4、最后针对返回的Items对象将会被持久化到数据库通过Item Pipeline组件存到数据库https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline 或者导出到不同的文件通过Feed exportshttps://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports 3、Spiders总共提供了五种类 #1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider #2、scrapy.spiders.CrawlSpider #3、scrapy.spiders.XMLFeedSpider #4、scrapy.spiders.CSVFeedSpider #5、scrapy.spiders.SitemapSpider 4、导入使用 # -*- coding: utf-8 -*- import scrapy from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpiderclass AmazonSpider(scrapy.Spider): #自定义类继承Spiders提供的基类name amazonallowed_domains [www.amazon.cn]start_urls [http://www.amazon.cn/] def parse(self, response):pass 5、class scrapy.spiders.Spider 这是最简单的spider类任何其他的spider类都需要继承它包含你自己定义的。该类不提供任何特殊的功能它仅提供了一个默认的start_requests方法默认从start_urls中读取url地址发送requests请求并且默认parse作为回调函数 class AmazonSpider(scrapy.Spider):name amazon allowed_domains [www.amazon.cn] start_urls [http://www.amazon.cn/]custom_settings {BOT_NAME : Egon_Spider_Amazon,REQUEST_HEADERS : {Accept: text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8,Accept-Language: en,} }def parse(self, response):pass#1、name amazon 定义爬虫名scrapy会根据该值定位爬虫程序所以它必须要有且必须唯一In Python 2 this must be ASCII only.#2、allowed_domains [www.amazon.cn] 定义允许爬取的域名如果OffsiteMiddleware启动默认就启动那么不属于该列表的域名及其子域名都不允许爬取如果爬取的网址为https://www.example.com/1.html那就添加example.com到列表.#3、start_urls [http://www.amazon.cn/] 如果没有指定url就从该列表中读取url来生成第一个请求#4、custom_settings 值为一个字典定义一些配置信息在运行爬虫程序时这些配置会覆盖项目级别的配置所以custom_settings必须被定义成一个类属性由于settings会在类实例化前被加载#5、settings 通过self.settings[配置项的名字]可以访问settings.py中的配置如果自己定义了custom_settings还是以自己的为准#6、logger 日志名默认为spider的名字 self.logger.debug(%s %self.settings[BOT_NAME])#5、crawler了解该属性必须被定义到类方法from_crawler中#6、from_crawler(crawler, *args, **kwargs)了解 You probably won’t need to override this directly because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.#7、start_requests() 该方法用来发起第一个Requests请求且必须返回一个可迭代的对象。它在爬虫程序打开时就被Scrapy调用Scrapy只调用它一次。默认从start_urls里取出每个url来生成Request(url, dont_filterTrue)#针对参数dont_filter,请看自定义去重规则如果你想要改变起始爬取的Requests你就需要覆盖这个方法例如你想要起始发送一个POST请求如下 class MySpider(scrapy.Spider):name myspider def start_requests(self):return [scrapy.FormRequest(http://www.example.com/login,formdata{user: john, pass: secret},callbackself.logged_in)]def logged_in(self, response):# here you would extract links to follow and return Requests for# each of them, with another callbackpass #8、parse(response) 这是默认的回调函数所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.#9、log(message[, level, component])了解 Wrapper that sends a log message through the Spider’s logger, kept for backwards compatibility. For more information see Logging from Spiders.#10、closed(reason) 爬虫程序结束时自动触发去重规则应该多个爬虫共享的但凡一个爬虫爬取了其他都不要爬了实现方式如下#方法一 1、新增类属性 visitedset() #类属性2、回调函数parse方法内 def parse(self, response):if response.url in self.visited:return None....... self.visited.add(response.url) #方法一改进针对url可能过长所以我们存放url的hash值 def parse(self, response):urlmd5(response.request.url)if url in self.visited:return None....... self.visited.add(url) #方法二Scrapy自带去重功能配置文件 DUPEFILTER_CLASS scrapy.dupefilter.RFPDupeFilter #默认的去重规则帮我们去重去重规则在内存中 DUPEFILTER_DEBUG False JOBDIR 保存范文记录的日志路径如/root/ # 最终路径为 /root/requests.seen去重规则放文件中scrapy自带去重规则默认为RFPDupeFilter只需要我们指定 Request(...,dont_filterFalse) 如果dont_filterTrue则告诉Scrapy这个URL不参与去重。#方法三我们也可以仿照RFPDupeFilter自定义去重规则from scrapy.dupefilter import RFPDupeFilter看源码仿照BaseDupeFilter#步骤一在项目目录下自定义去重文件dup.py class UrlFilter(object):def __init__(self):self.visited set() #或者放到数据库classmethoddef from_settings(cls, settings):return cls() def request_seen(self, request):if request.url in self.visited:return Trueself.visited.add(request.url)def open(self): # can return deferredpassdef close(self, reason): # can return a deferredpassdef log(self, request, spider): # log that a request has been filteredpass #步骤二配置文件settings.py DUPEFILTER_CLASS 项目名.dup.UrlFilter# 源码分析 from scrapy.core.scheduler import Scheduler 见Scheduler下的enqueue_request方法self.df.request_seen(request) #例一 import scrapyclass MySpider(scrapy.Spider):name example.comallowed_domains [example.com]start_urls [http://www.example.com/1.html,http://www.example.com/2.html,http://www.example.com/3.html,] def parse(self, response):self.logger.info(A response from %s just arrived!, response.url) #例二一个回调函数返回多个Requests和Items import scrapyclass MySpider(scrapy.Spider):name example.comallowed_domains [example.com]start_urls [http://www.example.com/1.html,http://www.example.com/2.html,http://www.example.com/3.html,] def parse(self, response):for h3 in response.xpath(//h3).extract():yield {title: h3}for url in response.xpath(//a/href).extract():yield scrapy.Request(url, callbackself.parse) #例三在start_requests()内直接指定起始爬取的urlsstart_urls就没有用了import scrapy from myproject.items import MyItemclass MySpider(scrapy.Spider):name example.comallowed_domains [example.com] def start_requests(self):yield scrapy.Request(http://www.example.com/1.html, self.parse)yield scrapy.Request(http://www.example.com/2.html, self.parse)yield scrapy.Request(http://www.example.com/3.html, self.parse)def parse(self, response):for h3 in response.xpath(//h3).extract():yield MyItem(titleh3)for url in response.xpath(//a/href).extract():yield scrapy.Request(url, callbackself.parse) 我们可能需要在命令行为爬虫程序传递参数比如传递初始的url像这样 #命令行执行 scrapy crawl myspider -a categoryelectronics#在__init__方法中可以接收外部传进来的参数 import scrapyclass MySpider(scrapy.Spider):name myspider def __init__(self, categoryNone, *args, **kwargs):super(MySpider, self).__init__(*args, **kwargs)self.start_urls [http://www.example.com/categories/%s % category]#...#注意接收的参数全都是字符串如果想要结构化的数据你需要用类似json.loads的方法六 Selectors #1 //与/ #2 text #3、extract与extract_first:从selector对象中解出内容 #4、属性xpath的属性加前缀 #5、嵌套查找 #6、设置默认值 #7、按照属性查找 #8、按照属性模糊查找 #9、正则表达式 #10、xpath相对路径 #11、带变量的xpath response.selector.css() response.selector.xpath() 可简写为 response.css() response.xpath()#1 //与/ response.xpath(//body/a/)# response.css(div a::text) response.xpath(//body/a) #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子 []response.xpath(//body//a) #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙 [Selector xpath//body//a dataa hrefimage1.htmlName: My image 1 , Selector xpath//body//a dataa hrefimage2.htmlName: My image 2 , Selector xpath//body//a dataa href image3.htmlName: My image 3 , Selector xpath//body//a dataa hrefimage4.htmlName: My image 4 , Selector xpath//body//a dataa hrefimage5.htmlName: My image 5 ]#2 textresponse.xpath(//body//a/text())response.css(body a::text)#3、extract与extract_first:从selector对象中解出内容response.xpath(//div/a/text()).extract() [Name: My image 1 , Name: My image 2 , Name: My image 3 , Name: My image 4 , Name: My image 5 ]response.css(div a::text).extract() [Name: My image 1 , Name: My image 2 , Name: My image 3 , Name: My image 4 , Name: My image 5 ] response.xpath(//div/a/text()).extract_first() Name: My image 1 response.css(div a::text).extract_first() Name: My image 1 #4、属性xpath的属性加前缀response.xpath(//div/a/href).extract_first() image1.htmlresponse.css(div a::attr(href)).extract_first() image1.html#4、嵌套查找response.xpath(//div).css(a).xpath(href).extract_first() image1.html#5、设置默认值response.xpath(//div[idxxx]).extract_first(defaultnot found) not found#4、按照属性查找 response.xpath(//div[idimages]/a[hrefimage3.html]/text()).extract() response.css(#images a[hrefimage3.html]/text()).extract()#5、按照属性模糊查找 response.xpath(//a[contains(href,image)]/href).extract() response.css(a[href*image]::attr(href)).extract()response.xpath(//a[contains(href,image)]/img/src).extract() response.css(a[href*imag] img::attr(src)).extract()response.xpath(//*[hrefimage1.html]) response.css(*[hrefimage1.html])#6、正则表达式 response.xpath(//a/text()).re(rName: (.*)) response.xpath(//a/text()).re_first(rName: (.*))#7、xpath相对路径resresponse.xpath(//a[contains(href,3)])[0]res.xpath(img) [Selector xpathimg dataimg srcimage3_thumb.jpg]res.xpath(./img) [Selector xpath./img dataimg srcimage3_thumb.jpg]res.xpath(.//img) [Selector xpath.//img dataimg srcimage3_thumb.jpg]res.xpath(//img) #这就是从头开始扫描 [Selector xpath//img dataimg srcimage1_thumb.jpg, Selector xpath//img dataimg srcimage2_thumb.jpg, Selector xpath//img dataimg srcimage3_thumb.jpg, Selector xpa th//img dataimg srcimage4_thumb.jpg, Selector xpath//img dataimg srcimage5_thumb.jpg]#8、带变量的xpathresponse.xpath(//div[id$xxx]/a/text(),xxximages).extract_first() Name: My image 1 response.xpath(//div[count(a)$yyy]/id,yyy5).extract_first() #求有5个a标签的div的id images 七 Items https://docs.scrapy.org/en/latest/topics/items.html 八 Item Pipeline #一可以写多个Pipeline类 #1、如果优先级高的Pipeline的process_item返回一个值或者None会自动传给下一个pipline的process_item, #2、如果只想让第一个Pipeline执行那得让第一个pipline的process_item抛出异常raise DropItem()#3、可以用spider.name 爬虫名来控制哪些爬虫用哪些pipeline二示范 from scrapy.exceptions import DropItemclass CustomPipeline(object):def __init__(self,v):self.value v classmethod def from_crawler(cls, crawler):Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完成实例化val crawler.settings.getint(MMMM)return cls(val)def open_spider(self,spider):爬虫刚启动时执行一次print(000000)def close_spider(self,spider):爬虫关闭时执行一次print(111111) def process_item(self, item, spider):# 操作并进行持久化# return表示会被后续的pipeline继续处理return item# 表示将item丢弃不会被后续pipeline处理# raise DropItem() #1、settings.py HOST127.0.0.1 PORT27017 USERroot PWD123 DBamazon TABLEgoodsITEM_PIPELINES {Amazon.pipelines.CustomPipeline: 200, }#2、pipelines.py class CustomPipeline(object):def __init__(self,host,port,user,pwd,db,table):self.hosthostself.portportself.useruserself.pwdpwdself.dbdbself.tabletable classmethod def from_crawler(cls, crawler):Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完成实例化HOST crawler.settings.get(HOST)PORT crawler.settings.get(PORT)USER crawler.settings.get(USER)PWD crawler.settings.get(PWD)DB crawler.settings.get(DB)TABLE crawler.settings.get(TABLE)return cls(HOST,PORT,USER,PWD,DB,TABLE)def open_spider(self,spider):爬虫刚启动时执行一次self.client MongoClient(mongodb://%s:%s%s:%s %(self.user,self.pwd,self.host,self.port))def close_spider(self,spider):爬虫关闭时执行一次self.client.close() def process_item(self, item, spider):# 操作并进行持久化self.client[self.db][self.table].save(dict(item)) 九 Dowloader Middeware 下载中间件的用途 1、在process——request内自定义下载不用scrapy的下载 2、对请求进行二次加工比如设置请求头设置cookie 添加代理 scrapy自带的代理组件 from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware from urllib.request import getproxies class DownMiddleware1(object):def process_request(self, request, spider):请求需要被下载时经过所有下载器中间件的process_request调用:param request: :param spider: :return: None,继续后续中间件去下载Response对象停止process_request的执行开始执行process_responseRequest对象停止中间件的执行将Request重新调度器raise IgnoreRequest异常停止process_request的执行开始执行process_exceptionpass def process_response(self, request, response, spider):spider处理完成返回时调用:param response::param result::param spider::return: Response 对象转交给其他中间件process_responseRequest 对象停止中间件request会被重新调度下载raise IgnoreRequest 异常调用Request.errbackprint(response1)return responsedef process_exception(self, request, exception, spider):当下载处理器(download handler)或 process_request() (下载中间件)抛出异常:param response::param exception::param spider::return: None继续交给后续中间件处理异常Response对象停止后续process_exception方法Request对象停止中间件request将会被重新调用下载return None #1、与middlewares.py同级目录下新建proxy_handle.py import requestsdef get_proxy():return requests.get(http://127.0.0.1:5010/get/).textdef delete_proxy(proxy):requests.get(http://127.0.0.1:5010/delete/?proxy{}.format(proxy))#2、middlewares.py from Amazon.proxy_handle import get_proxy,delete_proxyclass DownMiddleware1(object):def process_request(self, request, spider):请求需要被下载时经过所有下载器中间件的process_request调用:param request::param spider::return:None,继续后续中间件去下载Response对象停止process_request的执行开始执行process_responseRequest对象停止中间件的执行将Request重新调度器raise IgnoreRequest异常停止process_request的执行开始执行process_exceptionproxyhttp:// get_proxy()request.meta[download_timeout]20request.meta[proxy] proxyprint(为%s 添加代理%s % (request.url, proxy),end)print(元数据为,request.meta) def process_response(self, request, response, spider):spider处理完成返回时调用:param response::param result::param spider::return:Response 对象转交给其他中间件process_responseRequest 对象停止中间件request会被重新调度下载raise IgnoreRequest 异常调用Request.errbackprint(返回状态吗,response.status)return response def process_exception(self, request, exception, spider):当下载处理器(download handler)或 process_request() (下载中间件)抛出异常:param response::param exception::param spider::return:None继续交给后续中间件处理异常Response对象停止后续process_exception方法Request对象停止中间件request将会被重新调用下载print(代理%s访问%s出现异常:%s %(request.meta[proxy],request.url,exception))import timetime.sleep(5)delete_proxy(request.meta[proxy].split(//)[-1])request.meta[proxy]http://get_proxy()return request 十 Spider Middleware 1、爬虫中间件方法介绍 from scrapy import signalsclass SpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s cls()crawler.signals.connect(s.spider_opened, signalsignals.spider_opened) #当前爬虫执行时触发spider_openedreturn s def spider_opened(self, spider):# spider.logger.info(我是egon派来的爬虫1: %s % spider.name)print(我是egon派来的爬虫1: %s % spider.name)def process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).print(start_requests1)for r in start_requests:yield rdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# 每个response经过爬虫中间件进入spider时调用# 返回值Should return None or raise an exception.#1、None: 继续执行其他中间件的process_spider_input#2、抛出异常# 一旦抛出异常则不再执行其他中间件的process_spider_input# 并且触发request绑定的errback# errback的返回值倒着传给中间件的process_spider_output# 如果未找到errback则倒着执行中间件的process_spider_exceptionprint(input1)return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.print(output1)# 用yield返回多次与return返回一次是一个道理# 如果生成器掌握不好函数内有yield执行函数得到的是生成器而并不会立刻执行生成器的形式会容易误导你对中间件执行顺序的理解# for i in result:# yield ireturn resultdef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.print(exception1) 2、当前爬虫启动时以及初始请求产生时 #步骤一打开注释 SPIDER_MIDDLEWARES {Baidu.middlewares.SpiderMiddleware1: 200,Baidu.middlewares.SpiderMiddleware2: 300,Baidu.middlewares.SpiderMiddleware3: 400, }#步骤二middlewares.py from scrapy import signalsclass SpiderMiddleware1(object):classmethoddef from_crawler(cls, crawler):s cls()crawler.signals.connect(s.spider_opened, signalsignals.spider_opened) #当前爬虫执行时触发spider_openedreturn s def spider_opened(self, spider):print(我是egon派来的爬虫1: %s % spider.name)def process_start_requests(self, start_requests, spider):# Must return only requests (not items).print(start_requests1)for r in start_requests:yield rclass SpiderMiddleware2(object):classmethoddef from_crawler(cls, crawler):s cls()crawler.signals.connect(s.spider_opened, signalsignals.spider_opened) # 当前爬虫执行时触发spider_openedreturn s def spider_opened(self, spider):print(我是egon派来的爬虫2: %s % spider.name)def process_start_requests(self, start_requests, spider):print(start_requests2)for r in start_requests:yield rclass SpiderMiddleware3(object):classmethoddef from_crawler(cls, crawler):s cls()crawler.signals.connect(s.spider_opened, signalsignals.spider_opened) # 当前爬虫执行时触发spider_openedreturn s def spider_opened(self, spider):print(我是egon派来的爬虫3: %s % spider.name)def process_start_requests(self, start_requests, spider):print(start_requests3)for r in start_requests:yield r#步骤三分析运行结果 #1、启动爬虫时则立刻执行我是egon派来的爬虫1: baidu 我是egon派来的爬虫2: baidu 我是egon派来的爬虫3: baidu#2、然后产生一个初始的request请求依次经过爬虫中间件1,2,3 start_requests1 start_requests2 start_requests3 3、process_spider_input返回None时 #步骤一打开注释 SPIDER_MIDDLEWARES {Baidu.middlewares.SpiderMiddleware1: 200,Baidu.middlewares.SpiderMiddleware2: 300,Baidu.middlewares.SpiderMiddleware3: 400, }#步骤二middlewares.py from scrapy import signalsclass SpiderMiddleware1(object): def process_spider_input(self, response, spider):print(input1)def process_spider_output(self, response, result, spider):print(output1)return resultdef process_spider_exception(self, response, exception, spider):print(exception1)class SpiderMiddleware2(object): def process_spider_input(self, response, spider):print(input2)return Nonedef process_spider_output(self, response, result, spider):print(output2)return resultdef process_spider_exception(self, response, exception, spider):print(exception2)class SpiderMiddleware3(object): def process_spider_input(self, response, spider):print(input3)return Nonedef process_spider_output(self, response, result, spider):print(output3)return resultdef process_spider_exception(self, response, exception, spider):print(exception3)#步骤三运行结果分析#1、返回response时依次经过爬虫中间件1,2,3 input1 input2 input3#2、spider处理完毕后依次经过爬虫中间件3,2,1 output3 output2 output1 4、process_spider_input抛出异常时 #步骤一打开注释 SPIDER_MIDDLEWARES {Baidu.middlewares.SpiderMiddleware1: 200,Baidu.middlewares.SpiderMiddleware2: 300,Baidu.middlewares.SpiderMiddleware3: 400, }#步骤二middlewares.pyfrom scrapy import signalsclass SpiderMiddleware1(object): def process_spider_input(self, response, spider):print(input1)def process_spider_output(self, response, result, spider):print(output1)return resultdef process_spider_exception(self, response, exception, spider):print(exception1)class SpiderMiddleware2(object): def process_spider_input(self, response, spider):print(input2)raise Typedef process_spider_output(self, response, result, spider):print(output2)return resultdef process_spider_exception(self, response, exception, spider):print(exception2)class SpiderMiddleware3(object): def process_spider_input(self, response, spider):print(input3)return Nonedef process_spider_output(self, response, result, spider):print(output3)return resultdef process_spider_exception(self, response, exception, spider):print(exception3)#运行结果 input1 input2 exception3 exception2 exception1#分析 #1、当response经过中间件1的 process_spider_input返回None继续交给中间件2的process_spider_input #2、中间件2的process_spider_input抛出异常则直接跳过后续的process_spider_input将异常信息传递给Spiders里该请求的errback #3、没有找到errback则该response既没有被Spiders正常的callback执行也没有被errback执行即Spiders啥事也没有干那么开始倒着执行process_spider_exception #4、如果process_spider_exception返回None代表该方法推卸掉责任并没处理异常而是直接交给下一个process_spider_exception全都返回None则异常最终交给Engine抛出5、指定errback #步骤一spider.py import scrapyclass BaiduSpider(scrapy.Spider):name baiduallowed_domains [www.baidu.com]start_urls [http://www.baidu.com/] def start_requests(self):yield scrapy.Request(urlhttp://www.baidu.com/,callbackself.parse,errbackself.parse_err,)def parse(self, response):passdef parse_err(self,res):#res 为异常信息异常已经被该函数处理了因此不会再抛给因此于是开始走process_spider_outputreturn [1,2,3,4,5] #提取异常信息中有用的数据以可迭代对象的形式存放于管道中等待被process_spider_output取走#步骤二打开注释 SPIDER_MIDDLEWARES {Baidu.middlewares.SpiderMiddleware1: 200,Baidu.middlewares.SpiderMiddleware2: 300,Baidu.middlewares.SpiderMiddleware3: 400, }#步骤三middlewares.pyfrom scrapy import signalsclass SpiderMiddleware1(object): def process_spider_input(self, response, spider):print(input1)def process_spider_output(self, response, result, spider):print(output1,list(result))return resultdef process_spider_exception(self, response, exception, spider):print(exception1)class SpiderMiddleware2(object): def process_spider_input(self, response, spider):print(input2)raise TypeError(input2 抛出异常)def process_spider_output(self, response, result, spider):print(output2,list(result))return resultdef process_spider_exception(self, response, exception, spider):print(exception2)class SpiderMiddleware3(object): def process_spider_input(self, response, spider):print(input3)return Nonedef process_spider_output(self, response, result, spider):print(output3,list(result))return resultdef process_spider_exception(self, response, exception, spider):print(exception3)#步骤四运行结果分析 input1 input2 output3 [1, 2, 3, 4, 5] #parse_err的返回值放入管道中只能被取走一次在output3的方法内可以根据异常信息封装一个新的request请求 output2 [] output1 []十一自定义扩展自定义扩展与django的信号类似 1、django的信号是django是预留的扩展信号一旦被触发相应的功能就会执行 2、scrapy自定义扩展的好处是可以在任意我们想要的位置添加功能而其他组件中提供的功能只能在规定的位置执行 #1、在与settings同级目录下新建一个文件文件名可以为extentions.py,内容如下 from scrapy import signalsclass MyExtension(object):def __init__(self, value):self.value value classmethod def from_crawler(cls, crawler):val crawler.settings.getint(MMMM)obj cls(val)crawler.signals.connect(obj.spider_opened, signalsignals.spider_opened)crawler.signals.connect(obj.spider_closed, signalsignals.spider_closed)return objdef spider_opened(self, spider):print(open)def spider_closed(self, spider):print(close)#2、配置生效 EXTENSIONS {Amazon.extentions.MyExtension:200 }十二 settings.py #第一部分基本配置 #1、项目名称默认的USER_AGENT由它来构成也作为日志记录的日志名 BOT_NAME Amazon#2、爬虫应用路径 SPIDER_MODULES [Amazon.spiders] NEWSPIDER_MODULE Amazon.spiders#3、客户端User-Agent请求头 #USER_AGENT Amazon (http://www.yourdomain.com)#4、是否遵循爬虫协议 # Obey robots.txt rules ROBOTSTXT_OBEY False#5、是否支持cookiecookiejar进行操作cookie默认开启 #COOKIES_ENABLED False#6、Telnet用于查看当前爬虫的信息操作爬虫等...使用telnet ip port 然后通过命令操作 #TELNETCONSOLE_ENABLED False #TELNETCONSOLE_HOST 127.0.0.1 #TELNETCONSOLE_PORT [6023,]#7、Scrapy发送HTTP请求默认使用的请求头 #DEFAULT_REQUEST_HEADERS { # Accept: text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8, # Accept-Language: en, #}#第二部分并发与延迟 #1、下载器总共最大处理的并发请求数,默认值16 #CONCURRENT_REQUESTS 32#2、每个域名能够被执行的最大并发请求数目默认值8 #CONCURRENT_REQUESTS_PER_DOMAIN 16#3、能够被单个IP处理的并发请求数默认值0代表无限制需要注意两点 #I、如果不为零那CONCURRENT_REQUESTS_PER_DOMAIN将被忽略即并发数的限制是按照每个IP来计算而不是每个域名 #II、该设置也影响DOWNLOAD_DELAY如果该值不为零那么DOWNLOAD_DELAY下载延迟是限制每个IP而不是每个域 #CONCURRENT_REQUESTS_PER_IP 16#4、如果没有开启智能限速这个值就代表一个规定死的值代表对同一网址延迟请求的秒数 #DOWNLOAD_DELAY 3#第三部分智能限速/自动节流AutoThrottle extension #一介绍 from scrapy.contrib.throttle import AutoThrottle #http://scrapy.readthedocs.io/en/latest/topics/autothrottle.html#topics-autothrottle 设置目标 1、比使用默认的下载延迟对站点更好 2、自动调整scrapy到最佳的爬取速度所以用户无需自己调整下载延迟到最佳状态。用户只需要定义允许最大并发的请求剩下的事情由该扩展组件自动完成#二如何实现在Scrapy中下载延迟是通过计算建立TCP连接到接收到HTTP包头(header)之间的时间来测量的。注意由于Scrapy可能在忙着处理spider的回调函数或者无法下载因此在合作的多任务环境下准确测量这些延迟是十分苦难的。不过这些延迟仍然是对Scrapy(甚至是服务器)繁忙程度的合理测量而这扩展就是以此为前提进行编写的。#三限速算法自动限速算法基于以下规则调整下载延迟 #1、spiders开始时的下载延迟是基于AUTOTHROTTLE_START_DELAY的值 #2、当收到一个response对目标站点的下载延迟收到响应的延迟时间/AUTOTHROTTLE_TARGET_CONCURRENCY #3、下一次请求的下载延迟就被设置成对目标站点下载延迟时间和过去的下载延迟时间的平均值 #4、没有达到200个response则不允许降低延迟 #5、下载延迟不能变的比DOWNLOAD_DELAY更低或者比AUTOTHROTTLE_MAX_DELAY更高#四配置使用 #开启True默认False AUTOTHROTTLE_ENABLED True #起始的延迟 AUTOTHROTTLE_START_DELAY 5 #最小延迟 DOWNLOAD_DELAY 3 #最大延迟 AUTOTHROTTLE_MAX_DELAY 10 #每秒并发请求数的平均值不能高于 CONCURRENT_REQUESTS_PER_DOMAIN或CONCURRENT_REQUESTS_PER_IP调高了则吞吐量增大强奸目标站点调低了则对目标站点更加”礼貌“ #每个特定的时间点scrapy并发请求的数目都可能高于或低于该值这是爬虫视图达到的建议值而不是硬限制 AUTOTHROTTLE_TARGET_CONCURRENCY 16.0 #调试 AUTOTHROTTLE_DEBUG True CONCURRENT_REQUESTS_PER_DOMAIN 16 CONCURRENT_REQUESTS_PER_IP 16#第四部分爬取深度与爬取方式 #1、爬虫允许的最大深度可以通过meta查看当前深度0表示无深度 # DEPTH_LIMIT 3#2、爬取时0表示深度优先Lifo(默认)1表示广度优先FiFo# 后进先出深度优先 # DEPTH_PRIORITY 0 # SCHEDULER_DISK_QUEUE scrapy.squeue.PickleLifoDiskQueue # SCHEDULER_MEMORY_QUEUE scrapy.squeue.LifoMemoryQueue # 先进先出广度优先# DEPTH_PRIORITY 1 # SCHEDULER_DISK_QUEUE scrapy.squeue.PickleFifoDiskQueue # SCHEDULER_MEMORY_QUEUE scrapy.squeue.FifoMemoryQueue#3、调度器队列 # SCHEDULER scrapy.core.scheduler.Scheduler # from scrapy.core.scheduler import Scheduler#4、访问URL去重 # DUPEFILTER_CLASS step8_king.duplication.RepeatUrl#第五部分中间件、Pipelines、扩展 #1、Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES { # Amazon.middlewares.AmazonSpiderMiddleware: 543, #}#2、Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES {# Amazon.middlewares.DownMiddleware1: 543, }#3、Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS { # scrapy.extensions.telnet.TelnetConsole: None, #}#4、Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES {# Amazon.pipelines.CustomPipeline: 200, }#第六部分缓存 1. 启用缓存目的用于将已经发送的请求或相应缓存下来以便以后使用from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddlewarefrom scrapy.extensions.httpcache import DummyPolicyfrom scrapy.extensions.httpcache import FilesystemCacheStorage# 是否启用缓存策略# HTTPCACHE_ENABLED True# 缓存策略所有请求均缓存下次在请求直接访问原来的缓存即可 # HTTPCACHE_POLICY scrapy.extensions.httpcache.DummyPolicy # 缓存策略根据Http响应头Cache-Control、Last-Modified 等进行缓存的策略 # HTTPCACHE_POLICY scrapy.extensions.httpcache.RFC2616Policy# 缓存超时时间 # HTTPCACHE_EXPIRATION_SECS 0# 缓存保存路径 # HTTPCACHE_DIR httpcache# 缓存忽略的Http状态码 # HTTPCACHE_IGNORE_HTTP_CODES []# 缓存存储的插件 # HTTPCACHE_STORAGE scrapy.extensions.httpcache.FilesystemCacheStorage#第七部分线程池 REACTOR_THREADPOOL_MAXSIZE 10#Default: 10 #scrapy基于twisted异步IO框架downloader是多线程的线程数是Twisted线程池的默认大小(The maximum limit for Twisted Reactor thread pool size.)#关于twisted线程池 http://twistedmatrix.com/documents/10.1.0/core/howto/threading.html#线程池实现twisted.python.threadpool.ThreadPool twisted调整线程池大小 from twisted.internet import reactor reactor.suggestThreadPoolSize(30)#scrapy相关源码 D:\python3.6\Lib\site-packages\scrapy\crawler.py#补充 windows下查看进程内线程数的工具https://docs.microsoft.com/zh-cn/sysinternals/downloads/pslist或https://pan.baidu.com/s/1jJ0pMaM 命令为 pslist |findstr pythonlinux下top -p 进程id#第八部分其他默认配置参考 D:\python3.6\Lib\site-packages\scrapy\settings\default_settings.py十三自定制命令在spiders同级创建任意目录如commands 在其中创建 crawlall.py 文件此处文件名就是自定义的命令 from scrapy.commands import ScrapyCommandfrom scrapy.utils.project import get_project_settingsclass Command(ScrapyCommand):requires_project Truedef syntax(self):return [options]def short_desc(self):return Runs all of the spidersdef run(self, args, opts):spider_list self.crawler_process.spiders.list()for name in spider_list:self.crawler_process.crawl(name, **opts.__dict__)self.crawler_process.start()在settings.py 中添加配置 COMMANDS_MODULE 项目名称.目录名称在项目目录执行命令scrapy crawlall

查看全文

http://www.pierceye.com/news/899944/