爬虫 从入门到实战二
文章目录
Scrapy 基础教学
部署项目
- scrapy startproject scrapyStudy
运行
- scrapy crawl demo1
scrapy 命令
- scrapy startproject <project_name> [project_dir]
- scrapy check -l
- scrapy list
- scrapy genspider example example.com
- scrapy fetch –headers https://www.lflxp.cn/
- scrapy shell ‘http://quotes.toscrape.com/page/1/'
- scrapy crawl quotes -o quotes.json
- scrapy crawl quotes -o quotes-humor.json -a tag=humor
- scrapy crawl hao -o hao.json -s FEED_EXPORT_ENCODING=utf-8 # 编码问题
- scrapy crawl jianshu -o jianshu.json -s FEED_EXPORT_ENCODING=utf-8
- scrapy parse http://www.example.com/ -c parse_item
- scrapy settings –get BOT_NAME
- scrapy runspider myspider.py
- scrapy version -v
- scrapy bench
DEBUG测试
https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors https://docs.scrapy.org/en/latest/intro/tutorial.html
-
response.xpath('//a[contains(@href,“github”)]/@href’).getall() # 搜索并过滤
-
response.xpath('//a[contains(@href, “image”)]/text()').re_first(r’Name:\s*(.*)') # 结果再正则过滤
-
response.xpath('//a/@href’).getall() 按属性搜索
-
response.css(‘a’).xpath('@href’).getall()
-
response.css(‘title::text’).get()
-
response.css(‘nav’).attrib
-
response.css(‘a’).attrib(‘href’)
-
response.css(‘a::attr(href)').get()
-
response.css(“div.quote”)
-
quote = response.css(“div.quote”)[0]
-
quote.css(“span.text::text”).get()
-
quote.css(“div.tags a.tag::text”).getall()
-
快速定位 tags+class
|
|
We can try extracting it in the shell:
|
|
文章作者 lixueping
上次更新 2020-03-13
许可协议 MIT