爬虫 从入门到实战二

Scrapy 基础教学

部署项目

scrapy startproject scrapyStudy

运行

scrapy crawl demo1

scrapy 命令

scrapy startproject <project_name> [project_dir]
scrapy check -l
scrapy list
scrapy genspider example example.com
scrapy fetch –headers https://www.lflxp.cn/
scrapy shell ‘http://quotes.toscrape.com/page/1/'
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes-humor.json -a tag=humor
scrapy crawl hao -o hao.json -s FEED_EXPORT_ENCODING=utf-8 # 编码问题
scrapy crawl jianshu -o jianshu.json -s FEED_EXPORT_ENCODING=utf-8
scrapy parse http://www.example.com/ -c parse_item
scrapy settings –get BOT_NAME
scrapy runspider myspider.py
scrapy version -v
scrapy bench

DEBUG测试

https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors https://docs.scrapy.org/en/latest/intro/tutorial.html

response.xpath('//a[contains(@href,“github”)]/@href’).getall() # 搜索并过滤
response.xpath('//a[contains(@href, “image”)]/text()').re_first(r’Name:\s*(.*)') # 结果再正则过滤
response.xpath('//a/@href’).getall() 按属性搜索
response.css(‘a’).xpath('@href’).getall()
response.css(‘title::text’).get()
response.css(‘nav’).attrib
response.css(‘a’).attrib(‘href’)
response.css(‘a::attr(href)').get()
response.css(“div.quote”)
quote = response.css(“div.quote”)[0]
quote.css(“span.text::text”).get()
quote.css(“div.tags a.tag::text”).getall()
快速定位 tags+class

1
2
3
4
5


<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

We can try extracting it in the shell:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

>>> response.css('li.next a::attr(href)').get()
'/page/2/'

>>> response.css('li.next a').attrib['href']
'/page/2/'

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

爬虫从入门到实战二

文章目录

部署项目

运行

scrapy 命令

DEBUG测试