Scrapy 基础教学

部署项目

  • scrapy startproject scrapyStudy

运行

  • scrapy crawl demo1

scrapy 命令

  • scrapy startproject <project_name> [project_dir]
  • scrapy check -l
  • scrapy list
  • scrapy genspider example example.com
  • scrapy fetch –headers https://www.lflxp.cn/
  • scrapy shell ‘http://quotes.toscrape.com/page/1/'
  • scrapy crawl quotes -o quotes.json
  • scrapy crawl quotes -o quotes-humor.json -a tag=humor
  • scrapy crawl hao -o hao.json -s FEED_EXPORT_ENCODING=utf-8 # 编码问题
  • scrapy crawl jianshu -o jianshu.json -s FEED_EXPORT_ENCODING=utf-8
  • scrapy parse http://www.example.com/ -c parse_item
  • scrapy settings –get BOT_NAME
  • scrapy runspider myspider.py
  • scrapy version -v
  • scrapy bench

DEBUG测试

https://docs.scrapy.org/en/latest/topics/selectors.html#topics-selectors https://docs.scrapy.org/en/latest/intro/tutorial.html

  • response.xpath('//a[contains(@href,“github”)]/@href’).getall() # 搜索并过滤

  • response.xpath('//a[contains(@href, “image”)]/text()').re_first(r’Name:\s*(.*)') # 结果再正则过滤

  • response.xpath('//a/@href’).getall() 按属性搜索

  • response.css(‘a’).xpath('@href’).getall()

  • response.css(‘title::text’).get()

  • response.css(‘nav’).attrib

  • response.css(‘a’).attrib(‘href’)

  • response.css(‘a::attr(href)').get()

  • response.css(“div.quote”)

  • quote = response.css(“div.quote”)[0]

  • quote.css(“span.text::text”).get()

  • quote.css(“div.tags a.tag::text”).getall()

  • 快速定位 tags+class

1
2
3
4
5
<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

We can try extracting it in the shell:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

>>> response.css('li.next a::attr(href)').get()
'/page/2/'

>>> response.css('li.next a').attrib['href']
'/page/2/'

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)