最近突发奇想,需要数据支持,现学现卖,留作防止以后忘记记录关键知识点。后续会进行详细的实战分享。

爬虫框架选择

其中collygo语言框架,pythonscrapypyppeteer

最终采用

scrapy + pyppeteer

方案选择的原则:

  • 快速上手 colly | scrapy
  • 简单实用 colly
  • 支持异步并发 pyppeteer
  • 动态数据渲染捕捉 pyppeteer
  • 使用成熟度广,后期解决方案多 scrapy + pyppeteer

DEMO

colly

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := a.Attr("href")
        fmt.Printf("Link Found: %s",link)
		e.Request.Visit(link)
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

scrapy

  • DEBUG模式 scrapy shell 'http://quotes.toscrape.com/page/1/'
  • 执行命令 scrapy crawl demo1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import scrapy

#  scrapy crawl demo1
class QuotesSpider(scrapy.Spider):
    name = "demo1"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    # def parse(self,response):
    #     page = response.url.split("/")[-2]
    #     filename = 'demo-%s.html' % page
    #     with open(filename,'wb') as f:
    #         f.write(response.body)
    #     self.log('保存到文件 %s' % filename)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

pyppeteer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://www.lflxp.cn')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())