python爬虫cookbook1爬虫入门

第一章爬虫入门

Requests和Beautiful Soup 爬取python.org
urllib3和Beautiful Soup 爬取python.org
Scrapy 爬取python.org
Selenium和PhantomJs爬取Python.org

请确认可以打开：https://www.python.org/events/pythonevents
安装好requests、bs4，然后我们开始实例1：Requests和Beautiful Soup 爬取python.org,

# pip3 install requests bs4

Requests和Beautiful Soup 爬取python.org

目标：爬取https://www.python.org/events/python-events/中事件的名称、地点和时间。

01_events_with_requests.py

import requestsfrom bs4 import BeautifulSoupdef get_upcoming_events(url):    req = requests.get(url)    soup = BeautifulSoup(req.text, 'lxml')    events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')    for event in events:        event_details = dict()        event_details['name'] = event.find('h3').find("a").text        event_details['location'] = event.find('span', {'class', 'event-location'}).text        event_details['time'] = event.find('time').text        print(event_details)get_upcoming_events('https://www.python.org/events/python-events/')

执行结果：

$ python3 01_events_with_requests.py {'name': 'PyCon US 2018', 'location': 'Cleveland, Ohio, USA', 'time': '09 May – 18 May  2018'}{'name': 'DjangoCon Europe 2018', 'location': 'Heidelberg, Germany', 'time': '23 May – 28 May  2018'}{'name': 'PyCon APAC 2018', 'location': 'NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore', 'time': '31 May – 03 June  2018'}{'name': 'PyCon CZ 2018', 'location': 'Prague, Czech Republic', 'time': '01 June – 04 June  2018'}{'name': 'PyConTW 2018', 'location': 'Taipei, Taiwan', 'time': '01 June – 03 June  2018'}{'name': 'PyLondinium', 'location': 'London, UK', 'time': '08 June – 11 June  2018'}

注意：因为事件的内容未必相同，所以每次的结果也不会一样

课后习题：用requests爬取https://china-testing.github.io/首页的博客标题，共10条。

参考答案：

01_blog_title.py

import requestsfrom bs4 import BeautifulSoupdef get_upcoming_events(url):    req = requests.get(url)    soup = BeautifulSoup(req.text, 'lxml')    events = soup.findAll('article')    for event in events:        event_details = {}        event_details['name'] = event.find('h1').find("a").text        print(event_details)get_upcoming_events('https://china-testing.github.io/')

执行结果：

$ python3 01_blog_title.py {'name': '10分钟学会API测试'}{'name': 'python数据分析快速入门教程4-数据汇聚'}{'name': 'python数据分析快速入门教程6-重整'}{'name': 'python数据分析快速入门教程5-处理缺失数据'}{'name': 'python库介绍-pytesseract: OCR光学字符识别'}{'name': '软件自动化测试初学者忠告'}{'name': '使用opencv转换3d图片'}{'name': 'python opencv3实例(对象识别和增强现实)2-边缘检测和应用图像过滤器'}{'name': 'numpy学习指南3rd3:常用函数'}{'name': 'numpy学习指南3rd2:NumPy基础'}

urllib3和Beautiful Soup 爬取python.org

代码：02_events_with_urlib3.py

import urllib3from bs4 import BeautifulSoupdef get_upcoming_events(url):    req = urllib3.PoolManager()    res = req.request('GET', url)    soup = BeautifulSoup(res.data, 'html.parser')    events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')    for event in events:        event_details = dict()        event_details['name'] = event.find('h3').find("a").text        event_details['location'] = event.find('span', {'class', 'event-location'}).text        event_details['time'] = event.find('time').text        print(event_details)get_upcoming_events('https://www.python.org/events/python-events/')

requests对urllib3进行了封装，一般是直接使用requests。

Scrapy 爬取python.org

Scrapy是用于提取数据的非常流行的开源Python抓取框架。 Scrapy提供所有这些功能以及许多其他内置模块和扩展。当涉及到使用Python进行挖掘时，它也是我们的首选工具。
Scrapy提供了许多值得一提的强大功能：

内置的扩展来生成HTTP请求并处理压缩，身份验证，缓存，操作用户代理和HTTP标头
内置的支持选择和提取选择器语言如数据CSS和XPath，以及支持使用正则表达式选择内容和链接。
编码支持来处理语言和非标准编码声明
灵活的API来重用和编写自定义中间件和管道，提供干净而简单的方法来实现自动化等任务。比如下载资产（例如图像或媒体）并将数据存储在存储器中，如文件系统，S3，数据库等

有几种使用Scrapy的方法。一个是程序模式我们在代码中创建抓取工具和蜘蛛。也可以配置Scrapy模板或生成器项目，然后从命令行使用运行。本书将遵循程序模式，因为它的代码在单个文件中。

代码：03_events_with_scrapy.py

import scrapyfrom scrapy.crawler import CrawlerProcessclass PythonEventsSpider(scrapy.Spider):    name = 'pythoneventsspider'    start_urls = ['https://www.python.org/events/python-events/',]    found_events = []    def parse(self, response):        for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):            event_details = dict()            event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()            event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()            event_details['time'] = event.xpath('p/time/text()').extract_first()            self.found_events.append(event_details)if __name__ == "__main__":    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROT630:~/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests/func$ pytest test_api_exceptions.py  -v -m "smoke and not get"=========================================== test session starts ===========================================platform linux -- Python 3.5.2, pytest-3.5.1, py-1.5.3, pluggy-0.6.0 -- /usr/bin/python3cachedir: ../.pytest_cacherootdir: /home/andrew/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests, inifile: pytest.inicollected 7 items / 6 deselected                                                                          test_api_exceptions.py::test_list_raises PASSED                                                     [100%]R'})    process.crawl(PythonEventsSpider)    spider = next(iter(process.crawlers)).spider    process.start()    for event in spider.found_events: print(event)

课后习题：用scrapy爬取https://china-testing.github.io/首页的博客标题，共10条。

参考答案：

03_blog_with_scrapy.py

from scrapy.crawler import CrawlerProcessclass PythonEventsSpider(scrapy.Spider):    name = 'pythoneventsspider'    start_urls = ['https://china-testing.github.io/',]    found_events = []    def parse(self, response):        for event in response.xpath('//article//h1'):            event_details = dict()            event_details['name'] = event.xpath('a/text()').extract_first()            self.found_events.append(event_details)if __name__ == "__main__":    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})    process.crawl(PythonEventsSpider)    spider = next(iter(process.crawlers)).spider    process.start()    for event in spider.found_events: print(event)

Selenium和PhantomJs爬取Python.org

04_events_with_selenium.py

from selenium import webdriverdef get_upcoming_events(url):    driver = webdriver.Chrome()    driver.get(url)    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')    for event in events:        event_details = dict()        event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text        event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text        event_details['time'] = event.find_element_by_xpath('p/time').text        print(event_details)    driver.close()get_upcoming_events('https://www.python.org/events/python-events/')

改用driver = webdriver.PhantomJS(‘phantomjs’)可以使用无界面的方式，代码如下：

05_events_with_phantomjs.py

from selenium import webdriverdef get_upcoming_events(url):    driver = webdriver.Chrome()    driver.get(url)    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')    for event in events:        event_details = dict()        event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text        event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text        event_details['time'] = event.find_element_by_xpath('p/time').text        print(event_details)    driver.close()get_upcoming_events('https://www.python.org/events/python-events/')

不过selenium的headless模式已经可以更好的代替phantomjs了。

04_events_with_selenium_headless.py

from selenium import webdriverdef get_upcoming_events(url):        options = webdriver.ChromeOptions()    options.add_argument('headless')    driver = webdriver.Chrome(chrome_options=options)    driver.get(url)    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')    for event in events:        event_details = dict()        event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text        event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text        event_details['time'] = event.find_element_by_xpath('p/time').text        print(event_details)    driver.close()get_upcoming_events('https://www.python.org/events/python-events/')

参考资料

讨论qq群144081101 591302926 567351477 钉钉免费群21745728
本文最新版本地址
本文涉及的python测试开发库谢谢点赞！
本文相关海量书籍下载
源码下载
本文英文版书籍下载

文章链接：https://www.sbkko.com/ganhuo-22.html
文章标题：python爬虫cookbook1爬虫入门
文章版权：SBKKO 所发布的内容，部分为原创文章，转载请注明来源，网络转载文章如有侵权请联系我们！

{{userData.name}}已认证

python爬虫cookbook1爬虫入门

Requests和Beautiful Soup 爬取python.org

urllib3和Beautiful Soup 爬取python.org

Scrapy 爬取python.org

Selenium和PhantomJs爬取Python.org

参考资料

卸载软件有什么难的？谁不会啊！还真别说，电脑软件卸载是门学问

关于PDF阅读处理软件，你需要的都在这里了

高清动漫游戏混剪辑镜头素材 mad短视频制作

500张高清分辨率背景图片组合，可用于各种设计

GIF动态素材 3D抽象艺术动图科技感粒子光效纹理运动视觉

搞笑聊天记录朋友圈神回复抖音快手自媒体引流短视频

551G高清/超清4K游戏动画CG素材 [约12500款视频]

关于我们

免责声明

用户协议

提交建议

开通会员

积分专区

本站商城

在线留言

申请友链

广告合作

文章创作

推广中心

{{userData.name}}已认证

Requests和Beautiful Soup 爬取python.org

urllib3和Beautiful Soup 爬取python.org

Scrapy 爬取python.org

Selenium和PhantomJs爬取Python.org

参考资料

相关文章:

卸载软件有什么难的？谁不会啊！还真别说，电脑软件卸载是门学问

关于PDF阅读处理软件，你需要的都在这里了

数据分析工具pandas快速入门教程3绘图1matplotlib基础

数据分析工具pandas快速入门教程3绘图2matplotlib统计图

数据分析工具pandas快速入门教程4-数据汇聚

数据分析工具pandas快速入门教程2-pandas数据结构

高清动漫游戏混剪辑镜头素材 mad短视频制作

500张高清分辨率背景图片组合，可用于各种设计

GIF动态素材 3D抽象艺术动图科技感粒子光效纹理运动视觉

搞笑聊天记录朋友圈神回复 抖音快手自媒体引流短视频

551G高清/超清4K游戏动画CG素材 [约12500款视频]

关于我们

免责声明

用户协议

提交建议

开通会员

积分专区

本站商城

在线留言

申请友链

广告合作

文章创作

推广中心

搞笑聊天记录朋友圈神回复抖音快手自媒体引流短视频