Python数据采集之BeautifulSoup

xiaoxiao2021-02-27  277

最近因为经常要爬取网站数据,需要频繁用到BeautifulSoup,但自己现在掌握的并不是特别熟练,就在这里梳理下BeautifulSoup的各项用法,以供以后参考。本文的测试数据来自BeautifulSoup的官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

1.BeautifulSoup基本用法

1.1 BeautifulSoup介绍

BeautifulSoup是一个可以从HTML或XML页面中从提取数据的Python第三方库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

构建一个 BeautifulSoup 对象需要两个参数,第一个参数是将要解析的 HTML 文本字符串,第二个参数告诉 BeautifulSoup 使用哪个解析器来解析 HTML(如Python自带的html.parser、第三方解析器lxml和html5lib)。 BeautifulSoup对象构建如下所示: soup = BeautifulSoup(html_doc,’lxml’)

1.2格式化输出HTML文档

代码如下所示:

# -*- coding: utf-8 -*- """ Created on Thu May 4 13:56:00 2017 @author: zch """ from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc,'lxml') print(soup.prettify())

格式化输出结果如下所示:

<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>

1.3 浏览结构化数据的几种方法

(1)获取HTML文档title各项属性

(2)获取HTML超链接(a)的各项属性

(3)获取HTML段落(p)的各项属性

(4)通过find方法查找HTML中的匹配项

2.BeautifulSoup实例测试

代码如下所示:

# -*- coding: utf-8 -*- """ Created on Thu May 4 15:11:23 2017 @author: zch """ from bs4 import BeautifulSoup import re html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc,'lxml') print('测试1:获取所有的链接') links = soup.find_all('a') for link in links: print(link.name,link['href'],link.get_text()) print('测试2:通过正则匹配获取链接') link_node = soup.find('a',href=re.compile(r"cie")) print(link_node.name,link_node['href'],link_node.get_text()) print('测试3:获取故事正文') p_text = soup.find('p',class_='story') print(p_text.name,p_text.get_text()) #print(soup.p.get_text())

测试结果如下图所示:

转载请注明原文地址: https://www.6miu.com/read-3549.html

最新回复(0)