Python数据采集之BeautifulSoup

xiaoxiao2021-02-27 337

最近因为经常要爬取网站数据，需要频繁用到BeautifulSoup，但自己现在掌握的并不是特别熟练，就在这里梳理下BeautifulSoup的各项用法，以供以后参考。本文的测试数据来自BeautifulSoup的官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

1.BeautifulSoup基本用法

1.1 BeautifulSoup介绍

BeautifulSoup是一个可以从HTML或XML页面中从提取数据的Python第三方库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

构建一个 BeautifulSoup 对象需要两个参数，第一个参数是将要解析的 HTML 文本字符串，第二个参数告诉 BeautifulSoup 使用哪个解析器来解析 HTML（如Python自带的html.parser、第三方解析器lxml和html5lib）。 BeautifulSoup对象构建如下所示： soup = BeautifulSoup(html_doc,’lxml’)

1.2格式化输出HTML文档

代码如下所示：

# -*- coding: utf-8 -*- """ Created on Thu May 4 13:56:00 2017 @author: zch """ from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ soup = BeautifulSoup(html_doc,'lxml') print(soup.prettify())

格式化输出结果如下所示：

<html> <head> <title> The Dormouse's story </title> </head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. ... </body> </html>

1.3 浏览结构化数据的几种方法

（1）获取HTML文档title各项属性

（2）获取HTML超链接（a）的各项属性

（3）获取HTML段落（p）的各项属性

（4）通过find方法查找HTML中的匹配项

2.BeautifulSoup实例测试

代码如下所示：

# -*- coding: utf-8 -*- """ Created on Thu May 4 15:11:23 2017 @author: zch """ from bs4 import BeautifulSoup import re html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ soup = BeautifulSoup(html_doc,'lxml') print('测试1：获取所有的链接') links = soup.find_all('a') for link in links: print(link.name,link['href'],link.get_text()) print('测试2：通过正则匹配获取链接') link_node = soup.find('a',href=re.compile(r"cie")) print(link_node.name,link_node['href'],link_node.get_text()) print('测试3：获取故事正文') p_text = soup.find('p',class_='story') print(p_text.name,p_text.get_text()) #print(soup.p.get_text())

测试结果如下图所示：

转载请注明原文地址: https://www.6miu.com/read-3549.html

技术

最新回复(0)