用 requests 和 bs4 爬取京东商品页面信息

xiaoxiao2021-02-27 421

1、网页地址

在京东主页搜索框输入电脑

得到的地址为

https://search.jd.com/Search?keyword=电脑&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&offset=5&wq=电脑&page=1&s=1&click=0

2、网页地址分析

点击在底部分页栏的分页按钮得到网页地址

第一页： https://search.jd.com/Search?keyword=电脑&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&offset=5&wq=电脑&page=1&s=1&click=0

第二页： https://search.jd.com/Search?keyword=电脑&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&offset=6&wq=电脑&page=3&s=52&click=0

第三页： https://search.jd.com/Search?keyword=电脑&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&offset=6&wq=电脑&page=5&s=112&click=0

其中的 keyword 和 wq 就是电脑

page 是以 1， 3 ，5 奇数递增

修改网页地址中的 page 为2， 4， 6 返回网页信息跟 page为 1， 3， 5 是一样的

所以，我们将 Craw 的 url 定为：

https://search.jd.com/Search?keyword=电脑&enc=utf-8&wq=电脑&page=3&s=30

3、代码

从网页源码中找到每个商品信息布局 li 中第一个 a 标签中的 title 属性表示商品详情以及对应下一个 div 块中的 i 标签的 string 显示商品价格

import requests from bs4 import BeautifulSoup # 获取网页信息 def getHtmlText(url): try: r =requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding print('success') return r.text except: print('false') return 'false' # 解析网页数据，获取有用信息 def parseHtml(goods_data, html): soup = BeautifulSoup(html, 'lxml') lis = soup.find_all('li', class_="gl-item") print(len(lis)) for i in range(len(lis)): try: # 获取商品信息 div 中的第一个 a 标签，获取 title 属性值 title = lis[i].a['title'] # print(title) # 获取商品的价格信息 price = lis[i].find('div', class_='p-price').i.string # print(price) goods_data.append([title, price]) except: print('') # 显示数据 def displayHtmlGoods(goods_data): std = r'{0:^100}{1:^8}' print(std.format('商品名称', '价格')) for i in range(len(goods_data)): print(std.format(goods_data[i][0], goods_data[i][0])) def main(): url_basic = 'https://search.jd.com/Search?keyword=' total_pages = 3 # 需要爬取的总页数 keyword = '电脑' # 关键字 goods_data = [] for i in range(total_pages): page = 1 + i * 2 url = url_basic + keyword + '&enc=utf-8&wq=' + keyword + '&page=' + str(page) print(url) html = getHtmlText(url) parseHtml(goods_data, html) displayHtmlGoods(goods_data) if __name__ == '__main__': main()

转载请注明原文地址: https://www.6miu.com/read-2211.html

技术

最新回复(0)