利用python爬虫结合前端技能实现经济学人(The Economist)阅时即查APP(010)

xiaoxiao2021-02-28  23

010、python爬取经济学人最新列表文章,归档为本地文件

首先回顾一下获取首页最新文章列表[[a,title],…]:

def getPaperList(): url = 'https://economist.com' req = urllib.request.Request(url=url,headers=headers, method='GET') response = urllib.request.urlopen(req) html = response.read() selector = etree.HTML(html.decode('utf-8')) goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]/div[1]/div[1]/div[3]/ul[1]/li' art=selector.xpath(goodpath) awithtext = [] try: for li in art: ap = li.xpath('article[1]/a[1]/div[1]/h3[1]/text()') a = li.xpath('article[1]/a[1]/@href') awithtext.append([a[0],ap[0]]) except Exception as err: print(err,'getMain') finally: return awithtext

1、接着分析要爬取的文章的html结构

上图中标注分别为: 1.flytitle-and-title__flytitle 2.real title 3.description 4.所有同一DOM级别的P元素包含的就是文章的主体段落

2、爬取文章内容:

def getPaper(url): req = urllib.request.Request(url=url,headers=headers, method='GET') response = urllib.request.urlopen(req) html = response.read() selector = etree.HTML(html.decode('utf-8')) goodpath='/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/main[1]/div[1]//div[2]/div[1]/article' article=selector.xpath(goodpath) return article

3、 获取标记1,2,3相关信息得到[1,2,3]:

def getHeadline(article): headline = [] try: h1 = article[0].xpath('h1/span') for item in h1: headline.append(item.text) p1 = article[0].xpath('p[1]/text()') headline.append(p1[0]) except Exception as err: print(err,'getHeadline') finally: return headline

4、获取文章内容p=[p,p,p….]:

def getContent(article): parr = [] try: p = article[0].xpath('div[1]/div[3]/p/text()') for i in p: print(i) parr.append(i+'\n') except Exception as err: print(err,'getContent') finally: return parr

5、爬虫请开始表演

if __name__ == '__main__': linkArr = getMain() time.sleep(10) tmpLast = [] toDayDir = './mds/' + todayDate +'/papers/' if not os.path.exists(toDayDir): os.makedirs(toDayDir) for item in linkArr: if item[0] not in lastLst: tmpLast.append(item[0]) url = 'https://economist.com' + item[0] article = getPaper(url) headLine = getHeadline(article) try: paperRecords[strY][strM][strD].append([item[0],headLine[1]]) content = getContent(article) paperName = '_'.join(item[1].split(' ')) saveMd = toDayDir + paperName+'.md' result = headLine[1:] result.extend(content) output = '\n'.join(result) with open(saveMd,'w') as fw: fw.write(output) time.sleep(10) except Exception as err: print(err) paperRecords['lastLst'] = tmpLast with open('spiRecords.json','w') as fwp: json.dump(paperRecords,fwp)

6、对5中的部分数据结构进行讲解:

首先是归档目录结构:

mds/2018_04_29/papers#日期papers目录都是生成结果时创建的

然后是爬取记录保存的json文件结构,直接给例子好了:

{"a2018": {"a4": {"a29": [ ["/blogs/graphicdetail/2018/04/daily-chart-18", "Success is on the cards for Nintendo"] ] } }, "lastLst": ["/blogs/graphicdetail/2018/04/daily-chart-18","/blogs/buttonwood/2018/04/affording-retirement"] }

保存lastLst是为了不重复爬取,当然也可以遍历所有数据剔除重复,但是代价有点大,而且代码要写好长一串太麻烦还没啥明显优点。

对于文章的爬取到这里就算结束了,下一篇将讲述文章中的单词如何去重得重

最后进入今天的阅时即查文章环节

转载请注明原文地址: https://www.6miu.com/read-2500243.html

最新回复(0)