如何使用python3爬取1000页百度百科条目

发布网友发布时间：2022-04-23 18:23

我来回答

共1个回答

热心网友时间：2022-05-10 08:34

1 问题描述

起始页面 ython 包含许多指向其他词条的页面。通过页面之间的链接访问1000条百科词条。

对每个词条，获取其标题和简介。

2 讨论

首先获取页面源码，然后解析得到自己要的数据。

这里我们通过urllib或者requests库获取到页面源码，然后通过beautifulsoup解析。

可以看到，标题是在<h1></h1>标签下的。

可以看出，简介是在class为lemma-summary的div下的。

可以看出，其他词条的格式都遵循hcom/item/xxx的形式

3 实现

# coding=utf-8from urllib import requestfrom bs4 import BeautifulSoupimport reimport tracebackimport time

url_new = set()
url_old = set()
start_url = 'httpm/item/python'max_url = 1000def add_url(url):
if len(url_new) + len(url_old) > 1000: return
if url not in url_old and url not in url_new:
url_new.add(url)def get_url():
url = url_new.pop()
url_old.add(url) return urldef parse_title_summary(page):
soup = BeautifulSoup(page, 'html.parser')
node = soup.find('h1')
title = node.text
node = soup.find('div', class_='lemma-summary')
summary = node.text return title, summarydef parse_url(page):
soup = BeautifulSoup(page, 'html.parser')
links = soup.findAll('a', href=re.compile(r'/item/'))
res = set()
keprefix = 'htt.baidu.com'
for i in links:
res.add(keprefix + i['href']) return resdef write2log(text, name='d:/ke-urllib.log'):
with open(name, 'a+', encoding='utf-8') as fp:
fp.write('\n')
fp.write(text)if __name__ == '__main__':
url_new.add(start_url) print('working')
time_begin=time.time()
count = 1
while url_new:
url = get_url() try:
resp = request.urlopen(url)
text = resp.read().decode()
write2log('.'.join(parse_title_summary(text)))
urls = parse_url(text) for i in urls:
add_url(i) print(str(count), 'ok')
count += 1
except:
traceback.print_exc() print(url)
time_end=time.time() print('time elapsed: ', time_end - time_begin) print('the end.')

输出结果

working1 ok
略983 ok984 ok
time elapsed: 556.4766345024109the end.

将urllib替换为第三方库requests：

pip install requests

略if __name__ == '__main__':
url_new.add(start_url) print('working')
time_begin = time.time()
count = 1
while url_new:
url = get_url() try: with requests.Session() as s:
resp = s.get(url)
text = resp.content.decode() # 默认'utf-8'
write2log('.'.join(parse_title_summary(text)))
urls = parse_url(text) for i in urls:
add_url(i) print(str(count), 'ok')
count += 1
except:
traceback.print_exc() print(url)
time_end = time.time() print('time elapsed: ', time_end - time_begin) print('the end.')

输出

略986 ok987 ok988 ok989 ok
time elapsed: 492.8088216781616the end.

一个通用的爬虫架构包括如下四部分：

调度器

URL管理器

网页下载器

网页解析器

从以上函数式的写法也可以看出了。

下面是面向对象的写法。

1、spider main

2、URL manager

3、html downloder

4、html parser

5、 html outputer