问答文章1 问答文章501 问答文章1001 问答文章1501 问答文章2001 问答文章2501 问答文章3001 问答文章3501 问答文章4001 问答文章4501 问答文章5001 问答文章5501 问答文章6001 问答文章6501 问答文章7001 问答文章7501 问答文章8001 问答文章8501 问答文章9001 问答文章9501

如何利用python 爬取知乎上面的数据

发布网友 发布时间:2022-04-26 03:25

我来回答

2个回答

懂视网 时间:2022-05-10 13:29

需要用到的包:

beautifulsoup4
html5lib
image
requests
redis
PyMySQL


pip安装所有依赖包:

pip install 
Image 
requests 
beautifulsoup4 
html5lib 
redis 
PyMySQL


运行环境需要支持中文
测试运行环境python3.5,不保证其他运行环境能完美运行
需要安装mysql和redis
配置

config.ini

文件,设置好mysql和redis,并且填写你的知乎帐号
向数据库导入

init.sql


Run
开始抓取数据:

python get_user.py


查看抓取数量:

python check_redis.py


效果
利用python实现多线程抓取知乎用户方法
利用python实现多线程抓取知乎用户方法
总体思路
1.首先是模拟登陆知乎,利用保存登陆的cookie信息
2.抓取知乎页面的html代码,留待下一步继续进行分析提取信息
3.分析提取页面中用户的个性化url,放入redis(这里特别说明一下redis的思路用法,将提取到的用户的个性化url放入redis的一个名为already_get_user的hash table,表示已抓取的用户,对于已抓取过的用户判断是否存在于already_get_user以去除重复抓取,同时将个性化url放入user_queue的队列中,需要抓取新用户时pop队列获取新的用户)
4.获取用户的关注列表和粉丝列表,继续插入到redis
5.从redis的user_queue队列中获取新用户继续重复步骤3
模拟登陆知乎
首先是登陆,登陆功能作为一个包封装了在login里面,方便整合调用
header部分,这里Connection最好设为close,不然可能会碰到max retireve exceed的错误
原因在于普通的连接是keep-alive的但是却又没有关闭

# http请求的header
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 "Host": "www.zhihu.com",
 "Referer": "https://www.zhihu.com/",
 "Origin": "https://www.zhihu.com/",
 "Upgrade-Insecure-Requests": "1",
 "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
 "Pragma": "no-cache",
 "Accept-Encoding": "gzip, deflate, br",
 'Connection': 'close'
}
# 验证是否登陆
def check_login(self):
 check_url = 'https://www.zhihu.com/settings/profile'
 try:
 login_check = self.__session.get(check_url, headers=self.headers, timeout=35)
 except Exception as err:
 print(traceback.print_exc())
 print(err)
 print("验证登陆失败,请检查网络")
 sys.exit()
 print("验证登陆的http status code为:" + str(login_check.status_code))
 if int(login_check.status_code) == 200:
 return True
 else:
 return False


进入首页查看http状态码来验证是否登陆,200为已经登陆,一般304就是被重定向所以就是没有登陆

# 获取验证码
def get_captcha(self):
 t = str(time.time() * 1000)
 captcha_url = 'http://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
 r = self.__session.get(captcha_url, headers=self.headers, timeout=35)
 with open('captcha.jpg', 'wb') as f:
 f.write(r.content)
 f.close()
 # 用pillow 的 Image 显示验证码
 # 如果没有安装 pillow 到源代码所在的目录去找到验证码然后手动输入
 '''try:
 im = Image.open('captcha.jpg')
 im.show()
 im.close()
 except:'''
 print(u'请到 %s 目录找到captcha.jpg 手动输入' % os.path.abspath('captcha.jpg'))
 captcha = input("请输入验证码
>")
 return captcha


获取验证码的方法。当登录次数太多有可能会要求输入验证码,这里实现这个功能

# 获取xsrf
def get_xsrf(self):
 index_url = 'http://www.zhihu.com'
 # 获取登录时需要用到的_xsrf
 try:
 index_page = self.__session.get(index_url, headers=self.headers, timeout=35)
 except:
 print('获取知乎页面失败,请检查网络连接')
 sys.exit()
 html = index_page.text
 # 这里的_xsrf 返回的是一个list
 BS = BeautifulSoup(html, 'html.parser')
 xsrf_input = BS.find(attrs={'name': '_xsrf'})
 pattern = r'value="(.*?)"'
 print(xsrf_input)
 self.__xsrf = re.findall(pattern, str(xsrf_input))
 return self.__xsrf[0]


获取xsrf,为什么要获取xsrf呢,因为xsrf是一种防止跨站攻击的手段,具体介绍可以看这里csrf
在获取到xsrf之后把xsrf存入cookie当中,并且在调用api的时候带上xsrf作为头部,不然的话知乎会返回403

# 进行模拟登陆
def do_login(self):
 try:
 # 模拟登陆
 if self.check_login():
 print('您已经登录')
 return
 else:
 if self.config.get("zhihu_account", "username") and self.config.get("zhihu_account", "password"):
 self.username = self.config.get("zhihu_account", "username")
 self.password = self.config.get("zhihu_account", "password")
 else:
 self.username = input('请输入你的用户名
> ')
 self.password = input("请输入你的密码
> ")
 except Exception as err:
 print(traceback.print_exc())
 print(err)
 sys.exit()
 if re.match(r"^1d{10}$", self.username):
 print("手机登陆
")
 post_url = 'http://www.zhihu.com/login/phone_num'
 postdata = {
 '_xsrf': self.get_xsrf(),
 'password': self.password,
 'remember_me': 'true',
 'phone_num': self.username,
 }
 else:
 print("邮箱登陆
")
 post_url = 'http://www.zhihu.com/login/email'
 postdata = {
 '_xsrf': self.get_xsrf(),
 'password': self.password,
 'remember_me': 'true',
 'email': self.username,
 }
 try:
 login_page = self.__session.post(post_url, postdata, headers=self.headers, timeout=35)
 login_text = json.loads(login_page.text.encode('latin-1').decode('unicode-escape'))
 print(postdata)
 print(login_text)
 # 需要输入验证码 r = 0为登陆成功代码
 if login_text['r'] == 1:
 sys.exit()
 except:
 postdata['captcha'] = self.get_captcha()
 login_page = self.__session.post(post_url, postdata, headers=self.headers, timeout=35)
 print(json.loads(login_page.text.encode('latin-1').decode('unicode-escape')))
 # 保存登陆cookie
 self.__session.cookies.save()


这个就是核心的登陆功能啦,非常关键的就是用到了requests库,非常方便的保存到session
我们这里全局都是用单例模式,统一使用同一个requests.session对象进行访问功能,保持登录状态的一致性
最后主要调用登陆的代码为

# 创建login对象
lo = login.login.Login(self.session)
# 模拟登陆
if lo.check_login():
 print('您已经登录')
else:
 if self.config.get("zhihu_account", "username") and self.config.get("zhihu_account", "username"):
 username = self.config.get("zhihu_account", "username")
 password = self.config.get("zhihu_account", "password")
 else:
 username = input('请输入你的用户名
> ')
 password = input("请输入你的密码
> ")
 lo.do_login(username, password)


知乎模拟登陆到此就完成啦
知乎用户抓取

def __init__(self, threadID=1, name=''):
 # 多线程
 print("线程" + str(threadID) + "初始化")
 threading.Thread.__init__(self)
 self.threadID = threadID
 self.name = name
 try:
 print("线程" + str(threadID) + "初始化成功")
 except Exception as err:
 print(err)
 print("线程" + str(threadID) + "开启失败")
 self.threadLock = threading.Lock()
 # 获取配置
 self.config = configparser.ConfigParser()
 self.config.read("config.ini")
 # 初始化session
 requests.adapters.DEFAULT_RETRIES = 5
 self.session = requests.Session()
 self.session.cookies = cookielib.LWPCookieJar(filename='cookie')
 self.session.keep_alive = False
 try:
 self.session.cookies.load(ignore_discard=True)
 except:
 print('Cookie 未能加载')
 finally:
 pass
 # 创建login对象
 lo = Login(self.session)
 lo.do_login()
 # 初始化redis连接
 try:
 redis_host = self.config.get("redis", "host")
 redis_port = self.config.get("redis", "port")
 self.redis_con = redis.Redis(host=redis_host, port=redis_port, db=0)
 # 刷新redis库
 # self.redis_con.flushdb()
 except:
 print("请安装redis或检查redis连接配置")
 sys.exit()
 # 初始化数据库连接
 try:
 db_host = self.config.get("db", "host")
 db_port = int(self.config.get("db", "port"))
 db_user = self.config.get("db", "user")
 db_pass = self.config.get("db", "password")
 db_db = self.config.get("db", "db")
 db_charset = self.config.get("db", "charset")
 self.db = pymysql.connect(host=db_host, port=db_port, user=db_user, passwd=db_pass, db=db_db,
   charset=db_charset)
 self.db_cursor = self.db.cursor()
 except:
 print("请检查数据库配置")
 sys.exit()
 # 初始化系统设置
 self.max_queue_len = int(self.config.get("sys", "max_queue_len"))


这个是get_user.py的构造函数,主要功能就是初始化mysql连接、redis连接、验证登陆、生成全局的session对象、导入系统配置、开启多线程。

# 获取首页html
def get_index_page(self):
 index_url = 'https://www.zhihu.com/'
 try:
 index_html = self.session.get(index_url, headers=self.headers, timeout=35)
 except Exception as err:
 # 出现异常重试
 print("获取页面失败,正在重试......")
 print(err)
 traceback.print_exc()
 return None
 finally:
 pass
 return index_html.text
# 获取单个用户详情页面
def get_user_page(self, name_url):
 user_page_url = 'https://www.zhihu.com' + str(name_url) + '/about'
 try:
 index_html = self.session.get(user_page_url, headers=self.headers, timeout=35)
 except Exception as err:
 # 出现异常重试
 print("失败name_url:" + str(name_url) + "获取页面失败,放弃该用户")
 print(err)
 traceback.print_exc()
 return None
 finally:
 pass
 return index_html.text
# 获取粉丝页面
def get_follower_page(self, name_url):
 user_page_url = 'https://www.zhihu.com' + str(name_url) + '/followers'
 try:
 index_html = self.session.get(user_page_url, headers=self.headers, timeout=35)
 except Exception as err:
 # 出现异常重试
 print("失败name_url:" + str(name_url) + "获取页面失败,放弃该用户")
 print(err)
 traceback.print_exc()
 return None
 finally:
 pass
 return index_html.text
def get_following_page(self, name_url):
 user_page_url = 'https://www.zhihu.com' + str(name_url) + '/followers'
 try:
 index_html = self.session.get(user_page_url, headers=self.headers, timeout=35)
 except Exception as err:
 # 出现异常重试
 print("失败name_url:" + str(name_url) + "获取页面失败,放弃该用户")
 print(err)
 traceback.print_exc()
 return None
 finally:
 pass
 return index_html.text
# 获取首页上的用户列表,存入redis
def get_index_page_user(self):
 index_html = self.get_index_page()
 if not index_html:
 return
 BS = BeautifulSoup(index_html, "html.parser")
 self.get_xsrf(index_html)
 user_a = BS.find_all("a", class_="author-link") # 获取用户的a标签
 for a in user_a:
 if a:
 self.add_wait_user(a.get('href'))
 else:
 continue


这一部分的代码就是用于抓取各个页面的html代码

# 加入带抓取用户队列,先用redis判断是否已被抓取过
def add_wait_user(self, name_url):
 # 判断是否已抓取
 self.threadLock.acquire()
 if not self.redis_con.hexists('already_get_user', name_url):
 self.counter += 1
 print(name_url + " 加入队列")
 self.redis_con.hset('already_get_user', name_url, 1)
 self.redis_con.lpush('user_queue', name_url)
 print("添加用户 " + name_url + "到队列")
 self.threadLock.release()
# 获取页面出错移出redis
def del_already_user(self, name_url):
 self.threadLock.acquire()
 if not self.redis_con.hexists('already_get_user', name_url):
 self.counter -= 1
 self.redis_con.hdel('already_get_user', name_url)
 self.threadLock.release()


用户加入redis的操作,在数据库插入出错时我们调用del_already_user删除插入出错的用户

# 分析粉丝页面获取用户的所有粉丝用户
# @param follower_page get_follower_page()中获取到的页面,这里获取用户hash_id请求粉丝接口获取粉丝信息
def get_all_follower(self, name_url):
 follower_page = self.get_follower_page(name_url)
 # 判断是否获取到页面
 if not follower_page:
 return
 BS = BeautifulSoup(follower_page, 'html.parser')
 # 获取关注者数量
 follower_num = int(BS.find('span', text='关注者').find_parent().find('strong').get_text())
 # 获取用户的hash_id
 hash_id = 
 json.loads(BS.select("#zh-profile-follows-list")[0].select(".zh-general-list")[0].get('data-init'))[
 'params'][
 'hash_id']
 # 获取关注者列表
 self.get_xsrf(follower_page) # 获取xsrf
 post_url = 'https://www.zhihu.com/node/ProfileFollowersListV2'
 # 开始获取所有的关注者 math.ceil(follower_num/20)*20
 for i in range(0, math.ceil(follower_num / 20) * 20, 20):
 post_data = {
 'method': 'next',
 'params': json.dumps({"offset": i, "order_by": "created", "hash_id": hash_id})
 }
 try:
 j = self.session.post(post_url, params=post_data, headers=self.headers, timeout=35).text.encode(
 'latin-1').decode(
 'unicode-escape')
 pattern = re.compile(r"class="zm-item-link-avatar"[^"]*"([^"]*)", re.DOTALL)
 j = pattern.findall(j)
 for user in j:
 user = user.replace('', '')
 self.add_wait_user(user) # 保存到redis
 except Exception as err:
 print("获取正在关注失败")
 print(err)
 traceback.print_exc()
 pass
# 获取正在关注列表
def get_all_following(self, name_url):
 following_page = self.get_following_page(name_url)
 # 判断是否获取到页面
 if not following_page:
 return
 BS = BeautifulSoup(following_page, 'html.parser')
 # 获取关注者数量
 following_num = int(BS.find('span', text='关注了').find_parent().find('strong').get_text())
 # 获取用户的hash_id
 hash_id = 
 json.loads(BS.select("#zh-profile-follows-list")[0].select(".zh-general-list")[0].get('data-init'))[
 'params'][
 'hash_id']
 # 获取关注者列表
 self.get_xsrf(following_page) # 获取xsrf
 post_url = 'https://www.zhihu.com/node/ProfileFolloweesListV2'
 # 开始获取所有的关注者 math.ceil(follower_num/20)*20
 for i in range(0, math.ceil(following_num / 20) * 20, 20):
 post_data = {
 'method': 'next',
 'params': json.dumps({"offset": i, "order_by": "created", "hash_id": hash_id})
 }
 try:
 j = self.session.post(post_url, params=post_data, headers=self.headers, timeout=35).text.encode(
 'latin-1').decode(
 'unicode-escape')
 pattern = re.compile(r"class="zm-item-link-avatar"[^"]*"([^"]*)", re.DOTALL)
 j = pattern.findall(j)
 for user in j:
 user = user.replace('', '')
 self.add_wait_user(user) # 保存到redis
 except Exception as err:
 print("获取正在关注失败")
 print(err)
 traceback.print_exc()
 pass


调用知乎的API,获取所有的关注用户列表和粉丝用户列表,递归获取用户
这里需要注意的是头部要记得带上xsrf不然会抛出403

# 分析about页面,获取用户详细资料
def get_user_info(self, name_url):
 about_page = self.get_user_page(name_url)
 # 判断是否获取到页面
 if not about_page:
 print("获取用户详情页面失败,跳过,name_url:" + name_url)
 return
 self.get_xsrf(about_page)
 BS = BeautifulSoup(about_page, 'html.parser')
 # 获取页面的具体数据
 try:
 nickname = BS.find("a", class_="name").get_text() if BS.find("a", class_="name") else ''
 user_type = name_url[1:name_url.index('/', 1)]
 self_domain = name_url[name_url.index('/', 1) + 1:]
 gender = 2 if BS.find("i", class_="icon icon-profile-female") else (1 if BS.find("i", class_="icon icon-profile-male") else 3)
 follower_num = int(BS.find('span', text='关注者').find_parent().find('strong').get_text())
 following_num = int(BS.find('span', text='关注了').find_parent().find('strong').get_text())
 agree_num = int(re.findall(r'<strong>(.*)</strong>.*赞同', about_page)[0])
 appreciate_num = int(re.findall(r'<strong>(.*)</strong>.*感谢', about_page)[0])
 star_num = int(re.findall(r'<strong>(.*)</strong>.*收藏', about_page)[0])
 share_num = int(re.findall(r'<strong>(.*)</strong>.*分享', about_page)[0])
 browse_num = int(BS.find_all("span", class_="zg-gray-normal")[2].find("strong").get_text())
 trade = BS.find("span", class_="business item").get('title') if BS.find("span",
     class_="business item") else ''
 company = BS.find("span", class_="employment item").get('title') if BS.find("span",
      class_="employment item") else ''
 school = BS.find("span", class_="education item").get('title') if BS.find("span",
     class_="education item") else ''
 major = BS.find("span", class_="education-extra item").get('title') if BS.find("span",
      class_="education-extra item") else ''
 job = BS.find("span", class_="position item").get_text() if BS.find("span",
     class_="position item") else ''
 location = BS.find("span", class_="location item").get('title') if BS.find("span",
      class_="location item") else ''
 description = BS.find("p", class_="bio ellipsis").get('title') if BS.find("p",
      class_="bio ellipsis") else ''
 ask_num = int(BS.find_all("a", class_='item')[1].find("span").get_text()) if 
 BS.find_all("a", class_='item')[
 1] else int(0)
 answer_num = int(BS.find_all("a", class_='item')[2].find("span").get_text()) if 
 BS.find_all("a", class_='item')[
 2] else int(0)
 article_num = int(BS.find_all("a", class_='item')[3].find("span").get_text()) if 
 BS.find_all("a", class_='item')[3] else int(0)
 collect_num = int(BS.find_all("a", class_='item')[4].find("span").get_text()) if 
 BS.find_all("a", class_='item')[4] else int(0)
 public_edit_num = int(BS.find_all("a", class_='item')[5].find("span").get_text()) if 
 BS.find_all("a", class_='item')[5] else int(0)
 replace_data = 
 (pymysql.escape_string(name_url), nickname, self_domain, user_type,
 gender, follower_num, following_num, agree_num, appreciate_num, star_num, share_num, browse_num,
 trade, company, school, major, job, location, pymysql.escape_string(description),
 ask_num, answer_num, article_num, collect_num, public_edit_num)
 replace_sql = '''REPLACE INTO
  user(url,nickname,self_domain,user_type,
  gender, follower,following,agree_num,appreciate_num,star_num,share_num,browse_num,
  trade,company,school,major,job,location,description,
  ask_num,answer_num,article_num,collect_num,public_edit_num)
  VALUES(%s,%s,%s,%s,
  %s,%s,%s,%s,%s,%s,%s,%s,
  %s,%s,%s,%s,%s,%s,%s,
  %s,%s,%s,%s,%s)'''
 try:
 print("获取到数据:")
 print(replace_data)
 self.db_cursor.execute(replace_sql, replace_data)
 self.db.commit()
 except Exception as err:
 print("插入数据库出错")
 print("获取到数据:")
 print(replace_data)
 print("插入语句:" + self.db_cursor._last_executed)
 self.db.rollback()
 print(err)
 traceback.print_exc()
 except Exception as err:
 print("获取数据出错,跳过用户")
 self.redis_con.hdel("already_get_user", name_url)
 self.del_already_user(name_url)
 print(err)
 traceback.print_exc()
 pass


最后,到用户的about页面,分析页面元素,利用正则或者beatifulsoup分析抓取页面的数据
这里我们SQL语句用REPLACE INTO而不用INSERT INTO,这样可以很好的防止数据重复问题

# 开始抓取用户,程序总入口
def entrance(self):
 while 1:
 if int(self.redis_con.llen("user_queue")) < 1:
 self.get_index_page_user()
 else:
 # 出队列获取用户name_url redis取出的是byte,要decode成utf-8
 name_url = str(self.redis_con.rpop("user_queue").decode('utf-8'))
 print("正在处理name_url:" + name_url)
 self.get_user_info(name_url)
 if int(self.redis_con.llen("user_queue")) <= int(self.max_queue_len):
 self.get_all_follower(name_url)
 self.get_all_following(name_url)
 self.session.cookies.save()
def run(self):
 print(self.name + " is running")
 self.entrance()


最后,入口

if __name__ == '__main__':
 login = GetUser(999, "登陆线程")
 threads = []
 for i in range(0, 4):
 m = GetUser(i, "thread" + str(i))
 threads.append(m)
 for i in range(0, 4):
 threads[i].start()
 for i in range(0, 4):
 threads[i].join()


这里就是多线程的开启,需要开启多少个线程就把4换成多少就可以了
Docker
嫌麻烦的可以参考一下我用docker简单的搭建一个基础环境:
mysql和redis都是官方镜像

docker run --name mysql -itd mysql:latest
docker run --name redis -itd mysql:latest


再利用docker-compose运行python镜像,我的python的docker-compose.yml:

python:
 container_name: python
 build: .
 ports:
 - "84:80"
 external_links:
 - memcache:memcache
 - mysql:mysql
 - redis:redis
 volumes:
 - /docker_containers/python/www:/var/www/html
 tty: true
 stdin_open: true
 extra_hosts:
 - "python:192.168.102.140"
 environment:
 PYTHONIOENCODING: utf-8



-->

热心网友 时间:2022-05-10 10:37

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author: Administrator
# @Date: 2015-10-31 15:45:27
# @Last Modified by: Administrator
# @Last Modified time: 2015-11-23 16:57:31
import requests
import sys
import json
import re
reload(sys)
sys.setdefaultencoding('utf-8')

#获取到匹配字符的字符串
def find(pattern,test):
finder = re.search(pattern, test)
start = finder.start()
end = finder.end()
return test[start:end-1]

cookies = {
'_ga':'GA1.2.10sdfsdfsdf', '_za':'8d570b05-b0b1-4c96-a441-faddff34',
'q_c1':'23ddd234234',
'_xsrf':'234id':'"ZTE3NWY2ZTsdfsdfsdfWM2YzYxZmE=|1446435757|15fef3b84e044c122ee0fe8959e606827d333134"',
'z_c0':'"QUFBQXhWNGZsdfsdRvWGxaeVRDMDRRVDJmSzJFN1JLVUJUT1VYaEtZYS13PT0=|14464e234767|57db366f67cc107a05f1dc8237af24b865573cbe5"',
'__utmt':'1', '__utma':'51854390.109883802f8.1417518721.1447917637.144c7922009.4',
'__utmb':'518542340.4.10.1447922009', '__utmc':'51123390', '__utmz':'5185435454sdf06.1.1.utmcsr=hu.com|utmcgcn=(referral)|utmcmd=referral|utmcct=/',
'__utmv':'51854340.1d200-1|2=registration_date=2028=1^3=entry_date=201330318=1'}

headers = {'user-agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36',
'referer':'http://www.hu.com/question/following',
'host':'www.hu.com','Origin':'http://www.hu.com',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Connection':'keep-alive','X-Requested-With':'XMLHttpRequest','Content-Length':'81',
'Accept-Encoding':'gzip,deflate','Accept-Language':'zh-CN,zh;q=0.8','Connection':'keep-alive'
}

#多次访问之后,其实一加载时加载20个问题,具体参数传输就是offset,以20递增

dicc = {"offset":60}
n=20
b=0

# 与爬取图片相同的是,往下拉的时候也会发送http请求返回json数据,但是不同的是,像模拟登录首页不同的是除了
# 发送form表单的那些东西后,知乎是拒绝了我的请求了,刚开始以为是headers上的拦截,往headers添加浏览器
# 访问是的headers那些信息添加上,发现还是拒绝访问。

#想了一下,应该是cookie原因。这个加载的请求和模拟登录首页不同
#所以补上其他的cookies信息,再次请求,请求成功。
for x in xrange(20,460,20):
n = n+20
b = b+20
dicc['offset'] = x
formdata = {'method':'next','params':'{"offset":20}','_xsrf':'20770d88051f0f45e941570645f5e2e6'}

#传输需要json串,和python的字典是有区别的,需要转换
formdata['params'] = json.mps(dicc)
# print json.mps(dicc)
# print dicc

circle = requests.post("http://www.hu.com/node/ProfileFollowedQuestionsV2",
cookies=cookies,data=formdata,headers=headers)

#response内容 其实爬过一次之后就大同小异了。 都是
#问题返回的json串格式
# {"r":0,
# "msg": ["<div class=\"zm-profile-section-item zg-clear\">\n
# <span class=\"zm-profile-vote-count\">\n<div class=\"zm-profile-vote-num\">205K<\/div>\n
# <div class=\"zm-profile-vote-type\">\u6d4f\u89c8<\/div>\n
# <\/span>\n<div class=\"zm-profile-section-main\">\n
# <h2 class=\"zm-profile-question\">\n
# <a class=\"question_link\" target=\"_blank\" href=\"\/question\/21719532\">
# \u4ec0\u4e48\u4fc3\u4f7f\u4f60\u8d70\u4e0a\u72ec\u7acb\u5f00\u53d1\u8005\u4e4b\u8def\uff1f<\/a>\n
# <\/h2>\n<div class=\"meta zg-gray\">\n<a data-follow=\"q:link\" class=\"follow-link zg-unfollow meta-item\"
# href=\"javascript:;\" id=\"sfb-868760\">
# <i class=\"z-icon-follow\"><\/i>\u53d6\u6d88\u5173\u6ce8<\/a>\n<span class=\"zg-bull\">•<\/span>\n63 \u4e2a\u56de\u7b54\n<span class=\"zg-bull\">•<\/span>\n3589 \u4eba\u5173\u6ce8\n<\/div>\n<\/div>\n<\/div>",
# "<div class=\"zm-profile-section-item zg-clear\">\n
# <span class=\"zm-profile-vote-count\">\n
# <div class=\"zm-profile-vote-num\">157K<\/div>\n
# <div class=\"zm-profile-vote-type\">\u6d4f\u89c8<\/div>\n
# <\/span>\n<div class=\"zm-profile-section-main\">\n
# <h2 class=\"zm-profile-question\">\n
# <a class=\"question_link\" target=\"_blank\" href=\"\/question\/31764065\">
# \u672c\u79d1\u6e23\u6821\u7684\u5b66\u751f\u5982\u4f55\u8fdb\u5165\u7f8e\u5e1d\u725b\u6821\u8bfbPhD\uff1f<\/a>\n
# <\/h2>\n<div class=\"meta zg-gray\">\n
# <a data-follow=\"q:link\" class=\"follow-link zg-unfollow meta-item\" href=\"javascript:;\" id=\"sfb-4904877\">
# <i class=\"z-icon-follow\"><\/i>\u53d6\u6d88\u5173\u6ce8<\/a>\n<span class=\"zg-bull\">•
# <\/span>\n112 \u4e2a\u56de\u7b54\n<span class=\"zg-bull\">•<\/span>\n1582 \u4eba\u5173\u6ce8\n
# <\/div>\n<\/div>\n<\/div>"]}
# print circle.content

#同样json串需要自己 转换成字典后使用
jsondict = json.loads(circle.text)
msgstr = jsondict['msg']
# print len(msgstr)

#根据自己所需要的提取信息规则写出正则表达式
pattern = 'question\/.*?/a>'
try:
for y in xrange(0,20):
wholequestion = find(pattern, msgstr[y])
pattern2 = '>.*?<'
finalquestion = find(pattern2, wholequestion).replace('>','')
print str(b+y)+" "+finalquestion

#当问题已经访问完后再传参数 抛出异常 此时退出循环
except Exception, e:
print "全部%s个问题" %(b+y)
break
声明声明:本网页内容为用户发布,旨在传播知识,不代表本网认同其观点,若有侵权等问题请及时与本网联系,我们将在第一时间删除处理。E-MAIL:11247931@qq.com
梦见关在拘留所里的人又出了交通事故,是什么意思? ...我朋友前阵子撞死了人,没有违规,当时立即报案了,现在在拘留所... 醉驾出事故,车已经给修了,现在已经进拘留所十多天了,什么时候能判... 朋友因为工地伤亡事故被拘留,现在已经25天了,赔偿金也付了10多天了... 江淮帅铃国4??加油发动机抖动冒黑烟是怎么回事? ...只换机油和格,现在5万公里了,请需要换些什么东西 诗经·国风·中谷有蓷原文、译文以及鉴赏 求解签 情缘 下签 条其啸矣。遇人之不淑矣。 解曰:孽缘遮眼 强栖双... 条其啸矣 如何将域解除? 如何用python爬取nba数据中心的数据 怎么投诉10086最有效电话多少? 如何用python爬取豆瓣读书的数据 怎么用python爬取网站上的数据 自己可以从电脑上查询孩子的学籍号吗?如何查? 怎样用python3抓取网站数据 如何投诉10086? 怎么投诉10086? 怎么投诉10086打什么 投诉10086最狠的办法 双卡打电话怎么转换? 驻马店有那个风景区好玩? 驻马店市区哪好玩啊 ???我要去玩 驻马店附近有什么好玩的地方吗? 驻马店或者周边有什么好玩的地方推荐一下 河南驻马店旅游景点有哪些 跑步要多久会有效果 跑步要多久会有效果 想要达到燃烧脂肪的效果,要慢跑多少时间才有用? 想要达到燃烧脂肪的效果,要慢跑多少时间才有用? 怎么可以投诉10086 如何处理python抓取的网页数据 怎样投诉中国移动?10086客服和分公司网站申诉已经试过了 降低机房噪音的措施有哪些 请问机房降噪有什么好的处理办法? 发电机机房降噪处理方法有哪些 谁有治理电梯机房噪音的良方? 发电机房噪声大如何处理? 发电机房的降噪知识发电机房要做哪些方面的降噪设计 发电机组机房隔音降噪需要采取什么措施 什么是降噪机房? 有么有好的机房隔声降噪材料可以选? 如何隔绝楼下风机房的噪音? 微信如何向商店付钱 商店如何用微信收款 是不是扫描商店二维码就可以支付了 三福商店怎么使用微信零钱扫码支付? 正常的付款扫描就好还是需要绑 cad中怎么输入文字自动换行 CAD单行换为多行文本该怎样转换? 如何使标注的文字成两行cad