Python爬虫入门练手案例,爬取某乎问答数

发表于:2020-8-06 11:31

字体: | 上一篇 | 下一篇 | 我要投稿

 作者:蛤鲤鹿鸭    来源:今日头条

#
Python
  前言
  Python是个获取数据的小能手,所以这次希望能用它在*乎爬取一些的问题的回答数,练练手。
  1.导入模块
   import re
  from bs4 import BeautifulSoup
  import requests
  import time
  import json
  import pandas as pd
  import numpy as np
  2.状态码
   r = requests.get('https://github.com/explore')
  r.status_code
  3. 爬取*乎
   #浏览器header和cookies
  headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
  cookies = {'cookie':'_zap=3d979dbb-f25b-4014-8770-89045dec48f6; d_c0="APDvML4koQ-PTqFU56egNZNd2wd-eileT3E=|1561292196"; tst=r; _ga=GA1.2.910277933.1582789012; q_c1=9a429b07b08a4ae1afe0a99386626304|1584073146000|1561373910000; _xsrf=bf1c5edf-75bd-4512-8319-02c650b7ad2c; _gid=GA1.2.1983259099.1586575835; l_n_c=1; l_cap_id="NDIxM2M4OWY4N2YwNDRjM2E3ODAxMDdmYmY2NGFiMTQ=|1586663749|ceda775ba80ff485b63943e0baf9968684237435"; r_cap_id="OWY3OGQ1MDJhMjFjNDBiYzk0MDMxMmVlZDIwNzU0NzU=|1586663749|0948d23c731a8fa985614d3ed58edb6405303e99"; cap_id="M2I5NmJkMzRjMjc3NGZjNDhiNzBmNDMyNDQ3NDlmNmE=|1586663749|dacf440ab7ad64214a939974e539f9b86ddb9eac"; n_c=1; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1586585625,1586587735,1586667228,1586667292; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1586667292; SESSIONID=GWBltmMTwz5oFeBTjRm4Akv8pFF6p8Y6qWkgUP4tjp6; JOID=UVkSBEJI6EKgHAipMkwAEWAkvEomDbkAwmJn4mY1kHHPVGfpYMxO3voUDK88UO62JqgwW5Up4hC2kX_KGO9xoKI=; osd=UlEXAU5L4EelEAuhN0kMEmghuUYlBbwFzmFv52M5k3nKUWvqaMlL0vkcCaowU-azI6QzU5As7hO-lHrGG-d0pa4=; capsion_ticket="2|1:0|10:1586667673|14:capsion_ticket|44:YTJkYmIyN2Q4YWI4NDI0Mzk0NjQ1YmIwYmUxZGYyNzY=|b49eb8176314b73e0ade9f19dae4b463fb970c8cbd1e6a07a6a0e535c0ab8ac3"; z_c0="2|1:0|10:1586667694|4:z_c0|92:Mi4xOGc1X0dnQUFBQUFBOE84d3ZpU2hEeVlBQUFCZ0FsVk5ydTVfWHdDazlHMVM1eFU5QjlqamJxWVhvZ2xuWlhTaVJ3|bcd3601ae34951fe72fd3ffa359bcb4acd60462715edcd1e6c4e99776f9543b3"; unlock_ticket="AMCRYboJGhEmAAAAYAJVTbankl4i-Y7Pzkta0e4momKdPG3NRc6GUQ=="; KLBRSID=fb3eda1aa35a9ed9f88f346a7a3ebe83|1586667697|1586660346'}
  start_url = 'https://www.zhihu.com/api/v3/feed/topstory/recommend?session_token=c03069ed8f250472b687fd1ee704dd5b&desktop=true&page_number=5&limit=6&action=pull&ad_interval=-1&before_id=23'
  4. beautifulsoup解析
   s = requests.Session()
  start_url = 'https://www.zhihu.com/'
  html = s.get(url = start_url, headers = headers,cookies = cookies,timeout = 5)
  soup = BeautifulSoup(html.content)
  question = [] ## 名称
  question_address = [] ## url
  temp1 = soup.find_all('div',class_='Card TopstoryItem TopstoryItem-isRecommend')
  for item in temp1:
  temp2 = item.find_all('div',itemprop="zhihu:question")
  #     print(temp2)
  if temp2 != []: #### 存在专栏等情况,暂时跳过
  question_address.append(temp2[0].find('meta',itemprop='url').get('content'))
  question.append(temp2[0].find('meta',itemprop='name').get('content'))
  5. 存储信息
   question_focus_number = [] #关注量
  question_answer_number = [] # 回答量
  for url in question_address:
  test = s.get(url = url,headers = headers,cookies = cookies,timeout = 5)
  soup = BeautifulSoup(test.content)
  info = soup.find_all('div',class_='QuestionPage')[0]
  #     print(info)
  focus_number = info.find('meta',itemprop="answerCount").get('content')
  answer_number = info.find('meta',itemprop="zhihu:followerCount").get('content')
  question_focus_number.append(focus_number)
  question_answer_number.append(answer_number)
  6. 整理信息并输出
   question_info = pd.DataFrame(list(zip(question,question_focus_number,question_answer_number)),columns = ['问题名称','关注人数','回答人数']
  for item in ['关注人数','回答人数']:
  question_info[item] = np.array(question_info[item],dtype = 'int')
  question_info.sort_values(by='关注人数',ascending = False)
  输出:
  
  7. 总计:
  简单的爬取并不难,但涉及到账户密码等,就需要注意了。爬取数据尽量不要给人家服务器造成负担(比如:把睡眠时间加长);不要把爬取的数据用于商业行为;不管技术有多牛,不要轻易触碰用户隐私数据。合理、合法、有节制的利用爬虫技术,要不可能给自己带来不必要的麻烦。

      本文内容不用于商业目的,如涉及知识产权问题,请权利人联系51Testing小编(021-64471599-8017),我们将立即处理
《2023软件测试行业现状调查报告》独家发布~

关注51Testing

联系我们

快捷面板 站点地图 联系我们 广告服务 关于我们 站长统计 发展历程

法律顾问:上海兰迪律师事务所 项棋律师
版权所有 上海博为峰软件技术股份有限公司 Copyright©51testing.com 2003-2024
投诉及意见反馈:webmaster@51testing.com; 业务联系:service@51testing.com 021-64471599-8017

沪ICP备05003035号

沪公网安备 31010102002173号