适合新手小白的几个练习Python爬虫的实战

发表于:2023-1-04 09:51

字体: | 上一篇 | 下一篇 | 我要投稿

 作者:萝卜大杂烩    来源:萝卜大杂烩

#
Python
分享:
  经常有新手小白在学习完 Python 的基础知识之后,不知道该如何进一步提升编码水平,那么此时找一些友好的网站来练习爬虫可能是一个比较好的方法,因为高级爬虫本身就需要掌握很多知识点,以爬虫作为切入点,既可以掌握巩固 Python 知识,也可能在未来学习接触到更多其他方面的知识,比如分布式,多线程等等,何乐而不为呢!
  下面我们介绍几个非常简单入门的爬虫项目,相信不会再出现那种直接劝退的现象啦!
  豆瓣
  豆瓣作为国民级网站,在爬虫方面也非常友好,几乎没有设置任何反爬措施,以此网站来练手实在是在适合不过了。
  评论爬取
  我们以如下地址为例子
  https://movie.douban.com/subject/3878007/
  可以看到这里需要进行翻页处理,通过观察发现,评论的URL如下:
  https://movie.douban.com/subject/3878007/comments?start=0&limit=20&sort=new_score&status=P&percent_type=l
  每次翻一页,start都会增长20,由此可以写代码如下:
  def get_praise():
      praise_list = []
      for i in range(0, 2000, 20):
          url = 'https://movie.douban.com/subject/3878007/comments?start=%s&limit=20&sort=new_score&status=P&percent_type=h' % str(i)
          req = requests.get(url).text
          content = BeautifulSoup(req, "html.parser")
          check_point = content.title.string
          if check_point != r"没有访问权限":
              comment = content.find_all("span", attrs={"class": "short"})
              for k in comment:
                  praise_list.append(k.string)
          else:
              break
      return 
  使用range函数,步长设置为20,同时通过title等于“没有访问权限”来作为翻页的终点。
  下面继续分析评论等级。
  豆瓣的评论是分为三个等级的,这里分别获取,方便后面的继续分析
  def get_ordinary():
      ordinary_list = []
      for i in range(0, 2000, 20):
          url = 'https://movie.douban.com/subject/3878007/comments?start=%s&limit=20&sort=new_score&status=P&percent_type=m' % str(i)
          req = requests.get(url).text
          content = BeautifulSoup(req, "html.parser")
          check_point = content.title.string
          if check_point != r"没有访问权限":
              comment = content.find_all("span", attrs={"class": "short"})
              for k in comment:
                  ordinary_list.append(k.string)
          else:
              break
      return 
  def get_lowest():
      lowest_list = []
      for i in range(0, 2000, 20):
          url = 'https://movie.douban.com/subject/3878007/comments?start=%s&limit=20&sort=new_score&status=P&percent_type=l' % str(i)
          req = requests.get(url).text
          content = BeautifulSoup(req, "html.parser")
          check_point = content.title.string
          if check_point != r"没有访问权限":
              comment = content.find_all("span", attrs={"class": "short"})
              for k in comment:
                  lowest_list.append(k.string)
          else:
              break
      return 
  其实可以看到,这里的三段区别主要在请求URL那里,分别对应豆瓣的好评,一般和差评。
  最后把得到的数据保存到文件里。
  if __name__ == "__main__":
      print("Get Praise Comment")
      praise_data = get_praise()
      print("Get Ordinary Comment")
      ordinary_data = get_ordinary()
      print("Get Lowest Comment")
      lowest_data = get_lowest()
      print("Save Praise Comment")
      praise_pd = pd.DataFrame(columns=['praise_comment'], data=praise_data)
      praise_pd.to_csv('praise.csv', encoding='utf-8')
      print("Save Ordinary Comment")
      ordinary_pd = pd.DataFrame(columns=['ordinary_comment'], data=ordinary_data)
      ordinary_pd.to_csv('ordinary.csv', encoding='utf-8')
      print("Save Lowest Comment")
      lowest_pd = pd.DataFrame(columns=['lowest_comment'], data=lowest_data)
      lowest_pd.to_csv('lowest.csv', encoding='utf-8')
      print("THE END!!!")
  制作词云
  这里使用jieba来分词,使用wordcloud库制作词云,还是分成三类,同时去掉了一些干扰词,比如“一部”、“一个”、“故事”和一些其他名词,操作都不是很难,直接上代码。
  import jieba
  import pandas as pd
  from wordcloud import WordCloud
  import numpy as np
  from PIL import Image
  font = r'C:\Windows\Fonts\FZSTK.TTF'
  STOPWORDS = set(map(str.strip, open('stopwords.txt').readlines()))
  def wordcloud_praise():
      df = pd.read_csv('praise.csv', usecols=[1])
      df_list = df.values.tolist()
      comment_after = jieba.cut(str(df_list), cut_all=False)
      words = ' '.join(comment_after)
      img = Image.open('haiwang8.jpg')
      img_array = np.array(img)
      wc = WordCloud(width=2000, height=1800, background_color='white', font_path=font, mask=img_array, stopwords=STOPWORDS)
      wc.generate(words)
      wc.to_file('praise.png')
  def wordcloud_ordinary():
      df = pd.read_csv('ordinary.csv', usecols=[1])
      df_list = df.values.tolist()
      comment_after = jieba.cut(str(df_list), cut_all=False)
      words = ' '.join(comment_after)
      img = Image.open('haiwang8.jpg')
      img_array = np.array(img)
      wc = WordCloud(width=2000, height=1800, background_color='white', font_path=font, mask=img_array, stopwords=STOPWORDS)
      wc.generate(words)
      wc.to_file('ordinary.png')
  def wordcloud_lowest():
      df = pd.read_csv('lowest.csv', usecols=[1])
      df_list = df.values.tolist()
      comment_after = jieba.cut(str(df_list), cut_all=False)
      words = ' '.join(comment_after)
      img = Image.open('haiwang7.jpg')
      img_array = np.array(img)
      wc = WordCloud(width=2000, height=1800, background_color='white', font_path=font, mask=img_array, stopwords=STOPWORDS)
      wc.generate(words)
      wc.to_file('lowest.png')
  if __name__ == "__main__":
      print("Save praise wordcloud")
      wordcloud_praise()
      print("Save ordinary wordcloud")
      wordcloud_ordinary()
      print("Save lowest wordcloud")
      wordcloud_lowest()
      print("THE END!!!")
  海报爬取
  对于海报的爬取,其实也十分类似,直接给出代码:
  import requests
  import json
  def deal_pic(url, name):
      pic = requests.get(url)
      with open(name + '.jpg', 'wb') as f:
          f.write(pic.content)
  def get_poster():
      for i in range(0, 10000, 20):
          url = 'https://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=电影&start=%s&genres=爱情' % i
          req = requests.get(url).text
          req_dict = json.loads(req)
          for j in req_dict['data']:
              name = j['title']
              poster_url = j['cover']
              print(name, poster_url)
              deal_pic(poster_url, name)
  if __name__ == "__main__":
      get_poster()
  烂番茄网站
  这是一个国外的电影影评网站,也比较适合新手练习,网址如下
  https://www.rottentomatoes.com/tv/game_of_thrones
  我们就以权力的游戏作为爬取例子。
  import requests
  from bs4 import BeautifulSoup
  from pyecharts.charts import Line
  import pyecharts.options as opts
  from wordcloud import WordCloud
  import jieba
  baseurl = 'https://www.rottentomatoes.com'
  def get_total_season_content():
      url = 'https://www.rottentomatoes.com/tv/game_of_thrones'
      response = requests.get(url).text
      content = BeautifulSoup(response, "html.parser")
      season_list = []
      div_list = content.find_all('div', attrs={'class': 'bottom_divider media seasonItem '})
      for i in div_list:
          suburl = i.find('a')['href']
          season = i.find('a').text
          rotten = i.find('span', attrs={'class': 'meter-value'}).text
          consensus = i.find('div', attrs={'class': 'consensus'}).text.strip()
          season_list.append([season, suburl, rotten, consensus])
      return season_list
  def get_season_content(url):
      # url = 'https://www.rottentomatoes.com/tv/game_of_thrones/s08#audience_reviews'
      response = requests.get(url).text
      content = BeautifulSoup(response, "html.parser")
      episode_list = []
      div_list = content.find_all('div', attrs={'class': 'bottom_divider'})
      for i in div_list:
          suburl = i.find('a')['href']
          fresh = i.find('span', attrs={'class': 'tMeterScore'}).text.strip()
          episode_list.append([suburl, fresh])
      return episode_list[:5]
  mylist = [['/tv/game_of_thrones/s08/e01', '92%'],
            ['/tv/game_of_thrones/s08/e02', '88%'],
            ['/tv/game_of_thrones/s08/e03', '74%'],
            ['/tv/game_of_thrones/s08/e04', '58%'],
            ['/tv/game_of_thrones/s08/e05', '48%'],
            ['/tv/game_of_thrones/s08/e06', '49%']]
  def get_episode_detail(episode):
      # episode = mylist
      e_list = []
      for i in episode:
          url = baseurl + i[0]
          # print(url)
          response = requests.get(url).text
          content = BeautifulSoup(response, "html.parser")
          critic_consensus = content.find('p', attrs={'class': 'critic_consensus superPageFontColor'}).text.strip().replace(' ', '').replace('\n', '')
          review_list_left = content.find_all('div', attrs={'class': 'quote_bubble top_critic pull-left cl '})
          review_list_right = content.find_all('div', attrs={'class': 'quote_bubble top_critic pull-right  '})
          review_list = []
          for i_left in review_list_left:
              left_review = i_left.find('div', attrs={'class': 'media-body'}).find('p').text.strip()
              review_list.append(left_review)
          for i_right in review_list_right:
              right_review = i_right.find('div', attrs={'class': 'media-body'}).find('p').text.strip()
              review_list.append(right_review)
          e_list.append([critic_consensus, review_list])
      print(e_list)
  if __name__ == '__main__':
      total_season_content = get_total_season_content()
  王者英雄网站
  我这里选取的是如下网站
  http://db.18183.com/
  import requests
  from bs4 import BeautifulSoup
  def get_hero_url():
      print('start to get hero urls')
      url = 'http://db.18183.com/'
      url_list = []
      res = requests.get(url + 'wzry').text
      content = BeautifulSoup(res, "html.parser")
      ul = content.find('ul', attrs={'class': "mod-iconlist"})
      hero_url = ul.find_all('a')
      for i in hero_url:
          url_list.append(i['href'])
      print('finish get hero urls')
      return url_list
  def get_details(url):
      print('start to get details')
      base_url = 'http://db.18183.com/'
      detail_list = []
      for i in url:
          # print(i)
          res = requests.get(base_url + i).text
          content = BeautifulSoup(res, "html.parser")
          name_box = content.find('div', attrs={'class': 'name-box'})
          name = name_box.h1.text
          hero_attr = content.find('div', attrs={'class': 'attr-list'})
          attr_star = hero_attr.find_all('span')
          survivability = attr_star[0]['class'][1].split('-')[1]
          attack_damage = attr_star[1]['class'][1].split('-')[1]
          skill_effect = attr_star[2]['class'][1].split('-')[1]
          getting_started = attr_star[3]['class'][1].split('-')[1]
          details = content.find('div', attrs={'class': 'otherinfo-datapanel'})
          # print(details)
          attrs = details.find_all('p')
          attr_list = []
          for attr in attrs:
              attr_list.append(attr.text.split(':')[1].strip())
          detail_list.append([name, survivability, attack_damage,
                              skill_effect, getting_started, attr_list])
      print('finish get details')
      return detail_list
  def save_tocsv(details):
      print('start save to csv')
      with open('all_hero_init_attr_new.csv', 'w', encoding='gb18030') as f:
          f.write('英雄名字,生存能力,攻击伤害,技能效果,上手难度,最大生命,最大法力,物理攻击,'
                  '法术攻击,物理防御,物理减伤率,法术防御,法术减伤率,移速,物理护甲穿透,法术护甲穿透,攻速加成,暴击几率,'
                  '暴击效果,物理吸血,法术吸血,冷却缩减,攻击范围,韧性,生命回复,法力回复\n')
          for i in details:
              try:
                  rowcsv = '{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}'.format(
                      i[0], i[1], i[2], i[3], i[4], i[5][0], i[5][1], i[5][2], i[5][3], i[5][4], i[5][5],
                      i[5][6], i[5][7], i[5][8], i[5][9], i[5][10], i[5][11], i[5][12], i[5][13], i[5][14], i[5][15],
                      i[5][16], i[5][17], i[5][18], i[5][19], i[5][20]
                  )
                  f.write(rowcsv)
                  f.write('\n')
              except:
                  continue
      print('finish save to csv')
  if __name__ == "__main__":
      get_hero_url()
      hero_url = get_hero_url()
      details = get_details(hero_url)
      save_tocsv(details)
  好了,今天先分享这三个网站,咱们后面再慢慢分享更多好的练手网站与实战代码!
  本文内容不用于商业目的,如涉及知识产权问题,请权利人联系51Testing小编(021-64471599-8017),我们将立即处理
《2023软件测试行业现状调查报告》独家发布~

关注51Testing

联系我们

快捷面板 站点地图 联系我们 广告服务 关于我们 站长统计 发展历程

法律顾问:上海兰迪律师事务所 项棋律师
版权所有 上海博为峰软件技术股份有限公司 Copyright©51testing.com 2003-2024
投诉及意见反馈:webmaster@51testing.com; 业务联系:service@51testing.com 021-64471599-8017

沪ICP备05003035号

沪公网安备 31010102002173号