Java爬虫:Jsoup利用dom方法遍历Document对象

发表于:2020-8-20 10:05

字体: | 上一篇 | 下一篇 | 我要投稿

 作者:Cyril_KI    来源:CSDN

  先给出网页地址:
  https://wall.alphacoders.com/featured.php?lang=Chinese
  主要步骤:
  1、利用Jsoup的connect方法获取Document对象
  String html = "https://wall.alphacoders.com/featured.php?lang=Chinese";
  Document doc = Jsoup.connect(html).get();
  内容过长,就不再显示。
  我们以这部分为例:
  <ul class="nav nav-pills"> 
      <li><a href="https://alphacoders.com/site/about-us" rel="nofollow">About Us</a></li> 
      <li><a href="https://alphacoders.com/site/faq" rel="nofollow">FAQ</a></li> 
      <li><a href="https://alphacoders.com/site/privacy" rel="nofollow">Privacy Policy</a></li> 
      <li><a href="https://alphacoders.com/site/tos" rel="nofollow">Terms Of Service</a></li> 
      <li><a href="https://alphacoders.com/site/acceptable_use" rel="nofollow">Acceptable Use</a></li> 
      <li><a href="https://alphacoders.com/site/etiquette" rel="nofollow">Etiquette</a></li> 
      <li><a href="https://alphacoders.com/site/advertising" rel="nofollow">Advertise With Us</a></li> 
      <li><a id="change_consent">Change Consent</a></li> 
  </ul> 
  2、我们先找到所有的ul:
  Elements elements = doc.getElementsByTag("ul");
  输出如下:
  <ul class="nav navbar-nav center"> 
   <li> <a title="Submit Wallpapers" href="https://alphacoders.com/site/submit-wallpaper"><i class="el el-circle-arrow-up"></i>&nbsp;提交</a> </li> 
   <li> <a href="https://alphacoders.com/contest"><i class="el el-gift"></i>&nbsp;精美奖品</a> </li> 
  </ul>
  <ul class="nav navbar-nav navbar-right center"> 
   <li> <a href="language.php?lang=Chinese"> <img src="https://static.alphacoders.com/wa/Chinese-flag.png" alt="Chinese-flag"> &nbsp;&nbsp;中文 &nbsp;&nbsp; </a> </li> 
   <li> <a rel="nofollow" href="https://alphacoders.com/users/login"><i class="el el-user"></i>&nbsp;登录</a> </li> 
   <li> <a href="https://alphacoders.com/users/register"><i class="el el-edit"></i>&nbsp;注册</a> </li> 
  </ul>
  <ul class="pagination"> 
   <li class="active"><a id="prev_page" href="#">&lt;&nbsp;上一页</a></li> 
   <li class="active"><a>1</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=2">2</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=3">3</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=4">4</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=5">5</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=6">6</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=7">7</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=8">8</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=9">9</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=10">10</a></li> 
   <li><a>...</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=319">319</a></li> 
   <li><a id="next_page" href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=2">下一页&nbsp;&gt;</a></li> 
  </ul>
  <ul class="pagination"> 
   <li class="active"><a href="#">&lt;&nbsp;上一页</a></li> 
   <li class="active"><a>1</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=2">2</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=3">3</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=4">4</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=5">5</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=6">6</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=7">7</a></li> 
   <li><a>...</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=319">319</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=2">下一页&nbsp;&gt;</a></li> 
  </ul>
  <ul class="pagination"> 
   <li class="active"><a href="#">&lt;&lt;&nbsp;</a></li> 
   <li class="active"><a href="#">&lt;&nbsp;上一页</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=2">下一页&nbsp;&gt;</a></li> 
   <li><a title="末页 (319)" href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=319">&nbsp;&gt;&gt;</a></li> 
  </ul>
  <ul class="pagination"> 
   <li class="active"><a href="#">1</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=2">2</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=3">3</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=4">4</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=5">5</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=6">6</a></li> 
   <li><a href="https://wall.alphacoders.com/featured.php?lang=Chinese&amp;page=7">7</a></li> 
  </ul>
  <ul class="nav nav-pills"> 
   <li><a href="https://alphacoders.com/site/about-us" rel="nofollow">About Us</a></li> 
   <li><a href="https://alphacoders.com/site/faq" rel="nofollow">FAQ</a></li> 
   <li><a href="https://alphacoders.com/site/privacy" rel="nofollow">Privacy Policy</a></li> 
   <li><a href="https://alphacoders.com/site/tos" rel="nofollow">Terms Of Service</a></li> 
   <li><a href="https://alphacoders.com/site/acceptable_use" rel="nofollow">Acceptable Use</a></li> 
   <li><a href="https://alphacoders.com/site/etiquette" rel="nofollow">Etiquette</a></li> 
   <li><a href="https://alphacoders.com/site/advertising" rel="nofollow">Advertise With Us</a></li> 
   <li><a id="change_consent">Change Consent</a></li> 
  </ul>
  3、可以发现class为"nav nav-pills"的只有一个,我们找到它:
  Elements elements = doc.getElementsByTag("ul");
  //System.out.println(elements);
  Element tempElement = null;
  for(Element element : elements) {
  if (element.className().equals("nav nav-pills")) {
  tempElement = element;
  //System.out.println(element.className());
  break;
  }
  }
  4、循环遍历这个ul,输出其中每一个li里每一个a的href和rel属性:
  Elements li = tempElement.getElementsByTag("li");
  for(Element element : li) {
  Elements element2 = element.getElementsByTag("a");
  for(Element element3 : element2) {
  String hrefString = element3.attr("href");
  String relString = element3.attr("rel");
  if(hrefString != "" && relString != "") {
  System.out.println("href=" + hrefString + " " + "rel=" + relString);
  }
  }
  }
  最终结果:
  href=https://alphacoders.com/site/about-us rel=nofollow
  href=https://alphacoders.com/site/faq rel=nofollow
  href=https://alphacoders.com/site/privacy rel=nofollow
  href=https://alphacoders.com/site/tos rel=nofollow
  href=https://alphacoders.com/site/acceptable_use rel=nofollow
  href=https://alphacoders.com/site/etiquette rel=nofollow
  href=https://alphacoders.com/site/advertising rel=nofollow
  完整代码:
  import org.jsoup.nodes.Document;
  import org.jsoup.nodes.Element;
  import org.jsoup.select.Elements;
  import java.io.IOException;
  import org.jsoup.Jsoup;
  /** 
   * @ClassName: Jsoup_Test
   * @description: 
   * @author: KI
   * @Date: 2020年8月17日 下午8:15:14
   */
  public class Jsoup_Test {
  public static void main(String[] args) throws IOException {
  // TODO 自动生成的方法存根
  
  String html = "https://wall.alphacoders.com/featured.php?lang=Chinese";
  Document doc = Jsoup.connect(html).get();
  
  System.out.println(doc);
  Elements elements = doc.getElementsByTag("ul");
  //System.out.println(elements);
  Element tempElement = null;
  for(Element element : elements) {
  if (element.className().equals("nav nav-pills")) {
  tempElement = element;
  //System.out.println(element.className());
  break;
  }
  }
  System.out.println(tempElement);
  Elements li = tempElement.getElementsByTag("li");
  for(Element element : li) {
  Elements element2 = element.getElementsByTag("a");
  for(Element element3 : element2) {
  String hrefString = element3.attr("href");
  String relString = element3.attr("rel");
  if(hrefString != "" && relString != "") {
  System.out.println("href=" + hrefString + " " + "rel=" + relString);
  }
  }
  }
  }
  }

  本文内容不用于商业目的,如涉及知识产权问题,请权利人联系博为峰小编(021-64471599-8017),我们将立即处理
《2023软件测试行业现状调查报告》独家发布~

关注51Testing

联系我们

快捷面板 站点地图 联系我们 广告服务 关于我们 站长统计 发展历程

法律顾问:上海兰迪律师事务所 项棋律师
版权所有 上海博为峰软件技术股份有限公司 Copyright©51testing.com 2003-2024
投诉及意见反馈:webmaster@51testing.com; 业务联系:service@51testing.com 021-64471599-8017

沪ICP备05003035号

沪公网安备 31010102002173号