Java代码
private CrawlerUrl getNextUrl() throws Throwable {
CrawlerUrl nextUrl = null;
while ((nextUrl == null) && (!urlQueue.isEmpty())) {
CrawlerUrl crawlerUrl = this.urlQueue.remove();
if (doWeHavePermissionToVisit(crawlerUrl)
&& (!isUrlAlreadyVisited(crawlerUrl))
&& isDepthAcceptable(crawlerUrl)) {
nextUrl = crawlerUrl;
}
}
return nextUrl;
} |
更多的关于robot.txt的具体写法,可参考以下这篇文章:http://www.bloghuman.com/post/67/
getContent内部使用apache的httpclient 4.1获取网页内容,具体代码如下:
Java代码
private String getContent(CrawlerUrl url) throws Throwable {
HttpClient client = new DefaultHttpClient();
HttpGet httpGet = new HttpGet(url.getUrlString());
StringBuffer strBuf = new StringBuffer();
HttpResponse response = client.execute(httpGet);
if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) {
HttpEntity entity = response.getEntity();
if (entity != null) {
BufferedReader reader = new BufferedReader(
new InputStreamReader(entity.getContent(), "UTF-8"));
String line = null;
if (entity.getContentLength() > 0) {
strBuf = new StringBuffer((int) entity.getContentLength());
while ((line = reader.readLine()) != null) {
strBuf.append(line);
}
}
}
if (entity != null) {
entity.consumeContent();
}
}
markUrlAsVisited(url);
return strBuf.toString();
} |
对于垂直型应用来说,数据的准确性往往更为重要。聚焦型爬虫的主要特点是,只收集和主题相关的数据,这就是isContentRelevant方法的作用。这里或许要使用分类预测技术,为简单起见,采用正则匹配来代替。其主要代码如下:
Java代码
public static boolean isContentRelevant(String content,
Pattern regexpPattern) {
boolean retValue = false;
if (content != null) {
Matcher m = regexpPattern.matcher(content.toLowerCase());
retValue = m.find();
}
return retValue;
} |