概述
爬虫需要抓取网站价格,与一般抓取网页区别的是抓取内容是通过AJAX加载,并且价格是通过CSS背景图片显示的。
每一个数字对应一个样式,如'p_h57_5'
.p_h57_5 {
background: url('http://pic.c-ctrip.com/priceblur/h57/3713de5c594648529f39d031243966dd.gif') no-repeat -590px;
padding: 0 6px;
font-size: 18px;
}
数字对应的样式和对应的backgroundimg都是动态改变的,需要获取到每一个房型的房价。虽然后来有了其它渠道获取房价,这里记录一下用Selenium&Emgu抓取的方式。
流程:
1.Selenium访问网址
2.全屏截图
3.Selenium选择器获取房型等信息
4.Selenium选择器获取价格DOM元素,计算出价格元素的相对位置,截取价格图片,使用Emgu识别价格并且输出
实现
```C# static void Main(string[] args) { //访问网址 ChromeOptions options = new ChromeOptions(); options.AddArguments("--start-maximized --disable-popup-blocking"); var driver = new ChromeDriver(options); driver.Navigate().GoToUrl("http://hotels.ctrip.com/hotel/992765.html"); try { new WebDriverWait(driver, TimeSpan.FromSeconds(1)).Until( ExpectedConditions.ElementExists((By.ClassName("htl_room_table")))); //表示已加载完毕 } finally { } //删除价格的¥符号 ReadOnlyCollection<IWebElement> elementsList = driver.FindElementsByCssSelector("tr[expand]"); driver.ExecuteScript(@" var arr = document.getElementsByTagName('dfn'); for(var i=0;i<arr.length;i++){ arr[i].style.display = 'none'; } "); //全屏截图 var image2 = GetEntereScreenshot(driver); image2.Save(@"Z:\111.jpg"); //输出 Console.WriteLine("{0,-20}{1,-20}{2,-20}", "房型", "类型", "房价"); foreach (IWebElement _ in elementsList) { //var image = _.Snapshot(); //image.Save(@"Z:\" + Guid.NewGuid() + ".jpg"); //var str = ORC_((Bitmap)image); var roomType = ""; try { roomType = _.FindElement(By.CssSelector(".room_unfold")).Text; } catch (Exception) { } var roomTypeText = regRoomType.Match(roomType); var roomTypeName = _.FindElement(By.CssSelector("span.room_type_name")).Text; //价格元素生成图片 var image = _.FindElement(By.CssSelector("span.base_price")).SnapshotV2(image2); //识别 var price = ORC_((Bitmap)image); Console.WriteLine("{0,-20}{1,-20}{2,-20}", roomTypeText.Value, roomTypeName, price); } Console.Read(); } ``` 图片识别方法 ```C# static Program() { ocr.SetVariable("tesseditchar_whitelist", "0123456789"); } private static Tesseract _ocr = new Tesseract(@"C:\Emgu\emgucv-windows-universal-cuda 2.9.0.1922\bin\tessdata", "eng", Tesseract.OcrEngineMode.OEM_TESSERACT_CUBE_COMBINED); //传入图片进行识别 public static string ORC_(Bitmap img) { //""标示OCR识别调用失败 string re = ""; if (img == null) return re; else { Bgr drawColor = new Bgr(Color.Blue); try { Image<Bgr, Byte> image = new Image<Bgr, byte>(img); using (Image<Gray, byte> gray = image.Convert<Gray, Byte>()) { _ocr.Recognize(gray); Tesseract.Charactor[] charactors = _ocr.GetCharactors(); foreach (Tesseract.Charactor c in charactors) { image.Draw(c.Region, drawColor, 1); } re = _ocr.GetText(); } return re; } catch (Exception ex) { return re; } } } ``` |