CN102054028B - 一种网络爬虫系统实现页面渲染功能的方法 - Google Patents
一种网络爬虫系统实现页面渲染功能的方法 Download PDFInfo
- Publication number
- CN102054028B CN102054028B CN 201010590806 CN201010590806A CN102054028B CN 102054028 B CN102054028 B CN 102054028B CN 201010590806 CN201010590806 CN 201010590806 CN 201010590806 A CN201010590806 A CN 201010590806A CN 102054028 B CN102054028 B CN 102054028B
- Authority
- CN
- China
- Prior art keywords
- page
- url
- label
- crawler system
- web
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000009877 rendering Methods 0.000 title claims abstract description 18
- 238000001914 filtration Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 12
- 238000013138 pruning Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 20
- 241000239290 Araneae Species 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000001514 detection method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 241000270322 Lepidosauria Species 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims (2)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010590806 CN102054028B (zh) | 2010-12-10 | 2010-12-10 | 一种网络爬虫系统实现页面渲染功能的方法 |
PCT/CN2011/078725 WO2012025040A1 (zh) | 2010-08-27 | 2011-08-22 | 可视化搜索引擎系统及其实现方法和应用 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010590806 CN102054028B (zh) | 2010-12-10 | 2010-12-10 | 一种网络爬虫系统实现页面渲染功能的方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102054028A CN102054028A (zh) | 2011-05-11 |
CN102054028B true CN102054028B (zh) | 2013-12-25 |
Family
ID=43958350
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010590806 Active CN102054028B (zh) | 2010-08-27 | 2010-12-10 | 一种网络爬虫系统实现页面渲染功能的方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102054028B (zh) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012025040A1 (zh) * | 2010-08-27 | 2012-03-01 | Huang Bin | 可视化搜索引擎系统及其实现方法和应用 |
CN102915308B (zh) * | 2011-08-02 | 2016-03-09 | 阿里巴巴集团控股有限公司 | 一种页面渲染的方法及装置 |
CN103164193B (zh) * | 2011-12-12 | 2016-02-17 | 阿里巴巴集团控股有限公司 | 一种模板的渲染方法及装置 |
CN102737128B (zh) * | 2012-06-20 | 2014-12-10 | 深圳市远行科技有限公司 | 一种基于浏览器的动态页面处理装置 |
CN103810425B (zh) * | 2012-11-13 | 2015-09-30 | 腾讯科技(深圳)有限公司 | 恶意网址的检测方法及装置 |
CN104346328A (zh) * | 2013-07-23 | 2015-02-11 | 同程网络科技股份有限公司 | 基于网页数据抓取的垂直智能爬虫数据收集方法 |
CN104462125B (zh) * | 2013-09-18 | 2019-09-17 | 腾讯科技(深圳)有限公司 | 生成网页截图的方法及装置 |
CN104156421B (zh) * | 2014-08-06 | 2018-11-09 | 百度在线网络技术(北京)有限公司 | 页面的展现方法、装置及系统 |
US9729606B2 (en) * | 2014-09-10 | 2017-08-08 | Benefitfocus.Com, Inc. | Systems and methods for a metadata driven user interface framework |
CN110851680B (zh) * | 2015-05-15 | 2023-06-30 | 阿里巴巴集团控股有限公司 | 网络爬虫识别方法和装置 |
CN106503253A (zh) * | 2016-11-11 | 2017-03-15 | 张军 | 一种针对图片格式的网络爬虫提取url并索引及映射的框架 |
CN108197125B (zh) | 2016-12-08 | 2020-10-09 | 腾讯科技(深圳)有限公司 | 网页抓取方法及装置 |
CN109711528A (zh) * | 2017-10-26 | 2019-05-03 | 北京深鉴智能科技有限公司 | 基于特征图变化对卷积神经网络剪枝的方法 |
CN108009598A (zh) * | 2017-12-27 | 2018-05-08 | 北京诸葛找房信息技术有限公司 | 基于深度学习的户型图识别方法 |
CN108549693B (zh) * | 2018-04-13 | 2022-07-08 | 上海宝尊电子商务有限公司 | 基于爬虫技术的cms页面生成方法 |
CN108777687B (zh) * | 2018-06-05 | 2020-04-14 | 掌阅科技股份有限公司 | 基于用户行为画像的爬虫拦截方法、电子设备、存储介质 |
CN109543085A (zh) * | 2018-11-15 | 2019-03-29 | 中电科嘉兴新型智慧城市科技发展有限公司 | 数据提取方法、装置及计算机可读存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6271840B1 (en) * | 1998-09-24 | 2001-08-07 | James Lee Finseth | Graphical search engine visual index |
CN101404666A (zh) * | 2008-10-06 | 2009-04-08 | 赵洪宇 | 一种基于Web页无限层采集方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8108371B2 (en) * | 2006-11-30 | 2012-01-31 | Microsoft Corporation | Web engine search preview |
CN101216836B (zh) * | 2007-12-29 | 2010-06-02 | 腾讯科技(深圳)有限公司 | 一种网页锚文本去噪系统及方法 |
CN101751438B (zh) * | 2008-12-17 | 2012-08-22 | 中国科学院自动化研究所 | 自适应语义驱动的主题网页过滤系统 |
-
2010
- 2010-12-10 CN CN 201010590806 patent/CN102054028B/zh active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6271840B1 (en) * | 1998-09-24 | 2001-08-07 | James Lee Finseth | Graphical search engine visual index |
CN101404666A (zh) * | 2008-10-06 | 2009-04-08 | 赵洪宇 | 一种基于Web页无限层采集方法 |
Non-Patent Citations (4)
Title |
---|
zhengchao860730.让页面变得更快一点-HTML解析原理[转].《http://zhengchao860730.iteye.com/blog/647842》.2010, |
刘忠.基于强化学习的垂直搜索引擎网络爬虫的研究与实现.《中国优秀硕士学位论文全文数据库》.2008,23-27页. |
基于强化学习的垂直搜索引擎网络爬虫的研究与实现;刘忠;《中国优秀硕士学位论文全文数据库》;20081130;23-27页 * |
让页面变得更快一点-HTML解析原理[转];zhengchao860730;《http://zhengchao860730.iteye.com/blog/647842》;20100419;1-2页 * |
Also Published As
Publication number | Publication date |
---|---|
CN102054028A (zh) | 2011-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102054028B (zh) | 一种网络爬虫系统实现页面渲染功能的方法 | |
CN102930059B (zh) | 一种聚焦爬虫的设计方法 | |
CN102831220B (zh) | 一种面向主题定制的新闻情报提取系统 | |
CN103970788A (zh) | 一种基于网页爬取的爬虫技术 | |
CN102270331B (zh) | 基于可视化搜索的网络购物导航方法 | |
CN105243159A (zh) | 一种基于可视化脚本编辑器的分布式网络爬虫系统 | |
CN102567407B (zh) | 一种论坛回帖增量采集方法及系统 | |
CN104516982A (zh) | 一种基于Nutch的Web信息提取方法和系统 | |
CN102314463A (zh) | 分布式爬虫系统及其提取网页数据的方法 | |
CN107257390B (zh) | 一种url地址的解析方法和系统 | |
CN102880607A (zh) | 网络动态内容抓取方法及网络动态内容爬虫系统 | |
CN102591992A (zh) | 基于垂直搜索和聚焦爬虫技术的网页分类识别系统及方法 | |
CN102768683B (zh) | 一种图片信息的搜索方法及搜索装置 | |
Sukumar et al. | Review on modern Data Preprocessing techniques in Web usage mining (WUM) | |
CN105468737A (zh) | 一种网络服务大数据分析方法、云计算平台及挖掘系统 | |
CN103258017B (zh) | 一种并行的垂直交叉网络数据采集方法及系统 | |
CN103455600A (zh) | 一种视频url抓取方法、装置及服务器设备 | |
CN106599270B (zh) | 网络数据抓取方法和爬虫 | |
CN102567521B (zh) | 网页数据抓取过滤方法 | |
CN103902579A (zh) | 获取信息的方法和装置 | |
CN103177022A (zh) | 一种恶意文件搜索方法及装置 | |
CN104199893A (zh) | 一种快速将全媒体内容发布的系统和方法 | |
CN103761257A (zh) | 基于移动浏览器的网页处理方法及系统 | |
CN106326236A (zh) | 一种网页内容识别方法和系统 | |
CN109446441B (zh) | 一种通用的网络社区可信分布式采集存储系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
ASS | Succession or assignment of patent right |
Owner name: BEIJING LIXINYINGQI INFORMATION TECHNOLOGY CO., LT Free format text: FORMER OWNER: HUANG BIN Effective date: 20150626 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20150626 Address after: 100083, Beijing, Haidian District, North Fourth Ring Road, No. 257 branch building, West 509 Patentee after: Beijing Lixinyingqi Information Technology Co., Ltd. Address before: 100083, Beijing, Haidian District, two Li village, 8 North building, 5 door, Room 501 Patentee before: Huang Bin |
|
C56 | Change in the name or address of the patentee | ||
CP03 | Change of name, title or address |
Address after: 100101 Beijing city Chaoyang District Anxiang Lane No. 11 Beijing building B block 1306 Patentee after: BEIJING LIXIN YINGQI BIG DATA TECHNOLOGY CO., LTD. Address before: 100083, Beijing, Haidian District, North Fourth Ring Road, No. 257 branch building, West 509 Patentee before: Beijing Lixinyingqi Information Technology Co., Ltd. |
|
CP01 | Change in the name or title of a patent holder |
Address after: 100101 Beijing city Chaoyang District Anxiang Lane No. 11 Beijing building B block 1306 Patentee after: Beijing fahe Big Data Technology Co., Ltd Address before: 100101 Beijing city Chaoyang District Anxiang Lane No. 11 Beijing building B block 1306 Patentee before: BEIJING LIXIN YINGQI BIG DATA TECHNOLOGY Co.,Ltd. |
|
CP01 | Change in the name or title of a patent holder | ||
CP02 | Change in the address of a patent holder |
Address after: Room 1126, 11 / F, building 1, No. 11 courtyard, Anxiang Beili, Chaoyang District, Beijing 100101 Patentee after: Beijing fahe Big Data Technology Co., Ltd Address before: 100101 Beijing city Chaoyang District Anxiang Lane No. 11 Beijing building B block 1306 Patentee before: Beijing fahe Big Data Technology Co., Ltd |
|
CP02 | Change in the address of a patent holder | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: Room 1126, floor 11, building 1, yard a 11, Anxiang Beili, Chaoyang District, Beijing 100101 Patentee after: Beijing fahe Digital Technology Group Co., Ltd Address before: Room 1126, floor 11, building 1, yard a 11, Anxiang Beili, Chaoyang District, Beijing 100101 Patentee before: Beijing fahe Big Data Technology Co., Ltd |