CN100452054C - 用于深层网页数据源集成的数据源发现方法 - Google Patents
用于深层网页数据源集成的数据源发现方法 Download PDFInfo
- Publication number
- CN100452054C CN100452054C CNB2007100218834A CN200710021883A CN100452054C CN 100452054 C CN100452054 C CN 100452054C CN B2007100218834 A CNB2007100218834 A CN B2007100218834A CN 200710021883 A CN200710021883 A CN 200710021883A CN 100452054 C CN100452054 C CN 100452054C
- Authority
- CN
- China
- Prior art keywords
- page
- link
- data source
- root
- scoring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000015572 biosynthetic process Effects 0.000 claims description 27
- 239000000284 extract Substances 0.000 claims description 6
- 238000013398 bayesian method Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 230000010354 integration Effects 0.000 abstract description 5
- 238000012545 processing Methods 0.000 abstract description 3
- 238000005755 formation reaction Methods 0.000 description 21
- 230000009193 crawling Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 201000004569 Blindness Diseases 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100218834A CN100452054C (zh) | 2007-05-09 | 2007-05-09 | 用于深层网页数据源集成的数据源发现方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100218834A CN100452054C (zh) | 2007-05-09 | 2007-05-09 | 用于深层网页数据源集成的数据源发现方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101051313A CN101051313A (zh) | 2007-10-10 |
CN100452054C true CN100452054C (zh) | 2009-01-14 |
Family
ID=38782726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2007100218834A Expired - Fee Related CN100452054C (zh) | 2007-05-09 | 2007-05-09 | 用于深层网页数据源集成的数据源发现方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100452054C (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346748A (zh) * | 2014-11-25 | 2015-02-11 | 新浪网技术(中国)有限公司 | 信息展示方法及装置 |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101261634B (zh) * | 2008-04-11 | 2012-11-21 | 哈尔滨工业大学深圳研究生院 | 基于增量Q-Learning的学习方法及系统 |
CN102117275B (zh) * | 2009-12-31 | 2012-11-07 | 北大方正集团有限公司 | 一种基于互联网定向站点网页数据采集的方法及装置 |
CN101916272B (zh) * | 2010-08-10 | 2012-04-25 | 南京信息工程大学 | 用于深层网数据集成的数据源选择方法 |
CN102103636B (zh) * | 2011-01-18 | 2013-08-07 | 南京信息工程大学 | 一种面向深层网页的增量信息获取方法 |
CN102739679A (zh) * | 2012-06-29 | 2012-10-17 | 东南大学 | 一种基于url分类的钓鱼网站检测方法 |
CN103678371B (zh) * | 2012-09-14 | 2017-10-10 | 富士通株式会社 | 词库更新装置、数据整合装置和方法以及电子设备 |
CN104317845A (zh) * | 2014-10-13 | 2015-01-28 | 安徽华贞信息科技有限公司 | 一种深度网络数据自动抽取方法及系统 |
CN104462241A (zh) * | 2014-11-18 | 2015-03-25 | 北京锐安科技有限公司 | 基于url中锚文字和周边文本的人口属性分类方法及装置 |
CN105843965B (zh) * | 2016-04-20 | 2019-06-04 | 广东精点数据科技股份有限公司 | 一种基于url主题分类的深层网络爬虫表单填充方法和装置 |
CN106326447B (zh) * | 2016-08-26 | 2019-06-21 | 北京量科邦信息技术有限公司 | 一种众包网络爬虫抓取数据的检测方法及系统 |
CN107784034B (zh) * | 2016-08-31 | 2021-05-25 | 北京搜狗科技发展有限公司 | 页面类别识别方法及装置、用于页面类别识别的装置 |
CN108090200A (zh) * | 2017-12-22 | 2018-05-29 | 中央财经大学 | 一种排序型隐藏网数据库数据的获取方法 |
CN108829792A (zh) * | 2018-06-01 | 2018-11-16 | 成都康乔电子有限责任公司 | 基于scrapy的分布式暗网资源挖掘系统及方法 |
CN109101600A (zh) * | 2018-08-01 | 2018-12-28 | 沈文策 | 一种网页中动态数据的爬取方法及装置 |
CN110765336B (zh) * | 2019-11-01 | 2022-04-01 | 北京天融信网络安全技术有限公司 | 一种网页信息处理方法及系统 |
CN112486989B (zh) * | 2020-11-28 | 2021-08-27 | 河北省科学技术情报研究院(河北省科技创新战略研究院) | 一种多源数据颗粒化融合及指标分类分层处理方法 |
CN113360798B (zh) * | 2021-06-02 | 2024-02-27 | 北京百度网讯科技有限公司 | 泛滥数据识别方法、装置、设备和介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1564157A (zh) * | 2004-03-23 | 2005-01-12 | 南京大学 | 一种可扩展、可定制的主题集中式万维网爬虫设置方法 |
US6988100B2 (en) * | 2001-02-01 | 2006-01-17 | International Business Machines Corporation | Method and system for extending the performance of a web crawler |
US20060161564A1 (en) * | 2004-12-20 | 2006-07-20 | Samuel Pierre | Method and system for locating information in the invisible or deep world wide web |
CN1851706A (zh) * | 2006-05-30 | 2006-10-25 | 南京大学 | 基于本体学习的智能主题式网络爬虫系统构建方法 |
CN1851705A (zh) * | 2006-05-30 | 2006-10-25 | 南京大学 | 基于本体的主题式网络爬虫系统构建方法 |
WO2007017862A2 (en) * | 2005-08-05 | 2007-02-15 | Buzzmetrics Ltd. | Method and system for extracting web data |
-
2007
- 2007-05-09 CN CNB2007100218834A patent/CN100452054C/zh not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6988100B2 (en) * | 2001-02-01 | 2006-01-17 | International Business Machines Corporation | Method and system for extending the performance of a web crawler |
CN1564157A (zh) * | 2004-03-23 | 2005-01-12 | 南京大学 | 一种可扩展、可定制的主题集中式万维网爬虫设置方法 |
US20060161564A1 (en) * | 2004-12-20 | 2006-07-20 | Samuel Pierre | Method and system for locating information in the invisible or deep world wide web |
WO2007017862A2 (en) * | 2005-08-05 | 2007-02-15 | Buzzmetrics Ltd. | Method and system for extracting web data |
CN1851706A (zh) * | 2006-05-30 | 2006-10-25 | 南京大学 | 基于本体学习的智能主题式网络爬虫系统构建方法 |
CN1851705A (zh) * | 2006-05-30 | 2006-10-25 | 南京大学 | 基于本体的主题式网络爬虫系统构建方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346748A (zh) * | 2014-11-25 | 2015-02-11 | 新浪网技术(中国)有限公司 | 信息展示方法及装置 |
CN104346748B (zh) * | 2014-11-25 | 2018-05-25 | 新浪网技术(中国)有限公司 | 信息展示方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN101051313A (zh) | 2007-10-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100452054C (zh) | 用于深层网页数据源集成的数据源发现方法 | |
CN109543086B (zh) | 一种面向多数据源的网络数据采集与展示方法 | |
Udapure et al. | Study of web crawler and its different types | |
Gupta et al. | Focused web crawlers and its approaches | |
CN103714149B (zh) | 一种自适应增量式的深层网络数据源发现方法 | |
CN106227788A (zh) | 一种以Lucene为基础的数据库查询方法 | |
CN104182412A (zh) | 一种网页爬取方法及系统 | |
CN103914538B (zh) | 基于锚文本上下文和链接分析的主题抓取方法 | |
CN103279492A (zh) | 一种抓取网页的方法和装置 | |
Kumar et al. | Design of a mobile Web crawler for hidden Web | |
CN109815388A (zh) | 一种基于遗传算法的智能聚焦爬虫系统 | |
Shekhar et al. | An architectural framework of a crawler for retrieving highly relevant web documents by filtering replicated web collections | |
Deng | Research on the focused crawler of mineral intelligence service based on semantic similarity | |
CN108090200A (zh) | 一种排序型隐藏网数据库数据的获取方法 | |
CN107169082A (zh) | 一种基于区域定位的消息推送方法 | |
Mangaravite et al. | Improving the efficiency of a genre-aware approach to focused crawling based on link context | |
Ye et al. | iSurfer: A focused web crawler based on incremental learning from positive samples | |
Prasath et al. | Finding potential seeds through rank aggregation of web searches | |
Yuan et al. | Improvement of pagerank for focused crawler | |
Kaur et al. | SmartCrawler: A Three-Stage Ranking Based Web Crawler for Harvesting Hidden Web Sources. | |
Patil et al. | Implementation of enhanced web crawler for deep-web interfaces | |
Wang et al. | Focused deep web entrance crawling by form feature classification | |
王辉 et al. | 使用分类器自动发现特定领域的深度网入口 | |
Yadav et al. | Topical web crawling using weighted anchor text and web page change detection techniques | |
Amrin et al. | Focused Web Crawling Algorithms. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
ASS | Succession or assignment of patent right |
Free format text: FORMER OWNER: ZHAO PENGPENG FANG WEI Owner name: SUZHOU PUDA NEW INFORMATION TECHNOLOGY CO., LTD. Free format text: FORMER OWNER: CUI ZHIMING Effective date: 20100401 |
|
C41 | Transfer of patent application or patent right or utility model | ||
COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 215001 ROOM 403, BUILDING 115, SU'AN NEW VILLAGE, SUZHOU CITY, JIANGSU PROVINCE TO: 215021 B502-2, INSIDE OF INTERNATIONAL SCIENCE PARK, NO.1355, JINJIHU AVENUE, SUZHOU INDUSTRIAL PARK DISTRICT, SUZHOU CITY, JIANGSU PROVINCE |
|
TR01 | Transfer of patent right |
Effective date of registration: 20100401 Address after: 215021 international science and Technology Park, 1355 Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, B502-2 Patentee after: Suzhou Production Information Technology Co., Ltd. Address before: 215001 room 115, building 403, Su an village, Suzhou, Jiangsu Co-patentee before: Zhao Pengpeng Patentee before: Cui Zhiming Co-patentee before: Fang Wei |
|
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20071010 Assignee: SUZHOU SOUKE INFORMATION TECHNOLOGY CO., LTD. Assignor: Suzhou Production Information Technology Co., Ltd. Contract record no.: 2013320010066 Denomination of invention: Integrated data source finding method for deep layer net page data source Granted publication date: 20090114 License type: Exclusive License Record date: 20130412 |
|
LICC | Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model | ||
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20161010 Address after: 215021 Jiangsu Suzhou City Canglang District liberation Village 5 403 room Patentee after: Shu Lan Address before: 215021 international science and Technology Park, 1355 Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, B502-2 Patentee before: Suzhou Production Information Technology Co., Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090114 Termination date: 20180509 |