CN101051313A - Integrated data source finding method for deep layer net page data source - Google Patents
Integrated data source finding method for deep layer net page data source Download PDFInfo
- Publication number
- CN101051313A CN101051313A CN 200710021883 CN200710021883A CN101051313A CN 101051313 A CN101051313 A CN 101051313A CN 200710021883 CN200710021883 CN 200710021883 CN 200710021883 A CN200710021883 A CN 200710021883A CN 101051313 A CN101051313 A CN 101051313A
- Authority
- CN
- China
- Prior art keywords
- page
- link
- data source
- query interface
- root
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000015572 biosynthetic process Effects 0.000 claims description 27
- 239000000284 extract Substances 0.000 claims description 6
- 238000013398 bayesian method Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 230000010354 integration Effects 0.000 abstract description 5
- 238000012545 processing Methods 0.000 abstract description 3
- 238000005755 formation reaction Methods 0.000 description 21
- 230000009193 crawling Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 201000004569 Blindness Diseases 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100218834A CN100452054C (en) | 2007-05-09 | 2007-05-09 | Integrated data source finding method for deep layer net page data source |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100218834A CN100452054C (en) | 2007-05-09 | 2007-05-09 | Integrated data source finding method for deep layer net page data source |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101051313A true CN101051313A (en) | 2007-10-10 |
CN100452054C CN100452054C (en) | 2009-01-14 |
Family
ID=38782726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2007100218834A Expired - Fee Related CN100452054C (en) | 2007-05-09 | 2007-05-09 | Integrated data source finding method for deep layer net page data source |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100452054C (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101916272A (en) * | 2010-08-10 | 2010-12-15 | 南京信息工程大学 | Data source selection method for deep web data integration |
CN102103636A (en) * | 2011-01-18 | 2011-06-22 | 南京信息工程大学 | Deep web-oriented incremental information acquisition method |
CN102739679A (en) * | 2012-06-29 | 2012-10-17 | 东南大学 | URL(Uniform Resource Locator) classification-based phishing website detection method |
CN102117275B (en) * | 2009-12-31 | 2012-11-07 | 北大方正集团有限公司 | Method and device for collecting webpage data of direction site based on internet |
CN101261634B (en) * | 2008-04-11 | 2012-11-21 | 哈尔滨工业大学深圳研究生院 | Studying method and system based on increment Q-Learning |
CN104317845A (en) * | 2014-10-13 | 2015-01-28 | 安徽华贞信息科技有限公司 | Method and system for automatic extraction of deep web data |
CN104462241A (en) * | 2014-11-18 | 2015-03-25 | 北京锐安科技有限公司 | Population property classification method and device based on anchor texts and peripheral texts in URLs |
CN105843965A (en) * | 2016-04-20 | 2016-08-10 | 广州精点计算机科技有限公司 | Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification |
CN106326447A (en) * | 2016-08-26 | 2017-01-11 | 北京量科邦信息技术有限公司 | Detection method and system of data captured by crowd sourcing network crawlers |
CN103678371B (en) * | 2012-09-14 | 2017-10-10 | 富士通株式会社 | Word library updating device, data integration device and method and electronic equipment |
CN107784034A (en) * | 2016-08-31 | 2018-03-09 | 北京搜狗科技发展有限公司 | The recognition methods of page classification and device, the device for the identification of page classification |
CN108090200A (en) * | 2017-12-22 | 2018-05-29 | 中央财经大学 | A kind of sequence type hides the acquisition methods of grid database data |
CN108829792A (en) * | 2018-06-01 | 2018-11-16 | 成都康乔电子有限责任公司 | Distributed darknet excavating resource system and method based on scrapy |
CN109101600A (en) * | 2018-08-01 | 2018-12-28 | 沈文策 | The crawling method and device of dynamic data in a kind of webpage |
CN110765336A (en) * | 2019-11-01 | 2020-02-07 | 北京天融信网络安全技术有限公司 | Webpage information processing method and system |
CN112486989A (en) * | 2020-11-28 | 2021-03-12 | 河北省科学技术情报研究院(河北省科技创新战略研究院) | Multi-source data granulation fusion and index classification and layering processing method |
CN113360798A (en) * | 2021-06-02 | 2021-09-07 | 北京百度网讯科技有限公司 | Flooding data identification method, device, equipment and medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104346748B (en) * | 2014-11-25 | 2018-05-25 | 新浪网技术(中国)有限公司 | Information displaying method and device |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6988100B2 (en) * | 2001-02-01 | 2006-01-17 | International Business Machines Corporation | Method and system for extending the performance of a web crawler |
CN100371932C (en) * | 2004-03-23 | 2008-02-27 | 南京大学 | Expandable and customizable theme centralized universile-web net reptile setup method |
US20060161564A1 (en) * | 2004-12-20 | 2006-07-20 | Samuel Pierre | Method and system for locating information in the invisible or deep world wide web |
US20070100779A1 (en) * | 2005-08-05 | 2007-05-03 | Ori Levy | Method and system for extracting web data |
CN100401301C (en) * | 2006-05-30 | 2008-07-09 | 南京大学 | Body learning based intelligent subject-type network reptile system configuration method |
CN100392658C (en) * | 2006-05-30 | 2008-06-04 | 南京大学 | Body-bused subject type network reptile system configuration method |
-
2007
- 2007-05-09 CN CNB2007100218834A patent/CN100452054C/en not_active Expired - Fee Related
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101261634B (en) * | 2008-04-11 | 2012-11-21 | 哈尔滨工业大学深圳研究生院 | Studying method and system based on increment Q-Learning |
CN102117275B (en) * | 2009-12-31 | 2012-11-07 | 北大方正集团有限公司 | Method and device for collecting webpage data of direction site based on internet |
CN101916272B (en) * | 2010-08-10 | 2012-04-25 | 南京信息工程大学 | Data source selection method for deep web data integration |
CN101916272A (en) * | 2010-08-10 | 2010-12-15 | 南京信息工程大学 | Data source selection method for deep web data integration |
CN102103636B (en) * | 2011-01-18 | 2013-08-07 | 南京信息工程大学 | Deep web-oriented incremental information acquisition method |
CN102103636A (en) * | 2011-01-18 | 2011-06-22 | 南京信息工程大学 | Deep web-oriented incremental information acquisition method |
CN102739679A (en) * | 2012-06-29 | 2012-10-17 | 东南大学 | URL(Uniform Resource Locator) classification-based phishing website detection method |
CN103678371B (en) * | 2012-09-14 | 2017-10-10 | 富士通株式会社 | Word library updating device, data integration device and method and electronic equipment |
CN104317845A (en) * | 2014-10-13 | 2015-01-28 | 安徽华贞信息科技有限公司 | Method and system for automatic extraction of deep web data |
CN104462241A (en) * | 2014-11-18 | 2015-03-25 | 北京锐安科技有限公司 | Population property classification method and device based on anchor texts and peripheral texts in URLs |
CN105843965A (en) * | 2016-04-20 | 2016-08-10 | 广州精点计算机科技有限公司 | Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification |
CN105843965B (en) * | 2016-04-20 | 2019-06-04 | 广东精点数据科技股份有限公司 | A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification |
CN106326447A (en) * | 2016-08-26 | 2017-01-11 | 北京量科邦信息技术有限公司 | Detection method and system of data captured by crowd sourcing network crawlers |
CN107784034A (en) * | 2016-08-31 | 2018-03-09 | 北京搜狗科技发展有限公司 | The recognition methods of page classification and device, the device for the identification of page classification |
CN107784034B (en) * | 2016-08-31 | 2021-05-25 | 北京搜狗科技发展有限公司 | Page type identification method and device for page type identification |
CN108090200A (en) * | 2017-12-22 | 2018-05-29 | 中央财经大学 | A kind of sequence type hides the acquisition methods of grid database data |
CN108829792A (en) * | 2018-06-01 | 2018-11-16 | 成都康乔电子有限责任公司 | Distributed darknet excavating resource system and method based on scrapy |
CN109101600A (en) * | 2018-08-01 | 2018-12-28 | 沈文策 | The crawling method and device of dynamic data in a kind of webpage |
CN110765336A (en) * | 2019-11-01 | 2020-02-07 | 北京天融信网络安全技术有限公司 | Webpage information processing method and system |
CN110765336B (en) * | 2019-11-01 | 2022-04-01 | 北京天融信网络安全技术有限公司 | Webpage information processing method and system |
CN112486989A (en) * | 2020-11-28 | 2021-03-12 | 河北省科学技术情报研究院(河北省科技创新战略研究院) | Multi-source data granulation fusion and index classification and layering processing method |
CN112486989B (en) * | 2020-11-28 | 2021-08-27 | 河北省科学技术情报研究院(河北省科技创新战略研究院) | Multi-source data granulation fusion and index classification and layering processing method |
CN113360798A (en) * | 2021-06-02 | 2021-09-07 | 北京百度网讯科技有限公司 | Flooding data identification method, device, equipment and medium |
CN113360798B (en) * | 2021-06-02 | 2024-02-27 | 北京百度网讯科技有限公司 | Method, device, equipment and medium for identifying flooding data |
Also Published As
Publication number | Publication date |
---|---|
CN100452054C (en) | 2009-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100452054C (en) | Integrated data source finding method for deep layer net page data source | |
CN1240011C (en) | File classifying management system and method for operation system | |
CN1290036C (en) | Computer system and method for establishing concept knowledge according to machine readable dictionary | |
CN101079056A (en) | Retrieving method and system | |
CN103714149B (en) | Self-adaptive incremental deep web data source discovery method | |
CN101079064A (en) | Web page sequencing method and device | |
CN1755678A (en) | System and method for incorporating anchor text into ranking of search results | |
CN1750002A (en) | Method for providing research result | |
CN101055587A (en) | Search engine retrieving result reordering method based on user behavior information | |
CN1804844A (en) | Web page metadata based formalized description method for user access behaviors | |
CN111522905A (en) | Document searching method and device based on database | |
CN106227788A (en) | Database query method based on Lucene | |
Liakos et al. | Focused crawling for the hidden web | |
Barrio et al. | Sampling strategies for information extraction over the deep web | |
CN103064841A (en) | Retrieval device and retrieval method | |
Shekhar et al. | An architectural framework of a crawler for retrieving highly relevant web documents by filtering replicated web collections | |
CN108090200A (en) | A kind of sequence type hides the acquisition methods of grid database data | |
Deng | Research on the focused crawler of mineral intelligence service based on semantic similarity | |
US20040205049A1 (en) | Methods and apparatus for user-centered web crawling | |
CN110647673A (en) | Method for realizing ecological environment space big data integration and sharing | |
CN106066875A (en) | A kind of high efficient data capture method and system based on deep net reptile | |
CN1209726C (en) | Method for identifying mirror and quasi-mirror web sites over internet | |
Yadav et al. | Architecture for parallel crawling and algorithm for change detection in web pages | |
Patil et al. | Implementation of enhanced web crawler for deep-web interfaces | |
NAGAVEENA et al. | A Smart Web Crawler: An Efficient Harvesting Deep-Web Interfaces Using Site Ranker And Adoptive Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
ASS | Succession or assignment of patent right |
Free format text: FORMER OWNER: ZHAO PENGPENG FANG WEI Owner name: SUZHOU PUDA NEW INFORMATION TECHNOLOGY CO., LTD. Free format text: FORMER OWNER: CUI ZHIMING Effective date: 20100401 |
|
C41 | Transfer of patent application or patent right or utility model | ||
COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 215001 ROOM 403, BUILDING 115, SU'AN NEW VILLAGE, SUZHOU CITY, JIANGSU PROVINCE TO: 215021 B502-2, INSIDE OF INTERNATIONAL SCIENCE PARK, NO.1355, JINJIHU AVENUE, SUZHOU INDUSTRIAL PARK DISTRICT, SUZHOU CITY, JIANGSU PROVINCE |
|
TR01 | Transfer of patent right |
Effective date of registration: 20100401 Address after: 215021 international science and Technology Park, 1355 Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, B502-2 Patentee after: Suzhou Production Information Technology Co., Ltd. Address before: 215001 room 115, building 403, Su an village, Suzhou, Jiangsu Co-patentee before: Zhao Pengpeng Patentee before: Cui Zhiming Co-patentee before: Fang Wei |
|
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20071010 Assignee: SUZHOU SOUKE INFORMATION TECHNOLOGY CO., LTD. Assignor: Suzhou Production Information Technology Co., Ltd. Contract record no.: 2013320010066 Denomination of invention: Integrated data source finding method for deep layer net page data source Granted publication date: 20090114 License type: Exclusive License Record date: 20130412 |
|
LICC | Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model | ||
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20161010 Address after: 215021 Jiangsu Suzhou City Canglang District liberation Village 5 403 room Patentee after: Shu Lan Address before: 215021 international science and Technology Park, 1355 Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, B502-2 Patentee before: Suzhou Production Information Technology Co., Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090114 Termination date: 20180509 |