CN101051313A - Integrated data source finding method for deep layer net page data source - Google Patents

Integrated data source finding method for deep layer net page data source Download PDF

Info

Publication number
CN101051313A
CN101051313A CN 200710021883 CN200710021883A CN101051313A CN 101051313 A CN101051313 A CN 101051313A CN 200710021883 CN200710021883 CN 200710021883 CN 200710021883 A CN200710021883 A CN 200710021883A CN 101051313 A CN101051313 A CN 101051313A
Authority
CN
China
Prior art keywords
page
link
data source
query interface
root
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200710021883
Other languages
Chinese (zh)
Other versions
CN100452054C (en
Inventor
崔志明
赵朋朋
方巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shu Lan
Original Assignee
崔志明
赵朋朋
方巍
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 崔志明, 赵朋朋, 方巍 filed Critical 崔志明
Priority to CNB2007100218834A priority Critical patent/CN100452054C/en
Publication of CN101051313A publication Critical patent/CN101051313A/en
Application granted granted Critical
Publication of CN100452054C publication Critical patent/CN100452054C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for discovering data source used on deep web data integration includes setting up station root chain queue and local chain queue, taking page chain with highest score out from local chain queue and using creepage module to download it, processing downloaded page by table sorter, adding said page into deep web data source if it has table query interface, processing downloaded page by page sorter and returning back to step of taking page chain if subject score is less than threshold, picking up chain address in page then placing it into local chain queue, repeating step of taking page chain to step of picking up chain address for realizing automatic creepage of deep web data source.

Description

Be used for the integrated data source finding method of deep layer net page data source
Technical field
The present invention relates to a kind of discover method of based on network data source, be specifically related to a kind of data source finding method of the deep layer net page that connects by the network inquiry interface, be used for the integrated of deep layer net page data source.
Background technology
Along with the widespread use of network data base, network quickens " in-depth ".It is dynamically to be produced by background data base that a large amount of pages are arranged on the internet, this part information can not directly be obtained by static linkage, can only submit to inquiry to obtain by filling in list,, can't obtain these pages because traditional web crawlers (Crawler) does not have the ability of filling in list.Therefore, existing search engine searches does not go out this part page info, thus cause this part information to the user be hide, sightless, we are referred to as deep layer Webpage (Deep Web is called Invisible Web again, Hidden Web).Deep Web is one and the corresponding notion of SurfaceWeb, is proposed in 1994 by Dr.Jill Ellsworth at first, refers to those are difficult to find its information content by the general search engine Webpage.Deep Web information generally is stored in the database, and compare common quantity of information with static page bigger, and theme is more single-minded, and information quality is better, and message structureization is better, and growth rate is faster.Studies show that Deep Web information is 500 times of Surface Web information, nearly 450,000 Deep Web websites are arranged.Realize that extensive Deep Web data integration is an effective way that is user-friendly to Deep Web information.
Realize extensive Deep Web integration search, need to solve: 1) data source finding (Deep WebDiscovery); 2) query interface extracts (Query Interface Extraction); 3) data source classification (Source Classification); 4) query conversion (Query Transfer); 5) synthetic five key issues such as (ResultMerging) of result.
To the classify prerequisite of integration search of Deep Web is to obtain Deep Web query interface, and this belongs to the category of data source finding.
K.C.-C.Chang, B.He, Z.Zhang is (Conference onInnovative Data Systems Research in Toward Large-Scale Integration:Building a MetaQuerier over Databases on the Web one literary composition, Asilomar, 2005), a kind of method of obtaining query interface from network is disclosed, it is at first collected provides WWW the IP address list of service, then for each IP address in the tabulation, grasp webpage in the certain depth scope successively according to breadth-first strategy, but and from the page of downloading, extract query interface.But because it is very little to contain the page ratio of query interface in the internet, and breadth-first is a kind of search strategy of blindness, adopts this method can download a large amount of irrelevant pages, and efficient is very low.
The effective means that addresses this problem is to adopt focused crawling (Focused Crawling) technology.The research that the focused crawler technology is applied in the Deep Web data source finding is also fewer at present.There is the scholar to use the link classification device preferentially to download those most probables and points to the page that contains query interface.In the training classifier process, search engines such as its use Google obtain pointing to all outer pages of the internal layer page, but the shortcoming of this method is: more to outer, the quantity of the page is just many more, and much be the page that has nothing to do, so can cause problems such as " theme drifts ".And said method can't obtain certain page depth information accurately in the website under it, thereby can't control the process of creeping well.
Summary of the invention
The object of the invention provides a kind of integrated data source finding method of deep layer net page data source that is used for, and with according to the theme that sets, realizes that the retrieval of the data query interface that theme is relevant is downloaded, and reduces page number of downloads, solves the theme drifting problem.
For achieving the above object, the technical solution used in the present invention is: a kind of integrated data source finding method of deep layer net page data source that is used for comprises the following steps:
(1) provide the theme of data to be checked, at least one seminal root chained address is put in structure the website root linked queue and local links formation in the linked queue of website root respectively, and according to the given weight of relation of itself and theme;
(2) if the local links formation is empty, then the local links formation is put in a root chained address of the heavy maximum of weighting from the linked queue of website root; From the local links formation, get the highest page link of scoring, download this page by the module of creeping;
(3) page that utilizes the list sorter that step (2) is downloaded is handled, and as wherein containing the list query interface, then it is added in the deep layer net page data source;
(4) page that utilizes the page classifications device that step (2) is downloaded is handled, and described page classifications device adopts preferential (best-first) strategy of the superior to carry out theme to judge, if the theme scoring less than setting threshold, is then returned step (2);
(5) chained address in the extraction page, judge with the link classification device whether the chained address might point to the page that contains the list interface, and to this link scoring, described link classification device determination methods is, extract the anchor text, chain picture address in hereinafter text, chained address, the link as feature, the information of carrying out participle is also added up word frequency, obtains the feature vector, X of this link, adopts the naive Bayesian method that link information is classified; For the link of scoring greater than setting value, as be local links, then put into the local links formation, as being the external site link, then search site root linked queue is when existing corresponding website root link, adjust the weight of website root link according to the scoring of this link, when not having corresponding website root link, the website root link that then will link adds the linked queue of website root, and sets the weight of root link according to scoring;
(6) repeating step (2) is realized creeping automatically of deep layer net page data source to step (5).
In the technique scheme, described " local links " is meant with the page of handling to have the page link that identical website root links." page classifications device " adopts the Best-first strategy, is used to judge whether the page P of extracting belongs to current theme.Have only the P of working as to belong to current theme, link among the P and query interface are just continued to handle." link classification device " is used to judge whether link url might point to the page that contains the list interface, and to this link scoring.Described classifier methods is a prior art, and its general process all is to create sorter automatically by the study to one group of training text having divided class, by directed learning is arranged test text is classified.Wherein, described Naive Bayes Classification device (
Figure A20071002188300061
BayesClassifier) be independently with respect to decision variable between each component of supposition proper vector.For proper vector is X=[x 1, x 2..., x d] TTest sample book, it belongs to the probability of Ci class shown in (1) formula:
P ( C i | X ) = P ( C i ) / P ( X ) * Π j = 1 d P ( x j | C i ) - - - ( 1 )
On behalf of X, P (Ci|X) belong to the probability of class Ci.Each classification is all calculated the probability of following formula, and final recognition result is that class that makes the probable value maximum.
Judge by adopting the page classifications device to carry out theme, avoided the theme drift effectively.
Further technical scheme in the described step (5), for local links, if the link degree of depth was then abandoned greater than 3 o'clock, is not put into the local links formation.According to investigations, the degree of depth of 91.6% the deep webpage query interface place page is smaller or equal to 3, therefore when the degree of depth of link greater than 3 the time, just do not handle this link, can under the prerequisite of assurance accuracy, effectively reduce treatment capacity.
In the technique scheme, adopt page instance that the page classifications device is trained earlier, then for the new page that obtains from crawl device with the page classifications device analysis that trains and mark, the probability size that this page belongs to current theme has been reacted in described scoring, have only when this scoring during greater than a previous preset threshold θ, link in the page and query interface are just continued to handle.
In the technique scheme, described list sorter is determined the query interface zone according to heuristic rule, only when the list in the page is query interface class list, it is added deep layer net page data source; Described heuristic rule is that the web form that is made of TEXTAREA control or PASSWORD control is not a query interface; It is not query interface that control quantity in the web form is less than 3 web form.
Further technical scheme is set the query interface threshold value, and the different query interface quantity of having found when certain website is during greater than the query interface threshold value, and the link of this website is directly abandoned, and no longer adds linked queue.
Optimized technical scheme is that described query interface threshold value is the integer between 5~8.
Because the technique scheme utilization, the present invention compared with prior art has following advantage:
1. judge the consistance probability of page theme and required inquiry theme owing to the present invention adopts the page classifications device, thereby can effectively prevent the theme drift, realize focused crawling, reduce treatment capacity greatly, improve the discovery efficient of deep layer net page data source;
2. because the present invention is provided with the linked queue of website root and two formations of local links formation, can effectively monitor the link degree of depth of the website of handling, when the link degree of depth greater than 3 the time, stop to handle, because the degree of depth of 91.6% the deep webpage query interface place page is smaller or equal to 3, therefore can under the prerequisite that guarantees accuracy, effectively reduce treatment capacity;
3. the present invention has considered the sequencing that problem adjustment such as each chained dependence is creeped in the weight of each website and the current website simultaneously, it is a kind of Web of Deep very efficiently data source acquisition methods, it can improve people's work efficient in a big way, for the integrated basis that provides of deep layer net page data source further is provided.
Description of drawings
Accompanying drawing 1 is the deep layer net page data source focused crawler system framework synoptic diagram of the embodiment of the invention one;
Accompanying drawing 2 is focused crawling algorithm synoptic diagram of embodiment one.
Embodiment
Below in conjunction with drawings and Examples the present invention is further described:
Embodiment one: to shown in the accompanying drawing 2, a kind of integrated data source finding method of deep layer net page data source that is used for comprises the following steps: referring to accompanying drawing 1
(1) provide the theme of data to be checked, at least one seminal root chained address is put in structure the website root linked queue and local links formation in the linked queue of website root respectively, and according to the given weight of relation of itself and theme;
(2) if the local links formation is empty, then the local links formation is put in a root chained address of the heavy maximum of weighting from the linked queue of website root; From the local links formation, get the highest page link of scoring, download this page by the module of creeping;
(3) page that utilizes the list sorter that step (2) is downloaded is handled, and as wherein containing the list query interface, then it is added in the deep layer net page data source;
(4) page that utilizes the page classifications device that step (2) is downloaded is handled, and described page classifications device adopts preferential (best-first) strategy of the superior to carry out theme to judge, if the theme scoring less than setting threshold, is then returned step (2);
(5) chained address in the extraction page, judge with the link classification device whether the chained address might point to the page that contains the list interface, and to this link scoring, described link classification device determination methods is, extract the anchor text, chain picture address in hereinafter text, chained address, the link as feature, the information of carrying out participle is also added up word frequency, obtains the feature vector, X of this link, adopts the naive Bayesian method that link information is classified; For the link of scoring greater than setting value, as be local links, then put into the local links formation, as being the external site link, then search site root linked queue is when existing corresponding website root link, adjust the weight of website root link according to the scoring of this link, when not having corresponding website root link, the website root link that then will link adds the linked queue of website root, and sets the weight of root link according to scoring;
(6) repeating step (2) is realized creeping automatically of deep layer net page data source to step (5).
Realize deep layer net page (Deep Web) the data source focused crawler system of said method, its system framework figure is referring to shown in Figure 1.Each module is described in detail as follows:
1. link classification device
The link classification device is used to judge whether link URL might point to the page that contains the list interface, and to this link scoring.The feature that the link classification device extracts mainly is the anchor text and chains picture address in hereinafter text, URL address, the link.Through observing, replaced the anchor text with picture in a lot of links, so we also take into account the address information of picture.To above-mentioned information participle and after adding up word frequency, just obtained the feature vector, X of this link.We adopt the naive Bayesian method to come link information is classified then.
2. page classifications device
The page classifications device adopts the Best-first strategy, is used to judge whether the page P of extracting belongs to current theme.Have only the P of working as to belong to current theme, link among the P and query interface are just continued to handle.The page classifications device is trained with the page instance that part obtains from Yahoo's split catalog earlier.Then for a new page P who obtains from crawl device, the page classifications device that trains is analyzed the content of P, has reacted the probability size that P belongs to current theme for then scoring of P, this scoring.Have only when this scoring during greater than a previous preset threshold θ, link among the P and query interface are just continued to handle.
3. list sorter
Because our target is to collect Deep Web data source, so we need remove the list that those are not Deep Web query interfaces, such as Member Entrance, mail subscription etc. are to the insignificant list of the present invention.For this reason, we determine the query interface zone according to some heuristic rules, and for example some web form has TEXTAREA control and PASSWORD control, and we can judge directly that this class web form is not a query interface according to practical experience.In addition a threshold value can be set for the control quantity in the web form, when the control quantity in the web form is lower than this threshold value, just can think that this web form is not a query interface.For example the web form number of elements of some search in Website seldom only has a text box and a submit button, and we can't obtain enough information to this class web form, therefore they can be put under non-query interface one class.
4. the module of creeping
The module of creeping adopts multithreading, to improve the processing speed of system.After a period of time of having creeped, along with becoming how much levels, number of links in the linked queue of waiting to creep increases, and memory consumption is quite fast, and it is very low that cpu busy percentage becomes.So limit the capacity of related data structures committed memory, to data when its capacity will utilize persistence technology (serialization) to write on the disk during greater than certain numerical value.
When determining to creep stop condition, because studies show that: on average each Deep Web website only contains 4.2 query interfaces.So when the different query interface quantity of having found when certain website or the page quantity of download surpassed certain threshold value, the link in this website had just no longer been handled.
5. linked queue to be creeped
Formation to be creeped mainly contains two in the native system: " local links formation " and " linked queue of website root ".According to investigations, the degree of depth of 91.6% the Deep Web query interface place page is smaller or equal to 3, therefore when the degree of depth of link greater than 3 the time, we have linked with regard to not handling this." local links formation " deposited belong to current website respectively wait the link of creeping, each link is according to scoring ordering from high to low.And for the link of directed outwards website in the current site page, we leave the home address of these link place websites and the weight of this website in " linked queue of website root " in.The weight of each website in " linked queue of website root " can constantly be updated in crawling process.The principle of upgrading is: when the newfound link scoring that belongs to certain website is very high, can increase the weight of this website; Can reduce the weight of this website on the contrary.Each website in " linked queue of website root " sorts from high to low according to its weight.In the crawling process, when " local links formation " is sky, then from " linked queue of website root ", takes out the highest website home address of weight and put into " local links formation ", thereby begin creeping of a new round.
Deep Web data source focused crawling algorithm
The core algorithm of Deep Web data source focused crawler as shown in Figure 2.Integrated in order later on the query interface in each field to be carried out respectively, present embodiment is creeped respectively to the website of different field (as career field, automotive field).
Link to be creeped is left in the formation of waiting to creep, and the formation of creeping is put in the link of having visited.When determining a link whether will join to wait to creep formation, consider three problems: 1. whether the degree of depth of this link is smaller or equal to 3 (because the degree of depth of 91.6% the Deep Web query interface place page is smaller or equal to 3).2. whether the content that should link the place page is relevant with current theme.If content of pages and theme are irrelevant, then do not consider link wherein.3. whether this link might point to the page that contains query interface.

Claims (6)

1. one kind is used for the integrated data source finding method of deep layer net page data source, it is characterized in that, comprises the following steps:
(1) provide the theme of data to be checked, at least one seminal root chained address is put in structure the website root linked queue and local links formation in the linked queue of website root respectively, and according to the given weight of relation of itself and theme;
(2) if the local links formation is empty, then the local links formation is put in a root chained address of the heavy maximum of weighting from the linked queue of website root; From the local links formation, get the highest page link of scoring, download this page by the module of creeping;
(3) page that utilizes the list sorter that step (2) is downloaded is handled, and as wherein containing the list query interface, then it is added in the deep layer net page data source;
(4) page that utilizes the page classifications device that step (2) is downloaded is handled, and described page classifications device adopts preferential (best-first) strategy of the superior to carry out theme to judge, if the theme scoring less than setting threshold, is then returned step (2);
(5) chained address in the extraction page, judge with the link classification device whether the chained address might point to the page that contains the list interface, and to this link scoring, described link classification device determination methods is, extract the anchor text, chain picture address in hereinafter text, chained address, the link as feature, the information of carrying out participle is also added up word frequency, obtains the feature vector, X of this link, adopts the naive Bayesian method that link information is classified; For the link of scoring greater than setting value, as be local links, then put into the local links formation, as being the external site link, then search site root linked queue is when existing corresponding website root link, adjust the weight of website root link according to the scoring of this link, when not having corresponding website root link, the website root link that then will link adds the linked queue of website root, and sets the weight of root link according to scoring;
(6) repeating step (2) is realized creeping automatically of deep layer net page data source to step (5).
2. the integrated data source finding method of deep layer net page data source that is used for according to claim 1 is characterized in that: in the described step (5), for local links, if the link degree of depth was then abandoned greater than 3 o'clock, do not put into the local links formation.
3. the integrated data source finding method of deep layer net page data source that is used for according to claim 1, it is characterized in that: adopt page instance that the page classifications device is trained earlier, then for the new page that obtains from crawl device with the page classifications device analysis that trains and mark, the probability size that this page belongs to current theme has been reacted in described scoring, have only when this scoring during greater than a previous preset threshold θ, link in the page and query interface are just continued to handle.
4. the integrated data source finding method of deep layer net page data source that is used for according to claim 1, it is characterized in that: described list sorter is determined the query interface zone according to heuristic rule, only when the list in the page is query interface class list, it is added deep layer net page data source; Described heuristic rule is that the web form that is made of TEXTAREA control or PASSWORD control is not a query interface; It is not query interface that control quantity in the web form is less than 3 web form.
5. the integrated data source finding method of deep layer net page data source that is used for according to claim 1, it is characterized in that: set the query interface threshold value, the different query interface quantity of having found when certain website is during greater than the query interface threshold value, the link of this website is directly abandoned, and no longer adds linked queue.
6. the integrated data source finding method of deep layer net page data source that is used for according to claim 5 is characterized in that: described query interface threshold value is the integer between 5~8.
CNB2007100218834A 2007-05-09 2007-05-09 Integrated data source finding method for deep layer net page data source Expired - Fee Related CN100452054C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100218834A CN100452054C (en) 2007-05-09 2007-05-09 Integrated data source finding method for deep layer net page data source

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100218834A CN100452054C (en) 2007-05-09 2007-05-09 Integrated data source finding method for deep layer net page data source

Publications (2)

Publication Number Publication Date
CN101051313A true CN101051313A (en) 2007-10-10
CN100452054C CN100452054C (en) 2009-01-14

Family

ID=38782726

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100218834A Expired - Fee Related CN100452054C (en) 2007-05-09 2007-05-09 Integrated data source finding method for deep layer net page data source

Country Status (1)

Country Link
CN (1) CN100452054C (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916272A (en) * 2010-08-10 2010-12-15 南京信息工程大学 Data source selection method for deep web data integration
CN102103636A (en) * 2011-01-18 2011-06-22 南京信息工程大学 Deep web-oriented incremental information acquisition method
CN102739679A (en) * 2012-06-29 2012-10-17 东南大学 URL(Uniform Resource Locator) classification-based phishing website detection method
CN102117275B (en) * 2009-12-31 2012-11-07 北大方正集团有限公司 Method and device for collecting webpage data of direction site based on internet
CN101261634B (en) * 2008-04-11 2012-11-21 哈尔滨工业大学深圳研究生院 Studying method and system based on increment Q-Learning
CN104317845A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for automatic extraction of deep web data
CN104462241A (en) * 2014-11-18 2015-03-25 北京锐安科技有限公司 Population property classification method and device based on anchor texts and peripheral texts in URLs
CN105843965A (en) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN103678371B (en) * 2012-09-14 2017-10-10 富士通株式会社 Word library updating device, data integration device and method and electronic equipment
CN107784034A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 The recognition methods of page classification and device, the device for the identification of page classification
CN108090200A (en) * 2017-12-22 2018-05-29 中央财经大学 A kind of sequence type hides the acquisition methods of grid database data
CN108829792A (en) * 2018-06-01 2018-11-16 成都康乔电子有限责任公司 Distributed darknet excavating resource system and method based on scrapy
CN109101600A (en) * 2018-08-01 2018-12-28 沈文策 The crawling method and device of dynamic data in a kind of webpage
CN110765336A (en) * 2019-11-01 2020-02-07 北京天融信网络安全技术有限公司 Webpage information processing method and system
CN112486989A (en) * 2020-11-28 2021-03-12 河北省科学技术情报研究院(河北省科技创新战略研究院) Multi-source data granulation fusion and index classification and layering processing method
CN113360798A (en) * 2021-06-02 2021-09-07 北京百度网讯科技有限公司 Flooding data identification method, device, equipment and medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346748B (en) * 2014-11-25 2018-05-25 新浪网技术(中国)有限公司 Information displaying method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6988100B2 (en) * 2001-02-01 2006-01-17 International Business Machines Corporation Method and system for extending the performance of a web crawler
CN100371932C (en) * 2004-03-23 2008-02-27 南京大学 Expandable and customizable theme centralized universile-web net reptile setup method
US20060161564A1 (en) * 2004-12-20 2006-07-20 Samuel Pierre Method and system for locating information in the invisible or deep world wide web
US20070100779A1 (en) * 2005-08-05 2007-05-03 Ori Levy Method and system for extracting web data
CN100401301C (en) * 2006-05-30 2008-07-09 南京大学 Body learning based intelligent subject-type network reptile system configuration method
CN100392658C (en) * 2006-05-30 2008-06-04 南京大学 Body-bused subject type network reptile system configuration method

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101261634B (en) * 2008-04-11 2012-11-21 哈尔滨工业大学深圳研究生院 Studying method and system based on increment Q-Learning
CN102117275B (en) * 2009-12-31 2012-11-07 北大方正集团有限公司 Method and device for collecting webpage data of direction site based on internet
CN101916272B (en) * 2010-08-10 2012-04-25 南京信息工程大学 Data source selection method for deep web data integration
CN101916272A (en) * 2010-08-10 2010-12-15 南京信息工程大学 Data source selection method for deep web data integration
CN102103636B (en) * 2011-01-18 2013-08-07 南京信息工程大学 Deep web-oriented incremental information acquisition method
CN102103636A (en) * 2011-01-18 2011-06-22 南京信息工程大学 Deep web-oriented incremental information acquisition method
CN102739679A (en) * 2012-06-29 2012-10-17 东南大学 URL(Uniform Resource Locator) classification-based phishing website detection method
CN103678371B (en) * 2012-09-14 2017-10-10 富士通株式会社 Word library updating device, data integration device and method and electronic equipment
CN104317845A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for automatic extraction of deep web data
CN104462241A (en) * 2014-11-18 2015-03-25 北京锐安科技有限公司 Population property classification method and device based on anchor texts and peripheral texts in URLs
CN105843965A (en) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification
CN105843965B (en) * 2016-04-20 2019-06-04 广东精点数据科技股份有限公司 A kind of Deep Web Crawler form filling method and apparatus based on URL subject classification
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN107784034A (en) * 2016-08-31 2018-03-09 北京搜狗科技发展有限公司 The recognition methods of page classification and device, the device for the identification of page classification
CN107784034B (en) * 2016-08-31 2021-05-25 北京搜狗科技发展有限公司 Page type identification method and device for page type identification
CN108090200A (en) * 2017-12-22 2018-05-29 中央财经大学 A kind of sequence type hides the acquisition methods of grid database data
CN108829792A (en) * 2018-06-01 2018-11-16 成都康乔电子有限责任公司 Distributed darknet excavating resource system and method based on scrapy
CN109101600A (en) * 2018-08-01 2018-12-28 沈文策 The crawling method and device of dynamic data in a kind of webpage
CN110765336A (en) * 2019-11-01 2020-02-07 北京天融信网络安全技术有限公司 Webpage information processing method and system
CN110765336B (en) * 2019-11-01 2022-04-01 北京天融信网络安全技术有限公司 Webpage information processing method and system
CN112486989A (en) * 2020-11-28 2021-03-12 河北省科学技术情报研究院(河北省科技创新战略研究院) Multi-source data granulation fusion and index classification and layering processing method
CN112486989B (en) * 2020-11-28 2021-08-27 河北省科学技术情报研究院(河北省科技创新战略研究院) Multi-source data granulation fusion and index classification and layering processing method
CN113360798A (en) * 2021-06-02 2021-09-07 北京百度网讯科技有限公司 Flooding data identification method, device, equipment and medium
CN113360798B (en) * 2021-06-02 2024-02-27 北京百度网讯科技有限公司 Method, device, equipment and medium for identifying flooding data

Also Published As

Publication number Publication date
CN100452054C (en) 2009-01-14

Similar Documents

Publication Publication Date Title
CN100452054C (en) Integrated data source finding method for deep layer net page data source
CN1240011C (en) File classifying management system and method for operation system
CN1290036C (en) Computer system and method for establishing concept knowledge according to machine readable dictionary
CN101079056A (en) Retrieving method and system
CN103714149B (en) Self-adaptive incremental deep web data source discovery method
CN101079064A (en) Web page sequencing method and device
CN1755678A (en) System and method for incorporating anchor text into ranking of search results
CN1750002A (en) Method for providing research result
CN101055587A (en) Search engine retrieving result reordering method based on user behavior information
CN1804844A (en) Web page metadata based formalized description method for user access behaviors
CN111522905A (en) Document searching method and device based on database
CN106227788A (en) Database query method based on Lucene
Liakos et al. Focused crawling for the hidden web
Barrio et al. Sampling strategies for information extraction over the deep web
CN103064841A (en) Retrieval device and retrieval method
Shekhar et al. An architectural framework of a crawler for retrieving highly relevant web documents by filtering replicated web collections
CN108090200A (en) A kind of sequence type hides the acquisition methods of grid database data
Deng Research on the focused crawler of mineral intelligence service based on semantic similarity
US20040205049A1 (en) Methods and apparatus for user-centered web crawling
CN110647673A (en) Method for realizing ecological environment space big data integration and sharing
CN106066875A (en) A kind of high efficient data capture method and system based on deep net reptile
CN1209726C (en) Method for identifying mirror and quasi-mirror web sites over internet
Yadav et al. Architecture for parallel crawling and algorithm for change detection in web pages
Patil et al. Implementation of enhanced web crawler for deep-web interfaces
NAGAVEENA et al. A Smart Web Crawler: An Efficient Harvesting Deep-Web Interfaces Using Site Ranker And Adoptive Learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Free format text: FORMER OWNER: ZHAO PENGPENG FANG WEI

Owner name: SUZHOU PUDA NEW INFORMATION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: CUI ZHIMING

Effective date: 20100401

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 215001 ROOM 403, BUILDING 115, SU'AN NEW VILLAGE, SUZHOU CITY, JIANGSU PROVINCE TO: 215021 B502-2, INSIDE OF INTERNATIONAL SCIENCE PARK, NO.1355, JINJIHU AVENUE, SUZHOU INDUSTRIAL PARK DISTRICT, SUZHOU CITY, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20100401

Address after: 215021 international science and Technology Park, 1355 Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, B502-2

Patentee after: Suzhou Production Information Technology Co., Ltd.

Address before: 215001 room 115, building 403, Su an village, Suzhou, Jiangsu

Co-patentee before: Zhao Pengpeng

Patentee before: Cui Zhiming

Co-patentee before: Fang Wei

EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20071010

Assignee: SUZHOU SOUKE INFORMATION TECHNOLOGY CO., LTD.

Assignor: Suzhou Production Information Technology Co., Ltd.

Contract record no.: 2013320010066

Denomination of invention: Integrated data source finding method for deep layer net page data source

Granted publication date: 20090114

License type: Exclusive License

Record date: 20130412

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20161010

Address after: 215021 Jiangsu Suzhou City Canglang District liberation Village 5 403 room

Patentee after: Shu Lan

Address before: 215021 international science and Technology Park, 1355 Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, B502-2

Patentee before: Suzhou Production Information Technology Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090114

Termination date: 20180509