CN101051313A

CN101051313A - Integrated data source finding method for deep layer net page data source

Info

Publication number: CN101051313A
Application number: CN 200710021883
Authority: CN
Inventors: 崔志明; 赵朋朋; 方巍
Original assignee: 崔志明; 赵朋朋; 方巍
Current assignee: Shu Lan
Priority date: 2007-05-09
Filing date: 2007-05-09
Publication date: 2007-10-10
Anticipated expiration: 2027-05-09
Also published as: CN100452054C

Abstract

A method for discovering data source used on deep web data integration includes setting up station root chain queue and local chain queue, taking page chain with highest score out from local chain queue and using creepage module to download it, processing downloaded page by table sorter, adding said page into deep web data source if it has table query interface, processing downloaded page by page sorter and returning back to step of taking page chain if subject score is less than threshold, picking up chain address in page then placing it into local chain queue, repeating step of taking page chain to step of picking up chain address for realizing automatic creepage of deep web data source.

Description

Be used for the integrated data source finding method of deep layer net page data source

Technical field

The present invention relates to a kind of discover method of based on network data source, be specifically related to a kind of data source finding method of the deep layer net page that connects by the network inquiry interface, be used for the integrated of deep layer net page data source.

Background technology

Along with the widespread use of network data base, network quickens " in-depth ".It is dynamically to be produced by background data base that a large amount of pages are arranged on the internet, this part information can not directly be obtained by static linkage, can only submit to inquiry to obtain by filling in list,, can't obtain these pages because traditional web crawlers (Crawler) does not have the ability of filling in list.Therefore, existing search engine searches does not go out this part page info, thus cause this part information to the user be hide, sightless, we are referred to as deep layer Webpage (Deep Web is called Invisible Web again, Hidden Web).Deep Web is one and the corresponding notion of SurfaceWeb, is proposed in 1994 by Dr.Jill Ellsworth at first, refers to those are difficult to find its information content by the general search engine Webpage.Deep Web information generally is stored in the database, and compare common quantity of information with static page bigger, and theme is more single-minded, and information quality is better, and message structureization is better, and growth rate is faster.Studies show that Deep Web information is 500 times of Surface Web information, nearly 450,000 Deep Web websites are arranged.Realize that extensive Deep Web data integration is an effective way that is user-friendly to Deep Web information.

Realize extensive Deep Web integration search, need to solve: 1) data source finding (Deep WebDiscovery); 2) query interface extracts (Query Interface Extraction); 3) data source classification (Source Classification); 4) query conversion (Query Transfer); 5) synthetic five key issues such as (ResultMerging) of result.

To the classify prerequisite of integration search of Deep Web is to obtain Deep Web query interface, and this belongs to the category of data source finding.

K.C.-C.Chang, B.He, Z.Zhang is (Conference onInnovative Data Systems Research in Toward Large-Scale Integration:Building a MetaQuerier over Databases on the Web one literary composition, Asilomar, 2005), a kind of method of obtaining query interface from network is disclosed, it is at first collected provides WWW the IP address list of service, then for each IP address in the tabulation, grasp webpage in the certain depth scope successively according to breadth-first strategy, but and from the page of downloading, extract query interface.But because it is very little to contain the page ratio of query interface in the internet, and breadth-first is a kind of search strategy of blindness, adopts this method can download a large amount of irrelevant pages, and efficient is very low.

The effective means that addresses this problem is to adopt focused crawling (Focused Crawling) technology.The research that the focused crawler technology is applied in the Deep Web data source finding is also fewer at present.There is the scholar to use the link classification device preferentially to download those most probables and points to the page that contains query interface.In the training classifier process, search engines such as its use Google obtain pointing to all outer pages of the internal layer page, but the shortcoming of this method is: more to outer, the quantity of the page is just many more, and much be the page that has nothing to do, so can cause problems such as " theme drifts ".And said method can't obtain certain page depth information accurately in the website under it, thereby can't control the process of creeping well.

Summary of the invention

The object of the invention provides a kind of integrated data source finding method of deep layer net page data source that is used for, and with according to the theme that sets, realizes that the retrieval of the data query interface that theme is relevant is downloaded, and reduces page number of downloads, solves the theme drifting problem.

For achieving the above object, the technical solution used in the present invention is: a kind of integrated data source finding method of deep layer net page data source that is used for comprises the following steps:

(1) provide the theme of data to be checked, at least one seminal root chained address is put in structure the website root linked queue and local links formation in the linked queue of website root respectively, and according to the given weight of relation of itself and theme;

(2) if the local links formation is empty, then the local links formation is put in a root chained address of the heavy maximum of weighting from the linked queue of website root; From the local links formation, get the highest page link of scoring, download this page by the module of creeping;

(3) page that utilizes the list sorter that step (2) is downloaded is handled, and as wherein containing the list query interface, then it is added in the deep layer net page data source;

(4) page that utilizes the page classifications device that step (2) is downloaded is handled, and described page classifications device adopts preferential (best-first) strategy of the superior to carry out theme to judge, if the theme scoring less than setting threshold, is then returned step (2);

(5) chained address in the extraction page, judge with the link classification device whether the chained address might point to the page that contains the list interface, and to this link scoring, described link classification device determination methods is, extract the anchor text, chain picture address in hereinafter text, chained address, the link as feature, the information of carrying out participle is also added up word frequency, obtains the feature vector, X of this link, adopts the naive Bayesian method that link information is classified; For the link of scoring greater than setting value, as be local links, then put into the local links formation, as being the external site link, then search site root linked queue is when existing corresponding website root link, adjust the weight of website root link according to the scoring of this link, when not having corresponding website root link, the website root link that then will link adds the linked queue of website root, and sets the weight of root link according to scoring;

(6) repeating step (2) is realized creeping automatically of deep layer net page data source to step (5).

In the technique scheme, described " local links " is meant with the page of handling to have the page link that identical website root links." page classifications device " adopts the Best-first strategy, is used to judge whether the page P of extracting belongs to current theme.Have only the P of working as to belong to current theme, link among the P and query interface are just continued to handle." link classification device " is used to judge whether link url might point to the page that contains the list interface, and to this link scoring.Described classifier methods is a prior art, and its general process all is to create sorter automatically by the study to one group of training text having divided class, by directed learning is arranged test text is classified.Wherein, described Naive Bayes Classification device (

BayesClassifier) be independently with respect to decision variable between each component of supposition proper vector.For proper vector is X=[x ₁, x ₂..., x _d] ^TTest sample book, it belongs to the probability of Ci class shown in (1) formula:

P (C_{i} | X) = P (C_{i}) / P (X) * Π_{j = 1}^{d} P (x_{j} | C_{i}) - - - (1)

On behalf of X, P (Ci|X) belong to the probability of class Ci.Each classification is all calculated the probability of following formula, and final recognition result is that class that makes the probable value maximum.

Judge by adopting the page classifications device to carry out theme, avoided the theme drift effectively.

Further technical scheme in the described step (5), for local links, if the link degree of depth was then abandoned greater than 3 o'clock, is not put into the local links formation.According to investigations, the degree of depth of 91.6% the deep webpage query interface place page is smaller or equal to 3, therefore when the degree of depth of link greater than 3 the time, just do not handle this link, can under the prerequisite of assurance accuracy, effectively reduce treatment capacity.

In the technique scheme, adopt page instance that the page classifications device is trained earlier, then for the new page that obtains from crawl device with the page classifications device analysis that trains and mark, the probability size that this page belongs to current theme has been reacted in described scoring, have only when this scoring during greater than a previous preset threshold θ, link in the page and query interface are just continued to handle.

In the technique scheme, described list sorter is determined the query interface zone according to heuristic rule, only when the list in the page is query interface class list, it is added deep layer net page data source; Described heuristic rule is that the web form that is made of TEXTAREA control or PASSWORD control is not a query interface; It is not query interface that control quantity in the web form is less than 3 web form.

Further technical scheme is set the query interface threshold value, and the different query interface quantity of having found when certain website is during greater than the query interface threshold value, and the link of this website is directly abandoned, and no longer adds linked queue.

Optimized technical scheme is that described query interface threshold value is the integer between 5～8.

Because the technique scheme utilization, the present invention compared with prior art has following advantage:

1. judge the consistance probability of page theme and required inquiry theme owing to the present invention adopts the page classifications device, thereby can effectively prevent the theme drift, realize focused crawling, reduce treatment capacity greatly, improve the discovery efficient of deep layer net page data source;

2. because the present invention is provided with the linked queue of website root and two formations of local links formation, can effectively monitor the link degree of depth of the website of handling, when the link degree of depth greater than 3 the time, stop to handle, because the degree of depth of 91.6% the deep webpage query interface place page is smaller or equal to 3, therefore can under the prerequisite that guarantees accuracy, effectively reduce treatment capacity;

3. the present invention has considered the sequencing that problem adjustment such as each chained dependence is creeped in the weight of each website and the current website simultaneously, it is a kind of Web of Deep very efficiently data source acquisition methods, it can improve people's work efficient in a big way, for the integrated basis that provides of deep layer net page data source further is provided.

Description of drawings

Accompanying drawing 1 is the deep layer net page data source focused crawler system framework synoptic diagram of the embodiment of the invention one;

Accompanying drawing 2 is focused crawling algorithm synoptic diagram of embodiment one.

Embodiment

Below in conjunction with drawings and Examples the present invention is further described:

Embodiment one: to shown in the accompanying drawing 2, a kind of integrated data source finding method of deep layer net page data source that is used for comprises the following steps: referring to accompanying drawing 1

Realize deep layer net page (Deep Web) the data source focused crawler system of said method, its system framework figure is referring to shown in Figure 1.Each module is described in detail as follows:

1. link classification device

The link classification device is used to judge whether link URL might point to the page that contains the list interface, and to this link scoring.The feature that the link classification device extracts mainly is the anchor text and chains picture address in hereinafter text, URL address, the link.Through observing, replaced the anchor text with picture in a lot of links, so we also take into account the address information of picture.To above-mentioned information participle and after adding up word frequency, just obtained the feature vector, X of this link.We adopt the naive Bayesian method to come link information is classified then.

2. page classifications device

The page classifications device adopts the Best-first strategy, is used to judge whether the page P of extracting belongs to current theme.Have only the P of working as to belong to current theme, link among the P and query interface are just continued to handle.The page classifications device is trained with the page instance that part obtains from Yahoo's split catalog earlier.Then for a new page P who obtains from crawl device, the page classifications device that trains is analyzed the content of P, has reacted the probability size that P belongs to current theme for then scoring of P, this scoring.Have only when this scoring during greater than a previous preset threshold θ, link among the P and query interface are just continued to handle.

3. list sorter

Because our target is to collect Deep Web data source, so we need remove the list that those are not Deep Web query interfaces, such as Member Entrance, mail subscription etc. are to the insignificant list of the present invention.For this reason, we determine the query interface zone according to some heuristic rules, and for example some web form has TEXTAREA control and PASSWORD control, and we can judge directly that this class web form is not a query interface according to practical experience.In addition a threshold value can be set for the control quantity in the web form, when the control quantity in the web form is lower than this threshold value, just can think that this web form is not a query interface.For example the web form number of elements of some search in Website seldom only has a text box and a submit button, and we can't obtain enough information to this class web form, therefore they can be put under non-query interface one class.

4. the module of creeping

The module of creeping adopts multithreading, to improve the processing speed of system.After a period of time of having creeped, along with becoming how much levels, number of links in the linked queue of waiting to creep increases, and memory consumption is quite fast, and it is very low that cpu busy percentage becomes.So limit the capacity of related data structures committed memory, to data when its capacity will utilize persistence technology (serialization) to write on the disk during greater than certain numerical value.

When determining to creep stop condition, because studies show that: on average each Deep Web website only contains 4.2 query interfaces.So when the different query interface quantity of having found when certain website or the page quantity of download surpassed certain threshold value, the link in this website had just no longer been handled.

5. linked queue to be creeped

Formation to be creeped mainly contains two in the native system: " local links formation " and " linked queue of website root ".According to investigations, the degree of depth of 91.6% the Deep Web query interface place page is smaller or equal to 3, therefore when the degree of depth of link greater than 3 the time, we have linked with regard to not handling this." local links formation " deposited belong to current website respectively wait the link of creeping, each link is according to scoring ordering from high to low.And for the link of directed outwards website in the current site page, we leave the home address of these link place websites and the weight of this website in " linked queue of website root " in.The weight of each website in " linked queue of website root " can constantly be updated in crawling process.The principle of upgrading is: when the newfound link scoring that belongs to certain website is very high, can increase the weight of this website; Can reduce the weight of this website on the contrary.Each website in " linked queue of website root " sorts from high to low according to its weight.In the crawling process, when " local links formation " is sky, then from " linked queue of website root ", takes out the highest website home address of weight and put into " local links formation ", thereby begin creeping of a new round.

Deep Web data source focused crawling algorithm

The core algorithm of Deep Web data source focused crawler as shown in Figure 2.Integrated in order later on the query interface in each field to be carried out respectively, present embodiment is creeped respectively to the website of different field (as career field, automotive field).

Link to be creeped is left in the formation of waiting to creep, and the formation of creeping is put in the link of having visited.When determining a link whether will join to wait to creep formation, consider three problems: 1. whether the degree of depth of this link is smaller or equal to 3 (because the degree of depth of 91.6% the Deep Web query interface place page is smaller or equal to 3).2. whether the content that should link the place page is relevant with current theme.If content of pages and theme are irrelevant, then do not consider link wherein.3. whether this link might point to the page that contains query interface.

Claims

1. one kind is used for the integrated data source finding method of deep layer net page data source, it is characterized in that, comprises the following steps:

2. the integrated data source finding method of deep layer net page data source that is used for according to claim 1 is characterized in that: in the described step (5), for local links, if the link degree of depth was then abandoned greater than 3 o'clock, do not put into the local links formation.

3. the integrated data source finding method of deep layer net page data source that is used for according to claim 1, it is characterized in that: adopt page instance that the page classifications device is trained earlier, then for the new page that obtains from crawl device with the page classifications device analysis that trains and mark, the probability size that this page belongs to current theme has been reacted in described scoring, have only when this scoring during greater than a previous preset threshold θ, link in the page and query interface are just continued to handle.

4. the integrated data source finding method of deep layer net page data source that is used for according to claim 1, it is characterized in that: described list sorter is determined the query interface zone according to heuristic rule, only when the list in the page is query interface class list, it is added deep layer net page data source; Described heuristic rule is that the web form that is made of TEXTAREA control or PASSWORD control is not a query interface; It is not query interface that control quantity in the web form is less than 3 web form.

5. the integrated data source finding method of deep layer net page data source that is used for according to claim 1, it is characterized in that: set the query interface threshold value, the different query interface quantity of having found when certain website is during greater than the query interface threshold value, the link of this website is directly abandoned, and no longer adds linked queue.

6. the integrated data source finding method of deep layer net page data source that is used for according to claim 5 is characterized in that: described query interface threshold value is the integer between 5～8.