CN101515287B

CN101515287B - Automatic generating method of wrapper of complex page

Info

Publication number: CN101515287B
Application number: CN2009100295613A
Authority: CN
Inventors: 崔志明; 方巍; 赵朋朋
Original assignee: SUZHOU PRODUCTION INFORMATION TECHNOLOGY Co Ltd
Current assignee: Shu Lan
Priority date: 2009-03-24
Filing date: 2009-03-24
Publication date: 2011-01-12
Anticipated expiration: 2029-03-24
Also published as: CN101515287A

Abstract

The invention discloses an automatic generating method of a wrapper of complex pages. The method comprises the followings steps: (1) acquiring two HTML page documents based on the same template to generate an HTML Tag tree; (2) acquiring a minimum region DS containing a data record set; (3) acquiring initial data record (DR) from the minimum region; (4) recording the layout combination relation of the DR according to the initial data record, determining aggregation relation of extraction items according to the similarity of characteristic items, carrying out semantic annotation on entities inthe same aggregation block in combination with the knowledge of the field, and recombining a new data record according to the semantic relation among entities; (5) generating the extraction rule of each aggregation block according to the position relation of the generated data record in step (4) in the HTML Tag tree, and then constructing the wrapper. The invention can extract the true data record rule from the complex pages through the analysis of the structural relation of the HTML Tag tree, thereby automatically constructing the wrapper with high extraction accuracy rate.

Description

A kind of automatic generating method of wrapper that is used for complex page

Technical field

The present invention relates to a kind of method of information Recognition of the Web page, be specifically related to a kind of automatic generation method of wrapper that is used to extract the deep layer net page data message that is applied to complex page.

Background technology

The last Web webpage of Internet is mostly presented with the form of HTML, and the characteristics of HTML make that any organizations and individuals can be according to the idea of oneself, and is content distributed various on Web, the information that form is abundant.The state of this semi-structured and even non-structureization of Web data makes only browsing of the suitable mankind of the Web page, and is unfavorable for that application program directly resolves and utilize the valuable information of magnanimity on the Web.On the other hand, along with the fast development of Internet and ecommerce, " information explosion " become the obstruction that people effectively obtain information.Therefore, utilize computing machine that Web information is carried out the extraction of robotization, becoming has actuality and urgency more.

Current, a lot of webpages on the Web are dynamically to generate, and the website is chosen data according to user's request and is embedded in the general template from background data base, and the website that this class is referred to as deep layer Webpage (Deep Web) is the important component part on the Internet.Studies show that Deep Web information is 500 times of top layer webpage (Surface Web) information, nearly 450,000 Deep Web websites are arranged.Because the Web data of this type of website generate according to request dynamic, therefore, traditional search engine can not be well to these type of data index in addition.By observing, we can find that this type of website often shows that by tabulation page or leaf and detail page user oriented it is kept at the information in the database.Data pick-up to this type of Web page then is the prerequisite of carrying out the deep layer net page data integration.

In recent years, the website for general data guiding (data-intensive) type has the researcher to propose the generation method of some wrapper, has solved the data pick-up problem of general website effectively.The task of wrapper adopts series of rules exactly, with the useful information that the user was concerned about, comes out from the Web web page extraction.Because the performance of the form of html document is different, the html document in different pieces of information source often needs different decimation rules, and therefore, wrapper is often closely related with the webpage format of particular source.Mainly there are following shortcomings in present wrapper: (1) exploitation and use the higher skill of wrapper needs, need manually participation, and spend a large amount of time to go to study the structure that will extract webpage.This mode does not utilize large-scale web data integrated.(2) because wrapper is closely-related with particular source, therefore, if the deviser of webpage has changed the layout of original webpage, the current packaging device just may lose efficacy so.(3) the research great majority are confined to the data pick-up problem of the simple result page.

Summary of the invention

The object of the invention provides a kind of automatic packaging device generation method based on HTML Tag tree, thereby improves the automaticity of data pick-up and extract accuracy rate and efficient.

For achieving the above object, the technical solution used in the present invention is: a kind of automatic generating method of wrapper that is used for complex page comprises the following steps:

(1) obtains two html page documents that generate based on same template, utilize the XML resolver to resolve to DOM Document Object Model respectively, i.e. the HTML labelled tree with tree structure;

(2) two HTML labelled trees of comparison step (1) acquisition are removed the noise range, obtain to comprise the Minimum Area DS of data record set;

(3) from described Minimum Area, obtain the primary data record, its method is, from the HTML labelled tree, obtain the Longest Common Substring in DS district, by finding that the repeat region in the DS district identifies initial data recording DR, described data recording is with one two tuple (D, G) expression, D represents the set of record attribute, and G represents the layout syntagmatic of attribute at the Html page;

(4) according to the layout syntagmatic of initial data recording DR, similarity according to characteristic item, determine to extract the gathering relation of (instance properties), and in conjunction with the knowledge of domain body, entity in the same aggregation block is carried out semantic tagger, be reassembled into new data recording DR2 according to the inter-entity semantic relation;

(5) according to the position relation of the data recording DR2 that generates in the step (4) in the HTML labelled tree, generate the decimation rule of each aggregation block, make up wrapper then.

Above, in the described step (4), being reassembled into new data recording DR2 according to the inter-entity semantic relation can concern between the accurate response data, meets user's request.

In the technique scheme, the characteristic item in the described step (4) comprises style characteristics, feature speech.

For ease of understanding, ask a step to be described as follows to technique scheme:

In the Web page, a complex lists page has following essential characteristic:

1. on producing method, complex page is generated by web page template T.

2. on content, not only comprise image in the data recording in the complex page (DR), also comprise text.

3. on the page layout structure, the content among the DR in the complex page can be organized into multiple row or a plurality of zone, and the DR layout under the perhaps same template can be different and different according to condition.

Below the Web page that produces based on template is carried out rational formalized description.

Tabulation web page template T: tabulation web page template T=H ∪ N.The extracted data region D ata-rich that the H representative of consumer is concerned about; N represents the noise range.We with one two tuple (S P) represents H, wherein, S representative data record set (DRs), P represents the distributed relation between the DR.

Data recording DR: data recording DR can (D G) represents with one two tuple.D represents the set of record attribute, and G represents the paradigmatic relation between the attribute.

HTML Tag tree: DOM (the Document Object Model that html document can become to have tree structure by the XML resolver resolves, DOM Document Object Model) model, the DOM model has the tree structure feature, and each is mapped as node in the dom tree to mark (Tag) in html document, wherein mark (Tag) is mainly used in title, paragraph and branch etc.The dom tree that is mapped to by mark Tag is called HTML Tag tree.

Domain body: the term that specific area is relevant or the set of vocabulary, as medical science, education etc.All notions can be distinguished by its different attribute that comprises in the ontology knowledge storehouse in general certain field.If there are two kinds of different notions, then necessarily there is difference in two pairing community sets of notion.Obtain in the query interface pattern that the method that the domain body knowledge base generates can be by the website and the integrated query interface pattern in field, can adopt document (Yoo Jung An, James Geller, Yi-Ta Wu and SoonAe Chun.Automatic Generation of Ontology from the Deep Web.In Proc.18thIntel.Workshop on DESA, IEEE 2007.) in generate the method for body automatically.

From the composition of data, because data abundance (Data-rich) zone of the complex lists page, (Data Record) produces with iterative manner by data recording.We can define data instance object and mutual relationship that a local domain body file is used for describing this list page.The data recording of the complex lists page all is the object instance that this ontology file is described.

The present invention is based on the complex page of being studied and generate automatically according to data-base content by same template, and the content of noise range remains unchanged, complicacy and DR record that its complicacy is mainly reflected in the DR layout comprise writings and image simultaneously.Under this supposition, for the wrapper automatic generating method of complex page, some key issues below main the existence:

Wrapper maker submodule is a nucleus module wherein, and its main task is exactly the record field that finds the user to extract from the DR glomeration.The wrapper maker is helped us by the mark analyzer module and is located attribute in the aggregation block.The mark analyzer module mainly relies on the attribution rule configuration file based on domain body.After marking out semantic information, according to its architectural feature in Html Tag tree and the pattern feature of attribute, export the regular expression rule of its extraction, be stored in the XML library file with the form of XML.

(1) Data-rich district (DS) finds.On data, the Data-rich district is exactly the set that Web goes up data recording.The tabulation page not only comprises data recording collection zone, also comprises zones such as advertisement bar, navigation bar.Here we adopt two list page that comparison generates based on same template, through some pre-treatment step, get rid of noise ranges such as advertisement and navigation information hurdle, find the Minimum Area that comprises data record set, are the Data-rich district.

(2) identification of data recording (DR).Find the user to want the data recording of the information that extracts from the Data-rich zone, the often related entity of this data recording is made up of a plurality of extractions items.

(3) mark of the discovery of aggregation block and extraction item.The structural relation of utilizing HTML Tag to set is found the aggregation block among the DR, based on domain body knowledge the extraction item in the aggregation block is carried out semantic tagger simultaneously.

(4) create-rule of structure wrapper.

Given one group of Web complex page that produces based on template, target of the present invention are exactly that robotization ground produces one group of specific decimation rule, are used for the wrapper of this Web page.

Because the technique scheme utilization, the present invention compared with prior art has following advantage:

The present invention can pass through the analysis to the structural relation of HTML Tag tree, extracts real data recording rule from complex page, extracts the high wrapper of accuracy rate thereby can make up automatically.

Description of drawings

Fig. 1 is the wrapper robotization product process figure that is used for the complicated Web page among the embodiment one;

Fig. 2 is the Aggregate Expression method of Html Tag tree among the embodiment one;

Fig. 3 is the decision rule chain synoptic diagram among the embodiment one;

Fig. 4 is the decimation rule file among the embodiment one.

Embodiment

Below in conjunction with drawings and Examples the present invention is further described:

Embodiment one: shown in accompanying drawing 1, showed the basic procedure of wrapper robotization generation system.Total system mainly is made up of three parts: Data-rich district (DS) recognin module, data recording (DR) recognin module and wrapper maker submodule.

Data-rich district (DS) recognin module, on data, DS is exactly the set that Web goes up data recording.The tabulation page not only comprises data recording collection zone, also comprises zones such as advertisement bar, navigation bar.By comparing the Html Tag tree of two pages (referring to list page here) that generate based on same module, come the interested Data-rich of consumer positioning district apace.Because list page is produced by predefined template, therefore, DR often occurs in the page with the form of iteration.According to the observation, can find near the Data-rich that tend to be accompanied by the appearance of paging navigation, we have designed Data-rich Finder algorithm and have located the Data-rich zone fast, promptly find the Minimum Area that comprises data record set.

Data recording (DR) recognin module is used for wanting the data recording (DR) that extracts from Data-Rich district identification user.If outcome record is associated with the instances of ontology in this field, then record field is to attribute that should the domain body example.According to the characteristics of complex page, record field can constitute different aggregation block as required on layout format.The outward appearance of the field of the same type of data recording is consistent with form.

Wrapper maker submodule is a nucleus module wherein, and its main task is exactly the record field that finds the user to extract from the DR glomeration.The wrapper maker is located attribute in the aggregation block by the mark analyzer module.The mark analyzer module mainly relies on the attribution rule configuration file based on domain body.After marking out semantic information, according to its architectural feature in Html Tag tree and the pattern feature of attribute, export the regular expression rule of its extraction, be stored in the XML library file with the form of XML.Concrete generative process is as follows:

1, the discovery of Data-rich Section

In the list page based on the template generation, not only comprise user's interest data recording collection, also comprise the noise section of information such as navigation bar, advertisement bar.Therefore,, need which zone of identification, be only the real interested data recording collection of user, and this region extraction is come out for given webpage.

The tabulating result page is the set of Data-rich district and noise range, and the content of noise range do not change, and then along with the paging navigation, its data are brought in constant renewal in the Data-rich district.Therefore, can pass through to compare the HTML Tag tree construction of the Different Results page that produces based on same template, and obtain the Data-rich district.The present invention is in conjunction with traditional DSE algorithm (J.Wang and F.Lochovsky, Schema guidedwrapper maintenance for Web-data extraction.In:Proc of ACMWIDM ' 2003.New York:ACM Press, 2003) and FLCS algorithm (Chen Xiaofeng, Zhang Ling, Dong Shoubin. based on XPath Web data pick-up method relatively. Zhengzhou University's journal (version of science), 2007, the 39th the 2nd phase of volume), characteristics at having complicated webpages such as navigation bar zone in the paging tabulation have designed a kind of Data-rich Finder extraction algorithm.The basic description of this algorithm is as shown in table 1.

Table 1Data-rich Finder algorithm

Input: two Url of original list
	Output: the forward Longest Common Substring of representing the Data-rich district
Algorithm steps:
	(1) input is based on the Url of two pages of same module.
(2) respectively the HTML Tag of two pages tree is carried out degree of depth recurrence, if find to have the paging navigation nodes in its subtree, then its father node of mark is the start node of step (3), otherwise is start node with the root node of HTMLTag tree.
	(3) root node with mark in the step (2) begins, and HTML Tag tree is carried out degree of depth recurrence relatively, judges whether its subtree is consistent.If the path unanimity then is labeled as unanimity to this subpath, turn back to father node, continue to choose next single sub path relatively.If all subpaths of father node are all consistent, then the path of representative is the noise branch.
(4) the forward Longest Common Substring of the different subtrees that will obtain is exported, and does the tree path of Data-rich.

2, the identification of DR

After the Longest Common Substring that has obtained Data-rich, can discern the user and want the initial non-accurate data record that extracts by finding repeat region among the Data-rich.The body object in related this field of outcome record, record field is to attribute that should the domain body object.According to the characteristics of complex page, record field can constitute different aggregation block as required on layout format.The outward appearance of the field of the same type of data recording is consistent with form.

By observing, we can be with the label in the Html Tag tree, and be divided into two classes according to its characteristic: a class is containers labels (this type of label has hierarchical relationship on layout); Another kind ofly be that then pattern modifies label.As shown in table 2:

The classification of table 2 label

The container class label	Modify the class label
		Table/tr/td/div/ul/li etc.	A/strong/font etc.

Just as shown in table 2, the container class label is modified the pattern that the class label has then been unified the same alike result of entity with forming different gathering relations between the entity attribute, and the different entities attribute has been played the classification effect of hint.(D, G), promptly DR is represented by two tuples of entity attribute and the paradigmatic relation between them DR=.Here we represent a pair of containers labels with (), represent hierarchical relationship between the label by the nest relation between (), its method for expressing as shown in Figure 2, wherein # represent text and pattern thereof and＜img label.We are referred to as aggregation block with (#).

In order correctly to distinguish the text between the aggregation block,, can find some rules like this by observing:

1. the pattern unanimity of same entity attribute.

2. different entity attributes are generally cut apart with segmentation tag.Such as with symbol＜br〉cut apart.

3. for entity attribute, can whether repeat to judge that an entity attribute still is a plurality of entity attributes by the judging characteristic speech with (feature speech, text) architectural feature.

For the simple result page, its DR is structurally often in full accord, and promptly its aggregation block number is in full accord, and the text block number in the aggregation block is also consistent; And for complex page, the aggregation block number of its DR and the text block of aggregation block inside may can have gap slightly according to different conditions.For judging that DR says, the weight of the structural similarity of aggregation block often will be higher than the weight of the inner similarity of aggregation block.Therefore, can eliminate the interference of decollator between DR and the DR simultaneously according to the similarity of recently judging these two DR of the number of DR aggregation block.Be reassembled into the accurate data recording DR2 that meets user view at last.

3, the generation of wrapper decimation rule file

After the HTML Tag tree of DR was represented with gathering, the domain body knowledge according to prior generation marked the text block among the DR, obtains the content of text of corresponding entity attribute.

According to the dependency rule of aforesaid entity attribute, provide some judgment criterion of distinguishing entity attribute:

(1), can judge according to its label for text and non-the text field.

(2), may judge according to its specific feature word for text attribute with feature speech.

(3) judge according to the data type and the format of text block.

(4) carry out according to the data content length of text block.

The extraction item of same field, in tag types, data pattern, there are very big similarity in feature pattern and feature speech aspect.By calculating the similarity of text node and picture node with minor function, the extraction item of same field is carried out cluster.

W wherein ₁, w ₂, w ₃Be respectively the corresponding weights ratio, (A B) calculates A to SimPtag, and whether the tag name of B node father node is consistent; Simtag represents the similarity between the tag name between these two nodes; (A B) calculates A, the pattern similarity of B text node to SimS; SimC (A, B) the content similarity of calculating text node (main) according to indicating speech and data type feature.

According to domain body knowledge information, according to the feature speech, the weight proportion of data pattern and style characteristics has been formulated decision rule chain as shown in Figure 3, and text and pictorial information after assembling are marked.

Employing Dela body mask method (J.Wang and F.Lochovsky.Data Extraction andLabel Assignment for Web Databases.WWW2003.) carries out after the semantic tagger to the extraction item, utilize label information, feature speech and the data type of its place node, produce the decimation rule of this extraction item, as shown in Figure 4.

Claims

1. an automatic generating method of wrapper that is used for complex page is characterized in that: comprise the following steps:

The described method that obtains the Longest Common Substring in DS district from the HTML labelled tree is:

1. import Url based on two pages of same module;

2. respectively the HTML labelled tree of two pages is carried out degree of depth recurrence, if find to have the paging navigation nodes in its subtree, then its father node of mark is a step start node 3., otherwise is start node with the root node of HTML labelled tree;

3. begin with the 2. middle start node that marks of step, the HTML labelled tree is carried out degree of depth recurrence relatively, judge whether its subtree is consistent, if path unanimity, then this subpath is labeled as unanimity, turns back to father node, continue to choose next single sub path relatively, if all subpaths of father node are all consistent, then the path of representative is the noise branch;

4. the forward Longest Common Substring of the different subtrees that will obtain output;

(4) according to the layout syntagmatic of initial data recording DR, according to the similarity of characteristic item, a gathering relation that determine to extract, and in conjunction with the knowledge of domain body, entity in the same aggregation block is carried out semantic tagger, be reassembled into new data recording DR2 according to the inter-entity semantic relation;

(5) according to the position relation of the data recording DR2 that generates in the step (4) in the HTML labelled tree, generate the decimation rule of each aggregation block, make up wrapper then;

Characteristic item in the described step (4) comprises style characteristics, feature speech.