CN1588370A

CN1588370A - Maintenance method for package device

Info

Publication number: CN1588370A
Application number: CN 200410074546
Authority: CN
Inventors: 孟小峰; 谷明哲; 王海燕; 胡东东; 于峻涛; 易蕾; 李宇
Original assignee: Individual
Current assignee: Individual
Priority date: 2004-09-08
Filing date: 2004-09-08
Publication date: 2005-03-02
Anticipated expiration: 2024-09-08
Also published as: CN100338609C

Abstract

The maintenance method of packing utensil includes: extracting mark features of data item in Web entity before change with original packing utensil; finding out the data item with the mark features with the Web HTML tree after traversal change of data item mark features; finding out the data item of the same entity, dividing its semantic block or converging the data item of the entity and extracting the data item description mode of the entity; matching the data item description mode of other entities in the same sub-tree level as HIML with the mode; treating the sub-trees based on the matching degree; obtaining the semantic block with best matching and re-creating new extraction regulation or creating new packing utensil. The present invention proposes one clear flow chart for creating packing utensil capable of being easily integrated with other application system.

Description

The maintaining method of wrapper

Technical field

The present invention relates to the maintaining method of the wrapper in a kind of Web page.

Background technology

Internet has had the history in more than 20 year, has just occurred Internet heat in recent years, and this mainly should give the credit to Web.Internet provides network interconnection and communication function in the world wide, and Web then is a global information resource database.Web is made up of the countless page (Home page homepage), and the information on the page embraces a wide spectrum of ideas, and is increasing constantly, is upgrading everyday.The user is as long as open any browser just can obtain their interested data.The ease for use of Web makes huge numbers of families go up huge information resources by share I nternet.At present, everybody is talking about Web and homepage, unfolds a newspaper or news magazine, almost all will mention Web and homepage every day.From a major company of small shop to, from the research institute to the school, there are not the homepage and the Web address that are in all modesty showing oneself to other people.Web and homepage have become the focus of computing machine and the communications field and even the whole society.

Because the data overwhelming majority on the Web represents by html language.The characteristics of html language be exactly any tissue or individual can be at an easy rate on Web content distributed various, information that form is different, the result makes the data on the Web be in the state of chaotic, data set becomes second nature very poor, and the foundation of using to Web has caused great difficulty.

Below HTML and XML are done with simple introduction.HTML is from Standard GeneralizedMarkup Language, and promptly standard generalized markup language is called for short SGML.Before Web did not invent, SGML just existed already.Title as it says, SGML a kind ofly describes the general purpose language of document information with mark, and it has comprised a series of DTD (Document Type Definition) (being called for short DTD), defined the implication of mark among the DTD, thereby the grammer of SGML can be expanded.Because SGML ten minutes is huge, both be not easy to learn, be not easy again to use, realize also very difficult on computers.For all these reasons, the ability of the researchist in the inventor of Web---European nuclear physics research centre (1989) computer technology according to has at that time proposed html language.HTML only uses very little a part of mark among the SGML, and for example HTML3.2 has defined 70 kinds of marks.For the ease of realizing that on computers the mark of HTML regulation is fixed, promptly the HTML grammer can not be expanded, and it need not comprise DTD.This fixing grammer of HTML is easy to learn and use it, and the browser of developing HTML on computers is also very easy.HTML is the display format that the universal method HTML of Web video data describes the Web page emphatically.

XML is the same with HTML, all from SGML.XML is a SGML who simplifies, and it is attached to the ease for use of enriching function and HTML of SGML in the application of Web.XML has kept the extendable functions of SGML, and this makes XML fundamentally be different from HTML.XML is strong more much bigger than HTML, and it no longer is the mark of fixing, but the mark that allows definition quantity not limit is described the data in the document, allows nested message structure.HTML is the universal method of Web video data, and XML provides a universal method of directly handling the Web data.HTML describes the display format of the Web page emphatically, is the content of the Web page and XML describes emphatically.Schema file has defined the structure of legal XML file by a series of nestable valid unit.Make the XML file itself have structurized Partial Feature like this, also make computer program to handle, rather than can't be directly by computer programs process as HTML.From example also as can be seen the XML page info be structurized, some and database structure are similar, thereby have more visit, its result for retrieval is more targeted, more accurate.And XML has simplified the rare function of using once of a sheet of SGML.As commonly used only 8,000 in the hundreds of thousands Chinese character, the part that SGML is commonly used only accounts for 20%, and XML has abandoned the part that is of little use among the SGML, makes it just simplify 80% once.In a word, XML use one simple and standard format flexibly arranged, the effective means of a data of description and swap data is provided for the application based on Web.HTML has described the universal method that shows global metadata, and XML provides the universal method of direct processing global metadata.

Because HTML self has just produced the Web wrapper.The task of wrapper is responsible for the data pick-up of html format (having fixation mark) exactly and is converted into the data of concrete structureization.Wrapper is one of important component part of Web data integrated system.

In addition, because dynamic and the instability of Web make that various changes often take place the html document on the Web, such as variation etc. has taken place on the page structure.Simultaneously, since all Web wrapper all with the structurally associated of a certain class html page of reality (this is must be different because the structure of the html document of different websites constitutes, make and to extract data from it respectively with different wrapper), therefore when page structure changes be, original wrapper that can make lost efficacy, and can not extract data from the page after changing.The variation of webpage mainly can be divided into two kinds of situations: the variation of content of pages and the variation of content structure.The characteristics of wrapper have determined the variation for former type, and owing to the position of data item in html document do not change, and just value changes, and therefore can directly extract.But for the latter, because variation has taken place in the position of data item in html document, make comprised in the decimation rule in original wrapper about the information of the position of the data item that will extract no longer can (partly or completely) corresponding to the page after changing, make wrapper lose efficacy.

In case if wrapper lost efficacy, the manual maintenance wrapper will be a very thing of trouble.This be because, generate the original time that just need be a large amount of of process of wrapper, and, in the time of a company management a lot of wrapper, if manual maintenance (whether operate as normal is to generating new wrapper for the page after changing again from detecting wrapper) wrapper almost is impossible thing.

At the Problem of Failure of wrapper, some maintaining methods were arranged in the past, simply introduce it below.

The method that wrapper is in the past safeguarded all is based on the situation that page subtle change causes that wrapper lost efficacy.For the checking of wrapper, generally adopt the method for machine learning to solve this problem, obtain extracting the data pattern (Data Pattern) of field by the method for machine learning.When finding that the pattern of test example with the training example has significantly not simultaneously, system can give notice or can call repair procedure automatically.Automatically repair procedure based on the hypothesis prerequisite be: the variation that the page takes place is small stylistic variation or slight page adjustment, and the content of field often remains unchanged.So just can utilize the front to be used for verifying that the pattern that extracts the result reorientates the correct example of this data field in new page.In case needed field is located, the page just can be obtained correct decimation rule by the inductive learning process according to the example after being labeled then more again by mark again automatically.The shortcoming of this method is to need a large amount of example study, and contains the character string in the field in the pattern, so recoverable situation is less.

Another method has used information retrieval Chinese version similar methods to reorientate data item in the page, and this structural recognition method is based on some very natural heuristic rules, as the repetitive sequence of detection HTML mark and the repeat pattern of similar string.Cohen is with the variation of the page in two kinds of situation: the format change of the page and content remains unchanged and variation has all taken place for page formatting and content.For the constant situation of content of pages, still can utilize the data that from the page of a last version, extract, utilize the ratio of similitude of text to come to obtain again wrapper.For the situation that content of pages also changes, the data that previous release extracted can only be assisted as approximate example and be regenerated wrapper.The limitation of this method is to handle simple tabulation (list) structure, and the accuracy that the case structure that content of pages changes is discerned is not high.

Above-mentioned common method does not solve well to the problem that wrapper is safeguarded.

Summary of the invention

At the problem and shortage part that above-mentioned existing wrapper is safeguarded, the present invention proposes a kind of maintaining method of the wrapper of can be automatically after the Web page changes wrapper being safeguarded.

The present invention is achieved in that a kind of maintaining method of wrapper, may further comprise the steps:

(1) utilize the original packing device to extract the flag sign of the data item of entity among the preceding Web of variation;

(2) the HTML tree of the Web after utilizing data item flag sign traversal in the step (1) to change is therefrom found out in steps the data item of the data item flag sign in (1);

(3) to finding out the data item of same entity in the step (2), it is divided semantic chunk, promptly data item is converged, and extract the data item description pattern of entity by entity; Data item description pattern with other entities in this pattern match HTML subtree at the same level; If

A. the example that includes a plurality of patterns in the described subtree promptly includes a plurality of and semantic chunk pattern match, and then recurrence is handled the subtree that all child nodes are root;

B. a part that only comprises the example of a pattern in the described subtree promptly only comprises the part of the semantic chunk of and pattern match, then merges adjacent subtree automatically;

C. the example that contains a pattern in the described subtree just promptly comprises the semantic chunk of and pattern match just, then returns the promptly relevant subtree of all semantic chunks;

(4) semantic chunk that draws in the step (3) all and pattern match, therefrom find out one with pattern match preferably, and therefrom regenerate new decimation rule, promptly generated the new packing device.

Preferably, described flag sign is specially the semantic feature of data item, the descriptor of data item and the link information of data item.

Preferably, described after variation among the Web extracted data item be specially, in resolving to the Web of tree structure, from root node, judge whether root node meets the data item flag sign, if do not meet, then recurrence is handled its all child nodes, up to all nodes of traversal.

Preferably, described searching in several node process if node is the ELEMENT node, also is nonleaf node, then continues recurrence and checks its child node; If node is the TEXT node, the leaf node that promptly comprises textual value, then judge earlier the annotation information that whether comprises data item in this node, if comprise then seek its corresponding data value from this node, if do not comprise the annotation information of data item in this node, then check semantic feature and link information, if all meet, then also this node as the node of possible data item.

Preferably, the adjacent subtree of described automatic merging is specially the merging of carrying out subtree in the child node of root node according to order from left to right, at first second semantic chunk joined in first, judge whether the piece after merging excessively mates, otherwise continue the subtree of back is joined in the piece of front, when excessive coupling, the subtree that then will add at last obtains a semantic chunk after removing, and keeps this semantic chunk; Begin to repeat above-mentioned union operation from first subtree that does not join the piece, until should all subtree of layer merging; Return coupling or the partly semantic chunk of coupling fully; Return again in other nodes of root node, repeat above-mentioned steps, up to all nodes of traversal.

The invention provides the feasible technology and the flow process that are used to safeguard wrapper, for long-term stability is obtained and effectively utilized the data on the Web webpage that a kind of new implementation method is provided.The descriptor feature of Web entity before the present invention utilizes just and changes is extracted each information of corresponding entity in the Web webpage after variation, and the decimation rule when drawing the Web inquiry by entity information, promptly regenerates wrapper.

Particularly, the present invention has following advantage:

(1) proposed one and overlapped the flow process that generates wrapper clearly;

(2) do not need artificial participation;

(3) has the accuracy that can reach the higher data extraction;

(4) safeguard flow process based on the wrapper of Java and safeguard that the result can be integrated with other application systems very easily.

Description of drawings

Below in conjunction with accompanying drawing the present invention is made detailed description.

Fig. 1 is the application flow synoptic diagram of wrapper of the present invention;

Fig. 2 is a detailed process synoptic diagram of the present invention;

Fig. 3 finds the schematic flow sheet of possible data item in changing the HTML tree for the present invention;

Fig. 4 divides the schematic flow sheet of semantic chunk for the present invention.

Embodiment

In order to be more readily understood the present invention, now related notion is described.

Wrapper promptly is the packet in the data source half hitch structure or structureless can be dressed up structurized or the semi-structured data of pattern are arranged.For the Web wrapper, then be the semi-structured XML data of the data pick-up in the html page of semi-structured non-mode being come out to obtain pattern.

Attribute, data type that pattern (Schema) has specified the element that can exist in document, element to have, the order that in entire document, occurs at the hierarchical structure and the element of element inner element.It can work out document to mark, strengthens the consistance of flag parameters inside and makes the XML syntax analyzer can confirm document.In the present invention, with DTD the data in the webpage that the user will extract are described and the form of the XML document as a result that finally from wrapper, obtains.

HTML tree: the DOM model that html document can become to have tree structure by the XML resolver resolves, the DOM model has the tree structure feature, wherein each in the html document to mark (tag) in dom tree corresponding to a node.Dom tree is the HTML tree.

Decimation rule is used for from the information of the relevant needs extraction in html document location.Decimation rule also is the basis of wrapper simultaneously, and it has stipulated which node need extract in the html document, and how to extract, such as the part that may only need the value of a node of extraction that has.

Xpath, a kind of query language at XML is used for the partial content of localization of XML document, operational character string, operand word, the group node (Nodes) in operation Bolean number and the coupling XML document.The output result of XPath expression formula can be a kind of in following four kinds of objects, i.e. set of node (Node Set), Boolean, numeral and character string.In the present invention, become the DOM model because html document has been resolved, so each data item that the user will extract there is all a decimation rule based on XPath to be used for representing the position of this data item at html document.The simple case of XPath expression formula :/authors/author[@period=" classical "] implication is: in the XML document of " authors ", inquiry period attribute (attribute) value is the author element (element) of classical at root node.

Xquery, XML document says that in essence being exactly one is the profile of primary structure unit with order and level.XQuery just is being based on this structure of XML, and it uses this structure to come to provide query capability for the data of the XML storage in the same scope.More speak by the book, XQuery is [XQ-DM] with the formal definition of XQuery 1.0 and XPath 2.0 data models, and with the analytical structure of XML document be described as orderly, do the tree of mark, each node on the tree all has a different identity and may have simple or complex types.XQuery can be used to inquiring about without any the XML data of pattern (schema), also can be to inquiring about by World Wide Web Consortium (W3C) XML pattern or by the data that document type definition (DTD) is managed.It should be noted that the employed data model of XQuery is completely different with classic relational model, do not have the notion of layer at XQuery, order here neither be very important, and do not support identity (Identity).Each inquiry of XQuery all is an expression formula for the treatment of evaluation, and can make up very flexibly between the expression formula and create a new expression formula.In the present invention, extract all data that need extract (corresponding to the pattern of this page data) in the page by an XQuery expression formula, each data item wherein corresponds respectively to a sub-XQuery query expression.

Semantic chunk, so-called semantic chunk promptly refer to the set of a data item that meets the page data pattern, and this is integrated into and shows as the information of describing an entity in the html page.Such as, in the page about the letter breath among the amazon, a semantic chunk is all complete information of a book in the page, comprises title, author, publishing house, price or the like information.Therefore, the appearance of the data item in the semantic chunk meets the regulation of page data pattern fully.Simultaneously, a semantic chunk shows as one or more adjacent subtree in the tree structure of html document, and if in a html page if a plurality of semantic chunks are arranged, these semantic chunks are in same level in the HTML tree construction and adjacent usually.

As shown in Figure 1, the wrapper generation wrapper actuator of the present invention by the webpage after changing and before safeguarding extracts needed contiguous items, carries out wrapper and verifies, if by then returning XML document, the data pick-up success; If do not pass through, illustrate that wrapper has been not suitable for data pick-up, need then wrapper is safeguarded again that to generate the wrapper that is fit to the html data structural change, the wrapper of carrying out after safeguarding is carried out, and reappears and verify.As shown in Figure 1, effect and status that wrapper is maintained in the whole web data extraction system are considerable, and it is the key of data pick-up, if it is not carried out corresponding maintenance, some general datas just can not obtain.The effect that the technology that wrapper is safeguarded plays is exactly an ability to work of having expanded wrapper, after making the page change, it can repair the original wrapper that no longer is suitable for automatically, the work that makes wrapper to continue down, therefore the ability that data pick-up is provided that such wrapper system can be more healthy and stronger and stable.

Provided detailed process of the present invention as Fig. 2.The present invention at first utilizes webpage, the wrapper before the maintenance and the webpage computational data item feature after the variation before changing.

Because data item has still kept the feature of the data item before changing in the page after changing, these data characteristicses can after step in be used for the possible data item of document discovery after variation.Therefore the present invention at first calculates the feature of these data item.

The calculating of the semantic feature of data item.Come the semantic feature of data of description item by the regular expression of HTML.Such as, the expression-form of price always " { () [0-9] { 1, } (.) [0-9] { 0,2}} ", this regular expression shows that the feature about the data item of price is generally the decimal with " " beginning.Like this, the data characteristics of the canonical of being correlated with can be extracted.

The calculating of the descriptor of data item (Annotation).The expression in HTML path meets the XPath standard fully, wherein predicate [contains (... )] then be to have indicated annotation information, simultaneously for decimation rule, wherein each bar rule all is a HTML path, therefore directly can obtain descriptor from the wrapper that changes the preceding page when wrapper is safeguarded.

The calculating of the link information of data item.Obtain in the HTML path in the computation process of the descriptor (Annotation) of last step data item in decimation rule.

Be exactly to find possible data item in the document after variation below.The feature of the data item that goes out according to previous calculations, the HTML tree after this step changes by degree of depth traversal and therefrom find out possible data item.The detailed process flow process as shown in Figure 3.The inlet of treatment scheme is to make after changing to it is emphasized that the resolved one-tenth tree structure of the page (dom tree) this moment by the html page that original wrapper lost efficacy.Three essential characteristics according to the data item maintenance that will extract in the html page of in previous step, mentioning, promptly (the descriptor number of the semantic feature of data item, data item and the link information of data item) is from root node, judge at first whether root node meets the data item feature, if do not meet, its all child nodes of the processing of recurrence then.If the node of facing is ELEMENT node (being nonleaf node), then continues recurrence and check its child node.If the node of facing is TEXT node (leaf node that promptly comprises textual value), judge earlier the annotation information (Annotation) that whether comprises data item in this node, if comprise then seek its corresponding data value (may be same node, also may in adjacent node) from this node; If do not comprise the annotation information (Annotation) of data item in this node, then check semantic feature and link information, if all meet, then also with the node of this node as possible data item.Recurrence is handled all nodes in the HTML tree construction, and is all accessed up to all nodes.

Be exactly that an important step of the present invention has been divided semantic chunk after this.Obtain after all possible data item, the present invention divides semantic chunk to data, and its purpose is therefrom to derive the institutional framework of the data in the html page after the variation.In the present invention, think that all data all are in the semantic chunk.This step is exactly to derive semantic chunk from possible data block.

Among the present invention, there are following three kinds of situations for the coupling between stalk tree and the pattern in the HTML tree:

1, excessively mates (Over match).The example that includes a plurality of patterns in the subtree promptly includes a plurality of and semantic chunk pattern match.

2, part coupling (Partial match).A part that only comprises the example of a pattern in the subtree promptly only comprises the part of the semantic chunk of and pattern match.

3, mate (Full match) fully.Just the example that contains a pattern in the subtree promptly comprises the semantic chunk of and pattern match just.

The concrete treatment scheme of this step as shown in Figure 4.The input of flow process is a HTML tree construction, and all possible data item wherein all has been found and has done mark.Flow process is at first judged the matching relationship between this HTML tree and the pattern from the root node of HTML tree, if excessive coupling, then continuing recurrence, to handle all child nodes with this root node be the number of words of tree root.For the overall significantly matching relationship of all subtrees statistics of each level on the HTML tree construction and pattern (by the number of three kinds of couplings on the same level of statistics HTML, get wherein the maximum as overall significantly matching relationship) do following different processing, if excessively mate, then recurrence is handled the subtree that all child nodes are root; If coupling is then returned all semantic chunks (promptly relevant subtree) fully; If the part coupling then merges adjacent subtree automatically.The merging process of subtree is according to such principle, carry out the merging of subtree according to order from left to right at the k layer, at first second joins in first, judge whether the piece after merging excessively mates, otherwise continue the subtree of back is joined in the piece of front, when excessive coupling, the piece that the subtree that then will add at last obtains after removing.Begin to repeat above-mentioned union operation from first subtree that does not join the piece, until should all subtree of layer merging.Return coupling or the partly piece of coupling fully at last; And get back to the last layer of HTML tree, continue to judge according to this flow process.When ending, final flowsheet exports the semantic chunk that all are divided out.

Final step of the present invention is repaired rule and is regenerated wrapper.Institute by the front has obtained a series of semantic chunk in steps, has comprised in these semantic chunks and has needed the data that extract in the page after changing, just the purpose safeguarded of wrapper.Simultaneously, the data in these semantic chunks all are (by the data item identification) that has been labeled different roles, that is to say that all semantic chunks all meet the pattern regulation.Therefore, can therefrom find out one with pattern match semantic chunk preferably, therefrom regenerate new decimation rule then.And, in order to make decimation rule can cover situation as much as possible, can choose a plurality of semantic chunks and generate decimation rule, and finally merge these decimation rules.At last, decimation rule derived regenerate wrapper, this wrapper can extract the data in the page after the variation.

The present invention has at first considered pattern and the effect of HTML tree construction in wrapper is safeguarded, and has proposed the solution of the present invention in view of the above.The present invention can realize fully.The present invention can supply a pattern by the user, precise definition user's demand more, and the semantic information of the data of better describing the page and being comprised.And the pattern that gives by the user, the present invention can handle more complicated content expression-form.The path expression of using in matching relationship that the present invention is all and the decimation rule all uses standard x Path to represent, this system that makes has had fabulous standard, extensibility and dirigibility.The strategy that the present invention has adopted decimation rule and the final wrapper file that forms to separate, make the wrapper of the generation that exists with java class (java code) form can very easily be included in other application program, particularly based on the Web application system of java exploitation.

Claims

1, a kind of maintaining method of wrapper may further comprise the steps:

2, the maintaining method of wrapper as claimed in claim 1 is characterized in that, described flag sign is specially the semantic feature of data item, the descriptor of data item and the link information of data item.

3, the maintaining method of wrapper as claimed in claim 2, it is characterized in that, described after variation among the Web extracted data item be specially, in resolving to the Web of tree structure, from root node, judge whether root node meets the data item flag sign, if do not meet, then recurrence is handled its all child nodes, up to all nodes of traversal.

4, the maintaining method of wrapper as claimed in claim 3 is characterized in that, described searching in several node process if node is the ELEMENT node, also is nonleaf node, then continues recurrence and checks its child node; If node is the TEXT node, the leaf node that promptly comprises textual value, then judge earlier the annotation information that whether comprises data item in this node, if comprise then seek its corresponding data value from this node, if do not comprise the annotation information of data item in this node, then check semantic feature and link information, if all meet, then also this node as the node of possible data item.

5, the maintaining method of wrapper as claimed in claim 1, it is characterized in that, the adjacent subtree of described automatic merging is specially the merging of carrying out subtree in the child node of root node according to order from left to right, at first second semantic chunk joined in first, judge whether the piece after merging excessively mates, otherwise continue the subtree of back is joined in the piece of front, when excessive coupling, the subtree that then will add at last obtains a semantic chunk after removing, and keeps this semantic chunk; Begin to repeat above-mentioned union operation from first subtree that does not join the piece, until should all subtree of layer merging; Return coupling or the partly semantic chunk of coupling fully; Return again in other nodes of root node, repeat above-mentioned steps, up to all nodes of traversal.