CN103514292A

CN103514292A - Webpage data extraction method based on semi-supervised learning of small sample

Info

Publication number: CN103514292A
Application number: CN201310465730.4A
Authority: CN
Inventors: 黄宜华; 罗雷; 施生生; 袁春风
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2013-10-09
Filing date: 2013-10-09
Publication date: 2014-01-15

Abstract

The invention discloses a webpage data extraction method based on semi-supervised learning of a small sample. The method comprises the following steps that a set of sample webpages are selected from similarity webpages coming from the same webpage template, and data terms needing extracting are selected and marked by a user manually and are named as marked data terms; nodes, corresponding to the marked data terms, on DOM trees are named as marked nodes; an initial candidate characteristic set is set according to the different characters of the marked data terms on the corresponding DOM trees; by the semi-supervised learning method, a minimum relevance character in the initial candidate characteristic set is determined, and generalized extraction rules for the data are deduced; rule deduction is carried out on the data terms extracted from the same webpage in a simulated mode, and extraction rules related to the data terms on the webpage are obtained; the extraction rules act on the similarity webpages, and a batch of data are extracted. By means of the webpage data extraction method, generation of the extraction rules for the webpage data and automatic extraction processing of the webpage data can be achieved.

Description

A kind of webpage data extracting method based on small sample semi-supervised learning

Technical field

The present invention relates to a kind of data pick-up method, be specifically related to a kind of webpage data extracting method based on small sample semi-supervised learning.

Background technology

Now, internet has become maximum data message source, the whole world, and many application based on Web information all need directly from internet, to obtain data.From different Web data sources, obtain after valuable data, can integrate various value-added service is provided, as application and service such as Web the analysis of public opinion, price comparing system, vertical searches.

For the interested data message of application system can accurately be obtained quickly and easily from Web, need to provide effective Web information extraction technique and instrument.So-called Web information extraction (Web Information Extraction), refer to from structuring or semi-structured webpage and extract user or apply interested data, it is imported to the process of application system processing for further analysis with structurized form.

Ten years in the past, Web information extraction is a hot research problem always, is much studied.A Generating Problems that major issue is Web page data decimation rule of Web information extraction, has produced a lot of different decimation rule generation technique methods at present.

Manual regular write method burden for users is large, write inefficiency.In order to improve decimation rule formation efficiency, reduce the burden of user's hand weaving decimation rule, most researcher is devoted to automatic decimation rule Generation Technology, for example DEPTA, ViDE, MDR, STALKER, the research work such as DEByE.

But full-automatic method is only applicable to the webpage that comprises one group of regular repeating data record conventionally, is difficult to generally be applicable to various dissimilar webpages, be especially difficult to be applicable to pack processing containing the webpage of non-regular data recording.Therefore, full-automatic method is often not ideal enough in extraction precision, is difficult to meet the practical application request of precision Web information extraction.Some researchs are also processed the extraction that comprises repeating data and record webpage to be called " record level (Record-Level) data pick-up ", and the data pick-up method that only comprises a non-repeating data record on that each webpage thereby need to just can derive decimation rule by the data recording on a plurality of webpages is called to " webpage level (Page-Level) data pick-up ".

For this webpage DBMS, extract, in order to obtain a good resultant effect between regular formation efficiency and applicability and extraction precision, when guaranteeing higher regular formation efficiency, also can guarantee the applicability of decimation rule generation method and extract precision, especially for the efficient solution data pick-up rule Generating Problems of regular data recording webpage by no means, the effective ways of a compromise property are the semi-automatic rule generating methods based on man-machine interactively and mark.So-called user interactions and mark, to be similar to when Office Word inediting text in order completing and to copy or text selection operation that deletion action need to first be carried out, by user, chosen simply and marked on webpage interested data item.This semi-automatic technique based on man-machine interactively and mark can obtain location and the Extracting Information of data item more accurately, allow the data item of required extraction on the accurate named web page of user, finally can significantly improve data item and extract precision, require again user to carry out the aftertreatment of secondary filtration after avoiding full-automatic method to extract all data recording and data item.Meanwhile, the mutual and markup information based on semi-automation, can based on semi-supervised learning method derive fast and automatically generating according to item decimation rule, too much burden while avoiding artificial rule to write.

In the Web application of reality, the webpage that comprises data recording is all often dynamically to generate from an identical template webpage, and the data that these pages need are conventionally from structurized underlying database.This just makes the Web page generating have closely similar template.The Web page being generated by template has similar dom tree structure conventionally, so we can adopt XPath based on dom tree to realize location and the extraction of similar data.Yet, owing to coming from the similar pages of same template, conventionally having some tiny gaps, simple XPath decimation rule does not possess generalization ability completely conventionally, thereby may lose efficacy when facing certain structure variation.Therefore, we wish to be derived and generated by method effectively to have the decimation rule of generalization ability and stablizing effect, to successfully manage the structural change of similar pages.

But how from similarity webpage, automatically learn and the stable decimation rule of deriving is a technical barrier up to the present solving not yet completely.Myllymaki and Jackson first manual construction stable decimation rule, first official has defined the stability of decimation rule; N. Niesh uses tree edit model based on probability to define the process that the Web page changed along with the time, and this probability model is added in the derivation of decimation rule and go.Although this method can help us to select a stable decimation rule from candidate's decimation rule set, probability model needs a large amount of history page information, and treatment effeciency is lower and do not have a general applicability.

Therefore, how from small sample markup page, Fast Learning is derived the open problem that stable decimation rule remains a difficult point.

Summary of the invention

Goal of the invention: the problem and shortage existing for above-mentioned prior art, the object of this invention is to provide a kind of webpage data extracting method based on small sample semi-supervised learning, can complete the generation of web data extraction rule and the automatic decimation of web data and process.

Technical scheme: for achieving the above object, the technical solution used in the present invention is a kind of webpage data extracting method based on small sample semi-supervised learning, comprises the steps:

(1) for the similarity webpage from same web page template, choose one group of sample webpage, on one to three sample webpage, by user, select and mark out by hand the same data item that needs extraction therein, this data item is called labeled data item;

(2) according to described labeled data item corresponding node on dom tree, described node is called mark node, and the different characteristic according to labeled data item on corresponding dom tree is constructed one group about the initial candidate characteristic set of this mark node;

(3) based on described sample webpage, utilize semi-supervised learning method, adopt the first algorithm to determine the minimum Relating Characteristic in initial candidate characteristic set, derive a decimation rule this data item to generalization ability; The described decimation rule with generalization ability refers to: in the situation that some structural changes appear in a data item in different web pages, decimation rule still can be stablized correct this data item that extracts;

(4) each data item of intending on a webpage extracting is carried out to the described rule induction of step (3), obtain about this, organizing on this webpage one group of decimation rule of data item;

(5) decimation rule of this group being derived acts on the similarity webpage of a collection of data to be extracted, finally extracts a collection of data item.

Further, in described step (2), initial candidate feature comprises three types: architectural feature, attributive character and content characteristic; Described initial candidate feature by a tlv triple, (characteristic type t, apart from d, v) describe by eigenwert.

Further, described architectural feature comprises Zi marking the architectural feature of node to all intermediate nodes dom tree root node, this architectural feature is described as (TAG, d, nv), TAG represents it is a DOM intermediate node element, and d is illustrated in this intermediate node on dom tree and arrives the internodal distance of mark, and nv represents the element term of this intermediate node;

Described attributive character comprises that this attributive character is described as (attribute type name, d, av) Zi marking id and the class attributive character of node to all intermediate nodes dom tree root node, and av represents the value of this attribute; Described attribute type name comprises base attribute id and the class of this intermediate node, id and the class attribute of the id of forerunner's node of this intermediate node and the descendant node of class attribute and this intermediate node;

Described content characteristic comprises non-NULL text node that occur first, that have the sign of helping and location mark node effect before and after mark node, described non-NULL text definition is anchor text, and this content characteristic is described as (text feature name, d, cv), cv represents the value of this anchor text; Described text feature name comprises left end anchor text and right-hand member anchor text;

Permission has same data item on two similar pages different eigenwert under identical feature, comprises different eigenwerts come in exclusive disjunction, and described identical finger characteristic type is identical with distance.

Further, in described step (3), iterative processing is carried out in candidate feature set to mark node: candidate feature is integrated on sample webpage and carries out constantly test and merge, until find an extensive decimation rule that can correctly extract labeled data item.

Further, described step (3) comprising:

With X, represent the decimation rule of deriving and forming by candidate feature set F, definition

for the distance of decimation rule X, its value is characteristic ultimate range in candidate feature set F:

In formula,

represent the feature in candidate feature set F;

Support is defined as follows: for the non-mark page from sample page, and the support of definition decimation rule X

for decimation rule X correctly extracts the ratio of the non-mark page:

Simultaneously, recall rate and the precision of the rule extraction data item that definition generates: suppose that A represents the mark node set of a data item on a plurality of mark pages, B represents the mark node set that decimation rule X correctly extracts on a plurality of mark pages, C represents decimation rule X actual all node set that extract on the mark page, the precision of definition decimation rule X

and recall rate

as follows:

In the derivation of decimation rule, the final decimation rule generating meets following four conditions:

1) recall rate is 1: decimation rule can return to all mark nodes on the mark page;

2) all, satisfy condition 1) decimation rule in, select all mark nodes to extract the decimation rule of accuracy value maximums;

3) all, satisfy condition 1) and 2) decimation rule in, the decimation rule of chosen distance minimum;

4), in all decimation rules of satisfied three conditions above, select the decimation rule of support maximum as final decimation rule.

Further, described step (3) comprising: in iterative process, use the decimation rule derivation method of gradual Stepwise Refinement, constantly test and merge initial characteristics to dwindle decimation rule to marking the inquiry orientation range of node on dom tree; If a feature initial or that merge can not accurately navigate to mark node, progressively add other feature to dwindle inquiry orientation range, by this feature, merge and incremental learning step, progressively derive the characteristic set that can uniquely navigate to mark node.

Further, described the first algorithm is class Apriori algorithm.

Beneficial effect: the present invention can be merged mutually with existing a lot of Web information extraction systems based on structuring decimation rule, for these system automatically generatings are according to decimation rule.Because the present invention is that semi-automatic rule based on simple mutual and mark generates and extracts and process, do not need manual compiling rule, thereby, can greatly improve the automaticity of Web information extraction, alleviate the manually-operated burden of user.

Accompanying drawing explanation

Fig. 1 is that decimation rule of the present invention is derived and web data extraction processing flow chart;

Fig. 2 is sample page and labeled data item exemplary plot;

Fig. 3 is labeled data item dom tree structure and mark node exemplary plot;

Fig. 4 is details page dom tree fragment in two example web page;

Fig. 5 is three details page dom tree fragments on example web film;

Fig. 6 is mean accuracy, recall rate and the F1_Value result figure of the decimation rule that obtains of three kinds of methods.

Embodiment

Below in conjunction with the drawings and specific embodiments, further illustrate the present invention, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.

Fig. 1 has provided decimation rule and has derived and web data extraction processing flow chart.This flow process comprises three phases: user annotation stage, XPath decimation rule are derived stage and web data extraction stage.

On same Web website, although the page being generated by same template has similar page structure, due to factors such as the disappearance of html page node, displacements, between these pages, may comprise tiny textural difference.In order finally to derive the decimation rule that there is good generalization ability, can successfully manage the similar pages that contains small structure difference, need to provide a small amount of sample page to supply user annotation data item to be extracted.A User Interface is provided in the prototype system realizing at us, allow user directly on the page, simply to select and mark interested data item, in this way, only need very simple interactive operation and user intervention just can mark out data item.Compare with the extraction of a large amount of Web data, the cost of this simple mark and burden are very low.

By the mark of page data item, can identify fast the absolute XPath of labeled data item, yet due to the variation of page structure, definitely XPath can not be applicable to all similar pages conventionally.Therefore, the page sample that we need to be based on marking out above, considers the various features relevant to data item, adopts semi-supervised learning mode, constructs a relative XPath decimation rule with stronger generalization ability.

XPath decimation rule derivation based on semi-supervised learning mainly comprises two steps.First obtain the initial characteristics set of labeled data item, then these features of application class Apriori algorithm combination, adopt the method for gradual Stepwise Refinement, until find optimum XPath decimation rule.The decimation rule finally forming will have stronger generalization ability, can extract the similar pages data item that same template generates.

For each data item has generated after the XPath decimation rule with generalization ability, can, by this decimation rule for the new page, extract fast target data item.

In order to complete above XPath rule induction, first need to describe feature and the corresponding XPath query expression thereof for page decimation rule derivation algorithm.So-called feature, refer to relevant with mark node on dom tree, for learning to mark DOM node and the attribute thereof of node decimation rule.We mainly adopt the feature of three types: architectural feature, attributive character and content characteristic.

We are a tlv triple (Type, Distance, Value) by each character representation, Type representation feature type wherein, Distance representation feature and the distance that marks node, its value is the distance between characteristic node on dom tree and mark node, the value of Value representation feature.And each feature is corresponding to an XPath expression formula, we can search by this XPath expression formula the in-scope of mark node.

Fig. 2 is an example page fragment that comes from certain business website, wherein price " $ 15.99 " is one of our target item that need to extract, and we need to learn out based on the design feature of this price data item on dom tree the decimation rule of this price data item.Fig. 3 is the corresponding dom tree Structure and characteristics of price data item of mark, and wherein, price data item mark node is text node pcadata($ 15.99).

For flag node on dom tree, to each intermediate node on root node path, we add in initial candidate characteristic set at the served as a mark architectural feature (" TAG ", d, nv) of node.Wherein, character string constant TAG is used for illustrating that this feature is a tag node, and its node t(by name is the table in example as above, tr, td, span, b, pcadata etc.), d represents the distance of this intermediate node range mark node.For example, dom tree structure for the mark node shown in Fig. 3, we need (" TAG ", 0, pcadata), (" TAG ", 1, b), (" TAG ", 2, span) etc. until all architectural features on this paths of root node all add in initial candidate characteristic set goes.

Each architectural feature is corresponding to an XPath query expression, and we can dwindle by XPath expression formula the query context of flag node.Usually, XPath corresponding to feature (" TAG ", d, t) be " //t/*/.../*/node () ", its expression is positioned at the node that under t node, the degree of depth is d may comprise flag node.For the flag node in Fig. 2 and 3 examples, above three XPath expression formulas corresponding to architectural feature be respectively " //pcadata ", " //b/node () ", " //span/*/node () ".

Conventionally the DOM node on the Web page all comprises many attributes, rationally utilizes these attributes can help to navigate to quickly and accurately destination node.According to the observation, normally most important two attributes of id in DOM node and class attribute, in the most Web page, by using id and the class attribute of DOM node, we can dwindle rapidly the locating query scope on dom tree, thereby accurately navigate to destination node.

1) basic id and class attribute

For flag node on dom tree, to each intermediate node on root node path, if contain id attribute, add id attributive character (" ID ", d, idVal) in initial characteristics set.Wherein, character string constant ID is used for illustrating that this feature is an ID attributive character, and its value is idVal, and d represents the distance of this attribute node range mark node.Similarly, also need class attributive character (" CLASS ", d, classVal) to join in initial characteristics set.For example, for the dom tree structure of the mark item shown in Fig. 3, the attributive character relevant with flag node is (" CLASS ", 1, " priceLarge "), (" ID ", 2, " i3 "), (" ID ", 3, " i2 "), (" ID ", 4, " actualPriceRow ") etc., we need to add these features in initial candidate characteristic set and go.

Each base attribute feature is also corresponding to an XPath query expression, usually, attributive character (" ID ", d, idVal) corresponding XPath be " // * [@id=' idVal ']/*/.../*/node () ", if the id property value of a node of its expression is idVal, the node that is d apart from its degree of depth may comprise flag node.For the flag node in Fig. 2 and Fig. 3 example, range mark nodal value is that 3 scopes are respectively " // * [@class=' priceLarge ']/node () ", " // * [@id=' i3 ']/*/node () ", " // * [@id=' i2 ']/*/*/node () " with XPath expression formula corresponding to interior attributive character.

2) forerunner and follow-up id and class attribute

When an intermediate node is not when enough significant attribute helps location to mark node, sometimes utilize the forerunner of this intermediate node or the attribute of descendant node also can help effective location to mark node.For this reason, we will also join the forerunner of each intermediate node and descendant node attributive character in initial characteristics set.

For flag node on dom tree to each intermediate node on root node path, if i forerunner's brotgher of node of these intermediate nodes contains id attribute, add forerunner id attributive character (" preceding-sibling-id-i ", d, psibIDVal) in initial characteristics set.Its corresponding XPath query expression be " // * [preceding-sibling::*[position ()=i] [@id=' psibIDVal ']]/*/.../*/node () ", this expression formula represents that the descendants's node that is d apart from this intermediate node degree of depth may comprise flag node if the i of an intermediate node locational forerunner's brotgher of node has id property value ' psibIDVal '.If forerunner's brotgher of node contains class attribute, we also will add corresponding forerunner class attributive character (" preceding-sibling-class-i ", d, psibClassVal) in initial characteristics set.Similarly, the id attribute of the follow-up brotgher of node of intermediate node and class attributive character also will add in characteristic set as mentioned above, and characteristic of correspondence type is respectively " following-sibling-id-i " or " following-sibling-class-i ".

As Fig. 3, on dom tree, flag node is respectively (" following-sibling-id-1 " to forerunner and the follow-up attribute of each intermediate node on root node path, 2, " i4 "), (preceding-sibling-id-1,3, " i1 "), (preceding-sibling-class-1,3, " c1 ") etc.For (" following-sibling-id-1 ", 2, " i4 "), its corresponding XPath expression formula is " // * [following-sibling::*[position ()=1] [@id=' i4 ']]/*/node () ", this expression formula explanation, if the 1st the follow-up brotgher of node of an intermediate node has id property value for " i4 ", its grandson's node may comprise the destination node of mark.

Further, may lack enough significant architectural feature and attributive character, and certain content characteristic may have significant positioning action, therefore, certain content of text with distinctive marks that the present invention occurs before and after using a data item is as content characteristic.

Data on the Web page show user to browse and check, in order to help user to understand the implication of data item on the page, before and after a lot of data item, have corresponding description text for explaining this data item.We are anchor text the non-NULL text definition occurring first before and after this mark node.Price mark " $ 15.99 " as shown in Figures 2 and 3, its forerunner occurs first, and non-NULL text node is " Price: ", by content characteristic (" Left-text " corresponding to the text, 0, " Price ") content characteristic of the node that serves as a mark joins in initial characteristics set, wherein, character string constant " Left-text " represents that this anchor text appears at the left end of mark item first.The XPath query expression that this content characteristic is corresponding is " //node () [preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", this XPath expression formula explanation, for any one node, if textual value is " Price: " to forerunner's non-NULL that it occurs first, this node is exactly likely the destination node that we need to extract.For Fig. 3, this XPath expression formula can narrow down to the query context of destination node " pcadata ($ 15.99) ", " b ", " span (id=i3) ", " td (id=i2) " four nodes.Similarly, we also need follow-up content characteristic (" Right-text ", 0, #text) join in initial characteristics set, its corresponding XPath is " //node () [following::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=#text] ".

In order to make decimation rule have stronger generalization ability, we have also considered the inclusive-OR operation of feature.When carrying out data item mark in a plurality of sample page, same data item likely can have different values on two similar pages under identical feature (characteristic type is identical with distance), for this reason, we need to consider to add the inclusive-OR operation of feature to comprise this many-valued situation.For example, Figure 4 shows that two similar pages dom tree fragments on example website, mark node is the number of users on the page, in left figure, the follow-up content characteristic of flag node is (" Right-text ", 0, " users "), and in right figure, the follow-up content characteristic of flag node is (" Right-text ", 0, " user ").They are identical characteristic types, and identical with the distance of mark node, but the former value is users (plural number), the latter's value is user (odd number), we have considered the exclusive disjunction of feature, they are considered as to same feature (" Right-text ", 0, " users " or " user "), XPath rule corresponding to this feature is " //node () [following::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' users ' or .=' user ']] ", this XPath expression formula explanation, for any one node, if its follow-up non-NULL content of text occurring is first " users " or " user ", this node is exactly likely the destination node that we need to extract.

Fig. 5 is the dom tree fragment of three similar pages of certain web film, after and then the film time appear at movie name.When user marks movie name in 3 similar sample page, on three pages, mark the follow-up content characteristic that node is corresponding and be respectively (" Right-text ", 0, " (1994) "), (" Right-text ", 0, " (2009) ") and (" Right-text ", 0, " (2010) ").These follow-up content characteristic textual value are different, if carrying out inclusive-OR operation by these features, we are combined into a feature (" Right-text ", 0, " (1994) " or " (2009) " or " (2010) "), the XPath rule of correspondence can only extract this three pages that textual value is corresponding.And the true page there will be the how different film times in the page while extracting, therefore, will be difficult to or logical row page-out in these have very diverse value.Therefore, when marking on a plurality of pages, if have same characteristic features type with distance, the different characteristic number of eigenwert is more than or equal to 3, we can think, these changeable eigenwerts are unsettled, thereby be unwell to as content characteristic use, now, the feature that we just need to find other relearns decimation rule.Therefore,, under worst case, we only need 3 mark pages just can learn to obtain XPath decimation rule.

We wish by a small amount of sample page, conclude and derive stable XPath decimation rule, and this decimation rule can uniquely accurately navigate to mark node.In the present invention, decimation rule obtains based on incremental learning Stepwise Refinement from initial candidate characteristic set.Our target is to select to have the XPath decimation rule of best generalization ability, and it has certain stability, can tackle the minor variations of page structure.We are the probability of effective location data item in similar pages afterwards by the definition of stability of decimation rule, yet it needs a large amount of historical datas to calculate this probability, and does not consider the nature and characteristic of decimation rule itself.

The present invention adopts two standards to assess the stability of decimation rule: distance and support.Distance refers to and forms the feature set of decimation rule and the ultimate range of flag node.As (" TAG ", 0, pcadata), (" TAG ", 1, b) two features can merge and derive a rule " //b/pcadata ", the maximal value 1 that this regular distance definition is feature set middle distance.The distance of decimation rule has been reacted the tightness degree of the characteristic set with the mark node that form decimation rule, and distance is larger, and tightness degree is lower.Support refers to the correct probability extracting in the non-mark page of decimation rule in sample page.

Suppose that we represent to merge by feature set F the XPath decimation rule X that derives and form with X (F), definition for the distance of decimation rule X, its value is characteristic ultimate range in F.

Support is defined as follows, and for the non-mark page from sample page, the support sup that we define decimation rule X is the ratio that X correctly extracts the non-mark page, that is:

Meanwhile, we define recall rate and the precision of the rule extraction data item of generation.Suppose that A represents the mark node set of a data item on a plurality of mark pages, B represents the mark node set that decimation rule X correctly extracts on a plurality of mark pages, and C represents X actual all node set that extract on the mark page.Precision and the recall rate of definition X are as follows:

In decimation rule derivation, we wish that the final XPath decimation rule generating need to meet following four conditions:

1) recall rate is 1, and decimation rule can return to all mark nodes on the mark page;

2) all, satisfy condition 1) decimation rule in, select all mark nodes to extract the rule of accuracy value maximums, decimation rule can not extract the node not marking in markup page as far as possible;

4), in all decimation rules of satisfied three conditions above, select the final decimation rule of conduct of support maximum.

Under a feature, the XPath query expression of corresponding mark node likely navigates to the non-mark node on the page.If a feature can not accurately navigate to mark node, we just need to progressively merge other feature and dwindle query context, by gradual study, progressively derive the characteristic set that can uniquely navigate to mark node.Here it is our incremental learning merging based on feature of adopting and the thought of derivation.

For example, in the sample shown in Fig. 2 and Fig. 3, in order to learn to obtain marking the decimation rule of node " pcadata ($ 15.99) ", if only adopt the architectural feature (" TAG " of mark node, 0, what pcadata), corresponding XPath expression formula " //pcadata " extracted is all text nodes; If further add, mark the content characteristic (" Left-text " of node, 0, " Price "), with its corresponding XPath expression formula " //node () [preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", can uniquely accurately navigate to flag node " pcadata (15.99) ".Therefore, for two features above, we are combined, derive a more accurate XPath decimation rule " //pcadata[preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", this Rule Expression, what we need to extract is a text node, simultaneously, textual value is " Price: " to forerunner's non-NULL that this node occurs first, in Fig. 2 and Fig. 3, this XPath expression formula just can uniquely navigate to the price data item node of mark.

In order to obtain optimum XPath decimation rule, first need the target data item in user annotation sample page, the initial characteristics of the target data item that then system automatic acquisition marks, this process is more directly perceived, we only need the dom tree of the traversal mark page, from mark node to root node, the described feature of the 4th joint is joined in candidate feature S set, then adopt incremental learning method to carry out the derivation of decimation rule.

The sample page that user provides comprises two parts, one to three mark sample page and several non-mark sample page.We derive XPath decimation rule from mark sample page learning, then by non-mark sample page, select the highest rule of support.When obtaining marking after the initial characteristics set of node, we adopt the frequent item set association rules method that is similar to Apriori algorithm, excavate minimum linked character, then they are merged into XPath decimation rule.

If S is the described initial candidate characteristic set of the 4th joint

it is the feature set of single Characteristics creation in S

for the ease of understanding following specific algorithm, in object lesson above, initial candidate feature set S with

as follows:

Initial candidate feature set

In algorithm, the 5th walk to the 22nd row be from arrive

one take turns iterative process.In k wheel iteration, first we check

in each feature set F, if its corresponding XPath X (F) recall rate is not 1, we just delete it from candidate feature set, all remaining features just satisfy condition 1).The 9th walk to 12 row we be recorded to all the time till current satisfy condition 1)-4) optimum XPath X (F), it is satisfying condition 1) in the situation that, precision is the highest, distance is minimum, and support is the highest.If the precision of X (F) is 1, we are just kept in best_XPath, otherwise are kept in max_prec_XPath.

For by

, we should be noted that 2 points: one, is used variable the storage precision feature set (the 9th row-10 row) that is less than 1, these feature sets likely produce more excellent XPath decimation rule in iterative process below.We only to

add the feature set that precision is less than 1, its reason is, if X (F) precision reaches maximal value 1, toward adding feature in F, can not allow the distance of X reduce or support raising again.Its two, as long as the 13rd row the 19th row shows to find accuracy value, be 1 best_XPath, we just can be by

middle distance and support are carried out beta pruning inferior to the feature set of best_XPath, dwindle fast query context.Finally,

in candidate feature collection derive from

feature Combination (the 20th row).

According to above algorithm, in Fig. 2, Fig. 3 example, the process of the XPath decimation rule of deriving final to mark node is as follows:

First obtain the initial candidate characteristic set S(S of mark node as mentioned above).

When the first round, study started,

.For all feature sets

corresponding XPath expression formula X (F) recall rate is all 1, precision is less than to 1 F and adds

in (

in XPath precision corresponding to all features be all less than 1, for use in next round, be combined into more excellent XPath), when

in all feature sets all test after (7-12 is capable), obtain precision and be 1 best_XPath and be

right again carry out beta pruning, delete distance and all features of support inferior to best_XPath, this is that these features of deletion can not be combined into the decimation rule more excellent than best_XPath because the precision of best_XPath has been 1.Now,

then will

in element combination of two form

.

When second takes turns study beginning, we are right

in all characteristic sets test.In this example,

in element only have one, take out in the synthetic new XPath rule of element set " //pcadata[preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", we need to extract a text node this XPath Rule Expression, and its forerunner occurs first, and non-NULL textual value is " Price: ".The precision of this XPath rule is also 1, according to condition 3) and 4) therefrom chosen distance is minimum, the XPath rule (12 row) that support is the highest, the best_XPath expression formula finally obtaining is

“//pcadata[preceding::text()[normalize-space(.)!=‘’][position()=1][.=‘Price:’]”。Now for sky,

also be empty, learning process finishes.

4. experimental result

In this section, we assess precision, recall rate and the F1 value of the XPath decimation rule of algorithm generation of the present invention by experiment.

We have collected the webpage of 30 true websites as our experimental data, and these websites have comprised different fields, as product web, business web site, financial website, entertainment sites etc.These websites all adopt script to generate, and have identical web page template.

To each website, we select 10 pages that belong to same template as sample page, and concentrate and select 3 pages for user annotation data item to be extracted from sample page, 7 remaining sample page are for detection of the support of the XPath decimation rule generating.Then, we generate XPath decimation rule to all labeled data items.

In the present invention, we adopt the module of information retrieval field to weigh the effect of decimation rule, and we generate the XPath decimation rule generating test page for a large amount of same template, calculates its extraction precision and recall rate on test page.In addition, we have also assessed the comprehensive standard of precision and recall rate---F1-Value.

We have generated decimation rule to the data item in all markup page, have enumerated the XPath rule sample that part website typical case's page data item generates in table 1.As can be seen from the table, in the general page, the characteristic number that we adopt is no more than 3 and just can navigates to fast mark node, and these features approach mark node very much.This is consistent with the original intention of webpage design, and Web page maker conventionally can add some special features (architectural feature, attributive character, content characteristic, even visual signature) near important data item, to cause user's attention.

The XPath decimation rule that we also obtain learning of the present invention compares with the decimation rule that definitely XPath decimation rule and Vertex method obtain.We adopt above three kinds of methods to derive XPath rule for each data set, in 30 test pages that then rule application generated to same template, calculate precision, recall rate and precision and recall rate comprehensive standard (F1_Value) separately.Mean accuracy, recall rate and the F1_Value of the decimation rule that three kinds of methods obtain are as shown in Figure 6.Definitely XPath decimation rule is owing to lacking generalization ability, when running into page structure and there are differences, will lose efficacy, therefore precision and recall rate are lower, and our method is compared with Vertex method, that considered feature choosing of feature or logic, thereby can reach higher precision and recall rate.Experimental result demonstration, method of the present invention is derived and is learnt the XPath decimation rule of generation, while carrying out data item extraction on similar pages, can reach 97.5% precision and approach 100% recall rate.

Claims

1. the webpage data extracting method based on small sample semi-supervised learning, comprises the steps:

2. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 1, is characterized in that: in described step (2), initial candidate feature comprises three types: architectural feature, attributive character and content characteristic; Described initial candidate feature by a tlv triple, (characteristic type t, apart from d, v) describe by eigenwert.

3. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 2, it is characterized in that: described architectural feature comprises Zi marking the architectural feature of node to all intermediate nodes dom tree root node, this architectural feature is described as (TAG, d, nv), TAG represents it is a DOM intermediate node element, and d is illustrated in this intermediate node on dom tree and arrives the internodal distance of mark, and nv represents the element term of this intermediate node;

4. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 1, it is characterized in that: in described step (3), iterative processing is carried out in candidate feature set to mark node: candidate feature is integrated on sample webpage and carries out constantly test and merge, until find an extensive decimation rule that can correctly extract labeled data item.

5. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 4, is characterized in that: described step (3) comprising:

With

the decimation rule that expression is derived and formed by candidate feature set F, definition

for decimation rule

distance, its value is characteristic ultimate range in candidate feature set F:

In formula,

represent the feature in candidate feature set F;

Support is defined as follows: for the non-mark page from sample page (being those pages that do not carry out data item mark), and definition decimation rule

support for decimation rule

correctly extract the ratio of the non-mark page:

Recall rate and the precision of the rule extraction data item that meanwhile, definition generates: suppose that A represents the mark node set of a data item on a plurality of mark pages, B represents decimation rule

the mark node set correctly extracting on a plurality of mark pages, C represents decimation rule the actual all node set that extract on the mark page, definition decimation rule

precision

and recall rate

as follows:

6. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 4, it is characterized in that: described step (3) comprising: in iterative process, use the decimation rule derivation method of gradual Stepwise Refinement, constantly test and merge initial characteristics to dwindle decimation rule to marking the inquiry orientation range of node on dom tree; If a feature initial or that merge can not accurately navigate to mark node, progressively add other feature to dwindle inquiry orientation range, by this feature, merge and incremental learning step, progressively derive the characteristic set that can uniquely navigate to mark node.

7. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 1, is characterized in that: described the first algorithm is class Apriori algorithm.