CN103514292A - Webpage data extraction method based on semi-supervised learning of small sample - Google Patents

Webpage data extraction method based on semi-supervised learning of small sample Download PDF

Info

Publication number
CN103514292A
CN103514292A CN201310465730.4A CN201310465730A CN103514292A CN 103514292 A CN103514292 A CN 103514292A CN 201310465730 A CN201310465730 A CN 201310465730A CN 103514292 A CN103514292 A CN 103514292A
Authority
CN
China
Prior art keywords
node
mark
decimation rule
webpage
data item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310465730.4A
Other languages
Chinese (zh)
Inventor
黄宜华
罗雷
施生生
袁春风
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201310465730.4A priority Critical patent/CN103514292A/en
Publication of CN103514292A publication Critical patent/CN103514292A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/16Automatic learning of transformation rules, e.g. from examples
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage data extraction method based on semi-supervised learning of a small sample. The method comprises the following steps that a set of sample webpages are selected from similarity webpages coming from the same webpage template, and data terms needing extracting are selected and marked by a user manually and are named as marked data terms; nodes, corresponding to the marked data terms, on DOM trees are named as marked nodes; an initial candidate characteristic set is set according to the different characters of the marked data terms on the corresponding DOM trees; by the semi-supervised learning method, a minimum relevance character in the initial candidate characteristic set is determined, and generalized extraction rules for the data are deduced; rule deduction is carried out on the data terms extracted from the same webpage in a simulated mode, and extraction rules related to the data terms on the webpage are obtained; the extraction rules act on the similarity webpages, and a batch of data are extracted. By means of the webpage data extraction method, generation of the extraction rules for the webpage data and automatic extraction processing of the webpage data can be achieved.

Description

A kind of webpage data extracting method based on small sample semi-supervised learning
Technical field
The present invention relates to a kind of data pick-up method, be specifically related to a kind of webpage data extracting method based on small sample semi-supervised learning.
Background technology
Now, internet has become maximum data message source, the whole world, and many application based on Web information all need directly from internet, to obtain data.From different Web data sources, obtain after valuable data, can integrate various value-added service is provided, as application and service such as Web the analysis of public opinion, price comparing system, vertical searches.
For the interested data message of application system can accurately be obtained quickly and easily from Web, need to provide effective Web information extraction technique and instrument.So-called Web information extraction (Web Information Extraction), refer to from structuring or semi-structured webpage and extract user or apply interested data, it is imported to the process of application system processing for further analysis with structurized form.
Ten years in the past, Web information extraction is a hot research problem always, is much studied.A Generating Problems that major issue is Web page data decimation rule of Web information extraction, has produced a lot of different decimation rule generation technique methods at present.
Manual regular write method burden for users is large, write inefficiency.In order to improve decimation rule formation efficiency, reduce the burden of user's hand weaving decimation rule, most researcher is devoted to automatic decimation rule Generation Technology, for example DEPTA, ViDE, MDR, STALKER, the research work such as DEByE.
But full-automatic method is only applicable to the webpage that comprises one group of regular repeating data record conventionally, is difficult to generally be applicable to various dissimilar webpages, be especially difficult to be applicable to pack processing containing the webpage of non-regular data recording.Therefore, full-automatic method is often not ideal enough in extraction precision, is difficult to meet the practical application request of precision Web information extraction.Some researchs are also processed the extraction that comprises repeating data and record webpage to be called " record level (Record-Level) data pick-up ", and the data pick-up method that only comprises a non-repeating data record on that each webpage thereby need to just can derive decimation rule by the data recording on a plurality of webpages is called to " webpage level (Page-Level) data pick-up ".
For this webpage DBMS, extract, in order to obtain a good resultant effect between regular formation efficiency and applicability and extraction precision, when guaranteeing higher regular formation efficiency, also can guarantee the applicability of decimation rule generation method and extract precision, especially for the efficient solution data pick-up rule Generating Problems of regular data recording webpage by no means, the effective ways of a compromise property are the semi-automatic rule generating methods based on man-machine interactively and mark.So-called user interactions and mark, to be similar to when Office Word inediting text in order completing and to copy or text selection operation that deletion action need to first be carried out, by user, chosen simply and marked on webpage interested data item.This semi-automatic technique based on man-machine interactively and mark can obtain location and the Extracting Information of data item more accurately, allow the data item of required extraction on the accurate named web page of user, finally can significantly improve data item and extract precision, require again user to carry out the aftertreatment of secondary filtration after avoiding full-automatic method to extract all data recording and data item.Meanwhile, the mutual and markup information based on semi-automation, can based on semi-supervised learning method derive fast and automatically generating according to item decimation rule, too much burden while avoiding artificial rule to write.
In the Web application of reality, the webpage that comprises data recording is all often dynamically to generate from an identical template webpage, and the data that these pages need are conventionally from structurized underlying database.This just makes the Web page generating have closely similar template.The Web page being generated by template has similar dom tree structure conventionally, so we can adopt XPath based on dom tree to realize location and the extraction of similar data.Yet, owing to coming from the similar pages of same template, conventionally having some tiny gaps, simple XPath decimation rule does not possess generalization ability completely conventionally, thereby may lose efficacy when facing certain structure variation.Therefore, we wish to be derived and generated by method effectively to have the decimation rule of generalization ability and stablizing effect, to successfully manage the structural change of similar pages.
But how from similarity webpage, automatically learn and the stable decimation rule of deriving is a technical barrier up to the present solving not yet completely.Myllymaki and Jackson first manual construction stable decimation rule, first official has defined the stability of decimation rule; N. Niesh uses tree edit model based on probability to define the process that the Web page changed along with the time, and this probability model is added in the derivation of decimation rule and go.Although this method can help us to select a stable decimation rule from candidate's decimation rule set, probability model needs a large amount of history page information, and treatment effeciency is lower and do not have a general applicability.
Therefore, how from small sample markup page, Fast Learning is derived the open problem that stable decimation rule remains a difficult point.
Summary of the invention
Goal of the invention: the problem and shortage existing for above-mentioned prior art, the object of this invention is to provide a kind of webpage data extracting method based on small sample semi-supervised learning, can complete the generation of web data extraction rule and the automatic decimation of web data and process.
Technical scheme: for achieving the above object, the technical solution used in the present invention is a kind of webpage data extracting method based on small sample semi-supervised learning, comprises the steps:
(1) for the similarity webpage from same web page template, choose one group of sample webpage, on one to three sample webpage, by user, select and mark out by hand the same data item that needs extraction therein, this data item is called labeled data item;
(2) according to described labeled data item corresponding node on dom tree, described node is called mark node, and the different characteristic according to labeled data item on corresponding dom tree is constructed one group about the initial candidate characteristic set of this mark node;
(3) based on described sample webpage, utilize semi-supervised learning method, adopt the first algorithm to determine the minimum Relating Characteristic in initial candidate characteristic set, derive a decimation rule this data item to generalization ability; The described decimation rule with generalization ability refers to: in the situation that some structural changes appear in a data item in different web pages, decimation rule still can be stablized correct this data item that extracts;
(4) each data item of intending on a webpage extracting is carried out to the described rule induction of step (3), obtain about this, organizing on this webpage one group of decimation rule of data item;
(5) decimation rule of this group being derived acts on the similarity webpage of a collection of data to be extracted, finally extracts a collection of data item.
Further, in described step (2), initial candidate feature comprises three types: architectural feature, attributive character and content characteristic; Described initial candidate feature by a tlv triple, (characteristic type t, apart from d, v) describe by eigenwert.
Further, described architectural feature comprises Zi marking the architectural feature of node to all intermediate nodes dom tree root node, this architectural feature is described as (TAG, d, nv), TAG represents it is a DOM intermediate node element, and d is illustrated in this intermediate node on dom tree and arrives the internodal distance of mark, and nv represents the element term of this intermediate node;
Described attributive character comprises that this attributive character is described as (attribute type name, d, av) Zi marking id and the class attributive character of node to all intermediate nodes dom tree root node, and av represents the value of this attribute; Described attribute type name comprises base attribute id and the class of this intermediate node, id and the class attribute of the id of forerunner's node of this intermediate node and the descendant node of class attribute and this intermediate node;
Described content characteristic comprises non-NULL text node that occur first, that have the sign of helping and location mark node effect before and after mark node, described non-NULL text definition is anchor text, and this content characteristic is described as (text feature name, d, cv), cv represents the value of this anchor text; Described text feature name comprises left end anchor text and right-hand member anchor text;
Permission has same data item on two similar pages different eigenwert under identical feature, comprises different eigenwerts come in exclusive disjunction, and described identical finger characteristic type is identical with distance.
Further, in described step (3), iterative processing is carried out in candidate feature set to mark node: candidate feature is integrated on sample webpage and carries out constantly test and merge, until find an extensive decimation rule that can correctly extract labeled data item.
Further, described step (3) comprising:
With X, represent the decimation rule of deriving and forming by candidate feature set F, definition
Figure 343613DEST_PATH_IMAGE001
for the distance of decimation rule X, its value is characteristic ultimate range in candidate feature set F:
Figure 57491DEST_PATH_IMAGE002
In formula,
Figure 165124DEST_PATH_IMAGE003
represent the feature in candidate feature set F;
Support is defined as follows: for the non-mark page from sample page, and the support of definition decimation rule X
Figure 622650DEST_PATH_IMAGE004
for decimation rule X correctly extracts the ratio of the non-mark page:
Figure 233760DEST_PATH_IMAGE005
Simultaneously, recall rate and the precision of the rule extraction data item that definition generates: suppose that A represents the mark node set of a data item on a plurality of mark pages, B represents the mark node set that decimation rule X correctly extracts on a plurality of mark pages, C represents decimation rule X actual all node set that extract on the mark page, the precision of definition decimation rule X
Figure 384119DEST_PATH_IMAGE006
and recall rate
Figure 713469DEST_PATH_IMAGE007
as follows:
Figure 37003DEST_PATH_IMAGE008
In the derivation of decimation rule, the final decimation rule generating meets following four conditions:
1) recall rate is 1: decimation rule can return to all mark nodes on the mark page;
2) all, satisfy condition 1) decimation rule in, select all mark nodes to extract the decimation rule of accuracy value maximums;
3) all, satisfy condition 1) and 2) decimation rule in, the decimation rule of chosen distance minimum;
4), in all decimation rules of satisfied three conditions above, select the decimation rule of support maximum as final decimation rule.
Further, described step (3) comprising: in iterative process, use the decimation rule derivation method of gradual Stepwise Refinement, constantly test and merge initial characteristics to dwindle decimation rule to marking the inquiry orientation range of node on dom tree; If a feature initial or that merge can not accurately navigate to mark node, progressively add other feature to dwindle inquiry orientation range, by this feature, merge and incremental learning step, progressively derive the characteristic set that can uniquely navigate to mark node.
Further, described the first algorithm is class Apriori algorithm.
Beneficial effect: the present invention can be merged mutually with existing a lot of Web information extraction systems based on structuring decimation rule, for these system automatically generatings are according to decimation rule.Because the present invention is that semi-automatic rule based on simple mutual and mark generates and extracts and process, do not need manual compiling rule, thereby, can greatly improve the automaticity of Web information extraction, alleviate the manually-operated burden of user.
Accompanying drawing explanation
Fig. 1 is that decimation rule of the present invention is derived and web data extraction processing flow chart;
Fig. 2 is sample page and labeled data item exemplary plot;
Fig. 3 is labeled data item dom tree structure and mark node exemplary plot;
Fig. 4 is details page dom tree fragment in two example web page;
Fig. 5 is three details page dom tree fragments on example web film;
Fig. 6 is mean accuracy, recall rate and the F1_Value result figure of the decimation rule that obtains of three kinds of methods.
Embodiment
Below in conjunction with the drawings and specific embodiments, further illustrate the present invention, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.
Figure 564936DEST_PATH_IMAGE009
Fig. 1 has provided decimation rule and has derived and web data extraction processing flow chart.This flow process comprises three phases: user annotation stage, XPath decimation rule are derived stage and web data extraction stage.
Figure 823879DEST_PATH_IMAGE010
On same Web website, although the page being generated by same template has similar page structure, due to factors such as the disappearance of html page node, displacements, between these pages, may comprise tiny textural difference.In order finally to derive the decimation rule that there is good generalization ability, can successfully manage the similar pages that contains small structure difference, need to provide a small amount of sample page to supply user annotation data item to be extracted.A User Interface is provided in the prototype system realizing at us, allow user directly on the page, simply to select and mark interested data item, in this way, only need very simple interactive operation and user intervention just can mark out data item.Compare with the extraction of a large amount of Web data, the cost of this simple mark and burden are very low.
Figure 640526DEST_PATH_IMAGE011
By the mark of page data item, can identify fast the absolute XPath of labeled data item, yet due to the variation of page structure, definitely XPath can not be applicable to all similar pages conventionally.Therefore, the page sample that we need to be based on marking out above, considers the various features relevant to data item, adopts semi-supervised learning mode, constructs a relative XPath decimation rule with stronger generalization ability.
XPath decimation rule derivation based on semi-supervised learning mainly comprises two steps.First obtain the initial characteristics set of labeled data item, then these features of application class Apriori algorithm combination, adopt the method for gradual Stepwise Refinement, until find optimum XPath decimation rule.The decimation rule finally forming will have stronger generalization ability, can extract the similar pages data item that same template generates.
Figure 439854DEST_PATH_IMAGE012
For each data item has generated after the XPath decimation rule with generalization ability, can, by this decimation rule for the new page, extract fast target data item.
Figure 43135DEST_PATH_IMAGE013
In order to complete above XPath rule induction, first need to describe feature and the corresponding XPath query expression thereof for page decimation rule derivation algorithm.So-called feature, refer to relevant with mark node on dom tree, for learning to mark DOM node and the attribute thereof of node decimation rule.We mainly adopt the feature of three types: architectural feature, attributive character and content characteristic.
We are a tlv triple (Type, Distance, Value) by each character representation, Type representation feature type wherein, Distance representation feature and the distance that marks node, its value is the distance between characteristic node on dom tree and mark node, the value of Value representation feature.And each feature is corresponding to an XPath expression formula, we can search by this XPath expression formula the in-scope of mark node.
Fig. 2 is an example page fragment that comes from certain business website, wherein price " $ 15.99 " is one of our target item that need to extract, and we need to learn out based on the design feature of this price data item on dom tree the decimation rule of this price data item.Fig. 3 is the corresponding dom tree Structure and characteristics of price data item of mark, and wherein, price data item mark node is text node pcadata($ 15.99).
Figure 535296DEST_PATH_IMAGE014
For flag node on dom tree, to each intermediate node on root node path, we add in initial candidate characteristic set at the served as a mark architectural feature (" TAG ", d, nv) of node.Wherein, character string constant TAG is used for illustrating that this feature is a tag node, and its node t(by name is the table in example as above, tr, td, span, b, pcadata etc.), d represents the distance of this intermediate node range mark node.For example, dom tree structure for the mark node shown in Fig. 3, we need (" TAG ", 0, pcadata), (" TAG ", 1, b), (" TAG ", 2, span) etc. until all architectural features on this paths of root node all add in initial candidate characteristic set goes.
Each architectural feature is corresponding to an XPath query expression, and we can dwindle by XPath expression formula the query context of flag node.Usually, XPath corresponding to feature (" TAG ", d, t) be " //t/*/.../*/node () ", its expression is positioned at the node that under t node, the degree of depth is d may comprise flag node.For the flag node in Fig. 2 and 3 examples, above three XPath expression formulas corresponding to architectural feature be respectively " //pcadata ", " //b/node () ", " //span/*/node () ".
Figure 573659DEST_PATH_IMAGE015
Conventionally the DOM node on the Web page all comprises many attributes, rationally utilizes these attributes can help to navigate to quickly and accurately destination node.According to the observation, normally most important two attributes of id in DOM node and class attribute, in the most Web page, by using id and the class attribute of DOM node, we can dwindle rapidly the locating query scope on dom tree, thereby accurately navigate to destination node.
1) basic id and class attribute
For flag node on dom tree, to each intermediate node on root node path, if contain id attribute, add id attributive character (" ID ", d, idVal) in initial characteristics set.Wherein, character string constant ID is used for illustrating that this feature is an ID attributive character, and its value is idVal, and d represents the distance of this attribute node range mark node.Similarly, also need class attributive character (" CLASS ", d, classVal) to join in initial characteristics set.For example, for the dom tree structure of the mark item shown in Fig. 3, the attributive character relevant with flag node is (" CLASS ", 1, " priceLarge "), (" ID ", 2, " i3 "), (" ID ", 3, " i2 "), (" ID ", 4, " actualPriceRow ") etc., we need to add these features in initial candidate characteristic set and go.
Each base attribute feature is also corresponding to an XPath query expression, usually, attributive character (" ID ", d, idVal) corresponding XPath be " // * [@id=' idVal ']/*/.../*/node () ", if the id property value of a node of its expression is idVal, the node that is d apart from its degree of depth may comprise flag node.For the flag node in Fig. 2 and Fig. 3 example, range mark nodal value is that 3 scopes are respectively " // * [@class=' priceLarge ']/node () ", " // * [@id=' i3 ']/*/node () ", " // * [@id=' i2 ']/*/*/node () " with XPath expression formula corresponding to interior attributive character.
2) forerunner and follow-up id and class attribute
When an intermediate node is not when enough significant attribute helps location to mark node, sometimes utilize the forerunner of this intermediate node or the attribute of descendant node also can help effective location to mark node.For this reason, we will also join the forerunner of each intermediate node and descendant node attributive character in initial characteristics set.
For flag node on dom tree to each intermediate node on root node path, if i forerunner's brotgher of node of these intermediate nodes contains id attribute, add forerunner id attributive character (" preceding-sibling-id-i ", d, psibIDVal) in initial characteristics set.Its corresponding XPath query expression be " // * [preceding-sibling::*[position ()=i] [@id=' psibIDVal ']]/*/.../*/node () ", this expression formula represents that the descendants's node that is d apart from this intermediate node degree of depth may comprise flag node if the i of an intermediate node locational forerunner's brotgher of node has id property value ' psibIDVal '.If forerunner's brotgher of node contains class attribute, we also will add corresponding forerunner class attributive character (" preceding-sibling-class-i ", d, psibClassVal) in initial characteristics set.Similarly, the id attribute of the follow-up brotgher of node of intermediate node and class attributive character also will add in characteristic set as mentioned above, and characteristic of correspondence type is respectively " following-sibling-id-i " or " following-sibling-class-i ".
As Fig. 3, on dom tree, flag node is respectively (" following-sibling-id-1 " to forerunner and the follow-up attribute of each intermediate node on root node path, 2, " i4 "), (preceding-sibling-id-1,3, " i1 "), (preceding-sibling-class-1,3, " c1 ") etc.For (" following-sibling-id-1 ", 2, " i4 "), its corresponding XPath expression formula is " // * [following-sibling::*[position ()=1] [@id=' i4 ']]/*/node () ", this expression formula explanation, if the 1st the follow-up brotgher of node of an intermediate node has id property value for " i4 ", its grandson's node may comprise the destination node of mark.
Further, may lack enough significant architectural feature and attributive character, and certain content characteristic may have significant positioning action, therefore, certain content of text with distinctive marks that the present invention occurs before and after using a data item is as content characteristic.
Data on the Web page show user to browse and check, in order to help user to understand the implication of data item on the page, before and after a lot of data item, have corresponding description text for explaining this data item.We are anchor text the non-NULL text definition occurring first before and after this mark node.Price mark " $ 15.99 " as shown in Figures 2 and 3, its forerunner occurs first, and non-NULL text node is " Price: ", by content characteristic (" Left-text " corresponding to the text, 0, " Price ") content characteristic of the node that serves as a mark joins in initial characteristics set, wherein, character string constant " Left-text " represents that this anchor text appears at the left end of mark item first.The XPath query expression that this content characteristic is corresponding is " //node () [preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", this XPath expression formula explanation, for any one node, if textual value is " Price: " to forerunner's non-NULL that it occurs first, this node is exactly likely the destination node that we need to extract.For Fig. 3, this XPath expression formula can narrow down to the query context of destination node " pcadata ($ 15.99) ", " b ", " span (id=i3) ", " td (id=i2) " four nodes.Similarly, we also need follow-up content characteristic (" Right-text ", 0, #text) join in initial characteristics set, its corresponding XPath is " //node () [following::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=#text] ".
Figure 820150DEST_PATH_IMAGE017
In order to make decimation rule have stronger generalization ability, we have also considered the inclusive-OR operation of feature.When carrying out data item mark in a plurality of sample page, same data item likely can have different values on two similar pages under identical feature (characteristic type is identical with distance), for this reason, we need to consider to add the inclusive-OR operation of feature to comprise this many-valued situation.For example, Figure 4 shows that two similar pages dom tree fragments on example website, mark node is the number of users on the page, in left figure, the follow-up content characteristic of flag node is (" Right-text ", 0, " users "), and in right figure, the follow-up content characteristic of flag node is (" Right-text ", 0, " user ").They are identical characteristic types, and identical with the distance of mark node, but the former value is users (plural number), the latter's value is user (odd number), we have considered the exclusive disjunction of feature, they are considered as to same feature (" Right-text ", 0, " users " or " user "), XPath rule corresponding to this feature is " //node () [following::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' users ' or .=' user ']] ", this XPath expression formula explanation, for any one node, if its follow-up non-NULL content of text occurring is first " users " or " user ", this node is exactly likely the destination node that we need to extract.
Fig. 5 is the dom tree fragment of three similar pages of certain web film, after and then the film time appear at movie name.When user marks movie name in 3 similar sample page, on three pages, mark the follow-up content characteristic that node is corresponding and be respectively (" Right-text ", 0, " (1994) "), (" Right-text ", 0, " (2009) ") and (" Right-text ", 0, " (2010) ").These follow-up content characteristic textual value are different, if carrying out inclusive-OR operation by these features, we are combined into a feature (" Right-text ", 0, " (1994) " or " (2009) " or " (2010) "), the XPath rule of correspondence can only extract this three pages that textual value is corresponding.And the true page there will be the how different film times in the page while extracting, therefore, will be difficult to or logical row page-out in these have very diverse value.Therefore, when marking on a plurality of pages, if have same characteristic features type with distance, the different characteristic number of eigenwert is more than or equal to 3, we can think, these changeable eigenwerts are unsettled, thereby be unwell to as content characteristic use, now, the feature that we just need to find other relearns decimation rule.Therefore,, under worst case, we only need 3 mark pages just can learn to obtain XPath decimation rule.
Figure 279950DEST_PATH_IMAGE018
Figure 8871DEST_PATH_IMAGE019
We wish by a small amount of sample page, conclude and derive stable XPath decimation rule, and this decimation rule can uniquely accurately navigate to mark node.In the present invention, decimation rule obtains based on incremental learning Stepwise Refinement from initial candidate characteristic set.Our target is to select to have the XPath decimation rule of best generalization ability, and it has certain stability, can tackle the minor variations of page structure.We are the probability of effective location data item in similar pages afterwards by the definition of stability of decimation rule, yet it needs a large amount of historical datas to calculate this probability, and does not consider the nature and characteristic of decimation rule itself.
The present invention adopts two standards to assess the stability of decimation rule: distance and support.Distance refers to and forms the feature set of decimation rule and the ultimate range of flag node.As (" TAG ", 0, pcadata), (" TAG ", 1, b) two features can merge and derive a rule " //b/pcadata ", the maximal value 1 that this regular distance definition is feature set middle distance.The distance of decimation rule has been reacted the tightness degree of the characteristic set with the mark node that form decimation rule, and distance is larger, and tightness degree is lower.Support refers to the correct probability extracting in the non-mark page of decimation rule in sample page.
Suppose that we represent to merge by feature set F the XPath decimation rule X that derives and form with X (F), definition for the distance of decimation rule X, its value is characteristic ultimate range in F.
Figure 710297DEST_PATH_IMAGE020
Support is defined as follows, and for the non-mark page from sample page, the support sup that we define decimation rule X is the ratio that X correctly extracts the non-mark page, that is:
Figure 278682DEST_PATH_IMAGE021
Meanwhile, we define recall rate and the precision of the rule extraction data item of generation.Suppose that A represents the mark node set of a data item on a plurality of mark pages, B represents the mark node set that decimation rule X correctly extracts on a plurality of mark pages, and C represents X actual all node set that extract on the mark page.Precision and the recall rate of definition X are as follows:
Figure 557216DEST_PATH_IMAGE022
Figure 502038DEST_PATH_IMAGE023
In decimation rule derivation, we wish that the final XPath decimation rule generating need to meet following four conditions:
1) recall rate is 1, and decimation rule can return to all mark nodes on the mark page;
2) all, satisfy condition 1) decimation rule in, select all mark nodes to extract the rule of accuracy value maximums, decimation rule can not extract the node not marking in markup page as far as possible;
3) all, satisfy condition 1) and 2) decimation rule in, the decimation rule of chosen distance minimum;
4), in all decimation rules of satisfied three conditions above, select the final decimation rule of conduct of support maximum.
Figure 916839DEST_PATH_IMAGE024
Under a feature, the XPath query expression of corresponding mark node likely navigates to the non-mark node on the page.If a feature can not accurately navigate to mark node, we just need to progressively merge other feature and dwindle query context, by gradual study, progressively derive the characteristic set that can uniquely navigate to mark node.Here it is our incremental learning merging based on feature of adopting and the thought of derivation.
For example, in the sample shown in Fig. 2 and Fig. 3, in order to learn to obtain marking the decimation rule of node " pcadata ($ 15.99) ", if only adopt the architectural feature (" TAG " of mark node, 0, what pcadata), corresponding XPath expression formula " //pcadata " extracted is all text nodes; If further add, mark the content characteristic (" Left-text " of node, 0, " Price "), with its corresponding XPath expression formula " //node () [preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", can uniquely accurately navigate to flag node " pcadata (15.99) ".Therefore, for two features above, we are combined, derive a more accurate XPath decimation rule " //pcadata[preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", this Rule Expression, what we need to extract is a text node, simultaneously, textual value is " Price: " to forerunner's non-NULL that this node occurs first, in Fig. 2 and Fig. 3, this XPath expression formula just can uniquely navigate to the price data item node of mark.
Figure 921704DEST_PATH_IMAGE025
In order to obtain optimum XPath decimation rule, first need the target data item in user annotation sample page, the initial characteristics of the target data item that then system automatic acquisition marks, this process is more directly perceived, we only need the dom tree of the traversal mark page, from mark node to root node, the described feature of the 4th joint is joined in candidate feature S set, then adopt incremental learning method to carry out the derivation of decimation rule.
The sample page that user provides comprises two parts, one to three mark sample page and several non-mark sample page.We derive XPath decimation rule from mark sample page learning, then by non-mark sample page, select the highest rule of support.When obtaining marking after the initial characteristics set of node, we adopt the frequent item set association rules method that is similar to Apriori algorithm, excavate minimum linked character, then they are merged into XPath decimation rule.
If S is the described initial candidate characteristic set of the 4th joint
Figure 421956DEST_PATH_IMAGE026
Figure 842573DEST_PATH_IMAGE027
it is the feature set of single Characteristics creation in S
Figure 377459DEST_PATH_IMAGE028
for the ease of understanding following specific algorithm, in object lesson above, initial candidate feature set S with
Figure 287647DEST_PATH_IMAGE027
as follows:
Initial candidate feature set
Figure 212877DEST_PATH_IMAGE029
Figure 505362DEST_PATH_IMAGE030
Figure 894755DEST_PATH_IMAGE031
In algorithm, the 5th walk to the 22nd row be from arrive
Figure 653949DEST_PATH_IMAGE033
one take turns iterative process.In k wheel iteration, first we check
Figure 478686DEST_PATH_IMAGE032
in each feature set F, if its corresponding XPath X (F) recall rate is not 1, we just delete it from candidate feature set, all remaining features just satisfy condition 1).The 9th walk to 12 row we be recorded to all the time till current satisfy condition 1)-4) optimum XPath X (F), it is satisfying condition 1) in the situation that, precision is the highest, distance is minimum, and support is the highest.If the precision of X (F) is 1, we are just kept in best_XPath, otherwise are kept in max_prec_XPath.
For by
Figure 722585DEST_PATH_IMAGE034
, we should be noted that 2 points: one, is used variable the storage precision feature set (the 9th row-10 row) that is less than 1, these feature sets likely produce more excellent XPath decimation rule in iterative process below.We only to
Figure 202294DEST_PATH_IMAGE035
add the feature set that precision is less than 1, its reason is, if X (F) precision reaches maximal value 1, toward adding feature in F, can not allow the distance of X reduce or support raising again.Its two, as long as the 13rd row the 19th row shows to find accuracy value, be 1 best_XPath, we just can be by
Figure 830721DEST_PATH_IMAGE035
middle distance and support are carried out beta pruning inferior to the feature set of best_XPath, dwindle fast query context.Finally,
Figure 866811DEST_PATH_IMAGE033
in candidate feature collection derive from
Figure 555281DEST_PATH_IMAGE035
feature Combination (the 20th row).
According to above algorithm, in Fig. 2, Fig. 3 example, the process of the XPath decimation rule of deriving final to mark node is as follows:
First obtain the initial candidate characteristic set S(S of mark node as mentioned above).
When the first round, study started,
Figure 4717DEST_PATH_IMAGE036
Figure 171256DEST_PATH_IMAGE037
Figure 389748DEST_PATH_IMAGE038
.For all feature sets
Figure 311436DEST_PATH_IMAGE039
corresponding XPath expression formula X (F) recall rate is all 1, precision is less than to 1 F and adds
Figure 920272DEST_PATH_IMAGE040
in (
Figure 156081DEST_PATH_IMAGE040
in XPath precision corresponding to all features be all less than 1, for use in next round, be combined into more excellent XPath), when
Figure 229079DEST_PATH_IMAGE041
in all feature sets all test after (7-12 is capable), obtain precision and be 1 best_XPath and be
Figure 253493DEST_PATH_IMAGE042
right again carry out beta pruning, delete distance and all features of support inferior to best_XPath, this is that these features of deletion can not be combined into the decimation rule more excellent than best_XPath because the precision of best_XPath has been 1.Now,
Figure 920284DEST_PATH_IMAGE043
then will
Figure 113367DEST_PATH_IMAGE040
in element combination of two form
Figure 314542DEST_PATH_IMAGE044
.
When second takes turns study beginning, we are right
Figure 960287DEST_PATH_IMAGE045
in all characteristic sets test.In this example,
Figure 272319DEST_PATH_IMAGE045
in element only have one, take out in the synthetic new XPath rule of element set " //pcadata[preceding::text () [normalize-space (.) unequal to ' '] [position ()=1] [.=' Price: '] ", we need to extract a text node this XPath Rule Expression, and its forerunner occurs first, and non-NULL textual value is " Price: ".The precision of this XPath rule is also 1, according to condition 3) and 4) therefrom chosen distance is minimum, the XPath rule (12 row) that support is the highest, the best_XPath expression formula finally obtaining is
“//pcadata[preceding::text()[normalize-space(.)!=‘’][position()=1][.=‘Price:’]”。Now for sky,
Figure 825026DEST_PATH_IMAGE047
also be empty, learning process finishes.
4. experimental result
In this section, we assess precision, recall rate and the F1 value of the XPath decimation rule of algorithm generation of the present invention by experiment.
Figure 940750DEST_PATH_IMAGE048
We have collected the webpage of 30 true websites as our experimental data, and these websites have comprised different fields, as product web, business web site, financial website, entertainment sites etc.These websites all adopt script to generate, and have identical web page template.
To each website, we select 10 pages that belong to same template as sample page, and concentrate and select 3 pages for user annotation data item to be extracted from sample page, 7 remaining sample page are for detection of the support of the XPath decimation rule generating.Then, we generate XPath decimation rule to all labeled data items.
Figure 967481DEST_PATH_IMAGE049
In the present invention, we adopt the module of information retrieval field to weigh the effect of decimation rule, and we generate the XPath decimation rule generating test page for a large amount of same template, calculates its extraction precision and recall rate on test page.In addition, we have also assessed the comprehensive standard of precision and recall rate---F1-Value.
We have generated decimation rule to the data item in all markup page, have enumerated the XPath rule sample that part website typical case's page data item generates in table 1.As can be seen from the table, in the general page, the characteristic number that we adopt is no more than 3 and just can navigates to fast mark node, and these features approach mark node very much.This is consistent with the original intention of webpage design, and Web page maker conventionally can add some special features (architectural feature, attributive character, content characteristic, even visual signature) near important data item, to cause user's attention.
Figure 915123DEST_PATH_IMAGE052
The XPath decimation rule that we also obtain learning of the present invention compares with the decimation rule that definitely XPath decimation rule and Vertex method obtain.We adopt above three kinds of methods to derive XPath rule for each data set, in 30 test pages that then rule application generated to same template, calculate precision, recall rate and precision and recall rate comprehensive standard (F1_Value) separately.Mean accuracy, recall rate and the F1_Value of the decimation rule that three kinds of methods obtain are as shown in Figure 6.Definitely XPath decimation rule is owing to lacking generalization ability, when running into page structure and there are differences, will lose efficacy, therefore precision and recall rate are lower, and our method is compared with Vertex method, that considered feature choosing of feature or logic, thereby can reach higher precision and recall rate.Experimental result demonstration, method of the present invention is derived and is learnt the XPath decimation rule of generation, while carrying out data item extraction on similar pages, can reach 97.5% precision and approach 100% recall rate.

Claims (7)

1. the webpage data extracting method based on small sample semi-supervised learning, comprises the steps:
(1) for the similarity webpage from same web page template, choose one group of sample webpage, on one to three sample webpage, by user, select and mark out by hand the same data item that needs extraction therein, this data item is called labeled data item;
(2) according to described labeled data item corresponding node on dom tree, described node is called mark node, and the different characteristic according to labeled data item on corresponding dom tree is constructed one group about the initial candidate characteristic set of this mark node;
(3) based on described sample webpage, utilize semi-supervised learning method, adopt the first algorithm to determine the minimum Relating Characteristic in initial candidate characteristic set, derive a decimation rule this data item to generalization ability; The described decimation rule with generalization ability refers to: in the situation that some structural changes appear in a data item in different web pages, decimation rule still can be stablized correct this data item that extracts;
(4) each data item of intending on a webpage extracting is carried out to the described rule induction of step (3), obtain about this, organizing on this webpage one group of decimation rule of data item;
(5) decimation rule of this group being derived acts on the similarity webpage of a collection of data to be extracted, finally extracts a collection of data item.
2. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 1, is characterized in that: in described step (2), initial candidate feature comprises three types: architectural feature, attributive character and content characteristic; Described initial candidate feature by a tlv triple, (characteristic type t, apart from d, v) describe by eigenwert.
3. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 2, it is characterized in that: described architectural feature comprises Zi marking the architectural feature of node to all intermediate nodes dom tree root node, this architectural feature is described as (TAG, d, nv), TAG represents it is a DOM intermediate node element, and d is illustrated in this intermediate node on dom tree and arrives the internodal distance of mark, and nv represents the element term of this intermediate node;
Described attributive character comprises that this attributive character is described as (attribute type name, d, av) Zi marking id and the class attributive character of node to all intermediate nodes dom tree root node, and av represents the value of this attribute; Described attribute type name comprises base attribute id and the class of this intermediate node, id and the class attribute of the id of forerunner's node of this intermediate node and the descendant node of class attribute and this intermediate node;
Described content characteristic comprises non-NULL text node that occur first, that have the sign of helping and location mark node effect before and after mark node, described non-NULL text definition is anchor text, and this content characteristic is described as (text feature name, d, cv), cv represents the value of this anchor text; Described text feature name comprises left end anchor text and right-hand member anchor text;
Permission has same data item on two similar pages different eigenwert under identical feature, comprises different eigenwerts come in exclusive disjunction, and described identical finger characteristic type is identical with distance.
4. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 1, it is characterized in that: in described step (3), iterative processing is carried out in candidate feature set to mark node: candidate feature is integrated on sample webpage and carries out constantly test and merge, until find an extensive decimation rule that can correctly extract labeled data item.
5. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 4, is characterized in that: described step (3) comprising:
With
Figure 237168DEST_PATH_IMAGE001
the decimation rule that expression is derived and formed by candidate feature set F, definition
Figure 32954DEST_PATH_IMAGE002
for decimation rule
Figure 206447DEST_PATH_IMAGE001
distance, its value is characteristic ultimate range in candidate feature set F:
Figure 154811DEST_PATH_IMAGE003
In formula,
Figure 122767DEST_PATH_IMAGE004
represent the feature in candidate feature set F;
Support is defined as follows: for the non-mark page from sample page (being those pages that do not carry out data item mark), and definition decimation rule
Figure 405850DEST_PATH_IMAGE001
support for decimation rule
Figure 920325DEST_PATH_IMAGE001
correctly extract the ratio of the non-mark page:
Recall rate and the precision of the rule extraction data item that meanwhile, definition generates: suppose that A represents the mark node set of a data item on a plurality of mark pages, B represents decimation rule
Figure 829561DEST_PATH_IMAGE001
the mark node set correctly extracting on a plurality of mark pages, C represents decimation rule the actual all node set that extract on the mark page, definition decimation rule
Figure 2233DEST_PATH_IMAGE001
precision
Figure 501872DEST_PATH_IMAGE007
and recall rate
Figure 572596DEST_PATH_IMAGE008
as follows:
Figure 563686DEST_PATH_IMAGE009
In the derivation of decimation rule, the final decimation rule generating meets following four conditions:
1) recall rate is 1: decimation rule can return to all mark nodes on the mark page;
2) all, satisfy condition 1) decimation rule in, select all mark nodes to extract the decimation rule of accuracy value maximums;
3) all, satisfy condition 1) and 2) decimation rule in, the decimation rule of chosen distance minimum;
4), in all decimation rules of satisfied three conditions above, select the decimation rule of support maximum as final decimation rule.
6. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 4, it is characterized in that: described step (3) comprising: in iterative process, use the decimation rule derivation method of gradual Stepwise Refinement, constantly test and merge initial characteristics to dwindle decimation rule to marking the inquiry orientation range of node on dom tree; If a feature initial or that merge can not accurately navigate to mark node, progressively add other feature to dwindle inquiry orientation range, by this feature, merge and incremental learning step, progressively derive the characteristic set that can uniquely navigate to mark node.
7. a kind of webpage data extracting method based on small sample semi-supervised learning according to claim 1, is characterized in that: described the first algorithm is class Apriori algorithm.
CN201310465730.4A 2013-10-09 2013-10-09 Webpage data extraction method based on semi-supervised learning of small sample Pending CN103514292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310465730.4A CN103514292A (en) 2013-10-09 2013-10-09 Webpage data extraction method based on semi-supervised learning of small sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310465730.4A CN103514292A (en) 2013-10-09 2013-10-09 Webpage data extraction method based on semi-supervised learning of small sample

Publications (1)

Publication Number Publication Date
CN103514292A true CN103514292A (en) 2014-01-15

Family

ID=49897016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310465730.4A Pending CN103514292A (en) 2013-10-09 2013-10-09 Webpage data extraction method based on semi-supervised learning of small sample

Country Status (1)

Country Link
CN (1) CN103514292A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063488A (en) * 2014-07-07 2014-09-24 成都安恒信息技术有限公司 Semi-automatic learning type form feature extraction method
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
US10558760B2 (en) 2017-07-28 2020-02-11 International Business Machines Corporation Unsupervised template extraction
CN111079403A (en) * 2019-12-10 2020-04-28 深圳市兴之佳科技有限公司 Page comparison method and device
CN111339396A (en) * 2018-12-18 2020-06-26 富士通株式会社 Method, apparatus and computer storage medium for extracting web page content
CN112801025A (en) * 2021-02-09 2021-05-14 北京市商汤科技开发有限公司 Target feature determination method and device, electronic equipment and storage medium
CN113177168A (en) * 2021-04-29 2021-07-27 上海云扩信息科技有限公司 Positioning method based on Web element attribute characteristics
CN113836877A (en) * 2021-09-28 2021-12-24 北京百度网讯科技有限公司 Text labeling method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786965A (en) * 2005-12-21 2006-06-14 北大方正集团有限公司 Method for acquiring news web page text information
CN101344889A (en) * 2008-07-31 2009-01-14 中国农业大学 Method and system for network information extraction
US20120005207A1 (en) * 2010-07-01 2012-01-05 Yahoo! Inc. Method and system for web extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786965A (en) * 2005-12-21 2006-06-14 北大方正集团有限公司 Method for acquiring news web page text information
CN101344889A (en) * 2008-07-31 2009-01-14 中国农业大学 Method and system for network information extraction
US20120005207A1 (en) * 2010-07-01 2012-01-05 Yahoo! Inc. Method and system for web extraction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PANKAJ GULHANE 等: "Web-scale information extraction with vertex", 《2011 IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE)》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063488A (en) * 2014-07-07 2014-09-24 成都安恒信息技术有限公司 Semi-automatic learning type form feature extraction method
CN104063488B (en) * 2014-07-07 2017-09-01 成都安恒信息技术有限公司 A kind of form feature extracting method of semi-automatic learning type
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
US10558760B2 (en) 2017-07-28 2020-02-11 International Business Machines Corporation Unsupervised template extraction
US10572601B2 (en) 2017-07-28 2020-02-25 International Business Machines Corporation Unsupervised template extraction
CN111339396A (en) * 2018-12-18 2020-06-26 富士通株式会社 Method, apparatus and computer storage medium for extracting web page content
CN111339396B (en) * 2018-12-18 2024-04-16 富士通株式会社 Method, device and computer storage medium for extracting webpage content
CN111079403A (en) * 2019-12-10 2020-04-28 深圳市兴之佳科技有限公司 Page comparison method and device
CN111079403B (en) * 2019-12-10 2023-08-08 深圳市兴之佳科技有限公司 Page comparison method and device
CN112801025A (en) * 2021-02-09 2021-05-14 北京市商汤科技开发有限公司 Target feature determination method and device, electronic equipment and storage medium
CN112801025B (en) * 2021-02-09 2023-12-19 北京市商汤科技开发有限公司 Target feature determining method and device, electronic equipment and storage medium
CN113177168A (en) * 2021-04-29 2021-07-27 上海云扩信息科技有限公司 Positioning method based on Web element attribute characteristics
CN113177168B (en) * 2021-04-29 2023-12-01 上海云扩信息科技有限公司 Positioning method based on Web element attribute characteristics
CN113836877A (en) * 2021-09-28 2021-12-24 北京百度网讯科技有限公司 Text labeling method, device, equipment and storage medium
CN113836877B (en) * 2021-09-28 2024-05-10 北京百度网讯科技有限公司 Text labeling method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103514292A (en) Webpage data extraction method based on semi-supervised learning of small sample
Liu et al. Vide: A vision-based approach for deep web data extraction
Kayed et al. FiVaTech: Page-level web data extraction from template pages
CN101515287B (en) Automatic generating method of wrapper of complex page
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN102831121A (en) Method and system for extracting webpage information
CN103646032A (en) Database query method based on body and restricted natural language processing
CN102262658B (en) Method for extracting web data from bottom to top based on entity
Majid et al. GoThere: travel suggestions using geotagged photos
Ghobadi et al. An ontology based semantic extraction approach for B2C eCommerce
Liu et al. Automatically extracting user reviews from forum sites
CN107491524B (en) Method and device for calculating Chinese word relevance based on Wikipedia concept vector
CN104537047A (en) Garment basic sample plate retrieval system based on Lucene
Gentile et al. Self training wrapper induction with linked data
Gupta et al. A heuristic approach for web content extraction
Pandarge et al. Automatic web information extraction and alignment using CTVS technique
Liu et al. Automatically mining review records from forum Web sites
Majid et al. Mining context-aware significant travel sequences from geotagged social media
Swami et al. Web Scraping Framework based on Combining Tag and Value Similarity
Padmadas et al. Web data extracion using visual features
Mukherjee et al. AHA: Asset harvester assistant
Sellers et al. OXPath: little language, little memory, great value
Liu et al. Deola: a system for linking author entities in web document with DBLP
Rane et al. Automatic annotating SRRs from web databases using Naive Bayes approach
Nikam et al. Web Data Extraction and Alignment Tools: A survey

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140115