CN103678432B

CN103678432B - A kind of web page body extracting method based on web page body feature and intermediary's true value

Info

Publication number: CN103678432B
Application number: CN201310116907.XA
Authority: CN
Inventors: 成卫青; 于静; 洪龙; 杨庚; 黄卫东; 梁胜
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Shenzhen Wealth Farm Internet Finance Service Co ltd
Priority date: 2013-04-07
Filing date: 2013-04-07
Publication date: 2016-11-16
Anticipated expiration: 2033-04-07
Also published as: CN103678432A

Abstract

Present invention web page body based on web page body feature and intermediary's true value identification and extracting method are the solutions carrying out main body identification in a kind of Web information extraction process.It is mainly used in solving the problem that the versatility that main body identification exists is strong, accuracy has much room for improvement, belongs to Web information process field.Inventive feature: by investigate the subtree number of each node in webpage node tree, average subtree branch number, can the attribute such as Display Characters Per Frame and subtree branch number extreme difference, differentiate that each node is the truth scale of main body node from multi-angle, the evaluation result of comprehensive all angles identifies the main body node of webpage again, thus has higher web page body recognition accuracy；And the present invention can automatically set the dividing value of each attribute required for intermediary's truth scale calculates for each webpage, multi-angle differentiates makes highly versatile of the present invention together with automatically setting dividing value, it is adaptable to the main body identification of all kinds webpage and extraction.

Description

A kind of web page body extracting method based on web page body feature and intermediary's true value

Technical field

The present invention relates to internet information process field, particularly to a kind of based on web page body feature and intermediary's true value Web page body extracting method.

Background technology

For from content, a webpage is usually by navigation information, Web page text, advertising message, copyright information, is correlated with The part compositions such as link.So-called web page body extracts, it is simply that separated from webpage by the text message of webpage.And webpage Content beyond middle removing text message, the present invention is referred to as noise content.How shielding noise information, by webpage just The technology such as Web page classifying, Web information extraction out, are had great importance by literary composition contents extraction.

Scholar both domestic and external has had substantial amounts of research work in this field, and sorting technique is also had nothing in common with each other, current main flow Have following a few class research method.

(1) web page body extracting method based on template

Web page body extracting method based on template depends on HTML inner structural features.It sets the similar webpage of Web In have similar architectural feature or similar Dom tree (Document Object Model) structure.Can be by formulating mould Plate obtains the body matter of similar webpage, and the body data in webpage can be extracted by dispenser (wrapper) program. Dispenser can be formulated template according to page layout format feature, write analyzer, parse text position in webpage, i.e. It extracts from information source according to specific information pattern needs the content of coupling, and displays with some form.

The emphasis of the method and difficult point are how to determine and safeguard template, and how to generate dispenser.The life of dispenser Becoming and safeguarding all wastes time and energy.Research worker still builds dispenser the most efficiently in research at present.The most popular Have the dispenser in TSIMMIs system, the dispenser in Ontology system, the dispenser etc. in XWRA system.

(2) the web page body extracting method of view-based access control model feature

The web page body extracting method of view-based access control model feature take full advantage of the size of font in webpage, the color of background, The visual signatures such as white space, become a Web-page segmentation multiple different visual information block, and then realize information extraction.Microsoft Web page cutting algorithm (the Visual Based Page Segment of the view-based access control model that Asia academy proposes Algorithm) it is exactly to use the method to realize.

Internet user is generally according to the content of the spatial layout feature perception Web text area of Web page, the extraction of view-based access control model The main thought of algorithm is exactly that the judge process simulating Internet user carries out main body extraction.The flow process of this algorithm is roughly divided into three Step: be first multiple web page blocks according to visual information and html source code by Web-page segmentation；Then intermediate mass is divided into more Little web page blocks；Secondly weights are given to the divider between web page blocks and web page blocks；Weight finally according to divider is entered Row webpage merges, and the data block after merging being terminated extracts, and obtains web page body data.

(3) web page body extracting method based on semantic information

So-called semantic information refers to all information in addition to the visual information in webpage, and the label including HTML is believed Breath, the Word message of webpage, the structural information etc. of HTML.Such method can be divided into again following three kinds: 1) based on removing HTML The method of label, the method main thought is first to remove HTML label, further according to the word density after removal HTML label Judge text region, finally by all of text region merging technique, it is thus achieved that webpage body content；2) based on character string statistics Method, the method first by one Dom tree of auto-building html files, then adds up the quantity of the Chinese character comprised in each TABLE node, Using comprise character quantity most as the node comprising text message；3) method based on Table node, Table label warp Often being utilized to complete page layout format, the method utilizes this feature exactly, extraction body matter from Table label, such as, TVPS algorithm (the Table and Vision based Page that Computer Department of the Chinese Academy of Science's software study room proposes Segmentation) just make use of TABLE label and visual signature that webpage is carried out semantic chunk division.

Although above method is respectively arranged with advantage, but is as the development of Internet, Web page becomes more and more multiple Miscellaneous, their deficiency also displays.Method (1) has good effect to webpage collection based on same template, but interconnects Having countless web page template on the net, therefore the method is the most general.Method (2) is although certain information extraction can be completed Task, has complicated and uncertain characteristic yet with visual signature, generally requires artificial constantly amendment and adjusts extraction Rule, uses it is difficult to ensure that the concordance of rule set.And method (3) is difficult to nothing based on the method removing html tag The noise closed all filters out, say, that, it is impossible to extract the main information that we are required exactly, based on character string The method of statistics requires that in webpage, all of text message all must necessarily be placed in same Table label, but actually webpage knot Structure is complicated many, has a lot of webpage even without Table label.Method of partition designed in TVPS algorithm is only to the bottom TABLE label takes in, and the either layout architecture of webpage or the nest relation of TABLE label is all the most multiple in reality Miscellaneous, the probability that Web page text information is all present in bottom TABLE label is the least.

As can be seen here, there is presently no a kind of method and can be suitably used for the main body extraction of all webpages, and existing web page body Extracting method accuracy has much room for improvement.In order to improve accuracy and the versatility that web page body extracts further, during the present invention incites somebody to action Jie's measure of truth grad (MMTD) be applied to web page body block identification and extract in, it is proposed that based on web page body feature and in The web page body extracting method of Jie's measure of truth grad (MMTD), this method not only substantially increases the accuracy rate of extraction, and There is higher versatility.

Summary of the invention

Present invention aim at a kind of Web page main body extracting method being provided and realizing flow process, be used for solving webpage master The problem that the accuracy of body extraction existence and versatility need to be improved further.

The technical solution adopted for the present invention to solve the technical problems is: the present invention is a kind of tactic method, Ke Yiyong Identification and extraction in Web page main body, it is possible to the Web information extraction extracted based on web page body according to the inventive method exploitation System.The target that web page body extracts is all peripheral contents in a webpage all to be removed, and leaves behind expression theme in webpage Main part.First Web page is carried out pretreatment by the present invention, arranges and Web de-noising including webpage, will be irregular Some noise content unrelated with its theme to be expressed easily differentiated in html tag standardization and removal webpage；Then base Identification and the extraction carrying out web page body is measured in web page body feature and intermediary's measure of truth grad (MMTD).The mesh of the present invention Mark is effectively to solve the existing problem that web page body extracting method accuracy is not high enough and versatility is the strongest, proposes one and is suitable for In the main body identification of multiple style and types of web pages, and there is web page body identification and the extracting method of high accuracy.

The present invention uses following concept and computing formula:

(1) subtree: the tree with certain child's node of node as root is referred to as the subtree of this node.

(2) subtree number: the number of the subtree that node is had.

(3) branch's number: in node tree with oneself as root, the number of all nodes subtracts.

(4) average subtree branch number: the meansigma methods of each subtree branch number of a node, dividing of i.e. one each subtree of node Number sum is divided by the subtree number of this node.

(5) can Display Characters Per Frame: all nodes in node tree with oneself as root can be shown on webpage The number of character.

(6) subtree branch number extreme difference: the maximum of all subtrees of a node and the difference of minimum branch number.

The present invention by the subtree number of a node, branch's number, average subtree branch number, can Display Characters Per Frame, subtree branch Number extreme difference is referred to as the attribute of this node.The feature of each attribute based on the html web page main body to multiple style and type is divided Analysis, the present invention proposes web page body identification based on web page body feature and intermediary's measure of truth grad (MMTD) and extraction side Method, selects in the preprocessed all nodes filtered out and comprises whole main information and comprise the node of minimum noise, this The content of node is the body matter of webpage, mainly includes webpage arrangement, Web de-noising, generation node tree, calculates webpage knot Count and the attribute of each node, determine that being applicable to this webpage calculates the dividing value of each attribute needed for node true value, calculates webpage Each node is the truth scale of web page body node, as main body node and the highest for true value node is extracted its node comprised With content save as the processes such as XML document, as shown in Figure 1.

Method flow:

The present invention provides a kind of web page body extracting method based on web page body feature and intermediary's true value, including walking as follows Rapid:

Step 1) webpage arrangement, will the standardization of irregular html tag；The content arranged includes adding end mark Signing, Reasonable is to nesting, so that it complies fully with the grammatical rules of HTML；

Step 2) Web de-noising, will firmly believe to be true is that the content of noise is removed；So-called webpage noise, refers to webpage In reach unrelated part content with this webpage subject heading list；Web de-noising includes: only output body part, does not export font mark Sign, not output attribute, do not export the space before often going in source code；Delete script type label and comments class label and therebetween interior Hold, delete the empty label such as label and select, input and content therebetween, delete img label；For a label Deleting, need to consider two kinds of situations, one is the situation that this label has pairing end-tag, and another is that this label is the most individually tied Two paired tags and content therebetween, for there being end-tag, all be removed, for not having by the situation of bundle label End-tag, the full content in this label be removed；

Step 3) utilize HTMLParser to resolve through step 1) and step 2) html source code of pretreated webpage, Generating with different levels node tree, process afterwards is all for pretreated webpage node tree；

Step 4) calculate the nodal point number of webpage and be designated as M, calculate 4 attributes of each node: subtree number, average subtree Branch's number, can Display Characters Per Frame and subtree branch number extreme difference；

Step 5) 4 attributes are determined respectively calculate the dividing value that intermediary truth scale needs；4 genus to all nodes Property sorts respectively, subtree number, average subtree branch number and can all enter according to order from small to large by three attributes of Display Characters Per Frame Row sequence, subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement；Each belongs to Property comes 50%M(and rounds) property value on position is as the value of first boundary's point of respective attributes, 70%M(rounds) on position Property value is as second dividing value of respective attributes, and 80%M(rounds) property value on position is as the 3rd boundary of respective attributes Value, 90%M(rounds) property value on position is as the 4th dividing value of respective attributes, and four dividing values are used respectively a ₁ 、 a ₂ 、a ₃ 、 a ₄Represent；Each attribute is determined respectively, and its four dividing values, 4 attributes have 4 class boundary values；

Step 6) each node is calculated according to formula (1) and formula (2) the true value journey of " this node is web page body node " Degree；If node iFour attributes, wherein the 4th attribute is subtree branch number extreme difference, node i It it is the truth scale of web page body node h _n _- _TFor:

(1)

Wherein,

(2)

Wherein, yFor certain property value, a ₁ , a ₂ , a ₃ , a ₄Four dividing values for this attribute；

Step 7) find out the node that truth scale is the highest, it is judged that it is the node at main information place, and by this node and Node and content that inter-node comprises are stored as XML format document, in case further Web information extraction, i.e. from semi-structured Webpage extracts Web page text information and stores with structurized form, to facilitate follow-up use.

The method have the benefit that

1, the present invention is by investigating the subtree number of each node, average subtree branch number in webpage node tree, can show word From multi-angle, the symbol attribute such as number and subtree branch number extreme difference, differentiates that each node seems the degree of main body node, more comprehensive each The evaluation result of angle identifies the main body node of webpage, thus has higher web page body recognition accuracy.

2, the present invention can automatically set the dividing value required for intermediary's truth scale calculates for each webpage, and multi-angle is sentenced The present invention is not made to be applicable to main body identification and the extraction of all kinds webpage together with automatically setting dividing value.

3, the inventive method efficiently solve that existing web page body recognition methods exists accuracy rate is not high enough and versatility The strongest problem.

Accompanying drawing explanation

Fig. 1 present invention web page body based on intermediary's true value identification and the flow chart of extraction.

Detailed description of the invention

Describe for convenience, it is assumed that there is following application example: easily purchase from Dangdang.com, Suning, remarkable Amazon, azure In bookstore, store, Jingdone district and six websites of phoenix net, respectively choosing 10 webpages carries out main body identification and extraction at random.

Specific embodiments of the present invention are,

Each webpage is proceeded as follows respectively:

(1) obtain webpage source code, arrange webpage, will standardize by irregular html tag, including adding end-tag, Reasonable, to nesting, makes source code comply fully with the grammatical rules of HTML；

(2) by positive in webpage be that the content of noise is removed.Web de-noising includes: only output body part, the most defeated Go out font label, not output attribute, do not export the space before often going in source code；Delete script type label and comments class label and Content therebetween, deletes the empty label such as label and select, input and content therebetween, deletes img label；For one The deletion of individual label, need to consider two kinds of situations, and one is the situation that this label has pairing end-tag, and another is that this label does not has There is the situation of independent end-tag, for there being end-tag, two paired tags and content therebetween all be removed, For being not over label, the full content in this label be removed；

(3) utilize HTMLParser to resolve the html source code of the webpage after first two steps process, generate with different levels knot Point tree, process afterwards is all for this webpage node tree；

(4) calculate the nodal point number of webpage and be designated as M, calculating 4 attributes of each node: subtree number, average subtree branch Number, can Display Characters Per Frame and subtree branch number extreme difference；

(5) 4 attributes are determined respectively the dividing value that calculating intermediary truth scale needs.4 attributes of all nodes are divided Do not sort, subtree number, average subtree branch number and can all arrange according to order from small to large by three attributes of Display Characters Per Frame Sequence, subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement；Each attribute is arranged Rounding at 50%M() property value on position is as the value of first boundary's point of respective attributes, 70%M(rounds) attribute on position Being worth as second dividing value of respective attributes, 80%M(rounds) property value on position as the 3rd dividing value of respective attributes, 90% M(rounds) property value on position is as the 4th dividing value of respective attributes, and four dividing values are used respectively a ₁ 、 a ₂ 、 a ₃ 、a ₄Represent.Each attribute is determined respectively, and its four dividing values, 4 attributes have 4 class boundary values；

(6) to each node, according to formula (2) calculating, in terms of single attribute angle, " this node is web page body knot the most respectively Point " truth scale, then the truth scale summation that four property calculation are obtained to obtain node be the comprehensive of web page body node Truth scale；Formula (2) is:

(2)

Wherein, yFor certain property value, a ₁ , a ₂ , a ₃ , a ₄Four dividing values for this attribute.

(7) node that truth scale is the highest is found out, it is judged that it is the node at main information place, and by this node and node The node inside comprised and content are stored as XML format document.

Claims

1. a web page body extracting method based on web page body feature and intermediary's true value, it is characterised in that comprise following step Rapid:

Step 1) webpage arrangement, will the standardization of irregular html tag；The content arranged includes adding end-tag, rationally Pairing nesting, so that it complies fully with the grammatical rules of HTML；

Step 2) Web de-noising, will firmly believe to be true is that the content of noise is removed, including: only output body part, do not export Font label, not output attribute, do not export the space before often going in source code；Delete script type label and comments class label and its Between content, delete empty label and select, input label and content therebetween, delete img label；For a mark The deletion signed, need to consider two kinds of situations, and one is the situation that this label has pairing end-tag, and another is that this label is the most single Solely two paired tags and content therebetween, for there being end-tag, all be removed by the situation of end-tag, for It is not over label, the full content in this label be removed；

Step 3) utilize HTMLParser to resolve through step 1) and step 2) html source code of pretreated webpage, generate and divide The node tree of level, process afterwards is all for pretreated webpage node tree；

Step 4) calculate the nodal point number of webpage and be designated as M, calculate 4 attributes of each node: subtree number, average subtree branch Number, can Display Characters Per Frame and subtree branch number extreme difference；

Step 5) 4 attributes are determined respectively calculate the dividing value that intermediary truth scale needs；4 attributes difference to all nodes Sequence, subtree number, average subtree branch number and can all be ranked up according to order from small to large by three attributes of Display Characters Per Frame, Subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement；Each attribute comes 50%M, rounds first dividing value as respective attributes of the property value on position, 70%M, rounds the property value conduct on position Second dividing value of respective attributes, 80%M, round the 3rd dividing value as respective attributes of the property value on position, 90%M, Rounding the 4th dividing value as respective attributes of the property value on position, four dividing values use a respectively₁、a₂、a₃、a₄Represent；To often Individual attribute determines that its four dividing values, 4 attributes have 4 class boundary values respectively；

Step 6) each node is calculated according to formula (1) and formula (2) truth scale of " this node is web page body node "；If knot Four attribute y of some i_i=(y_i1,y_i2,y_i3,y_i4), wherein the 4th attribute is subtree branch number extreme difference, and node i is web page body The truth scale h of node_n-TFor:

h_{n - T} (y_{i}) = Σ_{k = 1}^{4} (h_{T} (y_{i k})) - - - (1)

Wherein,

Wherein, y is certain property value, a₁,a₂,a₃,a₄Four dividing values for this attribute；

Step 7) find out the node that truth scale is the highest, it is judged that and it is the node at main information place, and by this node and node The node inside comprised and content are stored as XML format document.