CN103678432B - A kind of web page body extracting method based on web page body feature and intermediary's true value - Google Patents
A kind of web page body extracting method based on web page body feature and intermediary's true value Download PDFInfo
- Publication number
- CN103678432B CN103678432B CN201310116907.XA CN201310116907A CN103678432B CN 103678432 B CN103678432 B CN 103678432B CN 201310116907 A CN201310116907 A CN 201310116907A CN 103678432 B CN103678432 B CN 103678432B
- Authority
- CN
- China
- Prior art keywords
- node
- web page
- webpage
- label
- subtree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Abstract
Present invention web page body based on web page body feature and intermediary's true value identification and extracting method are the solutions carrying out main body identification in a kind of Web information extraction process.It is mainly used in solving the problem that the versatility that main body identification exists is strong, accuracy has much room for improvement, belongs to Web information process field.Inventive feature: by investigate the subtree number of each node in webpage node tree, average subtree branch number, can the attribute such as Display Characters Per Frame and subtree branch number extreme difference, differentiate that each node is the truth scale of main body node from multi-angle, the evaluation result of comprehensive all angles identifies the main body node of webpage again, thus has higher web page body recognition accuracy;And the present invention can automatically set the dividing value of each attribute required for intermediary's truth scale calculates for each webpage, multi-angle differentiates makes highly versatile of the present invention together with automatically setting dividing value, it is adaptable to the main body identification of all kinds webpage and extraction.
Description
Technical field
The present invention relates to internet information process field, particularly to a kind of based on web page body feature and intermediary's true value
Web page body extracting method.
Background technology
For from content, a webpage is usually by navigation information, Web page text, advertising message, copyright information, is correlated with
The part compositions such as link.So-called web page body extracts, it is simply that separated from webpage by the text message of webpage.And webpage
Content beyond middle removing text message, the present invention is referred to as noise content.How shielding noise information, by webpage just
The technology such as Web page classifying, Web information extraction out, are had great importance by literary composition contents extraction.
Scholar both domestic and external has had substantial amounts of research work in this field, and sorting technique is also had nothing in common with each other, current main flow
Have following a few class research method.
(1) web page body extracting method based on template
Web page body extracting method based on template depends on HTML inner structural features.It sets the similar webpage of Web
In have similar architectural feature or similar Dom tree (Document Object Model) structure.Can be by formulating mould
Plate obtains the body matter of similar webpage, and the body data in webpage can be extracted by dispenser (wrapper) program.
Dispenser can be formulated template according to page layout format feature, write analyzer, parse text position in webpage, i.e.
It extracts from information source according to specific information pattern needs the content of coupling, and displays with some form.
The emphasis of the method and difficult point are how to determine and safeguard template, and how to generate dispenser.The life of dispenser
Becoming and safeguarding all wastes time and energy.Research worker still builds dispenser the most efficiently in research at present.The most popular
Have the dispenser in TSIMMIs system, the dispenser in Ontology system, the dispenser etc. in XWRA system.
(2) the web page body extracting method of view-based access control model feature
The web page body extracting method of view-based access control model feature take full advantage of the size of font in webpage, the color of background,
The visual signatures such as white space, become a Web-page segmentation multiple different visual information block, and then realize information extraction.Microsoft
Web page cutting algorithm (the Visual Based Page Segment of the view-based access control model that Asia academy proposes
Algorithm) it is exactly to use the method to realize.
Internet user is generally according to the content of the spatial layout feature perception Web text area of Web page, the extraction of view-based access control model
The main thought of algorithm is exactly that the judge process simulating Internet user carries out main body extraction.The flow process of this algorithm is roughly divided into three
Step: be first multiple web page blocks according to visual information and html source code by Web-page segmentation;Then intermediate mass is divided into more
Little web page blocks;Secondly weights are given to the divider between web page blocks and web page blocks;Weight finally according to divider is entered
Row webpage merges, and the data block after merging being terminated extracts, and obtains web page body data.
(3) web page body extracting method based on semantic information
So-called semantic information refers to all information in addition to the visual information in webpage, and the label including HTML is believed
Breath, the Word message of webpage, the structural information etc. of HTML.Such method can be divided into again following three kinds: 1) based on removing HTML
The method of label, the method main thought is first to remove HTML label, further according to the word density after removal HTML label
Judge text region, finally by all of text region merging technique, it is thus achieved that webpage body content;2) based on character string statistics
Method, the method first by one Dom tree of auto-building html files, then adds up the quantity of the Chinese character comprised in each TABLE node,
Using comprise character quantity most as the node comprising text message;3) method based on Table node, Table label warp
Often being utilized to complete page layout format, the method utilizes this feature exactly, extraction body matter from Table label, such as,
TVPS algorithm (the Table and Vision based Page that Computer Department of the Chinese Academy of Science's software study room proposes
Segmentation) just make use of TABLE label and visual signature that webpage is carried out semantic chunk division.
Although above method is respectively arranged with advantage, but is as the development of Internet, Web page becomes more and more multiple
Miscellaneous, their deficiency also displays.Method (1) has good effect to webpage collection based on same template, but interconnects
Having countless web page template on the net, therefore the method is the most general.Method (2) is although certain information extraction can be completed
Task, has complicated and uncertain characteristic yet with visual signature, generally requires artificial constantly amendment and adjusts extraction
Rule, uses it is difficult to ensure that the concordance of rule set.And method (3) is difficult to nothing based on the method removing html tag
The noise closed all filters out, say, that, it is impossible to extract the main information that we are required exactly, based on character string
The method of statistics requires that in webpage, all of text message all must necessarily be placed in same Table label, but actually webpage knot
Structure is complicated many, has a lot of webpage even without Table label.Method of partition designed in TVPS algorithm is only to the bottom
TABLE label takes in, and the either layout architecture of webpage or the nest relation of TABLE label is all the most multiple in reality
Miscellaneous, the probability that Web page text information is all present in bottom TABLE label is the least.
As can be seen here, there is presently no a kind of method and can be suitably used for the main body extraction of all webpages, and existing web page body
Extracting method accuracy has much room for improvement.In order to improve accuracy and the versatility that web page body extracts further, during the present invention incites somebody to action
Jie's measure of truth grad (MMTD) be applied to web page body block identification and extract in, it is proposed that based on web page body feature and in
The web page body extracting method of Jie's measure of truth grad (MMTD), this method not only substantially increases the accuracy rate of extraction, and
There is higher versatility.
Summary of the invention
Present invention aim at a kind of Web page main body extracting method being provided and realizing flow process, be used for solving webpage master
The problem that the accuracy of body extraction existence and versatility need to be improved further.
The technical solution adopted for the present invention to solve the technical problems is: the present invention is a kind of tactic method, Ke Yiyong
Identification and extraction in Web page main body, it is possible to the Web information extraction extracted based on web page body according to the inventive method exploitation
System.The target that web page body extracts is all peripheral contents in a webpage all to be removed, and leaves behind expression theme in webpage
Main part.First Web page is carried out pretreatment by the present invention, arranges and Web de-noising including webpage, will be irregular
Some noise content unrelated with its theme to be expressed easily differentiated in html tag standardization and removal webpage;Then base
Identification and the extraction carrying out web page body is measured in web page body feature and intermediary's measure of truth grad (MMTD).The mesh of the present invention
Mark is effectively to solve the existing problem that web page body extracting method accuracy is not high enough and versatility is the strongest, proposes one and is suitable for
In the main body identification of multiple style and types of web pages, and there is web page body identification and the extracting method of high accuracy.
The present invention uses following concept and computing formula:
(1) subtree: the tree with certain child's node of node as root is referred to as the subtree of this node.
(2) subtree number: the number of the subtree that node is had.
(3) branch's number: in node tree with oneself as root, the number of all nodes subtracts.
(4) average subtree branch number: the meansigma methods of each subtree branch number of a node, dividing of i.e. one each subtree of node
Number sum is divided by the subtree number of this node.
(5) can Display Characters Per Frame: all nodes in node tree with oneself as root can be shown on webpage
The number of character.
(6) subtree branch number extreme difference: the maximum of all subtrees of a node and the difference of minimum branch number.
The present invention by the subtree number of a node, branch's number, average subtree branch number, can Display Characters Per Frame, subtree branch
Number extreme difference is referred to as the attribute of this node.The feature of each attribute based on the html web page main body to multiple style and type is divided
Analysis, the present invention proposes web page body identification based on web page body feature and intermediary's measure of truth grad (MMTD) and extraction side
Method, selects in the preprocessed all nodes filtered out and comprises whole main information and comprise the node of minimum noise, this
The content of node is the body matter of webpage, mainly includes webpage arrangement, Web de-noising, generation node tree, calculates webpage knot
Count and the attribute of each node, determine that being applicable to this webpage calculates the dividing value of each attribute needed for node true value, calculates webpage
Each node is the truth scale of web page body node, as main body node and the highest for true value node is extracted its node comprised
With content save as the processes such as XML document, as shown in Figure 1.
Method flow:
The present invention provides a kind of web page body extracting method based on web page body feature and intermediary's true value, including walking as follows
Rapid:
Step 1) webpage arrangement, will the standardization of irregular html tag;The content arranged includes adding end mark
Signing, Reasonable is to nesting, so that it complies fully with the grammatical rules of HTML;
Step 2) Web de-noising, will firmly believe to be true is that the content of noise is removed;So-called webpage noise, refers to webpage
In reach unrelated part content with this webpage subject heading list;Web de-noising includes: only output body part, does not export font mark
Sign, not output attribute, do not export the space before often going in source code;Delete script type label and comments class label and therebetween interior
Hold, delete the empty label such as label and select, input and content therebetween, delete img label;For a label
Deleting, need to consider two kinds of situations, one is the situation that this label has pairing end-tag, and another is that this label is the most individually tied
Two paired tags and content therebetween, for there being end-tag, all be removed, for not having by the situation of bundle label
End-tag, the full content in this label be removed;
Step 3) utilize HTMLParser to resolve through step 1) and step 2) html source code of pretreated webpage,
Generating with different levels node tree, process afterwards is all for pretreated webpage node tree;
Step 4) calculate the nodal point number of webpage and be designated as M, calculate 4 attributes of each node: subtree number, average subtree
Branch's number, can Display Characters Per Frame and subtree branch number extreme difference;
Step 5) 4 attributes are determined respectively calculate the dividing value that intermediary truth scale needs;4 genus to all nodes
Property sorts respectively, subtree number, average subtree branch number and can all enter according to order from small to large by three attributes of Display Characters Per Frame
Row sequence, subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement;Each belongs to
Property comes 50%M(and rounds) property value on position is as the value of first boundary's point of respective attributes, 70%M(rounds) on position
Property value is as second dividing value of respective attributes, and 80%M(rounds) property value on position is as the 3rd boundary of respective attributes
Value, 90%M(rounds) property value on position is as the 4th dividing value of respective attributes, and four dividing values are used respectively a 1 、 a 2 、a 3 、 a 4Represent;Each attribute is determined respectively, and its four dividing values, 4 attributes have 4 class boundary values;
Step 6) each node is calculated according to formula (1) and formula (2) the true value journey of " this node is web page body node "
Degree;If node iFour attributes, wherein the 4th attribute is subtree branch number extreme difference, node i
It it is the truth scale of web page body node h n - T For:
(1)
Wherein,
(2)
Wherein, yFor certain property value, a 1 , a 2 , a 3 , a 4Four dividing values for this attribute;
Step 7) find out the node that truth scale is the highest, it is judged that it is the node at main information place, and by this node and
Node and content that inter-node comprises are stored as XML format document, in case further Web information extraction, i.e. from semi-structured
Webpage extracts Web page text information and stores with structurized form, to facilitate follow-up use.
The method have the benefit that
1, the present invention is by investigating the subtree number of each node, average subtree branch number in webpage node tree, can show word
From multi-angle, the symbol attribute such as number and subtree branch number extreme difference, differentiates that each node seems the degree of main body node, more comprehensive each
The evaluation result of angle identifies the main body node of webpage, thus has higher web page body recognition accuracy.
2, the present invention can automatically set the dividing value required for intermediary's truth scale calculates for each webpage, and multi-angle is sentenced
The present invention is not made to be applicable to main body identification and the extraction of all kinds webpage together with automatically setting dividing value.
3, the inventive method efficiently solve that existing web page body recognition methods exists accuracy rate is not high enough and versatility
The strongest problem.
Accompanying drawing explanation
Fig. 1 present invention web page body based on intermediary's true value identification and the flow chart of extraction.
Detailed description of the invention
Describe for convenience, it is assumed that there is following application example: easily purchase from Dangdang.com, Suning, remarkable Amazon, azure
In bookstore, store, Jingdone district and six websites of phoenix net, respectively choosing 10 webpages carries out main body identification and extraction at random.
Specific embodiments of the present invention are,
Each webpage is proceeded as follows respectively:
(1) obtain webpage source code, arrange webpage, will standardize by irregular html tag, including adding end-tag,
Reasonable, to nesting, makes source code comply fully with the grammatical rules of HTML;
(2) by positive in webpage be that the content of noise is removed.Web de-noising includes: only output body part, the most defeated
Go out font label, not output attribute, do not export the space before often going in source code;Delete script type label and comments class label and
Content therebetween, deletes the empty label such as label and select, input and content therebetween, deletes img label;For one
The deletion of individual label, need to consider two kinds of situations, and one is the situation that this label has pairing end-tag, and another is that this label does not has
There is the situation of independent end-tag, for there being end-tag, two paired tags and content therebetween all be removed,
For being not over label, the full content in this label be removed;
(3) utilize HTMLParser to resolve the html source code of the webpage after first two steps process, generate with different levels knot
Point tree, process afterwards is all for this webpage node tree;
(4) calculate the nodal point number of webpage and be designated as M, calculating 4 attributes of each node: subtree number, average subtree branch
Number, can Display Characters Per Frame and subtree branch number extreme difference;
(5) 4 attributes are determined respectively the dividing value that calculating intermediary truth scale needs.4 attributes of all nodes are divided
Do not sort, subtree number, average subtree branch number and can all arrange according to order from small to large by three attributes of Display Characters Per Frame
Sequence, subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement;Each attribute is arranged
Rounding at 50%M() property value on position is as the value of first boundary's point of respective attributes, 70%M(rounds) attribute on position
Being worth as second dividing value of respective attributes, 80%M(rounds) property value on position as the 3rd dividing value of respective attributes, 90%
M(rounds) property value on position is as the 4th dividing value of respective attributes, and four dividing values are used respectively a 1 、 a 2 、 a 3 、a 4Represent.Each attribute is determined respectively, and its four dividing values, 4 attributes have 4 class boundary values;
(6) to each node, according to formula (2) calculating, in terms of single attribute angle, " this node is web page body knot the most respectively
Point " truth scale, then the truth scale summation that four property calculation are obtained to obtain node be the comprehensive of web page body node
Truth scale;Formula (2) is:
(2)
Wherein, yFor certain property value, a 1 , a 2 , a 3 , a 4Four dividing values for this attribute.
(7) node that truth scale is the highest is found out, it is judged that it is the node at main information place, and by this node and node
The node inside comprised and content are stored as XML format document.
Claims (1)
1. a web page body extracting method based on web page body feature and intermediary's true value, it is characterised in that comprise following step
Rapid:
Step 1) webpage arrangement, will the standardization of irregular html tag;The content arranged includes adding end-tag, rationally
Pairing nesting, so that it complies fully with the grammatical rules of HTML;
Step 2) Web de-noising, will firmly believe to be true is that the content of noise is removed, including: only output body part, do not export
Font label, not output attribute, do not export the space before often going in source code;Delete script type label and comments class label and its
Between content, delete empty label and select, input label and content therebetween, delete img label;For a mark
The deletion signed, need to consider two kinds of situations, and one is the situation that this label has pairing end-tag, and another is that this label is the most single
Solely two paired tags and content therebetween, for there being end-tag, all be removed by the situation of end-tag, for
It is not over label, the full content in this label be removed;
Step 3) utilize HTMLParser to resolve through step 1) and step 2) html source code of pretreated webpage, generate and divide
The node tree of level, process afterwards is all for pretreated webpage node tree;
Step 4) calculate the nodal point number of webpage and be designated as M, calculate 4 attributes of each node: subtree number, average subtree branch
Number, can Display Characters Per Frame and subtree branch number extreme difference;
Step 5) 4 attributes are determined respectively calculate the dividing value that intermediary truth scale needs;4 attributes difference to all nodes
Sequence, subtree number, average subtree branch number and can all be ranked up according to order from small to large by three attributes of Display Characters Per Frame,
Subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement;Each attribute comes
50%M, rounds first dividing value as respective attributes of the property value on position, 70%M, rounds the property value conduct on position
Second dividing value of respective attributes, 80%M, round the 3rd dividing value as respective attributes of the property value on position, 90%M,
Rounding the 4th dividing value as respective attributes of the property value on position, four dividing values use a respectively1、a2、a3、a4Represent;To often
Individual attribute determines that its four dividing values, 4 attributes have 4 class boundary values respectively;
Step 6) each node is calculated according to formula (1) and formula (2) truth scale of " this node is web page body node ";If knot
Four attribute y of some ii=(yi1,yi2,yi3,yi4), wherein the 4th attribute is subtree branch number extreme difference, and node i is web page body
The truth scale h of noden-TFor:
Wherein,
Wherein, y is certain property value, a1,a2,a3,a4Four dividing values for this attribute;
Step 7) find out the node that truth scale is the highest, it is judged that and it is the node at main information place, and by this node and node
The node inside comprised and content are stored as XML format document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310116907.XA CN103678432B (en) | 2013-04-07 | 2013-04-07 | A kind of web page body extracting method based on web page body feature and intermediary's true value |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310116907.XA CN103678432B (en) | 2013-04-07 | 2013-04-07 | A kind of web page body extracting method based on web page body feature and intermediary's true value |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103678432A CN103678432A (en) | 2014-03-26 |
CN103678432B true CN103678432B (en) | 2016-11-16 |
Family
ID=50316013
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310116907.XA Expired - Fee Related CN103678432B (en) | 2013-04-07 | 2013-04-07 | A kind of web page body extracting method based on web page body feature and intermediary's true value |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103678432B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965871A (en) * | 2015-06-09 | 2015-10-07 | 北京金山安全软件有限公司 | Page loading method and device and electronic equipment |
CN109635200B (en) * | 2018-12-18 | 2022-02-01 | 南京邮电大学 | Collaborative filtering recommendation method based on intermediary truth degree measurement and user |
CN109885743B (en) * | 2019-01-04 | 2024-01-02 | 上海七印信息科技有限公司 | Webpage data information extraction method |
-
2013
- 2013-04-07 CN CN201310116907.XA patent/CN103678432B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN103678432A (en) | 2014-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106201465A (en) | Software project personalized recommendation method towards open source community | |
Peters et al. | Content extraction using diverse feature sets | |
CN102646095B (en) | Object classifying method and system based on webpage classification information | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN102650999B (en) | A kind of method and system of extracting object attribute value information from webpage | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN103823896A (en) | Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm | |
CN103136358B (en) | A kind of method of Automatic Extraction forum data | |
CN102298638A (en) | Method and system for extracting news webpage contents by clustering webpage labels | |
CN103559199A (en) | Web information extraction method and web information extraction device | |
CN106021383A (en) | Method and device for computing similarity of webpages | |
CN103927397A (en) | Recognition method for Web page link blocks based on block tree | |
CN104077417A (en) | Figure tag recommendation method and system in social network | |
CN102314520A (en) | Webpage text extraction method and device based on statistical backtracking positioning | |
CN108733813A (en) | Information extracting method, system towards BBS forum Web pages contents and medium | |
CN103530316A (en) | Science subject extraction method based on multi-view learning | |
CN103678432B (en) | A kind of web page body extracting method based on web page body feature and intermediary's true value | |
CN106528068A (en) | Webpage content reconstruction method and system | |
CN106202007B (en) | A kind of appraisal procedure of MATLAB program files similarity | |
CN103064966A (en) | Method for extracting regular noise from single record web pages | |
Chu et al. | Automatic data extraction of websites using data path matching and alignment | |
CN105183730B (en) | The treating method and apparatus of webpage information | |
Kamanwar et al. | Web data extraction techniques: A review | |
CN110083760B (en) | Multi-recording dynamic webpage information extraction method based on visual block | |
CN105653567A (en) | Method for quickly looking for feature character strings in text sequential data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20171229 Address after: 518000 the 21 layer of DESAY science and technology building in the South District of Nanshan District high tech park, Shenzhen City, Guangdong Patentee after: SHENZHEN WEALTH FARM INTERNET FINANCE SERVICE Co.,Ltd. Address before: 210003 Gulou District, Jiangsu, Nanjing new model road, No. 66 Patentee before: NANJING University OF POSTS AND TELECOMMUNICATIONS |
|
TR01 | Transfer of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161116 |
|
CF01 | Termination of patent right due to non-payment of annual fee |