CN103678432B - A kind of web page body extracting method based on web page body feature and intermediary's true value - Google Patents

A kind of web page body extracting method based on web page body feature and intermediary's true value Download PDF

Info

Publication number
CN103678432B
CN103678432B CN201310116907.XA CN201310116907A CN103678432B CN 103678432 B CN103678432 B CN 103678432B CN 201310116907 A CN201310116907 A CN 201310116907A CN 103678432 B CN103678432 B CN 103678432B
Authority
CN
China
Prior art keywords
node
web page
webpage
label
subtree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310116907.XA
Other languages
Chinese (zh)
Other versions
CN103678432A (en
Inventor
成卫青
于静
洪龙
杨庚
黄卫东
梁胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wealth Farm Internet Finance Service Co ltd
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201310116907.XA priority Critical patent/CN103678432B/en
Publication of CN103678432A publication Critical patent/CN103678432A/en
Application granted granted Critical
Publication of CN103678432B publication Critical patent/CN103678432B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Abstract

Present invention web page body based on web page body feature and intermediary's true value identification and extracting method are the solutions carrying out main body identification in a kind of Web information extraction process.It is mainly used in solving the problem that the versatility that main body identification exists is strong, accuracy has much room for improvement, belongs to Web information process field.Inventive feature: by investigate the subtree number of each node in webpage node tree, average subtree branch number, can the attribute such as Display Characters Per Frame and subtree branch number extreme difference, differentiate that each node is the truth scale of main body node from multi-angle, the evaluation result of comprehensive all angles identifies the main body node of webpage again, thus has higher web page body recognition accuracy;And the present invention can automatically set the dividing value of each attribute required for intermediary's truth scale calculates for each webpage, multi-angle differentiates makes highly versatile of the present invention together with automatically setting dividing value, it is adaptable to the main body identification of all kinds webpage and extraction.

Description

A kind of web page body extracting method based on web page body feature and intermediary's true value
Technical field
The present invention relates to internet information process field, particularly to a kind of based on web page body feature and intermediary's true value Web page body extracting method.
Background technology
For from content, a webpage is usually by navigation information, Web page text, advertising message, copyright information, is correlated with The part compositions such as link.So-called web page body extracts, it is simply that separated from webpage by the text message of webpage.And webpage Content beyond middle removing text message, the present invention is referred to as noise content.How shielding noise information, by webpage just The technology such as Web page classifying, Web information extraction out, are had great importance by literary composition contents extraction.
Scholar both domestic and external has had substantial amounts of research work in this field, and sorting technique is also had nothing in common with each other, current main flow Have following a few class research method.
(1) web page body extracting method based on template
Web page body extracting method based on template depends on HTML inner structural features.It sets the similar webpage of Web In have similar architectural feature or similar Dom tree (Document Object Model) structure.Can be by formulating mould Plate obtains the body matter of similar webpage, and the body data in webpage can be extracted by dispenser (wrapper) program. Dispenser can be formulated template according to page layout format feature, write analyzer, parse text position in webpage, i.e. It extracts from information source according to specific information pattern needs the content of coupling, and displays with some form.
The emphasis of the method and difficult point are how to determine and safeguard template, and how to generate dispenser.The life of dispenser Becoming and safeguarding all wastes time and energy.Research worker still builds dispenser the most efficiently in research at present.The most popular Have the dispenser in TSIMMIs system, the dispenser in Ontology system, the dispenser etc. in XWRA system.
(2) the web page body extracting method of view-based access control model feature
The web page body extracting method of view-based access control model feature take full advantage of the size of font in webpage, the color of background, The visual signatures such as white space, become a Web-page segmentation multiple different visual information block, and then realize information extraction.Microsoft Web page cutting algorithm (the Visual Based Page Segment of the view-based access control model that Asia academy proposes Algorithm) it is exactly to use the method to realize.
Internet user is generally according to the content of the spatial layout feature perception Web text area of Web page, the extraction of view-based access control model The main thought of algorithm is exactly that the judge process simulating Internet user carries out main body extraction.The flow process of this algorithm is roughly divided into three Step: be first multiple web page blocks according to visual information and html source code by Web-page segmentation;Then intermediate mass is divided into more Little web page blocks;Secondly weights are given to the divider between web page blocks and web page blocks;Weight finally according to divider is entered Row webpage merges, and the data block after merging being terminated extracts, and obtains web page body data.
(3) web page body extracting method based on semantic information
So-called semantic information refers to all information in addition to the visual information in webpage, and the label including HTML is believed Breath, the Word message of webpage, the structural information etc. of HTML.Such method can be divided into again following three kinds: 1) based on removing HTML The method of label, the method main thought is first to remove HTML label, further according to the word density after removal HTML label Judge text region, finally by all of text region merging technique, it is thus achieved that webpage body content;2) based on character string statistics Method, the method first by one Dom tree of auto-building html files, then adds up the quantity of the Chinese character comprised in each TABLE node, Using comprise character quantity most as the node comprising text message;3) method based on Table node, Table label warp Often being utilized to complete page layout format, the method utilizes this feature exactly, extraction body matter from Table label, such as, TVPS algorithm (the Table and Vision based Page that Computer Department of the Chinese Academy of Science's software study room proposes Segmentation) just make use of TABLE label and visual signature that webpage is carried out semantic chunk division.
Although above method is respectively arranged with advantage, but is as the development of Internet, Web page becomes more and more multiple Miscellaneous, their deficiency also displays.Method (1) has good effect to webpage collection based on same template, but interconnects Having countless web page template on the net, therefore the method is the most general.Method (2) is although certain information extraction can be completed Task, has complicated and uncertain characteristic yet with visual signature, generally requires artificial constantly amendment and adjusts extraction Rule, uses it is difficult to ensure that the concordance of rule set.And method (3) is difficult to nothing based on the method removing html tag The noise closed all filters out, say, that, it is impossible to extract the main information that we are required exactly, based on character string The method of statistics requires that in webpage, all of text message all must necessarily be placed in same Table label, but actually webpage knot Structure is complicated many, has a lot of webpage even without Table label.Method of partition designed in TVPS algorithm is only to the bottom TABLE label takes in, and the either layout architecture of webpage or the nest relation of TABLE label is all the most multiple in reality Miscellaneous, the probability that Web page text information is all present in bottom TABLE label is the least.
As can be seen here, there is presently no a kind of method and can be suitably used for the main body extraction of all webpages, and existing web page body Extracting method accuracy has much room for improvement.In order to improve accuracy and the versatility that web page body extracts further, during the present invention incites somebody to action Jie's measure of truth grad (MMTD) be applied to web page body block identification and extract in, it is proposed that based on web page body feature and in The web page body extracting method of Jie's measure of truth grad (MMTD), this method not only substantially increases the accuracy rate of extraction, and There is higher versatility.
Summary of the invention
Present invention aim at a kind of Web page main body extracting method being provided and realizing flow process, be used for solving webpage master The problem that the accuracy of body extraction existence and versatility need to be improved further.
The technical solution adopted for the present invention to solve the technical problems is: the present invention is a kind of tactic method, Ke Yiyong Identification and extraction in Web page main body, it is possible to the Web information extraction extracted based on web page body according to the inventive method exploitation System.The target that web page body extracts is all peripheral contents in a webpage all to be removed, and leaves behind expression theme in webpage Main part.First Web page is carried out pretreatment by the present invention, arranges and Web de-noising including webpage, will be irregular Some noise content unrelated with its theme to be expressed easily differentiated in html tag standardization and removal webpage;Then base Identification and the extraction carrying out web page body is measured in web page body feature and intermediary's measure of truth grad (MMTD).The mesh of the present invention Mark is effectively to solve the existing problem that web page body extracting method accuracy is not high enough and versatility is the strongest, proposes one and is suitable for In the main body identification of multiple style and types of web pages, and there is web page body identification and the extracting method of high accuracy.
The present invention uses following concept and computing formula:
(1) subtree: the tree with certain child's node of node as root is referred to as the subtree of this node.
(2) subtree number: the number of the subtree that node is had.
(3) branch's number: in node tree with oneself as root, the number of all nodes subtracts.
(4) average subtree branch number: the meansigma methods of each subtree branch number of a node, dividing of i.e. one each subtree of node Number sum is divided by the subtree number of this node.
(5) can Display Characters Per Frame: all nodes in node tree with oneself as root can be shown on webpage The number of character.
(6) subtree branch number extreme difference: the maximum of all subtrees of a node and the difference of minimum branch number.
The present invention by the subtree number of a node, branch's number, average subtree branch number, can Display Characters Per Frame, subtree branch Number extreme difference is referred to as the attribute of this node.The feature of each attribute based on the html web page main body to multiple style and type is divided Analysis, the present invention proposes web page body identification based on web page body feature and intermediary's measure of truth grad (MMTD) and extraction side Method, selects in the preprocessed all nodes filtered out and comprises whole main information and comprise the node of minimum noise, this The content of node is the body matter of webpage, mainly includes webpage arrangement, Web de-noising, generation node tree, calculates webpage knot Count and the attribute of each node, determine that being applicable to this webpage calculates the dividing value of each attribute needed for node true value, calculates webpage Each node is the truth scale of web page body node, as main body node and the highest for true value node is extracted its node comprised With content save as the processes such as XML document, as shown in Figure 1.
Method flow:
The present invention provides a kind of web page body extracting method based on web page body feature and intermediary's true value, including walking as follows Rapid:
Step 1) webpage arrangement, will the standardization of irregular html tag;The content arranged includes adding end mark Signing, Reasonable is to nesting, so that it complies fully with the grammatical rules of HTML;
Step 2) Web de-noising, will firmly believe to be true is that the content of noise is removed;So-called webpage noise, refers to webpage In reach unrelated part content with this webpage subject heading list;Web de-noising includes: only output body part, does not export font mark Sign, not output attribute, do not export the space before often going in source code;Delete script type label and comments class label and therebetween interior Hold, delete the empty label such as label and select, input and content therebetween, delete img label;For a label Deleting, need to consider two kinds of situations, one is the situation that this label has pairing end-tag, and another is that this label is the most individually tied Two paired tags and content therebetween, for there being end-tag, all be removed, for not having by the situation of bundle label End-tag, the full content in this label be removed;
Step 3) utilize HTMLParser to resolve through step 1) and step 2) html source code of pretreated webpage, Generating with different levels node tree, process afterwards is all for pretreated webpage node tree;
Step 4) calculate the nodal point number of webpage and be designated as M, calculate 4 attributes of each node: subtree number, average subtree Branch's number, can Display Characters Per Frame and subtree branch number extreme difference;
Step 5) 4 attributes are determined respectively calculate the dividing value that intermediary truth scale needs;4 genus to all nodes Property sorts respectively, subtree number, average subtree branch number and can all enter according to order from small to large by three attributes of Display Characters Per Frame Row sequence, subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement;Each belongs to Property comes 50%M(and rounds) property value on position is as the value of first boundary's point of respective attributes, 70%M(rounds) on position Property value is as second dividing value of respective attributes, and 80%M(rounds) property value on position is as the 3rd boundary of respective attributes Value, 90%M(rounds) property value on position is as the 4th dividing value of respective attributes, and four dividing values are used respectively a 1 a 2a 3 a 4Represent;Each attribute is determined respectively, and its four dividing values, 4 attributes have 4 class boundary values;
Step 6) each node is calculated according to formula (1) and formula (2) the true value journey of " this node is web page body node " Degree;If node iFour attributes, wherein the 4th attribute is subtree branch number extreme difference, node i It it is the truth scale of web page body node h n - T For:
(1)
Wherein,
(2)
Wherein, yFor certain property value, a 1 , a 2 , a 3 , a 4Four dividing values for this attribute;
Step 7) find out the node that truth scale is the highest, it is judged that it is the node at main information place, and by this node and Node and content that inter-node comprises are stored as XML format document, in case further Web information extraction, i.e. from semi-structured Webpage extracts Web page text information and stores with structurized form, to facilitate follow-up use.
The method have the benefit that
1, the present invention is by investigating the subtree number of each node, average subtree branch number in webpage node tree, can show word From multi-angle, the symbol attribute such as number and subtree branch number extreme difference, differentiates that each node seems the degree of main body node, more comprehensive each The evaluation result of angle identifies the main body node of webpage, thus has higher web page body recognition accuracy.
2, the present invention can automatically set the dividing value required for intermediary's truth scale calculates for each webpage, and multi-angle is sentenced The present invention is not made to be applicable to main body identification and the extraction of all kinds webpage together with automatically setting dividing value.
3, the inventive method efficiently solve that existing web page body recognition methods exists accuracy rate is not high enough and versatility The strongest problem.
Accompanying drawing explanation
Fig. 1 present invention web page body based on intermediary's true value identification and the flow chart of extraction.
Detailed description of the invention
Describe for convenience, it is assumed that there is following application example: easily purchase from Dangdang.com, Suning, remarkable Amazon, azure In bookstore, store, Jingdone district and six websites of phoenix net, respectively choosing 10 webpages carries out main body identification and extraction at random.
Specific embodiments of the present invention are,
Each webpage is proceeded as follows respectively:
(1) obtain webpage source code, arrange webpage, will standardize by irregular html tag, including adding end-tag, Reasonable, to nesting, makes source code comply fully with the grammatical rules of HTML;
(2) by positive in webpage be that the content of noise is removed.Web de-noising includes: only output body part, the most defeated Go out font label, not output attribute, do not export the space before often going in source code;Delete script type label and comments class label and Content therebetween, deletes the empty label such as label and select, input and content therebetween, deletes img label;For one The deletion of individual label, need to consider two kinds of situations, and one is the situation that this label has pairing end-tag, and another is that this label does not has There is the situation of independent end-tag, for there being end-tag, two paired tags and content therebetween all be removed, For being not over label, the full content in this label be removed;
(3) utilize HTMLParser to resolve the html source code of the webpage after first two steps process, generate with different levels knot Point tree, process afterwards is all for this webpage node tree;
(4) calculate the nodal point number of webpage and be designated as M, calculating 4 attributes of each node: subtree number, average subtree branch Number, can Display Characters Per Frame and subtree branch number extreme difference;
(5) 4 attributes are determined respectively the dividing value that calculating intermediary truth scale needs.4 attributes of all nodes are divided Do not sort, subtree number, average subtree branch number and can all arrange according to order from small to large by three attributes of Display Characters Per Frame Sequence, subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement;Each attribute is arranged Rounding at 50%M() property value on position is as the value of first boundary's point of respective attributes, 70%M(rounds) attribute on position Being worth as second dividing value of respective attributes, 80%M(rounds) property value on position as the 3rd dividing value of respective attributes, 90% M(rounds) property value on position is as the 4th dividing value of respective attributes, and four dividing values are used respectively a 1 a 2 a 3a 4Represent.Each attribute is determined respectively, and its four dividing values, 4 attributes have 4 class boundary values;
(6) to each node, according to formula (2) calculating, in terms of single attribute angle, " this node is web page body knot the most respectively Point " truth scale, then the truth scale summation that four property calculation are obtained to obtain node be the comprehensive of web page body node Truth scale;Formula (2) is:
(2)
Wherein, yFor certain property value, a 1 , a 2 , a 3 , a 4Four dividing values for this attribute.
(7) node that truth scale is the highest is found out, it is judged that it is the node at main information place, and by this node and node The node inside comprised and content are stored as XML format document.

Claims (1)

1. a web page body extracting method based on web page body feature and intermediary's true value, it is characterised in that comprise following step Rapid:
Step 1) webpage arrangement, will the standardization of irregular html tag;The content arranged includes adding end-tag, rationally Pairing nesting, so that it complies fully with the grammatical rules of HTML;
Step 2) Web de-noising, will firmly believe to be true is that the content of noise is removed, including: only output body part, do not export Font label, not output attribute, do not export the space before often going in source code;Delete script type label and comments class label and its Between content, delete empty label and select, input label and content therebetween, delete img label;For a mark The deletion signed, need to consider two kinds of situations, and one is the situation that this label has pairing end-tag, and another is that this label is the most single Solely two paired tags and content therebetween, for there being end-tag, all be removed by the situation of end-tag, for It is not over label, the full content in this label be removed;
Step 3) utilize HTMLParser to resolve through step 1) and step 2) html source code of pretreated webpage, generate and divide The node tree of level, process afterwards is all for pretreated webpage node tree;
Step 4) calculate the nodal point number of webpage and be designated as M, calculate 4 attributes of each node: subtree number, average subtree branch Number, can Display Characters Per Frame and subtree branch number extreme difference;
Step 5) 4 attributes are determined respectively calculate the dividing value that intermediary truth scale needs;4 attributes difference to all nodes Sequence, subtree number, average subtree branch number and can all be ranked up according to order from small to large by three attributes of Display Characters Per Frame, Subtree branch number extreme difference is ranked up according to order from big to small, obtains the sequence of four ordered arrangement;Each attribute comes 50%M, rounds first dividing value as respective attributes of the property value on position, 70%M, rounds the property value conduct on position Second dividing value of respective attributes, 80%M, round the 3rd dividing value as respective attributes of the property value on position, 90%M, Rounding the 4th dividing value as respective attributes of the property value on position, four dividing values use a respectively1、a2、a3、a4Represent;To often Individual attribute determines that its four dividing values, 4 attributes have 4 class boundary values respectively;
Step 6) each node is calculated according to formula (1) and formula (2) truth scale of " this node is web page body node ";If knot Four attribute y of some ii=(yi1,yi2,yi3,yi4), wherein the 4th attribute is subtree branch number extreme difference, and node i is web page body The truth scale h of noden-TFor:
h n - T ( y i ) = Σ k = 1 4 ( h T ( y i k ) ) - - - ( 1 )
Wherein,
Wherein, y is certain property value, a1,a2,a3,a4Four dividing values for this attribute;
Step 7) find out the node that truth scale is the highest, it is judged that and it is the node at main information place, and by this node and node The node inside comprised and content are stored as XML format document.
CN201310116907.XA 2013-04-07 2013-04-07 A kind of web page body extracting method based on web page body feature and intermediary's true value Expired - Fee Related CN103678432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310116907.XA CN103678432B (en) 2013-04-07 2013-04-07 A kind of web page body extracting method based on web page body feature and intermediary's true value

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310116907.XA CN103678432B (en) 2013-04-07 2013-04-07 A kind of web page body extracting method based on web page body feature and intermediary's true value

Publications (2)

Publication Number Publication Date
CN103678432A CN103678432A (en) 2014-03-26
CN103678432B true CN103678432B (en) 2016-11-16

Family

ID=50316013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310116907.XA Expired - Fee Related CN103678432B (en) 2013-04-07 2013-04-07 A kind of web page body extracting method based on web page body feature and intermediary's true value

Country Status (1)

Country Link
CN (1) CN103678432B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965871A (en) * 2015-06-09 2015-10-07 北京金山安全软件有限公司 Page loading method and device and electronic equipment
CN109635200B (en) * 2018-12-18 2022-02-01 南京邮电大学 Collaborative filtering recommendation method based on intermediary truth degree measurement and user
CN109885743B (en) * 2019-01-04 2024-01-02 上海七印信息科技有限公司 Webpage data information extraction method

Also Published As

Publication number Publication date
CN103678432A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN106201465A (en) Software project personalized recommendation method towards open source community
Peters et al. Content extraction using diverse feature sets
CN102646095B (en) Object classifying method and system based on webpage classification information
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN102650999B (en) A kind of method and system of extracting object attribute value information from webpage
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN103823896A (en) Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
CN103136358B (en) A kind of method of Automatic Extraction forum data
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
CN103559199A (en) Web information extraction method and web information extraction device
CN106021383A (en) Method and device for computing similarity of webpages
CN103927397A (en) Recognition method for Web page link blocks based on block tree
CN104077417A (en) Figure tag recommendation method and system in social network
CN102314520A (en) Webpage text extraction method and device based on statistical backtracking positioning
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN103530316A (en) Science subject extraction method based on multi-view learning
CN103678432B (en) A kind of web page body extracting method based on web page body feature and intermediary's true value
CN106528068A (en) Webpage content reconstruction method and system
CN106202007B (en) A kind of appraisal procedure of MATLAB program files similarity
CN103064966A (en) Method for extracting regular noise from single record web pages
Chu et al. Automatic data extraction of websites using data path matching and alignment
CN105183730B (en) The treating method and apparatus of webpage information
Kamanwar et al. Web data extraction techniques: A review
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
CN105653567A (en) Method for quickly looking for feature character strings in text sequential data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20171229

Address after: 518000 the 21 layer of DESAY science and technology building in the South District of Nanshan District high tech park, Shenzhen City, Guangdong

Patentee after: SHENZHEN WEALTH FARM INTERNET FINANCE SERVICE Co.,Ltd.

Address before: 210003 Gulou District, Jiangsu, Nanjing new model road, No. 66

Patentee before: NANJING University OF POSTS AND TELECOMMUNICATIONS

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161116

CF01 Termination of patent right due to non-payment of annual fee