CN106557565A

CN106557565A - A kind of text message extracting method based on website construction

Info

Publication number: CN106557565A
Application number: CN201611027102.8A
Authority: CN
Inventors: 陈星�; 王洲; 王一洲; 戴远飞
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2016-11-22
Filing date: 2016-11-22
Publication date: 2017-04-05

Abstract

The present invention relates to a kind of text message extracting method based on website construction, website rank is combined with webpage rank, the gap between smooth webpage is realized by the website construction of website rank, recycle the density feature of web page release and node to determine the position of such Web page text, and extract corresponding decimation rule.The present invention can effectively improve the accuracy rate of Web page text contents extraction.

Description

A kind of text message extracting method based on website construction

Technical field

The present invention relates to web page contents extractive technique field, particularly a kind of text message extraction side based on website construction Method.

Background technology

The fast development of Web technologies so that substantial amounts of data are represented with html format, and Web page becomes information and sends out The main carriers of cloth are also one of main channel that people obtain information.In recent years, with the development of big data, it was recognized that The importance of data.Formulate in business decision, the various aspects big data such as public sentiment monitoring plays huge effect.

Therefore, a study hotspot being taken into for current academia of Web page important content.However, in webpage There are two difficult points in content extraction：First, in a Web page in addition to the text interested comprising user, also include The noise information unrelated with theme such as navigation bar, advertisement, recommended links, copyright statement.Second, due to dynamic script and CSS skills The extensive application of art so that the structural difference between webpage constantly increases and the complexity of webpage self structure is constantly carried It is high.For the two difficult points, it has been proposed that the Web page text based on statistics is extracted and the Web page text based on Web-page segmentation is extracted, However, these methods may fail under special circumstances.

In fact, html page is stored in the combination of the data in background data base and HTML content template, in website Internal webpage is generated by a set of identical content template mostly, it can be considered that the design of webpage is that have relative rule Rule.

The content of the invention

In view of this, the purpose of the present invention is to propose to a kind of text message extracting method based on website construction, Ke Yiyou The accuracy rate of the raising Web page text contents extraction of effect.

The present invention is realized using below scheme：A kind of text message extracting method based on website construction, specifically include with Lower step：

Step S1：Crawl and be input into the corresponding webpage of url set, script, annotation in removal html web page, style tags, And by each web analysis into a dom tree；

Step S2：The corresponding dom tree of given webpage, traversal dom tree extract structure of web page feature；Calculated by architectural feature Similarity between webpage；Collections of web pages is clustered according to the similarity between webpage；Ultimately produce a series of labelling Class set, and the web page characteristics of each marking class；

Step S3：For same class webpage, cutting whole into sections is carried out to webpage, count the node density feature of each block, seek Look for comprising body matter and the block not comprising noise information, and extract the feature of the block and carry as content blocks in such webpage Take rule；

Step S4：Webpage in each marking class is extracted into content by extracting rule；For the new of identical source Webpage determines the marking class belonging to webpage also by the web page characteristics of marking class, is then advised according to the extraction of the marking class webpage Then carrying out the extraction of content.

Further, step S2 specifically includes following steps：

Step S21：Using the path of each block node in dom tree as webpage architectural feature, by breadth first traversal Mode travel through webpage dom tree T_d, set of paths F.f is extracted, two tuples F=are constituted<w,f>Structure to represent webpage w is special Levy；

Step S22：One web pages W of input are converted into into web page characteristics set D={ F₁,F₂,…,F_k}；

Step S23：According to structure of web page feature F, the similarity between webpage is calculated using following formula：

Wherein, F_iAnd F_jThe architectural feature of i-th webpage and j-th webpage is represented respectively；

Step S24：One web pages are clustered by webpage Similarity Measure using hierarchical clustering algorithm.

Further, the calculating of node density feature described in step S3 is comprised the following steps：

Step S31：If n ∈ B are dom tree T_dIn a block node, then the text density of n be defined as：

Wherein, T_nFor the pure words character number that block node n is included, T is T_dPure words character in the whole document for representing Number；Wherein T_nLink text is not included with T；

Step S32：If n ∈ B are dom tree T_dIn a block node, then the link density of n be defined as：

Wherein, lN_nFor the link number included in node n, lN is T_dThe link number included in whole document for representing；

Step S33：If n ∈ B are dom tree T_dIn a block node, then the node text density of n be defined as：

Wherein, wherein T_nFor the pure words character number of block node n, lT_nFor the text character number of block node n；Wherein T_nDo not wrap Containing link text, lT_nComprising link text；

Step S34：Calculate combined density eigenvalue H (p) of block node：

H (p)=λ * p1*p2*p3；

Wherein, p={ p | p=p (b), b ∈ B } represents the corresponding path of block node；P1, p2, p3 represent node respectively Density feature, takes p1=p_text, p2=1-p_link, p3=p_textl, block node is estimated using H (p).

Compared with prior art, the present invention has following beneficial effect：Website rank is combined by the present invention with webpage rank, The gap between smooth webpage is realized by the website construction of website rank, the density feature of web page release and node is recycled To determine the position of such Web page text, and extract corresponding decimation rule.Webpage just can effectively improved by this method The accuracy rate of literary contents extraction.

Description of the drawings

Principle schematics of the Fig. 1 for the embodiment of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawings and embodiment the present invention will be further described.

As shown in figure 1, present embodiments provide a kind of text message extracting method based on website construction, specifically include with Lower step：

In the present embodiment, step S2 specifically includes following steps：

In the present embodiment, the calculating of node density feature described in step S3 is comprised the following steps：

Step S34：Calculate combined density eigenvalue H (p) of block node：

H (p)=λ * p1*p2*p3；

Preferably, in the present embodiment, representing the architectural feature of webpage for convenience, following four definition are introduced.

Define 1. each Web page and can be expressed as a dom tree T_d, T_dIt is a directed graph<V,E>, wherein V is The set on summit, V=v | v ∈ html tally set Tag }.Set of the E for directed edge, E=<u,v>| u, v ∈ V, wherein u are referred to as The father vertex of v, and v is referred to as the son vertex of u, and in html structures, the corresponding labels of v are contained by the corresponding label packet receivings of u }.

Define 2. 1 dom tree T_dIt is represented by the set B={ b of a page block_i|b_i∈ V, and b_iNode is corresponding Html labels are div or table }, the node is called block node.

Define 3.T_dIt it is one with v₀For the dom tree of root, for arbitrary node v ∈ V, v₀iv₁i…v_kI is tree T_dFrom v₀ Reach v_kSequence node, wherein, parent (v_j)=v_j-1(1<=j<=k), i represents position of the node in its brotgher of node Put, v_k=v, then claim v₀iv₁i…v_kI is the path of node v, is designated as p (v). such as：" body1/div3/div2 " is a node Path.

Architectural feature f for defining 4. given webpage w and webpage is represented by two tuples F=<w,f>, wherein f is one Individual set of paths f={ p₁,p₂,…,p_n|p_i=p (b_i),b_i∈B}。

In order to the quick similarity calculated between webpage, the path of each block node in dom tree is made by the present embodiment For the architectural feature of webpage.Webpage dom tree T is traveled through by way of breadth first traversal_d, set of paths F.f is extracted, two are constituted Tuple F=<w,f>To represent the architectural feature of webpage w.Web pages W of input are converted into into web page characteristics set D=finally {F₁,F₂,…,F_k}。

In the present embodiment, according to structure of web page feature F, the similarity that can be calculated between webpage.Define herein Calculate the similarity function of two pages：

Wherein F_iAnd F_jThe architectural feature of i-th webpage and j-th webpage is represented respectively.Algorithmic notation is herein for convenience Give the definition of labelling category feature after cluster result.

Define 5 given structure identical collections of web pages W={ w₁,w₂,…,w_n|sim(<w_i,f_i>,<w_j,f_j>)>0.82,0< I, j<=n, } and webpage architectural feature f=f | f=F_i.f_iAnd F_i.w_i∈ W }, then can represent every in website construction result One marking class is characterized as tlv triple C=<c,f,W>, wherein c represents such labelling, c ∈ positive integers N.

The present embodiment is clustered to a web pages by webpage Similarity Measure using hierarchical clustering algorithm, such as in following table Algorithm 1.

By algorithm 1, Web Page Tags class collections of web pages M can be obtained, each element representation marking class in set Feature.

In the present embodiment, after website construction result is obtained, in addition it is also necessary to extract content to the webpage of same marking class Decimation rule.The present embodiment is determined in webpage by the way of being combined based on the node density feature and web page release of statistics The position of appearance.In webpage, the distribution of body text is typically relatively concentrated, therefore, the text of the node that body text is located The text density of density ratio other nodes will height.From the point of view of the effect that html file represents in a browser, the page is by some What individual block was constituted, these blocks be by HTML containers labels (<div>With<table>Label) be split to form.So herein will Html page cuts into set of blocks B, then selects not comprising noise information, but comprising in complete text message from set of blocks B Hold block.

Whether the present embodiment adopts density feature come decision block node for content blocks, and 3 density definition are given below：

Define 6. n ∈ B are set as dom tree T_dIn a block node, then the text density of n be defined as：

Wherein T_nThe pure words character number (without link text) included for block node n, T is T_dIn the whole document for representing Pure words character number (do not include link text).

P_textReflect in the global page, Relatively centralized degree of the content of text in certain block node.By observation Draw with experiment, P_textIt is bigger, it is meant that the node is more possible to comprising content blocks to be found.

Define 7. n ∈ B are set as dom tree T_dIn a block node, then the link density of n be defined as：

Wherein lN_nFor the link number included in node n, lN is T_dThe link number included in whole document for representing.

P_linkReflect in the global page, be linked at the Relatively centralized degree of certain block node.By observing and testing Draw P_linkIt is bigger, it is meant that probability of the block node comprising noise information is bigger.

Define 8. n ∈ B are set as dom tree T_dIn a block node, then the node text density of n be defined as：

Wherein T_nPure words character number (not containing link text) for block node n, lT_nFor the text character number of block node n (including link text).

P_textlReflect the plain text intensity in certain node.P is drawn by observation and experiment_textlIt is bigger, meaning The node to be more possible to comprising main text block to be found.

After giving 3 Density Metrics, combined density eigenvalue H (p) of definition block node：

H (p)=λ * p1*p2*p3

Wherein p=p | and p=p (b), b ∈ B } the corresponding path of block node is represented, p1, p2, p3 represent node respectively Density feature, takes p1=p_text, p2=1-p_link, p3=p_textl, block node is estimated using H (p).

Web Page Tags class set M, the webpage in same marking class can be obtained by algorithm 1, the present embodiment thinks interior The position for holding block is identical, so in same class webpage going to select content blocks by density feature, then extracts content blocks Decimation rule of the feature as such Web page content block, as shown in algorithm 2.In webpage, the feature of block can have three kinds of expression sides The path path of method, the corresponding value of block class attributes, the corresponding value of block id attributes and block.Algorithmic notation for convenience, this enforcement Example gives after cluster result the definition of the feature of content blocks in decimation rule, i.e. each marking class in each marking class.

The labelling c and the content blocks b ∈ B of the marking class webpage of marking class in 9. given cluster results are defined, the mark is defined The decimation rule of note class webpage be four-tuple L (b)=<c,class,id,p>, wherein class represents such webpage correspondence The corresponding value of content blocks class attributes, id represent such corresponding value of webpage corresponding content block the i-th d attributes, and p represents such net The tag path of page corresponding content block, p=p | p=p (b) and b ∈ B }.

The result of cluster can be obtained into decimation rule L (b) of each marking class corresponding content block through algorithm 2.Utilize Decimation rule set N is processing the Web page in each marking class in marking class set M, and therefrom extracts data. The path of three features of recorded content block in L (b), the corresponding value of block class attributes, the corresponding value of block id attributes and block. In decimation rule L (b), if during L.class and L.id existence values, extracting content blocks using the corresponding value of class and id attributes. In decimation rule L (b), if L.class and L.id is space-time, content blocks are extracted with the path L.p of block.Finally, from content Web page contents are extracted in block.

Particularly, in order to verify the effectiveness of said method, the present embodiment is real on Eclipse platform using Java language Corresponding prototype system is showed.The input of the prototype system is given one group Web page, and output is this group of Web page correspondence Web page contents.Data set used in experiment is from 1000 webpages including 5 websites.The data set passes through semi-hand Mode (seed URL+ reptiles+craft screening) is obtained from the Internet online collection, respectively from Netease, Sohu, Sina, the people Net and the www.xinhuanet.com, different themes classification of these webpages in website.Notebook data collection participates in website construction and processes.

The experiment of the present embodiment is divided into two kinds, and the first is the web page contents based on web page release and block node density feature Extract, be used as content blocks by choosing the higher block of combined density in the page, and extract its content, its result is as shown in table 1.The Two kinds is that web page contents based on website construction, web page release and block node density feature are extracted, and its result is as shown in table 2.

Web page contents of the table 1 based on web page release and block node density feature extract result

DataSet	Webpage sum	Accuracy rate
			Netease	200	86%
Sohu	200	98.5%
			Sina	200	96%
The www.xinhuanet.com	200	99%
			People's Net	200	100%

Web page contents of the table 2 based on website construction are extracted

DataSet	Webpage sum	Accuracy rate
			Netease	200	91%
Sohu	200	100%
			Sina	200	100%
The www.xinhuanet.com	200	100%
			People's Net	200	100%

As can be seen that carrying out webpage based on the method for web page release and block node density feature from the experimental result of table 1 Contents extraction is not fine for its extraction effect of some websites.We to malfunction webpage investigate, draw it is following its The reason for middle error：

(1) there is no block comprising complete content in webpage：Web page contents are except being included in<div>With<table>Block section In point label, also it is included in<center>With<text>In label, because define only node mark in definition block node herein Sign and be<div>Or<table>It is block node, causes data to be extracted and substantial amounts of omission occurs, cause accuracy rate to decline.

(2) there is no block in webpage and only include complete content：This kind of mistake is occurred mainly in data set Netease, at us In the middle part of 200 webpages of the Netease of selection there are no accurate content blocks in subnetting page, but content is mixed with recommended links block Place in the same block, which results in us and contain impurity in the content blocks for extracting, so as to reduce accuracy rate.

(3) content blocks of mistake are chosen：Web page text content length is shorter, and the page noise it is more when such as：Visitor Comment it is longer etc. so that system chooses the block of mistake.This kind of situation is occurred mainly in from Sina, Sohu, the net of the www.xinhuanet.com In page.

Find to extract its accuracy rate phase based on the web page contents of web page release and block node density feature by experimental result To unstable, after the process of website construction, then carry out extraction its accuracy rate of content and greatly improve, can be eliminated substantially Three kinds of mistakes.

The foregoing is only presently preferred embodiments of the present invention, all impartial changes done according to scope of the present invention patent with Modification, should all belong to the covering scope of the present invention.

Claims

1. a kind of text message extracting method based on website construction, it is characterised in that：Comprise the following steps：

Step S1：Crawl and be input into the corresponding webpage of url set, script, annotation in removal html web page, style tags, and will Each web analysis is into a dom tree；

Step S2：The corresponding dom tree of given webpage, traversal dom tree extract structure of web page feature；Webpage is calculated by architectural feature Between similarity；Collections of web pages is clustered according to the similarity between webpage；Ultimately produce a series of labelling class set Close, and the web page characteristics of each marking class；

Step S3：For same class webpage, cutting whole into sections is carried out to webpage, count the node density feature of each block, find bag Containing body matter and the block not comprising noise information, and extract the extraction rule of the feature as content blocks in such webpage of the block Then；

Step S4：Webpage in each marking class is extracted into content by extracting rule；For the new webpage of identical source The marking class belonging to webpage is determined also by the web page characteristics of marking class, then according to the decimation rule of the marking class webpage come Carry out the extraction of content.

2. a kind of text message extracting method based on website construction according to claim 1, it is characterised in that：Step S2 Specifically include following steps：

Step S21：Using the path of each block node in dom tree as webpage architectural feature, by the side of breadth first traversal Formula travels through webpage dom tree T_d, set of paths F.f is extracted, two tuples F=are constituted<w,f>To represent the architectural feature of webpage w；

s i m (F_{i}, F_{j}) = \frac{c r a d (F_{i} . f \cap F_{j} . f)}{\min (c r a d (F_{i} . f), c r a d (F_{j} . f))};

3. a kind of text message extracting method based on website construction according to claim 1, it is characterised in that：Step S3 Described in the calculating of node density feature comprise the following steps：

p_{t e x t}^{i} = \frac{T_{n}}{T};

Wherein, T_nFor the pure words character number that block node n is included, T is T_dPure words character number in the whole document for representing；Its Middle T_nLink text is not included with T；

p_{l i n k}^{i} = \frac{{lN}_{n}}{l N};

p_{t e x t l}^{i} = \frac{T_{n}}{{lT}_{n}};

Wherein, wherein T_nFor the pure words character number of block node n, lT_nFor the text character number of block node n；Wherein T_nNot comprising chain Meet text, lT_nComprising link text；

Step S34：Calculate combined density eigenvalue H (p) of block node：

H (p)=λ * p1*p2*p3；

Wherein, p={ p | p=p (b), b ∈ B } represents the corresponding path of block node；P1, p2, p3 represent the density of node respectively Feature, takes p1=p_text, p2=1-p_link, p3=p_textl, block node is estimated using H (p).