CN103853834A

CN103853834A - Text structure analysis-based Web document abstract generation method

Info

Publication number: CN103853834A
Application number: CN201410090200.0A
Authority: CN
Inventors: 沈怡涛; 顾君忠; 林晨
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2014-03-12
Filing date: 2014-03-12
Publication date: 2014-06-11
Anticipated expiration: 2034-03-12
Also published as: CN103853834B

Abstract

The invention discloses a text structure analysis-based Web document abstract generation method. The method comprises the steps of using a URL (uniform resource locator) as input, integrating the webpage main bodies of visual features and text features for extraction, partitioning the main bodies into a plurality of semantic paragraphs, and abstracting each semantic paragraph, so the generated abstract has higher coverage rate. The text structure analysis-based Web document summary generation method realizes the generation of the text abstract with better quality from a Webpage aiming at the conditions that the Webpage structure is complex, the main body is hard to identify and the Chinese automatic abstract is still positioned in the probe stage.

Description

The generation method of the Web documentation summary of analyzing based on text structure

Technical field

The present invention relates to Web page text extraction, natural language processing, Chinese Text Summarization technical field, specifically a kind of generation method of the Web documentation summary of analyzing based on text structure.

Background technology

At present, Internet has become the main source of people's obtaining information.The particularly develop rapidly of user-generated content (UGC) in recent years, the information on Internet is just in explosive growth.Although search engine can require to return Search Results according to user.But user still need to find the webpage of the most applicable oneself needs from search listing, particularly due to a large amount of search engine optimizations that exist on internet with reprint phenomenon, bring very large difficulty to user's searching information fast and accurately.

Automatic abstracting system is to utilize computing machine fast processing Web document, therefrom captures out the core content of Web document by certain ratio of compression, and user can therefrom obtain subject information and judge the value of this Web document, has improved the efficiency of user search information.

In Web document, exist in a large number noise information, information as irrelevant in advertisement, navigation bar, user function bar, associated recommendation, copyright information etc. and theme.Web document is a kind of semi-structured information, although have a fixed structure, semanteme cannot be determined.The page that the expression of content in html source code and final rendering obtain has very large difference.The extensive application of JS and AJAX technology in recent years, making web data is no longer static HTML code, but dynamically generate, even also can produce corresponding change for user's operation behavior.So how to extract from Web document and content Topic relative and that structure is correct, exist certain difficulty.

The history of the nearly more than two decades of Chinese Text Summarization systematic research, but at present also in the exploratory stage, the result of autoabstract also far away can not be satisfactory.The method of autoabstract is mainly divided into two large classes, the automatic abstract based on understanding and the automatic abstract based on extracting.Because natural language processing technique does not still have important breakthrough, realize automatic abstract so the method based on understanding can not be real.

And shorter towards the research history of the autoabstract technology of Web document, " compared with traditional text; the text structure of webpage is loose; title name is relatively so not rigorous; a sentence finishes also may not have end mark; and there is a large amount of and the incoherent content of text, this brings certain difficulty to generation of summary.”

Summary of the invention

The object of this invention is to provide a kind of generation method of the Web documentation summary of analyzing based on text structure, the method integrated use the technology such as visual signature analysis, natural language analysis, text structure analysis, for the each webpage in Search Results generates based on semanteme, the good webpage summary of quality, for user provides reference.

The object of the present invention is achieved like this:

A generation method for the Web documentation summary of analyzing based on text structure, it comprises the following steps:

1) input the URL of webpage to be made a summary;

2) extract Web page text from webpage to be made a summary based on visual analysis, specifically comprise;

2.1) adopt browser core that Web document is resolved and played up;

2.2) adopt Visual tree (VIPS) algorithm to carry out piecemeal to webpage, obtain position, the area of each block;

2.3) each block is carried out to participle;

2.4) each block is analyzed to text feature;

2.5) whether each block being comprised to text gives a mark;

2.6) score is linked in sequence higher than the text of a certain threshold value;

2.7) output Web document text;

3) text extracting is carried out to the autoabstract of analyzing based on text structure, specifically comprises:

3.1) by step 2) obtain Web page text;

3.2) text is carried out to participle and part-of-speech tagging;

3.3) carry out text pre-service: the basic structure in identification text, identify article title, sentence completion, paragraph cutting;

3.4) text is carried out to the cutting of semanteme section, by the semantic position changing of text structure analysis identification, as the mark of semantic section cutting;

3.5) to each semantic section, utilize the promotion method of TFIDF, the importance to each sentence in the semantic section in place is measured, and then according to abstract word number requirement, extracts some the sentences that can represent this semanteme section theme;

3.6) each sentence is linked in sequence, output digest.

Described step 2.4) in text feature be number of words, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity.

Described step 2.5) described in judge whether each block comprises text and give a mark, use following formula to calculate the score value of marking:

V (s) = \frac{S^{2} * P (x_{1}, x_{2}, x_{3}, x_{4})}{N + 1}

Wherein S represents declarative sentence quantity, and N represents non-declarative sentence quantity, and P is a value big or small according to block and that position calculation obtains, x ₁, y ₁represent the coordinate in the block upper left corner, x ₂, y ₂represent the coordinate in the block lower right corner.

Described step 3.4) in the analysis identification of the position that changes of semanteme be:

1) document D is carried out to subordinate sentence, between every two adjacent sentences, be cut-point undetermined;

2) each cut-point undetermined is given a mark, its formula is:

Q (p_{i}) = \underset{i + 1 < j \leq i + a}{Σ} R (s_{i}, s_{j}) - \underset{i - a < = j < i}{Σ} R (s_{i}, s_{j})

Wherein, R (s _i, s _j) expression sentence s _iwith sentence s _jsentence between semantic relevancy; p _irepresent that cut-point is at sentence s _iand s _i-1between, if Q is (p _i) > Q (p _i-1) and Q (p _i) > Q (p _i+1), p is described _ithe maximum point of cut-point weights, so p _iit is the cut-point between semantic section in the text.A is an adjustable empirical parameter, and the scope of the semantic analysis while being illustrated in identification cut-point represents to consider cut-point front and back each a sentence.

3) if the score value of cut-point is greater than a certain threshold value, and be local maximum, score value is higher than the score value of former and later two cut-points, and this cut-point is exactly the cut-off of semantic section, i.e. step 3.4) described in the semantic position changing.

The analysis identification step 2 of the position that described semanteme changes) between sentence the calculating of semantic relevancy comprise the following steps:

1) sentence is cut into the set of word;

2) use following formula to calculate semantic relevancy between sentence

R (s_{1}, s_{2}) = \underset{w_{i} &Element; s_{1}}{Σ} \max (R (w_{i}, w_{j})) (w_{j} &Element; s_{2})

Wherein R (w _i, w _j) expression word w _iwith word w _jword between semantic relevancy.

Described step 3.5) in to each sentence the importance in the semantic section in place measure use below formula calculating:

V(S _l)=sum(w∈S ₁)*TFIDF(w)

Wherein, while calculating TFIDF (w), each paragraph is considered as to independently file, several paragraphs that entire article is comprised are considered as file set.

The present invention can filter out in webpage and irrelevant word, the link etc. of theme, identifies the article text comprising in webpage, and accuracy rate is higher, and has higher robustness.Autoabstract flow process has adopted the automatic Summarization Technique of analyzing based on text structure, and the summary coverage rate of generation is high and summary is comparatively smooth.

The present invention can, for Web document, by the ratio of compression requirement of user's appointment, only need to input the URL address of webpage to be made a summary, just can be within the time of several seconds, formation can cover the original text meaning, and comparatively accurate, smooth summary helps user's searching information in internet fast and accurately.

Brief description of the drawings

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is webpage pretreatment process figure of the present invention;

Fig. 3 is autoabstract process flow diagram of the present invention

Embodiment

The invention discloses a kind of Web documentation summary generation method of Search Engine-Oriented, can Web webpage of automatic analysis, and the text snippet of reaction of formation Web page subject.

The present invention comprises a Web page text that combines visual signature and text feature and extracts and an autotext summary based on carry out sub-topics division by text structure analysis.

The present invention, using a URL as input, through Web page text extraction, two stages of autoabstract, finally generates text snippet.

Specific algorithm to described two stages below, is described further for example in conjunction with a news web page is made a summary:

Fig. 1 has described from URL to be made a summary to the overall procedure that generates summary, comprising webpage pretreatment process and autoabstract flow process.

Particularly, in an embodiment, the present invention obtains the URL of news web page to be made a summary in webpage pretreatment process (see figure 2) URL input step.Webpage pretreatment process, by analyzing visual signature, can find the body part in webpage more accurately, has more high robust than additive method.Consider other features such as text feature, the analysis of text-dependent degree, html tag feature, semantic feature simultaneously, further improve the accuracy that Web Web page text extracts.

Webpage is played up step and is responsible for reading webpage corresponding to input URL, in this embodiment, adopts IE11 browser core to process html tag, and plays up this webpage.On the basis of playing up at webpage, Visual tree analytical procedure adopts VIPS algorithm, and webpage is carried out to Visual tree analysis, obtains position, the area of each block.In this embodiment, news web page to be made a summary is divided into 6 blocks by this step: a top block, a bottom block, navigation block, an advertisement block and two blocks that comprise text.Participle step is responsible for each block to carry out participle.Then, text feature analytical procedure is carried out text feature analysis to word segmentation result.Feature and the text feature of last comprehensive analytical procedure each block that analysis obtains to Visual tree are comprehensively analyzed, output body.

In this embodiment, adopt following formula to calculate P (x ₁, y ₁, x ₂, y ₂).

P(x _l，y _l，x ₂y ₂)＝(x ₂-x ₁)*(y ₂-y ₁)-x ₁*y ₁

Wherein x ₁, y ₁represent the coordinate in the block upper left corner, x ₂, y ₂represent the coordinate in the block lower right corner.Then calculate V (s) value of each block:

V (s) = \frac{S^{2} * P (x_{1}, x_{2}, x_{3}, x_{4})}{N + 1}

V (s) value of above-mentioned 6 blocks is respectively 3.7 × 10 from big to small ⁶, 2.3 × 10 ⁶, 7.5 × 10 ⁵, 5.4 × 10 ⁶, 3.7 × 10 ⁵, 1.6 × 10 ⁵, 1.2 × 10 ⁴.

In this embodiment, the threshold value of employing is 10 ⁶so, choose V (s) and be greater than 10 ⁶block, i.e. two maximum blocks of V (s) value.In this embodiment, two maximum blocks of V (s) value are exactly two blocks that comprise text, so correctly extracted body.

Extracting after body, then carrying out autoabstract flow process (see figure 3), comprising that relatedness computation between relatedness computation between text pre-service, word, sentence, semantic section are cut apart, these steps of summarization generation.

A text pre-treatment step, the basic structure in identification text, identifies article title, sentence completion, paragraph cutting.In this embodiment, body comprises 8 paragraphs altogether, 23 sentences.

Between word, relatedness computation step, based on knowing that the computing semantic that net provides gains knowledge, obtains the degree of correlation of two words by calculating the former similarity of justice of two words.The formula adopting is as follows:

R(w _l，w ₂)＝max(Rele(C _i，C _j))(C _i∈w ₁，C _j∈w ₂)

Wherein R (w ₁, w ₂) represent semantic relevancy between two words, Rele (C _i, C _j) represent two degrees of correlation that justice is former, get the semantic relevancy that its maximal value represents two words.

Between sentence, degree of correlation step obtains the degree of correlation of two sentences by analyzing the degree of correlation between word in two sentences.

R (s_{1}, s_{2}) = \underset{w_{i} &Element; s_{1}}{Σ} \max (R (w_{i}, w_{j})) (w_{j} &Element; s_{2})

Wherein R (s ₁, s ₂) represent the degree of correlation between two sentences, be the word in each sentence 1, look for the maximum word of associated degree in sentence 2, calculate the degree of correlation between these two words.Finally, by these maximal value summations, obtain the degree of correlation between these two sentences.

A semantic section segmentation step, carries out text structure analysis with reference to document " the Text Structure Analysis research of content-based relatedness computation ".Between semantic section, the feature of cut-point is first sentence after cut-point and the degree of correlation of some sentences is very little before, and larger with the degree of correlation of several sentences afterwards.Adopt the score value of following formula to 22 cut-point computed segmentation points between 23 sentences in this embodiment, and find function Q (p _i) maximum point:

Q (p_{i}) = \underset{i + 1 < j \leq i + a}{Σ} R (s_{i}, s_{j}) - \underset{i - a < = j < i}{Σ} R (s_{i}, s_{j})

In this embodiment, Q (p _i) comprise 2 maximum points, according to these two maximum points, this news is divided into 3 semantic sections.The sub-topics that each semantic section has comprised news, in this embodiment, first semantic section is the general introduction to media event, latter two semantic section is that two sides divide other comment to this media event.

A summarization generation step according to user's requirement, extracts by a certain percentage summary from the text of text formatting.

In this embodiment, this summarization generation step, by relatedness computation step between sentence, is calculated sentence in each sub-topics and the degree of correlation sum of article title sequence of words, thereby determines the value of each sub-topics.The quantity that extracts sentence from sub-topics is directly proportional with the degree of correlation of this sub-topics and article title.

In this embodiment, the ratio that user specifies is 0.2, and 5 words of extracting in 23 form summary.By the value of 3 sub-topicses is calculated, determine from 3 semantic sections and extract respectively 2,1,1 sentences.Finally, described summarization generation step is linked in sequence 5 summary sentences choosing, forms and makes a summary and export.

Claims

1. a generation method for the Web documentation summary of analyzing based on text structure, is characterized in that: the method comprises the following steps:

1) input the URL of webpage to be made a summary;

2.1) adopt browser core that Web document is resolved and played up;

2.2) adopt Visual tree algorithm to carry out piecemeal to webpage, obtain position, the area of each block;

2.3) each block is carried out to participle;

2.4) each block is analyzed to text feature;

2.5) whether each block being comprised to text gives a mark;

2.7) output Web document text;

3.1) by step 2) obtain Web page text;

3.2) text is carried out to participle and part-of-speech tagging;

3.6) each sentence is linked in sequence, output digest.

2. method according to claim 1, is characterized in that: step 2.4) described in text feature be number of words, font size, declarative sentence quantity, non-declarative sentence quantity and text fragment quantity.

3. method according to claim 1, is characterized in that: step 2.5) described in judge whether each block comprises text and give a mark, use following formula to calculate the score value of marking:

V (s) = \frac{S^{2} * P (x_{1}, x_{2}, x_{3}, x_{4})}{N + 1}

4. method according to claim 1, is characterized in that: step 3.4) described in the analysis identification of the semantic position changing be:

2) each cut-point undetermined is given a mark, its formula is:

Q (p_{i}) = \underset{i + 1 < j \leq i + a}{Σ} R (s_{i}, s_{j}) - \underset{i - a < = j < i}{Σ} R (s_{i}, s_{j})

Wherein, R (s _i, s _j) expression sentence s _iwith sentence s _jsentence between semantic relevancy; p _irepresent that cut-point is at sentence s _iand s _i-1between, if Q is (p _i) > Q (p _i-1) and 2 (p _i) > Q (p _i+1), p is described _ithe maximum point of cut-point weights, so p _iit is the cut-point between semantic section in the text; A is an adjustable empirical parameter, and the scope of the semantic analysis while being illustrated in identification cut-point represents to consider cut-point front and back each a sentence;

5. method according to claim 4, is characterized in that: step 2) described between sentence the calculating of semantic relevancy comprise the following steps:

1) sentence is cut into the set of word;

2) use following formula to calculate semantic relevancy between sentence

R (s_{1}, s_{2}) = \underset{w_{i} &Element; s_{1}}{Σ} \max (R (w_{i}, w_{j})) (w_{j} &Element; s_{2})

6. method according to claim 1, is characterized in that: step 3.5) described in to each sentence the importance in the semantic section in place measure use below formula calculating:

V(S ₁)=sum(w∈S ₁)*TFIDF(w)