CN104699847A

CN104699847A - Method and device for extracting summaries from web pages

Info

Publication number: CN104699847A
Application number: CN201510153451.3A
Authority: CN
Inventors: 刘庆朝; 何文娟; 周娜; 冯艳伟
Original assignee: Liu Xiulei
Current assignee: Liu Xiulei
Priority date: 2015-02-13
Filing date: 2015-04-02
Publication date: 2015-06-10
Anticipated expiration: 2035-04-02
Also published as: CN104699847B

Abstract

The invention discloses a method for extracting summaries from web pages. The method is used for extracting the summaries from the web pages which are obtained by means of inputting query terms selected by users and are related to the query terms. The method for extracting the summaries from the web pages has the advantages that the summaries can be accurately extracted from the web pages, whether the web pages are relevant or not and whether the web pages are about to be further checked or not can be quickly determined by the users, and accordingly the problem of trouble when users need to quickly search relevant documents can be solved. The invention further provides a device for extracting the summaries from the web pages.

Description

Web-page summarization abstracting method and device thereof

Technical field

The present invention relates to a kind of web-page summarization abstracting method and device thereof, particularly relate to a kind of web-page summarization abstracting method based on keyword and device thereof.

Background technology

Following a few class is had at present: 1. based on the method for text set for web-page summarization abstracting method.The method introduces the method for machine learning, from text set rule relevant to summary learning.Mainly be divided into two stages: training stage and test phase.Training stage concentrates the key character extracting text from training text, and by learning algorithm create-rule; The rule that training stage acquires then is applied to test text collection by test phase, generates summary.The deficiency of the method needs the rule utilizing training to generate could generate summary, too relies on training text collection, is difficult to the accuracy ensureing to generate summary.2. based on the method for domain body: the method analyzes text mainly through body for the internal relations of the semantic information on application level and body, filter irrelevant information, make theme more outstanding, the extraction of summary is more accurate, the deficiency of the method is the shortage due to Chinese domain body, need oneself to build domain body, workload is large.

Therefore, need to occur a kind of accuracy that can ensure to generate summary, the appearance of the web-page summarization abstracting method of workload can be reduced again.

Summary of the invention

For the problems referred to above, the invention provides a kind of method and the device thereof that from webpage, correctly can extract summary.

The invention provides a kind of web-page summarization abstracting method, make a summary for extracting from the webpage relevant to described query word obtained by the query word selected of input user, it is characterized in that, comprise the following steps: (1) query word pre-treatment step: utilize participle instrument to be several subquery words by query word participle, add up frequency and the distributed intelligence of each subquery word in query word of each subquery word appearance; (2) Text Pretreatment step: pretreated web page text is divided into several sentences, utilizes participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text, and adds up the number of word in sentence number and text; (3) word frequency and word distributed intelligence statistic procedure: the kind of subquery word of adding up the number of the subquery word that each sentence comprises, the number of continuous subquery word comprised and comprising, and at sentence the subquery word occurred marks and indicated and be specially that subquery word; (4) scoring step: based on the statistics of step (3), utilize scoring instrument to mark to each sentence, calculate the mark of each sentence; (5) sequence and generation summary step: from high to low sentence is sorted by mark, choose a front K sentence to generate summary, wherein 0≤K≤3.

The present invention also provides a kind of web-page summarization draw-out device, it is characterized in that, comprise: query word pretreatment unit, utilize participle instrument to be several subquery words by query word participle, add up frequency and the distributed intelligence of each subquery word in query word of each subquery word appearance; Text Pretreatment unit, is divided into several sentences by text, utilize participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text, and add up the number of word in sentence number and text; Word frequency and word distributed intelligence unit, add up the sentence number occurring subquery word described at least one in several sentences, add up the frequency that described in certain, subquery word occurs in each sentence, and add up the kind occurring subquery word in each sentence, and at sentence, the subquery word occurred is marked and indicated and is specially that subquery word; Scoring unit, utilizes scoring instrument to mark to each sentence, calculates the mark of each sentence; Sequencing unit, sorts to sentence from high to low by mark; And tissue summary unit, choose a front K sentence to generate summary, wherein 0≤K≤3.

The present invention can when response user inquiry, according to the position that query word occurs in a document, extract word relevant around query word and return to user, because one section of document can be recalled by different query words, therefore, the present invention according to different query words, can form different summary words to same document.Meanwhile, the key word in given text is carried out to overstriking, adds the display modes such as red, highlighted the difference of it and plain text.Thus enable the relevant information needed for user's fast finding.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of web-page summarization abstracting method of the present invention.

Fig. 2 is the schematic diagram of web-page summarization draw-out device of the present invention.

Embodiment

Aim of the present invention extracts summary exactly from webpage, mainly solves following problem:

1. summarize the approximate contents of the webpage searched out, whether user is determined fast, and whether webpage is correlated with, check further.Thus solve the puzzlement of user's fast finding relevant documentation.

2., for search system, in time responding user's inquiry, according to the position that query word occurs in a document, extract word relevant around query word and return to user.

3. the accuracy rate problem of abstract extraction.The key elements such as word frequency, title, word distribution are mainly extracted important sentence as summary as the statement passing judgment on sentence importance by the present invention.Effectively raise the accuracy rate problem of abstract extraction.

For solving the problems of the technologies described above, web-page summarization abstracting method of the present invention have employed following technology:

(1) text-processing technology.In the present invention, main two places use this technology, and one is query word or title pretreatment unit, and another is Text Pretreatment unit.

(2) sort result technology.First the present invention carries out subordinate sentence to given document, subordinate sentence Main Basis in ".! .! " wait the appearance of symbol.Then each sentence is marked, following three factors of scoring Main Basis: the number 1) comprising the subquery word after query word participle in sentence, comprises the subquery word repeated; 2) number of continuous subquery word is comprised in sentence; 3) kind of the subquery word comprised in sentence.Finally sort according to scoring, find out the score that comes front K higher sentence.

For making skilled person understand feature of the present invention and technology contents further, below in conjunction with drawings and Examples, technical scheme of the present invention is described in detail.

Generally, after user chooses term according to the information for obtaining, by this term input frame retrieval, the related content about this term can be found in magnanimity webpage in search engine usual very short time, these contents directly will not feed back to user, thus user is difficult to know that in retrieved webpage, those are required for oneself, if each webpage is browsed, will waste the plenty of time.Thus, how fast and efficiently the content in webpage is presented to user with the form of summary, so that user understands in time, whether webpage is relevant seems most important to judge.

Fig. 1 is the schematic diagram of web-page summarization abstracting method of the present invention.As shown in Figure 1, web-page summarization abstracting method of the present invention comprises:

Step S1: query word pre-treatment step

Participle instrument is utilized to be several subquery words by query word participle;

Particularly, word frequency statistics instrument can be utilized, if query word participle is several subquery words by IKAnalyzer segmenter, ICTCLAS, Ansj, stammerer participle etc., add up frequency and the distributed intelligence of each subquery word in query word of the appearance of each subquery word, comprise the position of each subquery word in query word and the information of front and back neighbouring relations, and identify the subquery word repeated in query word;

Step S2: Text Pretreatment step

Pretreated web page text is divided into several sentences, utilizes participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text;

Particularly, foundation ".! .! " wait the appearance of symbol to carry out subordinate sentence process to text, then utilize participle instrument, as IKAnalyzer segmenter carries out word segmentation processing to each sentence, and preserve each word positional information in the text, comprise starting position and end position.

Distributed intelligence is worked to sentence scoring, mark according to the distributed intelligence of subquery word, such as, suppose that query word comprises subquery word A, B, C, so occur shape as ... AB...C... or ... the such sentence of ABC... certainly than shape as ... the scoring of the sentence that A...B...C... is such wants high.

Step S3: word frequency and word distributed intelligence statistic procedure

The kind of subquery word of adding up the number of the subquery word that each sentence comprises, the number of continuous subquery word comprised and comprising, and at sentence the subquery word occurred marks and indicated and be specially that subquery word; Wherein, the kind of subquery word refers to the number of the subquery word of appearance.

Step S4: scoring step

Based on the statistics of step S3, utilize scoring instrument to mark to each sentence, calculate the mark of each sentence;

Particularly, utilize BM25 algorithm to be the mark that each sentence calculates BM25 contribution, and be the mark that each sentence calculates word distributed intelligence contribution, the mark of the mark that weighting BM25 contributes and word distributed intelligence contribution obtains the mark of each sentence.

BM25 algorithm, is commonly used to divide equally as relevance of searches.Its main thought is: carry out morpheme parsing to Query, generates morpheme qi; Then, for each Search Results D, calculate the Relevance scores of each morpheme qi and D, finally, qi is weighted summation relative to the Relevance scores of D, thus obtains the Relevance scores of Query and D.

The present invention adopts BM25 formula to calculate the mark of BM25 contribution, and BM25 formula is as follows:

Score (Q, d) = Σ_{i = 1}^{n} w_{i} * R (q_{i}, d) - - - (1)

Wherein Q represents query word, and d represents sentence.When Score (Q, d) represents input inquiry word Q, the mark of each sentence d.By query word participle, obtain the set { q of query word _i, the Lian Jia on the right represents the score value of each query word to each sentence, finally add up namely all query words to the score value of sentence, w _irepresent query word q _iweight; R (qi, d) represents the Relevance scores of query word qi and sentence d.

R (q in formula _i, d) represent a query word to the mark of a sentence, concrete restriction can with different formula according to different application.By the weight w of each subquery word in the present invention _iall be set to 1, the formula of embody rule is: score=kinds+ (continutation+occur)/100.0 (2)

Wherein kinds is the kind comprising subquery word in sentence; Continuation is the number of continuous subquery word; Occur is the number comprising subquery word in sentence.

For each sentence calculate the mark of word distributed intelligence contribution time, the segment information of Main Basis subquery word is marked, if several subquery words come together in order, then gives higher mark.In the present invention, distributed intelligence is actually the continuation in above-mentioned formula (2), and if there is the situation that two sub-query words are connected in a sentence, the score value of this sentence just adds 1.Concrete, continuation initial value is set to 0, has two connected query word continuation just to add 1, as ... the continuation=1 of AB...C... ... the continuation=2 of ABC... ... the ontinuation=0 of A...B...C....

Step S5: sequence and generation summary step

Sort to sentence from high to low by mark, choose a front K sentence to generate summary, wherein 0≤K≤3, generally, K is 3.

In step s 5, if the length of the summary of K sentence generation has exceeded user-defined length, then the content exceeding self-defining length is deleted, to obtain suitable summary, and each subquery word is marked, as with overstriking, add the display mode such as red it marked, to highlight the difference of it and plain text.

For setting forth web-page summarization abstracting method of the present invention further, illustrate below.

Below for query word " the joyous New Year's Day ", extract web-page summarization according to above-mentioned steps.

The first step: input inquiry word and text

Query word: the joyous New Year's Day

Text:

[it is said, the Zhu Zhishan of the Ming Dynasty, once once teased such with people.When the joyous New Year's Day, wish that a family rich family is crossed in front of the door in a hill path, see on red-painted gate and against a red paper.He thinks, and just makes people bring pen and ink, has write 12 large words: this year, very good bad luck was entered completely without wealth above.That family owner out at a glance, reads: " this year, very good bad luck, entered completely without wealth." One's face was ghastly pale immediately.Does he think that this fuzzy is my bad luck tactile? so he will just start to beat Zhu Zhishan.And Zhu Zhishan laughs at him, say unhurriedly: " what I obviously write is good word, and yourself fails to understand! " finish, he just with pen put several under, just become: " this year is very good, unlucky completely without, wealth is entered." that owner seen, can't help turning anger for happiness, say "good, good" again and again.】

Second step: query word pre-service

By query word participle, be divided into " the new year " and " happy festival time " two words.

3rd step: Text Pretreatment

To text subordinate sentence, result is as follows, every a line:

[it is said, the Zhu Zhishan of the Ming Dynasty, once once teased such with people.

When the joyous New Year's Day, wish that a family rich family is crossed in front of the door in a hill path, see on red-painted gate and against a red paper.

He thinks, and just makes people bring pen and ink, has write 12 large words: this year, very good bad luck was entered completely without wealth above.

That family owner out at a glance, reads: " this year, very good bad luck, entered completely without wealth." One's face was ghastly pale immediately.

Does he think that this fuzzy is my bad luck tactile?

So he will just start to beat Zhu Zhishan.

And Zhu Zhishan laughs at him, say unhurriedly: " what I obviously write is good word, and yourself fails to understand! " finish, he just with pen put several under, just become: " this year is very good, unlucky completely without, wealth is entered." that owner seen, can't help turning anger for happiness, say "good, good" again and again.】

4th step: sentence is marked

Evaluate formula is: score=kinds+ (continutation+occur)/100.0.

Kinds is the kind comprising subquery word in sentence; Continuation is the number of continuous subquery word; Occur is the number comprising subquery word in sentence, calculates kinds, continuation, occur tri-value to each sentence.Below first two are illustrated:

1, query word: the joyous New Year's Day

Sentence: it is said, the Zhu Zhishan of the Ming Dynasty, once once teased such with people.

Whether whether kinds is initialized as 0, check in sentence and have " the new year ", do not have, and kinds value is constant, check in sentence and have " happy festival time ", do not have, and kinds value is constant, kinds=0;

Continuation is initialized as 0, does not occur continuous print query word, continuation=0 in sentence;

Occur in sentence that query word number is 0, occur=0;

So, the score score=0 of the words.

2, query word: the joyous New Year's Day

Sentence: when the joyous New Year's Day, wishes that a family rich family is crossed in front of the door in a hill path, sees on red-painted gate and againsts a red paper.

Whether whether kinds is initialized as 0, check in sentence and have " the new year ", have, kinds adds 1, check in sentence and have " happy festival time ", have, kinds adds 1, kinds=2;

Continuation is initialized as 0, occurs continuous print query word, continuation=1 in sentence;

Occur in sentence that query word number is 2, occur=2;

So, the score score=2.03 of the words.

According to the method described above, except second, all the other sentence scores are all 0.

4th step: generate summary

Choose first three conduct summary that score is the highest, and red to query word mark.Result is as follows:

When <font color='red'> the joyous New Year's Day </font>, wish that a family rich family is crossed in front of the door in a hill path, see on red-painted gate and against a red paper.He thinks, and just makes people bring pen and ink, has write 12 large words: this year, very good bad luck was entered completely without wealth above.That family owner out at a glance, reads: " this year, very good bad luck, entered completely without wealth.

The present invention also provides a kind of web-page summarization draw-out device.As shown in Figure 2, web-page summarization draw-out device of the present invention comprises query word pretreatment unit 1, it utilizes participle instrument to be several subquery words by query word participle, adds up frequency and the distributed intelligence of each subquery word in query word of the appearance of each subquery word; Text Pretreatment unit 2, for text is divided into several sentences, utilizes participle instrument carry out participle to each sentence and preserve the distributed intelligence of each word in web page text, and adds up the number of word in sentence number and text; Word frequency and word distributed intelligence unit 3, for adding up in several sentences the sentence number occurring at least one sub-query word, add up the frequency that certain subquery word occurs in each sentence, and add up the kind occurring subquery word in each sentence, and at sentence, the subquery word occurred is marked and indicates to be specially which subquery word; Scoring unit 4, utilizes scoring instrument to mark to each sentence, calculates the mark of each sentence; Sequencing unit 5, sorts to sentence from high to low by mark; And tissue summary unit 6, choose a front K sentence to generate summary, wherein 0≤K≤3.

In the present invention, if the length of summarization of K sentence generation is greater than the length of summarization of setting, then sequencing unit 5 is deleted the content of the sentence exceeding length of summarization, and marks the subquery word occurred in summary.In addition, the distributed intelligence of each subquery word in query word comprises the position of each subquery word in query word and front and back neighbor information, and the distributed intelligence of each word in web page text comprises each word starting position in the text and end position information.

Scoring unit 4 calculates the mark of BM25 contribution to each sentence and every sentence is calculated to the mark of word distributed intelligence contribution, and the mark of the mark that weighting BM25 contributes and word distributed intelligence contribution, obtains the mark of each sentence.

In the present invention, word frequency statistics instrument can be utilized, as IKAnalyzer segmenter carries out participle.

Web-page summarization draw-out device of the present invention can extract summary accurately from webpage, and use for reference, considerably reduces the search time of user.

Although illustrate and describe embodiments of the invention, those having ordinary skill in the art will appreciate that: can carry out multiple change, amendment, replacement and modification to these embodiments when not departing from principle of the present invention and aim, scope of the present invention is by claim and equivalents thereof.

Claims

1. a web-page summarization abstracting method, making a summary for extracting from the webpage relevant to described query word obtained by the query word selected of input user, it is characterized in that, comprise the following steps:

(1) query word pre-treatment step

(2) Text Pretreatment step

Pretreated web page text is divided into several sentences, utilizes participle instrument that each sentence is carried out participle and preserve the distributed intelligence of each subquery word in web page text;

(3) word frequency and word distributed intelligence statistic procedure

The kind of subquery word of adding up the number of the subquery word that each sentence comprises, the number of continuous subquery word comprised and comprising, and at sentence the subquery word occurred marks and indicated and be specially that subquery word;

(4) scoring step

Based on the statistics of step (3), utilize scoring instrument to mark to each sentence, calculate the mark of each sentence;

(5) sequence and generation summary step

From high to low sentence is sorted by mark, choose a front K sentence to generate summary, wherein 0≤K≤3.

2. web-page summarization abstracting method according to claim 1, it is characterized in that, described step (5) also comprises: if the length of summarization of a described K sentence generation is greater than the length of summarization of setting, then the content of the sentence exceeding described length of summarization is deleted, and the subquery word occurred in summary is marked.

3. web-page summarization abstracting method according to claim 1, it is characterized in that, the distributed intelligence of described each subquery word in query word comprises the position of each subquery word in query word and front and back neighbor information, and the distributed intelligence of described each subquery word in web page text comprises each subquery word starting position in the text and end position information.

4. web-page summarization abstracting method according to claim 1, it is characterized in that, described scoring instrument in described step (4) is BM25 algorithm, be specially: the mark of BM25 contribution is calculated to each sentence and every sentence is calculated to the mark of word distributed intelligence contribution, the mark of the mark that weighting BM25 contributes and word distributed intelligence contribution, obtains the mark of each sentence.

5. web-page summarization abstracting method according to claim 4, it is characterized in that, adopt formula score=kinds+ (continutation+occur)/100.0 to carry out to calculate each sentence the mark of BM25 contribution, wherein kinds is the kind comprising subquery word in sentence; Continuation is the number of continuous subquery word; Occur is the number comprising subquery word in sentence.

6. the web-page summarization abstracting method according to any one of Claims 1-4, is characterized in that, described participle instrument is IKAnalyzer segmenter, ICTCLAS, Ansj or stammerer participle.

7. a web-page summarization draw-out device, is characterized in that, comprising:

Query word pretreatment unit, utilizes participle instrument to be several subquery words by query word participle, adds up frequency and the distributed intelligence of each subquery word in query word of the appearance of each subquery word;

Text Pretreatment unit, is divided into several sentences by text, utilize participle instrument each sentence is carried out participle and preserves the distributed intelligence of each word in web page text, and add up the number of word in sentence number and text;

Word frequency and word distributed intelligence unit, add up the sentence number occurring subquery word described at least one in several sentences, add up the frequency that described in certain, subquery word occurs in each sentence, and add up the kind occurring subquery word in each sentence, and at sentence, the subquery word occurred is marked and indicated and is specially that subquery word;

Scoring unit, utilizes scoring instrument to mark to each sentence, calculates the mark of each sentence;

Sequencing unit, sorts to sentence from high to low by mark; And

Tissue summary unit, chooses a front K sentence to generate summary, wherein 0≤K≤3.