CN100511214C - Method and system for abstracting batch single document for document set - Google Patents

Method and system for abstracting batch single document for document set Download PDF

Info

Publication number
CN100511214C
CN100511214C CNB2006101145906A CN200610114590A CN100511214C CN 100511214 C CN100511214 C CN 100511214C CN B2006101145906 A CNB2006101145906 A CN B2006101145906A CN 200610114590 A CN200610114590 A CN 200610114590A CN 100511214 C CN100511214 C CN 100511214C
Authority
CN
China
Prior art keywords
document
sentence
bunch
class bunch
abundant information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006101145906A
Other languages
Chinese (zh)
Other versions
CN101187919A (en
Inventor
万小军
杨建武
吴於茜
陈晓鸥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University, Peking University Founder Group Co Ltd filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB2006101145906A priority Critical patent/CN100511214C/en
Publication of CN101187919A publication Critical patent/CN101187919A/en
Application granted granted Critical
Publication of CN100511214C publication Critical patent/CN100511214C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a system for performing a single document summary to a document assemblage in lot quantity, and belongs to the technical field of the language word processing. Almost all automatic summary methods of the current single documents only make use of the information of an unbound document to abstract. The method of the invention can create the single document summary in lot quantity for all documents in the specified document assemblage. Firstly, the method performs the document clustering to the specified document assemblage to create a plurality of document type clusters, and the documents belonging to the same type cluster have the similar theme. Each document type cluster is specified, all sentences in the type cluster perform overall importance estimation, and then diversity castigation in the document is performed for the sentences based on each document in the type cluster, finally, a real important and novel sentence is chosen from the document to create the summary for the document. Due to the adoption of the method of the invention, the prior single document automatic summary method based on the picture array is improved, thereby obtaining better effect in the actual evaluating, and improving the summary efficiency in the mass production way.

Description

A kind of method and system of document sets being carried out the batch ticket documentation summary
Technical field
The invention belongs to spoken and written languages and handle and technical field of information retrieval, be specifically related to a kind of method and system of document sets being carried out the batch ticket documentation summary.
Background technology
Single document autoabstract refers to win automatically smart wanting or main points from given document, its objective is by former text is compressed, refines, for the user provides brief and concise content description.Single document autoabstract is one of key problem of natural language processing field, is widely used in document/Web search engine, enterprise content management system and Knowledge Management System (thinking and upright intelligence think of as Founder is rich) etc.
In short, the method for the multi-document summary method (Abstraction) that can be divided into method (Extraction) and extract based on sentence based on sentence generation.Then need to utilize the natural language understanding technology of deep layer based on the method for sentence generation, after former document is carried out sentence structure, semantic analysis, utilize information extraction or natural language generation technique to produce new sentence, thereby form summary.Based on the fairly simple practicality of method that sentence extracts, do not need to utilize the natural language understanding technology of deep layer; This method is given certain weight to each sentence after text is carried out subordinate sentence, reflect its importance, and several sentences of weight selection maximum form summary then.One step of key of extracting sentence is to give weights to sentence to reflect its importance, and this process need be taken all factors into consideration the different characteristic of sentence usually, for example word frequency, sentence position, clue word (Cue Words), rubbish speech (Stigma Words) etc.The method of present most of multi-document summary all is based on the sentence extraction technique, has put down in writing multiple method about single document autoabstract in the existing document.
(author is C.-Y.Lin and E.Hovy to article The automated acquisition of topic signatures for text Summarization, be published in the collection of thesis of publication in 2000: Proceedings ofACL2000) described the SUMMARIST system, this system utilizes theme signature (Topic Signature) to represent document subject matter, a theme signature is made up of a theme notion and some relative words, extracts sentence according to the theme signature then and forms summary.(author is H.G.Silber and K.McCoy to article Efficient text summarization using lexicalchains, is published in the collection of thesis of publication in 2000: Proceedings of the 5 ThInternational Conference on Intelligent User Interfaces) earlier document is analyzed, obtained vocabulary chain (Lexical Chain), a vocabulary chain is the sequence of a related term in the document.Each sentence with its total speech chain value that comprises as weight.(author is J.Kupiec to article A.trainable documentsummarizer, J.Pedersen and F.Chen, be published in the collection of thesis of nineteen ninety-five publication: Proceedings of SIGIR1995) the summary problem is regarded as whether sentence belongs to two class partition problems of summary, utilizes the comprehensive various features of Bayes classifier that sentence is selected.Article The use ofMMR, (author is Jaime Carbonell and Jade Goldstein to diversity-based reranking for reordering documents and producingsummaries, be published in the collection of thesis of publishing in 1998: Proceedings of SIGIR1998) described maximal margin correlativity (MMR) technology, be commonly used to extract sentence not only relevant but also that have certain novelty with document query.(author is Y.H.Gong and X.Liu to article Generic textsummarization using relevance measure and latent semantic analysis, be published in the collection of thesis of calendar year 2001 publication: Proceedings of SIGIR2001) adopted implicit semantic analysis (LSA) to extract sentence from new semantic space, and extract one with behind the maximally related sentence of document each according to calculation of correlation criterion (Relevance Measure), just from document, remove the speech that comprises in this sentence, guarantee each novelty that extracts sentence like this.In addition, (author is R.Mihalcea and P.Tarau to article TextRank:bringing order into texts, be published in the collection of thesis of publishing in 2004: Proceedings of EMNLP2004) and article A language independent algorithmfor single and multiple document summarization (author be R.Mihalcea and P.Tarau, is published in the collection of thesis of publishing in 2005: Proceedings of IJCNLP2005) proposed based on the method for scheming arrangement sentence in the document to be arranged.Sentence in the document connects according to the similarity relation between the sentence as the summit among the figure, desires to make money or profit with similar PageRank or HITS algorithm computation sentence importance based on this then.These class methods are based on " election " or " recommendation " of sentence to sentence, and between the adjacent sentence mutual " election " or " recommendation ", " election " or " recommendation " of a sentence acquisition is many more, and this sentence is important more." election " or " recommendation " person's significance level has determined the importance of " election " or " recommendation " that it is made.
Above single document auto-abstracting method has all only utilized the information of single piece of document self, does not utilize the information of other relevant documentations, and to every piece of document all need to carry out all calculation procedures just can obtain the summary.A lot of application needs carry out the single document summary to every piece of document in the extensive collection of document in the reality, there is different document class bunch in the document set, belonging between the document of same document class bunch is that theme is relevant, have the information redundancy characteristic, one piece of important information that document reflected also is reflected in a plurality of other documents of such bunch usually.
Summary of the invention
At the defective that exists in the existing single document autoabstract technology, the purpose of this invention is to provide a kind of method of document sets being carried out the batch ticket documentation summary, main thought of the present invention is as follows: utilize the characteristic that information redundancy exists between the similar document to weigh the importance for the treatment of sentence in the digest document better, thereby make a summary for the document generates better single document.Given collection of document is carried out clustering documents, can obtain reflecting some document class bunch of same theme, each class cocooning tool has similar document.This method can be carried out the batch ticket documentation summary to all documents in the single class bunch, that is to say, by once calculating the abundant information degree of all sentences in such bunch document, and does not need respectively to calculate separately at the sentence of every piece of document.This method can extract real important sentence on the one hand and form the higher summary of quality, saves the time that generates summary by calculating in batches on the other hand.
For reaching above purpose, the technical solution used in the present invention is: a kind of document sets is carried out the method for batch ticket documentation summary, may further comprise the steps:
Step 1, given collection of document D is carried out clustering documents, obtain k document class bunch C 1..., C k, k is a positive integer;
Step 2, the document in above-mentioned each document class bunch is carried out the batch ticket documentation summary respectively.
Further, the method that the document in the document class bunch is carried out the batch ticket documentation summary is:
Step 2.1, read in all documents among the document class bunch Ci, to every piece of document subordinate sentence, participle, obtain class bunch sentence S set=s1, s2 ..., sn}, n are the quantity of all sentences in such bunch, make up sentence graph of a relation G based on this sentence S set;
Step 2.2, according to the abundant information degree of above-mentioned each sentence of sentence graph of a relation G iterative computation that obtains;
Step 2.3, carry out otherness punishment in the document, obtain the final weighted value of each sentence in the document for the sentence among every piece of document d among the class bunch Ci;
Step 2.4, according to the final weighted value of each sentence among the document d, selecting the big sentence of weighted value is that the document forms summary.
Further, for given document sets D carries out clustering documents, when generating k document class bunch, concrete grammar is for utilizing the k-means clustering algorithm.
Further, utilize the concrete steps of k document clusters of k-means clustering algorithm generation as follows:
Step 4.1, from document sets D, select k document respectively as the average point of k class bunch at random, document among the D is assigned to respectively in the class the most similar to it bunch, the similarity of document and class bunch is weighed according to the cosine similarity of the document and class bunch average point, and the weight of speech is calculated according to the TFIDF formula;
Step 4.2, recomputate the average point of k class bunch, the document among the D is re-assigned in the class the most similar to it bunch, class bunch average point vector is the mean value of document vector in the class bunch;
Step 4.3, circulation execution in step 4.2 are till all classes bunch no longer change.
Further, when carrying out clustering documents for given document sets D, clustering algorithm can be the hierarchy type aggregation algorithms, divides formula algorithm, self-organized mapping network algorithm or kernel clustering algorithm.
Further, when given document sets D was carried out clustering documents, the number k of document class bunch was provided by the user according to priori.
Further, to document class bunch C iIn the sentence S set of document correspondence to make up the step of sentence graph of a relation G as follows:
To any two different sentence s among the S iAnd s jUtilize following cosine formula to calculate the similarity value:
sim ( s i , s j ) = cos ( s → i , s → j ) = s → i · s → j | | s → i | | · | | s → j | | - - - ( 1 )
1≤i wherein, j≤n, i ≠ j, each dimension of each sentence vector is a speech in the sentence, speech t weight is tf t* isf t, tf tBe the frequency of speech t in sentence, isf tArrange sentence frequency, just 1+log (N/n for speech t t), wherein N is the quantity of all sentences in the background document set, n tBe the quantity that wherein comprises the sentence of speech t;
If sim is (s i, s j) 0, so at s iAnd s jBetween set up a connection, just figure G in s iAnd s jBetween add a limit;
The adjacency matrix of the figure G that obtains is M=(M I, j) N * nBe defined as follows:
Figure C200610114590D00111
Matrix M makes that through following standardization each row element value sum is 1, obtains new adjacency matrix
Figure C200610114590D0011142951QIETU
:
Figure C200610114590D00112
Further, during according to the abundant information degree of figure G iterative computation sentence, adopt following method:
Obtaining the sentence adjacency matrix Afterwards, utilize each sentence s in the following formula iterative computation sentence S set iAbundant information degree InfoRich (s i):
InfoRich ( s i ) = d · Σ all j ≠ i InfoRich ( s j ) · M ~ j , i + ( 1 - d ) n - - - ( 4 )
InfoRich (the s on formula (4) equal sign the right wherein j) the sentence s that calculates through the last iteration process of expression jThe abundant information degree, and the InfoRich (s on formula (4) the equal sign left side i) then represent the current sentence s that obtains iNew abundant information degree, d is a damping factor;
Following formula is expressed as with matrix form:
λ → = d M ~ T λ → + ( 1 - d ) n e → - - - ( 5 )
Wherein
Figure C200610114590D00121
Be a n-dimensional vector, the abundant information degree of a sentence of each dimension expression, the transposition of subscript T representing matrix,
Figure C200610114590D00122
It is a n dimension vector of unit length;
The sentence abundant information degree that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new abundant information degree of each sentence, till the abundant information degree that twice iterative computation in the front and back of all sentences obtains no longer changes, perhaps during actual computation the abundant information degree change of all sentences less than preset threshold.
Further, described damping factor d is 0.85, and the abundant information degree change of setting sentence is during less than threshold value, and described threshold setting is 0.0001.
Further, the sentence among every piece of document d among the class bunch Ci is carried out otherness punishment in the document, the concrete grammar that obtains the final weighted value of each sentence in the document is as follows:
Step 10.1, make the sentence set of document d correspondence be S d, the sentence number is m, m<n, and making the local sentence graph of a relation of the document correspondence is G d, vertex set wherein is , adjacency matrix M d=(M d) M * mCan from the adjacency matrix M of resulting sentence graph of a relation G correspondence, extract corresponding element and obtain, if two sentences just among the document d are at local relation figure G dIn be expressed as s iAnd s j, in sentence graph of a relation G, be expressed as s I 'And s J ', (M is arranged so d) I, j=M I ', j ', then with M dStandardize and arrive
Figure C200610114590D00123
Make that each row element value sum is 1;
Step 10.2, to two set A=φ of document d initialization, B={s i| i=1,2 ..., m}, B comprise all sentences among the document d, the final weighted value of each sentence is initialized as its abundant information degree, that is to say ARScore (s i)=InfoRich (s i), i=1,2 ... m;
Step 10.3, according to the sentence among the current final weighted value descending sort B;
Step 10.4, supposition s iBe the highest sentence of rank, first sentence in the sequence just is with s iMove on to A from B, and to each and s among the B iAdjacent sentence s jCarry out following otherness punishment, j ≠ i:
ARScore ( s j ) = ARScore ( s j ) - ( M ~ d ) j , i · InfoRich ( s i ) - - - ( 6 )
Step 10.5, circulation execution in step 10.3 and step 10.4 are up to B=φ.
Further, when determining the final weighted value of each sentence among the document d, 2-10 sentence selecting the weighted value maximum is that the document forms and makes a summary.
The present invention also provides a kind of document sets is carried out the system of batch ticket documentation summary, is used for document sets is carried out the batch ticket documentation summary.
This system comprises with lower device: clustering documents device, batch ticket documentation summary device;
Wherein, the clustering documents device is used for given collection of document D is carried out clustering documents, obtains k document class bunch C 1..., C k, k is a positive integer;
The document that batch ticket documentation summary device is used for each document class bunch carries out the batch ticket documentation summary respectively; This device specifically comprises:
The document reader unit is used to read in document class bunch C iIn all documents, and realize, thereby obtain class bunch sentence S set={ s every piece of document subordinate sentence, participle 1, s 2..., s n, n is the quantity of all sentences in such bunch, and can realize making up sentence graph of a relation G based on this sentence S set;
Abundant information degree calculation element is used for the abundant information degree to above-mentioned each sentence of sentence graph of a relation G iterative computation;
The weighted value calculation element is used for class bunch C iIn sentence among every piece of document d carry out otherness punishment in the document, thereby obtain the final weighted value of each sentence in the document;
The summary output unit is used for the final weighted value of each sentence of document d is screened, and selecting the big sentence of weighted value is that the document forms summary and output.
Effect of the present invention is: adopt method of the present invention to overcome the shortcoming that existing single document method of abstracting is not considered information redundancy characteristic between similar document, can extract real important sentence from single document.Why the present invention has the foregoing invention effect, be because the present invention utilizes clustering documents that similar document is accumulated in the same document class bunch, document in the same class bunch has stronger information redundancy characteristic, utilizes in the same class bunch " election " or " recommendation " of sentence between the document to estimate the importance of sentence based on this characteristic.
In addition,, therefore, can improve the efficient that summary generates, and can generate the single document summary in batches for all documents in the document class bunch because for same document class bunch, the abundant information degree of sentence can obtain by once calculating.
Description of drawings
Fig. 1 is the process flow diagram of the method for the invention.
Embodiment
Further illustrate method of the present invention below in conjunction with embodiment and accompanying drawing.
Main thought of the present invention is as follows: utilize the characteristic that information redundancy exists between the similar document to weigh the importance for the treatment of sentence in the digest document better, thereby make a summary for the document generates better single document.Given collection of document is carried out clustering documents, can obtain reflecting some document class bunch of same theme, each class cocooning tool has similar document.This method can be carried out the batch ticket documentation summary to all documents in the single class bunch, that is to say, by once calculating the abundant information degree of all sentences in such bunch document, and does not need respectively to calculate separately at the sentence of every piece of document.This method can extract real important sentence on the one hand and form the higher summary of quality, saves the time that generates summary by calculating in batches on the other hand.
As shown in Figure 1, being the present invention carries out the schematic flow sheet of the method for batch ticket documentation summary to document sets, as seen from the figure, may further comprise the steps:
Step 101, given collection of document D utilize the k-means clustering algorithm that clustering documents is carried out in the document set, obtain k document class bunch;
The concrete grammar that document sets D is carried out the k-mean cluster is as follows:
1) from document sets D, selects k document respectively as the average point of k class bunch at random, the document among the D is assigned to respectively in the class the most similar to it bunch.The similarity of document and class bunch is weighed according to the cosine similarity of the document and class bunch average point, and the weight of speech is calculated according to the TFIDF formula.
2) recomputate the average point of k class bunch, then the document among the D is re-assigned in the class the most similar bunch to it.Class bunch average point vector is the mean value of document vector in the class bunch.
3) circulation execution in step 2), till all classes bunch no longer change.
K is generally provided by the user according to priori,, perhaps order k = | D | , Wherein | D| represents the document number among the document sets D.
Step 102, for each document class bunch C that obtains i, the batch ticket documentation summary is carried out to the document in such bunch in execution in step (3)-(5);
Step 103, read in document class bunch C iIn all documents, to every piece of document subordinate sentence, participle, obtain class bunch sentence S set={ s 1, s 2..., s n, n is the quantity of all sentences in such bunch; Make up sentence graph of a relation G based on this sentence S set;
To class bunch C iIn the sentence S set of document correspondence to make up the step of sentence graph of a relation G as follows:
To any two different sentence s among the S iAnd s jUtilize following cosine formula to calculate the similarity value:
sim ( s i , s j ) = cos ( s → i , s → j ) = s → i · s → j | | s → i | | · | | s → j | | - - - ( 1 )
Wherein each dimension of each sentence vector is a speech in the sentence, and speech t weight is tf t* isf t, tf tBe the frequency of speech t in sentence, isf tArrange sentence frequency, just 1+log (N/n for speech t t), wherein N is the quantity of all sentences in the background document set, n tBe the quantity that wherein comprises the sentence of speech t, the background document set is bigger usually;
If sim is (s i, s j) 0, so at s iAnd s jBetween set up a connection, just figure G in s iAnd s jBetween add a limit;
The adjacency matrix of the figure G that obtains is M=(M I, j) n * n is defined as follows:
Figure C200610114590D00153
Matrix M makes that through following standardization each row element value sum is 1, obtains new adjacency matrix
Figure C200610114590D00154
Figure C200610114590D00155
Step 104, based on the abundant information degree of each sentence of sentence graph of a relation G iterative computation;
During according to the abundant information degree of figure G iterative computation sentence, present embodiment adopts following method:
What of subject information that this sentence comprises are the abundant information degree of sentence reflected, obtaining the sentence adjacency matrix
Figure C200610114590D00161
Afterwards, utilize each sentence s in the following formula iterative computation sentence S set iAbundant information degree InfoRich (s i):
InfoRich ( s i ) = d · Σ all j ≠ i InfoRich ( s j ) · M ~ j , i + ( 1 - d ) n - - - ( 4 )
InfoRich (the s on formula (4) equal sign the right wherein j) the sentence s that calculates through the last iteration process of expression jThe abundant information degree, and the InfoRich (s on formula (4) the equal sign left side i) then represent the current sentence s that obtains iNew abundant information degree; D is a damping factor, is made as 0.85 in the present embodiment.
Following formula is expressed as with matrix form:
λ → = d M ~ T λ → + ( 1 - d ) n e → - - - ( 5 )
Wherein
Figure C200610114590D00164
Be a n-dimensional vector, the abundant information degree of a sentence of each dimension expression, the transposition of subscript T representing matrix,
Figure C200610114590D00165
It is a n dimension vector of unit length.
The sentence abundant information degree that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new abundant information degree of each sentence, till the abundant information degree that twice iterative computation in the front and back of all sentences obtains no longer changes, perhaps during actual computation the abundant information degree change of all sentences less than preset threshold.In the present embodiment, threshold setting is 0.0001.
Step 105, for class bunch C iIn every piece of document d, the sentence in the document is carried out in the document otherness punishment, obtain the final weighted value of each sentence in the document; According to the final weighted value of each sentence among the document d, selecting the big sentence of weighted value is that the document forms summary;
The concrete grammar that sentence among the document d is carried out otherness punishment in the document is as follows:
1) make the sentence set of document d correspondence be S d, the sentence number is that (m<n), make the local sentence graph of a relation of the document correspondence is G to m d, vertex set wherein is
Figure C200610114590D0016143337QIETU
, adjacency matrix M d=(M d) M * mCan from the adjacency matrix M of the resulting sentence graph of a relation of step 102 G correspondence, extract corresponding element and obtain, if two sentences just among the document d are at local relation figure G dIn be expressed as s iAnd s j, in sentence graph of a relation G, be expressed as s I 'And s J ', (M is arranged so d) I, j=M I ', j 'Then with M dStandardize and arrive
Figure C200610114590D00171
Make that each row element value sum is 1.
2) to two set A=φ of document d initialization, B={s i| i=1,2 ..., m}, B comprise all sentences among the document d.The final weighted value of each sentence is initialized as its abundant information degree, that is to say ARScore (s i)=InfoRich (s i), i=1,2 ... m;
3) according to the sentence among the current final weighted value descending sort B;
4) supposition s iBe the highest sentence of rank, first sentence in the sequence just is with s iMove on to A from B, and to each and s among the B iAdjacent sentence s j(j ≠ i) carry out following otherness to punish:
ARScore ( s j ) = ARScore ( s j ) - ( M ~ d ) j , i · InfoRich ( s i ) - - - ( 6 )
5) circulation execution in step 3) and step 4), up to B=φ.
Among the document d that obtains according to above-mentioned steps the final weighted value concentrated expression of each sentence the abundant information degree and the novel degree of information of this sentence.
Step 106, according to the final weighted value of each sentence among the document d, select several sentences of weighted value maximum to form summary.
In general, select 2-10 sentence to form summary and get final product, select 8 sentences to form summary in the present embodiment.
The present invention also provides a kind of document sets is carried out the system of batch ticket documentation summary, is used for document sets is carried out the batch ticket documentation summary.
This system comprises with lower device: clustering documents device, batch ticket documentation summary device;
Wherein, the clustering documents device is used for given collection of document D is carried out clustering documents, obtains k document class bunch C 1..., C k, k is a positive integer;
The document that batch ticket documentation summary device is used for each document class bunch carries out the batch ticket documentation summary respectively; This device specifically comprises:
The document reader unit is used to read in document class bunch C iIn all documents, and realize, thereby obtain class bunch sentence S set={ s every piece of document subordinate sentence, participle 1, s 2..., s n, n is the quantity of all sentences in such bunch, and can realize making up sentence graph of a relation G based on this sentence S set;
Abundant information degree calculation element is used for the abundant information degree to above-mentioned each sentence of sentence graph of a relation G iterative computation;
The weighted value calculation element is used for class bunch C iIn sentence among every piece of document d carry out otherness punishment in the document, thereby obtain the final weighted value of each sentence in the document;
The summary output unit is used for the final weighted value of each sentence of document d is screened, and selecting the big sentence of weighted value is that the document forms summary and output.
The function of each device of this system is corresponding one by one with the method for above-mentioned summary.
In order to verify validity of the present invention, adopt document to understand the evaluation and test data and the task of conference (DUC).Adopted the single document summary evaluation and test task of DUC2002 in the present embodiment, just the 1st of DUC2002 the evaluation and test task.The single document summary task of DUC2002 provides 567 pieces of documents, requires the person of participating in evaluation and electing to provide the summary of 100 words with interior length for every piece of document, and document derives from TREC-9.The summary that the person of participating in evaluation and electing submits to will compare with artificial summary.Adopt popular documentation summary evaluating method ROUGE evaluating method to evaluate and test method of the present invention, comprise three evaluation index ROUGE-1, ROUGE-2 and ROUGE-W, the ROUGE value is big more, and effect is good more, and the ROUGE-1 value is topmost evaluation index.The present invention at first utilizes the k-mean algorithm that document sets is carried out cluster, and these 567 pieces of documents are gathered into 59 document class bunch, then the document in each class bunch is carried out the batch ticket documentation summary.Method of the present invention and only utilize the figure aligning method of document self information to compare, experimental result is as shown in table 1.
Table 1: the comparative result on DUC2002 evaluation and test data
Figure C200610114590D00181
Experimental result shows that method performance of the present invention all is better than only utilizing the method for abstracting of single piece of document information than more excellent on three evaluation indexes.
Effect of the present invention is: adopt method of the present invention to overcome the shortcoming that existing single document method of abstracting is not considered information redundancy characteristic between similar document, can extract real important sentence from single document.Why the present invention has the foregoing invention effect, be because the present invention utilizes clustering documents that similar document is accumulated in the same document class bunch, document in the same class bunch has stronger information redundancy characteristic, utilizes in the same class bunch " election " or " recommendation " of sentence between the document to estimate the importance of sentence based on this characteristic.
In addition,, therefore, can improve the efficient that summary generates, and can generate the single document summary in batches for all documents in the document class bunch because for same document class bunch, the abundant information degree of sentence can obtain by once calculating.
The ROUGE evaluating method can be referring to document Automatic Evaluation of SummariesUsing N-gram Co-occurrence Statistics (author: C.-Y.Lin and E.H.Hovy be published in the periodical Proceedings of 2003 Language TechnologyConference (HLT-NAACL 2003) that published in 2003)
Method of the present invention is not limited to the embodiment described in the embodiment, the algorithm that in the step (1) document sets is carried out clustering documents not only is confined to the k-mean algorithm, also comprise the hierarchy type aggregation algorithms, division formula algorithm, the self-organized mapping network algorithm, other clustering algorithms such as kernel clustering algorithm.The method of calculating the similarity value between the sentence in the step (3) not only is confined to cosine formula, also comprises the Jaccard formula, Dice formula, other similarity calculating methods such as Overlap formula.The abundant information degree methods of calculating each sentence in the step (4) also can adopt other method, as traditional directly according to the importance of the keyword that sentence comprised to the method for sentence marking etc.The final weighted value that calculates each sentence in the document in the step (5) also can adopt other method, as relevant (MMR) technology of maximal margin etc.Those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (12)

1, a kind of document sets is carried out the method for batch ticket documentation summary, it is characterized in that, may further comprise the steps:
Step 1, given collection of document D is carried out clustering documents, obtain k document class bunch C 1..., C k, k is a positive integer;
Step 2, the document in above-mentioned each document class bunch is carried out the batch ticket documentation summary respectively.
2, the method for claim 1 is characterized in that, the method that the document in the described step 2 pair document class bunch carries out the batch ticket documentation summary is:
Step 2.1, read in document class bunch C iIn all documents, to every piece of document subordinate sentence, participle, obtain class bunch sentence S set={ s 1, s 2..., s n, n is the quantity of all sentences in such bunch, makes up sentence graph of a relation G based on this sentence S set;
The abundant information degree of step 2.2, each sentence of sentence graph of a relation G iterative computation of obtaining according to step 2.1;
Step 2.3, for class bunch C iIn sentence among every piece of document d carry out otherness punishment in the document, obtain the final weighted value of each sentence in the document;
Step 2.4, according to the final weighted value of each sentence among the document d, selecting the big sentence of weighted value is that the document forms summary.
3, method as claimed in claim 2 is characterized in that, described step 1 is carried out clustering documents for given document sets D, and when generating k document class bunch, concrete grammar is for utilizing the k-means clustering algorithm.
4, method as claimed in claim 3 is characterized in that, the described concrete steps of utilizing the k-means clustering algorithm to generate k document clusters are as follows:
Step 4.1, from document sets D, select k document respectively as the average point of k class bunch at random, document among the D is assigned to respectively in the class the most similar to it bunch, the similarity of document and class bunch is weighed according to the cosine similarity of the document and class bunch average point, and the weight of speech is calculated according to the TFIDF formula;
Step 4.2, recomputate the average point of k class bunch, the document among the D is re-assigned in the class the most similar to it bunch, class bunch average point vector is the mean value of document vector in the class bunch;
Step 4.3, circulation execution in step 4.2 are till all classes bunch no longer change.
5, method as claimed in claim 2 is characterized in that, when described step 1 was carried out clustering documents for given document sets D, clustering algorithm can be the hierarchy type aggregation algorithms, divides formula algorithm, self-organized mapping network algorithm or kernel clustering algorithm.
As any described method of claim 1-5, it is characterized in that 6, when in the described step 1 given document sets D being carried out clustering documents, the number k of document class bunch is provided by the user according to priori.
7, method as claimed in claim 6 is characterized in that, and is described to document class bunch C iIn the sentence S set of document correspondence to make up the step of sentence graph of a relation G as follows:
To any two different sentence s among the S iAnd s jUtilize following cosine formula to calculate the similarity value:
sim ( s i , s j ) = cos ( s → i , s → j ) = s → i · s → j | | s → i | | · | | s → j | | - - - ( 1 )
1≤i wherein, j≤n, i ≠ j, each dimension of each sentence vector is a speech in the sentence, speech t weight is tf t* isf t, tf tBe the frequency of speech t in sentence, isf tArrange sentence frequency, just 1+log (N/n for speech t t), wherein N is the quantity of all sentences in the background document set, n tBe the quantity that wherein comprises the sentence of speech t;
If sim is (s i, s j) 0, so at s iAnd s jBetween set up a connection, just figure G in s iAnd s jBetween add a limit;
The adjacency matrix of the figure G that obtains is M=(M I, j) N * nBe defined as follows:
Matrix M makes that through following standardization each row element value sum is 1, obtains new adjacency matrix :
Figure C200610114590C00034
8, method as claimed in claim 7 is characterized in that, during according to the abundant information degree of figure G iterative computation sentence, adopts following method:
Obtaining the sentence adjacency matrix
Figure C200610114590C00041
Afterwards, utilize the abundant information degree InfoRich (s of each sentence si in the following formula iterative computation sentence S set i):
InfoRich ( s i ) = d · Σ allj ≠ i InfoRich ( s j ) · M ~ j , i + ( 1 - d ) n - - - ( 4 )
InfoRich (the s on formula (4) equal sign the right wherein j) the sentence s that calculates through the last iteration process of expression jThe abundant information degree, and the InfoRich (s on formula (4) the equal sign left side i) then represent the current sentence s that obtains iNew abundant information degree, d is a damping factor;
Following formula is expressed as with matrix form:
λ → = d M ~ T λ → + ( 1 - d ) n e → - - - ( 5 )
Wherein
Figure C200610114590C00044
Be a n-dimensional vector, the abundant information degree of a sentence of each dimension expression, the transposition of subscript T representing matrix, It is a n dimension vector of unit length;
The sentence abundant information degree that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new abundant information degree of each sentence, till the abundant information degree that twice iterative computation in the front and back of all sentences obtains no longer changes, perhaps during actual computation the abundant information degree change of all sentences less than preset threshold.
9, method as claimed in claim 8 is characterized in that, described damping factor d is 0.85, and the abundant information degree change of setting sentence is during less than threshold value, and described threshold setting is 0.0001.
10, method as claimed in claim 9 is characterized in that, and is described to class bunch C iIn sentence among every piece of document d carry out otherness punishment in the document, the concrete grammar that obtains the final weighted value of each sentence in the document is as follows:
Step 10.1, make the sentence set of document d correspondence be S d, the sentence number is m, m<n, and making the local sentence graph of a relation of the document correspondence is G d, vertex set wherein is
Figure C200610114590C0004082547QIETU
Adjacency matrix M d=(M d) M * mCan from the adjacency matrix M of resulting sentence graph of a relation G correspondence, extract corresponding element and obtain, if two sentences just among the document d are at local relation figure G dIn be expressed as si and sj, in sentence graph of a relation G, be expressed as S I 'And S J ', (M is arranged so d) I, j=M I, j ' 'Then with M dStandardize and arrive
Figure C200610114590C00051
Make that each row element value sum is 1;
Step 10.2, to two set A=φ of document d initialization, B={s i| i=1,2 ..., m}, B comprise all sentences among the document d, the final weighted value of each sentence is initialized as its abundant information degree, that is to say ARScore (s i)=InfoRich (s i), i=1,2 ... m;
Step 10.3, according to the sentence among the current final weighted value descending sort B;
Step 10.4, supposition s iBe the highest sentence of rank, first sentence in the sequence just is with s iMove on to A from B, and to each and s among the B iAdjacent sentence s jCarry out following otherness punishment, j ≠ i:
ARScore ( s j ) = ARScore ( s j ) - ( M ~ d ) j , i · InfoRich ( s i ) - - - ( 6 )
Step 10.5, circulation execution in step 10.3 and step 10.4 are up to B=φ.
11, method as claimed in claim 10 is characterized in that, when determining the final weighted value of each sentence among the document d, 2-10 sentence selecting the weighted value maximum is that the document forms and makes a summary.
12, a kind of document sets is carried out the system of batch ticket documentation summary, it is characterized in that, comprise with lower device: clustering documents device, batch ticket documentation summary device;
Wherein, the clustering documents device is used for given collection of document D is carried out clustering documents, obtains k document class bunch C 1..., C k, k is a positive integer;
The document that batch ticket documentation summary device is used for each document class bunch carries out the batch ticket documentation summary respectively; This device specifically comprises:
The document reader unit is used to read in document class bunch C iIn all documents, and realize, thereby obtain class bunch sentence S set={ s every piece of document subordinate sentence, participle 1, s 2..., s n, n is the quantity of all sentences in such bunch, and can realize making up sentence graph of a relation G based on this sentence S set;
Abundant information degree calculation element is used for the abundant information degree to above-mentioned each sentence of sentence graph of a relation G iterative computation;
The weighted value calculation element is used for class bunch C iIn sentence among every piece of document d carry out otherness punishment in the document, thereby obtain the final weighted value of each sentence in the document;
The summary output unit is used for the final weighted value of each sentence of document d is screened, and selecting the big sentence of weighted value is that the document forms summary and output.
CNB2006101145906A 2006-11-16 2006-11-16 Method and system for abstracting batch single document for document set Expired - Fee Related CN100511214C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006101145906A CN100511214C (en) 2006-11-16 2006-11-16 Method and system for abstracting batch single document for document set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101145906A CN100511214C (en) 2006-11-16 2006-11-16 Method and system for abstracting batch single document for document set

Publications (2)

Publication Number Publication Date
CN101187919A CN101187919A (en) 2008-05-28
CN100511214C true CN100511214C (en) 2009-07-08

Family

ID=39480317

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101145906A Expired - Fee Related CN100511214C (en) 2006-11-16 2006-11-16 Method and system for abstracting batch single document for document set

Country Status (1)

Country Link
CN (1) CN100511214C (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622378A (en) * 2011-01-30 2012-08-01 北京千橡网景科技发展有限公司 Method and device for detecting events from text flow
CN102831119B (en) * 2011-06-15 2016-08-17 日电(中国)有限公司 Short text clustering Apparatus and method for
CN104794131B (en) * 2014-01-21 2019-07-05 腾讯科技(深圳)有限公司 A kind of the batch edit methods and device of file
CN105335375B (en) * 2014-06-20 2019-01-15 华为技术有限公司 Topics Crawling method and apparatus
CN104915335B (en) * 2015-06-12 2018-03-16 百度在线网络技术(北京)有限公司 The method and apparatus of the document sets that are the theme generation summary
CN105183710A (en) * 2015-06-23 2015-12-23 武汉传神信息技术有限公司 Method for automatically generating document summary
CN106407178B (en) * 2016-08-25 2019-08-13 中国科学院计算技术研究所 A kind of session abstraction generating method, device, server apparatus and terminal device
CN108090049B (en) * 2018-01-17 2021-02-05 山东工商学院 Multi-document abstract automatic extraction method and system based on sentence vectors
CN111274537B (en) * 2020-01-20 2021-12-31 山西大学 Document representation method based on punishment matrix decomposition
CN116910827B (en) * 2023-09-13 2023-11-21 北京点聚信息技术有限公司 Automatic signature management method for OFD format file based on artificial intelligence

Also Published As

Publication number Publication date
CN101187919A (en) 2008-05-28

Similar Documents

Publication Publication Date Title
CN100511214C (en) Method and system for abstracting batch single document for document set
CN101398814B (en) Method and system for simultaneously abstracting document summarization and key words
CN103970729B (en) A kind of multi-threaded extracting method based on semantic category
CN101566998B (en) Chinese question-answering system based on neural network
Li et al. Enhancing diversity, coverage and balance for summarization through structure learning
CN100416570C (en) FAQ based Chinese natural language ask and answer method
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
CN102411621A (en) Chinese query-oriented multi-document automatic abstracting method based on cloud model
CN101620596A (en) Multi-document auto-abstracting method facing to inquiry
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN104008090A (en) Multi-subject extraction method based on concept vector model
CN101231634A (en) Autoabstract method for multi-document
CN101382962B (en) Superficial layer analyzing and auto document summary method based on abstraction degree of concept
CN106294863A (en) A kind of abstract method for mass text fast understanding
CN100435145C (en) Multiple file summarization method based on sentence relation graph
CN106294736A (en) Text feature based on key word frequency
CN106294733A (en) Page detection method based on text analyzing
CN106599072A (en) Text clustering method and device
CN1916904A (en) Method of abstracting single file based on expansion of file
CN102253973A (en) Chinese and English cross language news topic detection method and system
Madnani et al. Multiple alternative sentence compressions for automatic text summarization
Zhang et al. Extractive Document Summarization based on hierarchical GRU
Perkio et al. Exploring independent trends in a topic-based search engine
Park et al. Extracting search intentions from web search logs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220915

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Patentee after: PEKING University FOUNDER R & D CENTER

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 5 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Patentee before: PEKING University FOUNDER R & D CENTER

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230328

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Address before: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee before: New founder holdings development Co.,Ltd.

Patentee before: Peking University

Patentee before: PEKING University FOUNDER R & D CENTER

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090708

CF01 Termination of patent right due to non-payment of annual fee