A kind of method and system of document sets being carried out the batch ticket documentation summary
Technical field
The invention belongs to spoken and written languages and handle and technical field of information retrieval, be specifically related to a kind of method and system of document sets being carried out the batch ticket documentation summary.
Background technology
Single document autoabstract refers to win automatically smart wanting or main points from given document, its objective is by former text is compressed, refines, for the user provides brief and concise content description.Single document autoabstract is one of key problem of natural language processing field, is widely used in document/Web search engine, enterprise content management system and Knowledge Management System (thinking and upright intelligence think of as Founder is rich) etc.
In short, the method for the multi-document summary method (Abstraction) that can be divided into method (Extraction) and extract based on sentence based on sentence generation.Then need to utilize the natural language understanding technology of deep layer based on the method for sentence generation, after former document is carried out sentence structure, semantic analysis, utilize information extraction or natural language generation technique to produce new sentence, thereby form summary.Based on the fairly simple practicality of method that sentence extracts, do not need to utilize the natural language understanding technology of deep layer; This method is given certain weight to each sentence after text is carried out subordinate sentence, reflect its importance, and several sentences of weight selection maximum form summary then.One step of key of extracting sentence is to give weights to sentence to reflect its importance, and this process need be taken all factors into consideration the different characteristic of sentence usually, for example word frequency, sentence position, clue word (Cue Words), rubbish speech (Stigma Words) etc.The method of present most of multi-document summary all is based on the sentence extraction technique, has put down in writing multiple method about single document autoabstract in the existing document.
(author is C.-Y.Lin and E.Hovy to article The automated acquisition of topic signatures for text Summarization, be published in the collection of thesis of publication in 2000: Proceedings ofACL2000) described the SUMMARIST system, this system utilizes theme signature (Topic Signature) to represent document subject matter, a theme signature is made up of a theme notion and some relative words, extracts sentence according to the theme signature then and forms summary.(author is H.G.Silber and K.McCoy to article Efficient text summarization using lexicalchains, is published in the collection of thesis of publication in 2000: Proceedings of the 5
ThInternational Conference on Intelligent User Interfaces) earlier document is analyzed, obtained vocabulary chain (Lexical Chain), a vocabulary chain is the sequence of a related term in the document.Each sentence with its total speech chain value that comprises as weight.(author is J.Kupiec to article A.trainable documentsummarizer, J.Pedersen and F.Chen, be published in the collection of thesis of nineteen ninety-five publication: Proceedings of SIGIR1995) the summary problem is regarded as whether sentence belongs to two class partition problems of summary, utilizes the comprehensive various features of Bayes classifier that sentence is selected.Article The use ofMMR, (author is Jaime Carbonell and Jade Goldstein to diversity-based reranking for reordering documents and producingsummaries, be published in the collection of thesis of publishing in 1998: Proceedings of SIGIR1998) described maximal margin correlativity (MMR) technology, be commonly used to extract sentence not only relevant but also that have certain novelty with document query.(author is Y.H.Gong and X.Liu to article Generic textsummarization using relevance measure and latent semantic analysis, be published in the collection of thesis of calendar year 2001 publication: Proceedings of SIGIR2001) adopted implicit semantic analysis (LSA) to extract sentence from new semantic space, and extract one with behind the maximally related sentence of document each according to calculation of correlation criterion (Relevance Measure), just from document, remove the speech that comprises in this sentence, guarantee each novelty that extracts sentence like this.In addition, (author is R.Mihalcea and P.Tarau to article TextRank:bringing order into texts, be published in the collection of thesis of publishing in 2004: Proceedings of EMNLP2004) and article A language independent algorithmfor single and multiple document summarization (author be R.Mihalcea and P.Tarau, is published in the collection of thesis of publishing in 2005: Proceedings of IJCNLP2005) proposed based on the method for scheming arrangement sentence in the document to be arranged.Sentence in the document connects according to the similarity relation between the sentence as the summit among the figure, desires to make money or profit with similar PageRank or HITS algorithm computation sentence importance based on this then.These class methods are based on " election " or " recommendation " of sentence to sentence, and between the adjacent sentence mutual " election " or " recommendation ", " election " or " recommendation " of a sentence acquisition is many more, and this sentence is important more." election " or " recommendation " person's significance level has determined the importance of " election " or " recommendation " that it is made.
Above single document auto-abstracting method has all only utilized the information of single piece of document self, does not utilize the information of other relevant documentations, and to every piece of document all need to carry out all calculation procedures just can obtain the summary.A lot of application needs carry out the single document summary to every piece of document in the extensive collection of document in the reality, there is different document class bunch in the document set, belonging between the document of same document class bunch is that theme is relevant, have the information redundancy characteristic, one piece of important information that document reflected also is reflected in a plurality of other documents of such bunch usually.
Summary of the invention
At the defective that exists in the existing single document autoabstract technology, the purpose of this invention is to provide a kind of method of document sets being carried out the batch ticket documentation summary, main thought of the present invention is as follows: utilize the characteristic that information redundancy exists between the similar document to weigh the importance for the treatment of sentence in the digest document better, thereby make a summary for the document generates better single document.Given collection of document is carried out clustering documents, can obtain reflecting some document class bunch of same theme, each class cocooning tool has similar document.This method can be carried out the batch ticket documentation summary to all documents in the single class bunch, that is to say, by once calculating the abundant information degree of all sentences in such bunch document, and does not need respectively to calculate separately at the sentence of every piece of document.This method can extract real important sentence on the one hand and form the higher summary of quality, saves the time that generates summary by calculating in batches on the other hand.
For reaching above purpose, the technical solution used in the present invention is: a kind of document sets is carried out the method for batch ticket documentation summary, may further comprise the steps:
Step 1, given collection of document D is carried out clustering documents, obtain k document class bunch C
1..., C
k, k is a positive integer;
Step 2, the document in above-mentioned each document class bunch is carried out the batch ticket documentation summary respectively.
Further, the method that the document in the document class bunch is carried out the batch ticket documentation summary is:
Step 2.1, read in all documents among the document class bunch Ci, to every piece of document subordinate sentence, participle, obtain class bunch sentence S set=s1, s2 ..., sn}, n are the quantity of all sentences in such bunch, make up sentence graph of a relation G based on this sentence S set;
Step 2.2, according to the abundant information degree of above-mentioned each sentence of sentence graph of a relation G iterative computation that obtains;
Step 2.3, carry out otherness punishment in the document, obtain the final weighted value of each sentence in the document for the sentence among every piece of document d among the class bunch Ci;
Step 2.4, according to the final weighted value of each sentence among the document d, selecting the big sentence of weighted value is that the document forms summary.
Further, for given document sets D carries out clustering documents, when generating k document class bunch, concrete grammar is for utilizing the k-means clustering algorithm.
Further, utilize the concrete steps of k document clusters of k-means clustering algorithm generation as follows:
Step 4.1, from document sets D, select k document respectively as the average point of k class bunch at random, document among the D is assigned to respectively in the class the most similar to it bunch, the similarity of document and class bunch is weighed according to the cosine similarity of the document and class bunch average point, and the weight of speech is calculated according to the TFIDF formula;
Step 4.2, recomputate the average point of k class bunch, the document among the D is re-assigned in the class the most similar to it bunch, class bunch average point vector is the mean value of document vector in the class bunch;
Step 4.3, circulation execution in step 4.2 are till all classes bunch no longer change.
Further, when carrying out clustering documents for given document sets D, clustering algorithm can be the hierarchy type aggregation algorithms, divides formula algorithm, self-organized mapping network algorithm or kernel clustering algorithm.
Further, when given document sets D was carried out clustering documents, the number k of document class bunch was provided by the user according to priori.
Further, to document class bunch C
iIn the sentence S set of document correspondence to make up the step of sentence graph of a relation G as follows:
To any two different sentence s among the S
iAnd s
jUtilize following cosine formula to calculate the similarity value:
1≤i wherein, j≤n, i ≠ j, each dimension of each sentence vector is a speech in the sentence, speech t weight is tf
t* isf
t, tf
tBe the frequency of speech t in sentence, isf
tArrange sentence frequency, just 1+log (N/n for speech t
t), wherein N is the quantity of all sentences in the background document set, n
tBe the quantity that wherein comprises the sentence of speech t;
If sim is (s
i, s
j) 0, so at s
iAnd s
jBetween set up a connection, just figure G in s
iAnd s
jBetween add a limit;
The adjacency matrix of the figure G that obtains is M=(M
I, j)
N * nBe defined as follows:
Matrix M makes that through following standardization each row element value sum is 1, obtains new adjacency matrix
:
Further, during according to the abundant information degree of figure G iterative computation sentence, adopt following method:
Obtaining the sentence adjacency matrix
Afterwards, utilize each sentence s in the following formula iterative computation sentence S set
iAbundant information degree InfoRich (s
i):
InfoRich (the s on formula (4) equal sign the right wherein
j) the sentence s that calculates through the last iteration process of expression
jThe abundant information degree, and the InfoRich (s on formula (4) the equal sign left side
i) then represent the current sentence s that obtains
iNew abundant information degree, d is a damping factor;
Following formula is expressed as with matrix form:
Wherein
Be a n-dimensional vector, the abundant information degree of a sentence of each dimension expression, the transposition of subscript T representing matrix,
It is a n dimension vector of unit length;
The sentence abundant information degree that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new abundant information degree of each sentence, till the abundant information degree that twice iterative computation in the front and back of all sentences obtains no longer changes, perhaps during actual computation the abundant information degree change of all sentences less than preset threshold.
Further, described damping factor d is 0.85, and the abundant information degree change of setting sentence is during less than threshold value, and described threshold setting is 0.0001.
Further, the sentence among every piece of document d among the class bunch Ci is carried out otherness punishment in the document, the concrete grammar that obtains the final weighted value of each sentence in the document is as follows:
Step 10.1, make the sentence set of document d correspondence be S
d, the sentence number is m, m<n, and making the local sentence graph of a relation of the document correspondence is G
d, vertex set wherein is
, adjacency matrix M
d=(M
d)
M * mCan from the adjacency matrix M of resulting sentence graph of a relation G correspondence, extract corresponding element and obtain, if two sentences just among the document d are at local relation figure G
dIn be expressed as s
iAnd s
j, in sentence graph of a relation G, be expressed as s
I 'And s
J ', (M is arranged so
d)
I, j=M
I ', j ', then with M
dStandardize and arrive
Make that each row element value sum is 1;
Step 10.2, to two set A=φ of document d initialization, B={s
i| i=1,2 ..., m}, B comprise all sentences among the document d, the final weighted value of each sentence is initialized as its abundant information degree, that is to say ARScore (s
i)=InfoRich (s
i), i=1,2 ... m;
Step 10.3, according to the sentence among the current final weighted value descending sort B;
Step 10.4, supposition s
iBe the highest sentence of rank, first sentence in the sequence just is with s
iMove on to A from B, and to each and s among the B
iAdjacent sentence s
jCarry out following otherness punishment, j ≠ i:
Step 10.5, circulation execution in step 10.3 and step 10.4 are up to B=φ.
Further, when determining the final weighted value of each sentence among the document d, 2-10 sentence selecting the weighted value maximum is that the document forms and makes a summary.
The present invention also provides a kind of document sets is carried out the system of batch ticket documentation summary, is used for document sets is carried out the batch ticket documentation summary.
This system comprises with lower device: clustering documents device, batch ticket documentation summary device;
Wherein, the clustering documents device is used for given collection of document D is carried out clustering documents, obtains k document class bunch C
1..., C
k, k is a positive integer;
The document that batch ticket documentation summary device is used for each document class bunch carries out the batch ticket documentation summary respectively; This device specifically comprises:
The document reader unit is used to read in document class bunch C
iIn all documents, and realize, thereby obtain class bunch sentence S set={ s every piece of document subordinate sentence, participle
1, s
2..., s
n, n is the quantity of all sentences in such bunch, and can realize making up sentence graph of a relation G based on this sentence S set;
Abundant information degree calculation element is used for the abundant information degree to above-mentioned each sentence of sentence graph of a relation G iterative computation;
The weighted value calculation element is used for class bunch C
iIn sentence among every piece of document d carry out otherness punishment in the document, thereby obtain the final weighted value of each sentence in the document;
The summary output unit is used for the final weighted value of each sentence of document d is screened, and selecting the big sentence of weighted value is that the document forms summary and output.
Effect of the present invention is: adopt method of the present invention to overcome the shortcoming that existing single document method of abstracting is not considered information redundancy characteristic between similar document, can extract real important sentence from single document.Why the present invention has the foregoing invention effect, be because the present invention utilizes clustering documents that similar document is accumulated in the same document class bunch, document in the same class bunch has stronger information redundancy characteristic, utilizes in the same class bunch " election " or " recommendation " of sentence between the document to estimate the importance of sentence based on this characteristic.
In addition,, therefore, can improve the efficient that summary generates, and can generate the single document summary in batches for all documents in the document class bunch because for same document class bunch, the abundant information degree of sentence can obtain by once calculating.
Description of drawings
Fig. 1 is the process flow diagram of the method for the invention.
Embodiment
Further illustrate method of the present invention below in conjunction with embodiment and accompanying drawing.
Main thought of the present invention is as follows: utilize the characteristic that information redundancy exists between the similar document to weigh the importance for the treatment of sentence in the digest document better, thereby make a summary for the document generates better single document.Given collection of document is carried out clustering documents, can obtain reflecting some document class bunch of same theme, each class cocooning tool has similar document.This method can be carried out the batch ticket documentation summary to all documents in the single class bunch, that is to say, by once calculating the abundant information degree of all sentences in such bunch document, and does not need respectively to calculate separately at the sentence of every piece of document.This method can extract real important sentence on the one hand and form the higher summary of quality, saves the time that generates summary by calculating in batches on the other hand.
As shown in Figure 1, being the present invention carries out the schematic flow sheet of the method for batch ticket documentation summary to document sets, as seen from the figure, may further comprise the steps:
Step 101, given collection of document D utilize the k-means clustering algorithm that clustering documents is carried out in the document set, obtain k document class bunch;
The concrete grammar that document sets D is carried out the k-mean cluster is as follows:
1) from document sets D, selects k document respectively as the average point of k class bunch at random, the document among the D is assigned to respectively in the class the most similar to it bunch.The similarity of document and class bunch is weighed according to the cosine similarity of the document and class bunch average point, and the weight of speech is calculated according to the TFIDF formula.
2) recomputate the average point of k class bunch, then the document among the D is re-assigned in the class the most similar bunch to it.Class bunch average point vector is the mean value of document vector in the class bunch.
3) circulation execution in step 2), till all classes bunch no longer change.
K is generally provided by the user according to priori,, perhaps order
Wherein | D| represents the document number among the document sets D.
Step 102, for each document class bunch C that obtains
i, the batch ticket documentation summary is carried out to the document in such bunch in execution in step (3)-(5);
Step 103, read in document class bunch C
iIn all documents, to every piece of document subordinate sentence, participle, obtain class bunch sentence S set={ s
1, s
2..., s
n, n is the quantity of all sentences in such bunch; Make up sentence graph of a relation G based on this sentence S set;
To class bunch C
iIn the sentence S set of document correspondence to make up the step of sentence graph of a relation G as follows:
To any two different sentence s among the S
iAnd s
jUtilize following cosine formula to calculate the similarity value:
Wherein each dimension of each sentence vector is a speech in the sentence, and speech t weight is tf
t* isf
t, tf
tBe the frequency of speech t in sentence, isf
tArrange sentence frequency, just 1+log (N/n for speech t
t), wherein N is the quantity of all sentences in the background document set, n
tBe the quantity that wherein comprises the sentence of speech t, the background document set is bigger usually;
If sim is (s
i, s
j) 0, so at s
iAnd s
jBetween set up a connection, just figure G in s
iAnd s
jBetween add a limit;
The adjacency matrix of the figure G that obtains is M=(M
I, j) n * n is defined as follows:
Matrix M makes that through following standardization each row element value sum is 1, obtains new adjacency matrix
Step 104, based on the abundant information degree of each sentence of sentence graph of a relation G iterative computation;
During according to the abundant information degree of figure G iterative computation sentence, present embodiment adopts following method:
What of subject information that this sentence comprises are the abundant information degree of sentence reflected, obtaining the sentence adjacency matrix
Afterwards, utilize each sentence s in the following formula iterative computation sentence S set
iAbundant information degree InfoRich (s
i):
InfoRich (the s on formula (4) equal sign the right wherein
j) the sentence s that calculates through the last iteration process of expression
jThe abundant information degree, and the InfoRich (s on formula (4) the equal sign left side
i) then represent the current sentence s that obtains
iNew abundant information degree; D is a damping factor, is made as 0.85 in the present embodiment.
Following formula is expressed as with matrix form:
Wherein
Be a n-dimensional vector, the abundant information degree of a sentence of each dimension expression, the transposition of subscript T representing matrix,
It is a n dimension vector of unit length.
The sentence abundant information degree that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new abundant information degree of each sentence, till the abundant information degree that twice iterative computation in the front and back of all sentences obtains no longer changes, perhaps during actual computation the abundant information degree change of all sentences less than preset threshold.In the present embodiment, threshold setting is 0.0001.
Step 105, for class bunch C
iIn every piece of document d, the sentence in the document is carried out in the document otherness punishment, obtain the final weighted value of each sentence in the document; According to the final weighted value of each sentence among the document d, selecting the big sentence of weighted value is that the document forms summary;
The concrete grammar that sentence among the document d is carried out otherness punishment in the document is as follows:
1) make the sentence set of document d correspondence be S
d, the sentence number is that (m<n), make the local sentence graph of a relation of the document correspondence is G to m
d, vertex set wherein is
, adjacency matrix M
d=(M
d)
M * mCan from the adjacency matrix M of the resulting sentence graph of a relation of step 102 G correspondence, extract corresponding element and obtain, if two sentences just among the document d are at local relation figure G
dIn be expressed as s
iAnd s
j, in sentence graph of a relation G, be expressed as s
I 'And s
J ', (M is arranged so
d)
I, j=M
I ', j 'Then with M
dStandardize and arrive
Make that each row element value sum is 1.
2) to two set A=φ of document d initialization, B={s
i| i=1,2 ..., m}, B comprise all sentences among the document d.The final weighted value of each sentence is initialized as its abundant information degree, that is to say ARScore (s
i)=InfoRich (s
i), i=1,2 ... m;
3) according to the sentence among the current final weighted value descending sort B;
4) supposition s
iBe the highest sentence of rank, first sentence in the sequence just is with s
iMove on to A from B, and to each and s among the B
iAdjacent sentence s
j(j ≠ i) carry out following otherness to punish:
5) circulation execution in step 3) and step 4), up to B=φ.
Among the document d that obtains according to above-mentioned steps the final weighted value concentrated expression of each sentence the abundant information degree and the novel degree of information of this sentence.
Step 106, according to the final weighted value of each sentence among the document d, select several sentences of weighted value maximum to form summary.
In general, select 2-10 sentence to form summary and get final product, select 8 sentences to form summary in the present embodiment.
The present invention also provides a kind of document sets is carried out the system of batch ticket documentation summary, is used for document sets is carried out the batch ticket documentation summary.
This system comprises with lower device: clustering documents device, batch ticket documentation summary device;
Wherein, the clustering documents device is used for given collection of document D is carried out clustering documents, obtains k document class bunch C
1..., C
k, k is a positive integer;
The document that batch ticket documentation summary device is used for each document class bunch carries out the batch ticket documentation summary respectively; This device specifically comprises:
The document reader unit is used to read in document class bunch C
iIn all documents, and realize, thereby obtain class bunch sentence S set={ s every piece of document subordinate sentence, participle
1, s
2..., s
n, n is the quantity of all sentences in such bunch, and can realize making up sentence graph of a relation G based on this sentence S set;
Abundant information degree calculation element is used for the abundant information degree to above-mentioned each sentence of sentence graph of a relation G iterative computation;
The weighted value calculation element is used for class bunch C
iIn sentence among every piece of document d carry out otherness punishment in the document, thereby obtain the final weighted value of each sentence in the document;
The summary output unit is used for the final weighted value of each sentence of document d is screened, and selecting the big sentence of weighted value is that the document forms summary and output.
The function of each device of this system is corresponding one by one with the method for above-mentioned summary.
In order to verify validity of the present invention, adopt document to understand the evaluation and test data and the task of conference (DUC).Adopted the single document summary evaluation and test task of DUC2002 in the present embodiment, just the 1st of DUC2002 the evaluation and test task.The single document summary task of DUC2002 provides 567 pieces of documents, requires the person of participating in evaluation and electing to provide the summary of 100 words with interior length for every piece of document, and document derives from TREC-9.The summary that the person of participating in evaluation and electing submits to will compare with artificial summary.Adopt popular documentation summary evaluating method ROUGE evaluating method to evaluate and test method of the present invention, comprise three evaluation index ROUGE-1, ROUGE-2 and ROUGE-W, the ROUGE value is big more, and effect is good more, and the ROUGE-1 value is topmost evaluation index.The present invention at first utilizes the k-mean algorithm that document sets is carried out cluster, and these 567 pieces of documents are gathered into 59 document class bunch, then the document in each class bunch is carried out the batch ticket documentation summary.Method of the present invention and only utilize the figure aligning method of document self information to compare, experimental result is as shown in table 1.
Table 1: the comparative result on DUC2002 evaluation and test data
Experimental result shows that method performance of the present invention all is better than only utilizing the method for abstracting of single piece of document information than more excellent on three evaluation indexes.
Effect of the present invention is: adopt method of the present invention to overcome the shortcoming that existing single document method of abstracting is not considered information redundancy characteristic between similar document, can extract real important sentence from single document.Why the present invention has the foregoing invention effect, be because the present invention utilizes clustering documents that similar document is accumulated in the same document class bunch, document in the same class bunch has stronger information redundancy characteristic, utilizes in the same class bunch " election " or " recommendation " of sentence between the document to estimate the importance of sentence based on this characteristic.
In addition,, therefore, can improve the efficient that summary generates, and can generate the single document summary in batches for all documents in the document class bunch because for same document class bunch, the abundant information degree of sentence can obtain by once calculating.
The ROUGE evaluating method can be referring to document Automatic Evaluation of SummariesUsing N-gram Co-occurrence Statistics (author: C.-Y.Lin and E.H.Hovy be published in the periodical Proceedings of 2003 Language TechnologyConference (HLT-NAACL 2003) that published in 2003)
Method of the present invention is not limited to the embodiment described in the embodiment, the algorithm that in the step (1) document sets is carried out clustering documents not only is confined to the k-mean algorithm, also comprise the hierarchy type aggregation algorithms, division formula algorithm, the self-organized mapping network algorithm, other clustering algorithms such as kernel clustering algorithm.The method of calculating the similarity value between the sentence in the step (3) not only is confined to cosine formula, also comprises the Jaccard formula, Dice formula, other similarity calculating methods such as Overlap formula.The abundant information degree methods of calculating each sentence in the step (4) also can adopt other method, as traditional directly according to the importance of the keyword that sentence comprised to the method for sentence marking etc.The final weighted value that calculates each sentence in the document in the step (5) also can adopt other method, as relevant (MMR) technology of maximal margin etc.Those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.