CN101446940B - Method and device of automatically generating a summary for document set - Google Patents

Method and device of automatically generating a summary for document set Download PDF

Info

Publication number
CN101446940B
CN101446940B CN2007101874807A CN200710187480A CN101446940B CN 101446940 B CN101446940 B CN 101446940B CN 2007101874807 A CN2007101874807 A CN 2007101874807A CN 200710187480 A CN200710187480 A CN 200710187480A CN 101446940 B CN101446940 B CN 101446940B
Authority
CN
China
Prior art keywords
document
new
sentence
document sets
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2007101874807A
Other languages
Chinese (zh)
Other versions
CN101446940A (en
Inventor
万小军
余军
杨建武
吴於茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder E-Government Technology Co Ltd
Peking University
Peking University Founder Group Co Ltd
Original Assignee
Peking University Founder E-Government Technology Co Ltd
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder E-Government Technology Co Ltd, Peking University, Peking University Founder Group Co Ltd filed Critical Peking University Founder E-Government Technology Co Ltd
Priority to CN2007101874807A priority Critical patent/CN101446940B/en
Publication of CN101446940A publication Critical patent/CN101446940A/en
Application granted granted Critical
Publication of CN101446940B publication Critical patent/CN101446940B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and a device of automatically generating a summary for a document set, relates to the language and word processing field, and aims at solving the problems of slow speed and low efficiency of the summary generation due to weight recomputation of each sentence in all documents of the document set when the summary is generated for the document set in the prior art. The method comprises the following steps: computing the weight of each sentence in a new document; updating the weight of the sentence in the existing summary of the document set; acquiring a weight order of all nonrepetitive sentences in the existing summary of the new document and the document set; and generating a new summary of the document set. The method and the device are applicable to the automatic summary generation of a plurality of documents.

Description

Automatically generate the method and the device of summary for document sets
Technical field
The present invention relates to spoken and written languages and handle and information retrieval field, particularly a kind of method and apparatus that generates summary for document sets automatically.
Background technology
Be meant for document sets generates summary automatically: from each piece document of a document sets, want or main points automatically by the essence that extracts the document collection for computer system; Its objective is: by each piece document in the described document sets is compressed, refines, for the user provides the document collection brief and concise content description.Along with computer technology, and constantly the applying of Internet technology, for generating the summary technology automatically, document sets has been widely used in aspects such as text/website (Web) content retrieval.For example: the press services that search engine provided such as Google, Baidu, be exactly by the various news informations on the collection network, different according to its theme and type, form a plurality of Special Topics in Journalism (news documents collection), and generate summary for document sets generates the summary technology automatically for each document sets by described, so that the user can browse oneself interested Special Topics in Journalism more easily.
In short, the described method that generates summary automatically for document sets can be divided into two kinds: the method (Extraction) that extracts based on sentence and based on the method (Abstraction) of sentence generation.Wherein, the described method that extracts based on sentence is that every piece in document sets document is cut apart by sentence, give different weighted values according to the importance of each sentence in described document sets for it, the sentence of selection weight maximum forms the summary of described document sets, this method does not need to utilize the natural language understanding technology of deep layer can be embodied as the purpose that document sets generates summary automatically, it realizes simple, and is easy to use; The described natural language understanding technology that need utilize deep layer based on the method for sentence generation, each sentence in the described document sets is carried out sentence structure, semantic analysis, and utilize information extraction or natural language generation technique to produce new sentence, thereby generate summary automatically for described document sets, this method need utilize the natural language understanding technology of deep layer just can be embodied as the purpose that document sets generates summary automatically, it implements more complicated, and inconvenience is used.
Owing to having, the described method that extracts based on sentence realizes advantages such as simple, easy to use, so major part all is to adopt the method that extracts based on sentence for the method that document sets generates summary automatically at present.For example: (author is D.R.Radev to article Centroid-based summarization of multiple documents, H.Y.Jing, M.Stys and D.Tam, be published in the periodical InformationProcessing and Management that published in 2004) a kind of sentence abstracting method based on central point disclosed, this method each sentence in giving document sets is given in the process of weight, taken all factors into consideration the feature between sentence level and the sentence, comprise class bunch central point, the sentence position, TF/IDF (frequency of keyword/inverted order index order) etc., give different weighted values by described each sentence that is characterized as, and extract the summary of the bigger sentence of weighted value as document sets; (author is C.-Y.Lin and E.H.Hovy to article From Single to Multi-document Summarization:APrototype System and its Evaluation, be published in the periodical Proceedings of the 40th Anniversary Meeting ofthe Association for Computational Linguistics (ACL-02) that published in 2002) the sentence extraction system of a kind of NeATS by name disclosed, this system is by considering the sentence position, the word frequency, features such as theme signature and word class bunch, for different weighted values given in each sentence in the document sets, utilize MMR (Modified Modified Read simultaneously, improved two dimensional compaction coding) technology disappears heavily to described sentence, thereby is that described document sets forms summary; (author is H.Hardy to article Cross-document summarization by conceptclassification, N.Shimizu, T.Strzalkowski, L.Ting, G.B.Wise, and X.Zhang, be published in the periodical of publication in 2003: the sentence extraction system that Proceedings of SIGIR ' 02) discloses a kind of XdoX by name, this system is suitably for large-scale document sets and generates summary, it at first detects most important theme in the document sets by the paragraph cluster, and the sentence that extracts the reflection important theme then forms summary; (author is S.Harabagiu and F.Lacatusu to article Topic themes for multi-document summarization, be published in the periodical Proceedings of SIGIR ' 05 that published in 2005) method of Harabagiu and Lacatusu is disclosed, this method has been inquired into five kinds of different many document subject matter manifestation modes and has been proposed a kind of new theme manifestation mode.
When utilizing the described method that extracts based on sentence to generate summary automatically, also be used to the importance of sentence is sorted based on the method for graph structure for document sets.For example: (author is I.Mani and E.Bloedorn to article Summarizing Similaritiesand Differences Among Related Documents, be published in the periodical Information Retrieval that published in 2000) method of a kind of WebSumm by name disclosed, this method is utilized the figure link model, many more according to other summit that is connected with certain summit, this summit is important more, this supposes to come the importance to sentence to sort, thereby makes a summary for document sets generates; (author is G.Erkan and D.Radev to article LexPageRank:prestige in multi-document text summarization, be published in the periodical of publication in 2004: the method that Proceedings of the Conferenceon Empirical Methods in Natural Language Processing (EMNLP ' 04)) discloses a kind of LexPageRank by name, this method at first makes up the sentence connection matrix, algorithm based on similar PageRank calculates sentence importance then, according to the importance position document sets generation summary of each sentence; (author is R.Mihalcea and P.Tarau to article Alanguage independent algorithm for single and multiple documentsummarization, be published in the periodical of publication in 2005: Proceedings of the Second International Joint Conference on NaturalLanguage Processing (IJCNLP ' 05)) disclose the method for a kind of Mihalcea by name and Tarau, this method has also proposed a similar algorithm computation sentence importance based on PageRank and HITS.
In sum, method described in the above example or system, when utilizing the described method that extracts based on sentence to generate summary automatically for document sets, it all is the weight of calculating each sentence in the document sets earlier, select the bigger sentence of weight as summary then, its difference only is to give for each sentence the method difference of weight.
In actual internet, applications, because internet content upgrades very fast, represent the document sets of certain theme, type also can bring in constant renewal in thereupon, that is to say, for each document sets, constantly have new relevant documentation and join current document and concentrate, especially for certain hot news topic, a large amount of documents of being correlated with this topic appear in meeting on the internet, also can be very frequent to the document set abstracts renewal of described hot news topic.If adopting existing multiple file summarization method makes a summary to the document sets of frequent updating; one piece of new document of the every increase of document sets all needs to recomputate the weight of all sentences in the document sets; its calculated amount is very huge; and can not generate new summary fast for described document sets; thereby cause generating the problem of the inefficiency of summary, can't satisfy the demand that large-scale internet is used (for example news topic detection, analysis of central issue etc.).
Summary of the invention
On the one hand, the invention provides a kind of is the method that document sets generates summary automatically, and this method can generate summary for document sets simply, apace automatically, has improved the efficient that generates summary for document sets.
The technical solution used in the present invention comprises: a kind of is the method that document sets generates summary automatically, is used for for document sets generates summary automatically, comprising the steps: after increasing new document to document sets
Calculate the weight of each sentence in the described new document;
Upgrade the weight of sentence in the existing summary of described document sets;
Obtain the weight ordering of all non-repetition sentences of the existing summary of new document and document sets;
Generate the new summary of described document sets.
Automatically the method that generates summary for document sets provided by the present invention, by calculating the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of document sets, each sentence in described new document and the existing summary of document sets is sorted, screens, thus the new summary of formation document sets; Compared with prior art, this method only need be calculated the weight of each sentence of the existing summary of new document and document sets, do not need all sentences of every piece in document sets document are recomputated weight, can obtain the new summary of described document sets, this method can be document sets generation summary simply, apace, improve the efficient that generates summary for document sets greatly, can adapt to the requirement that information updating speed improves constantly.
On the other hand, it is the device that document sets generates summary automatically that the present invention also provides a kind of, and this device can generate summary for document sets simply, apace automatically, has improved the efficient that generates summary for document sets.
The technical solution used in the present invention comprises: a kind of is the device that document sets generates summary automatically, is used for after increasing new document to document sets, and document sets generates summary automatically, it is characterized in that, comprising:
Weight calculation unit is calculated the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of described document sets;
Select sequencing unit, from weight calculation unit, obtain the weighted value of all non-repetition sentences of new document and the existing summary of document sets, and it is sorted;
The summary generation unit will be selected the big sentence of weighted value in the sequencing unit, generates the new summary of described document sets.
Automatically the device that generates summary for document sets provided by the present invention, by calculating the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of document sets, each sentence in described new document and the existing summary of document sets is sorted, screens, thus the new summary of formation document sets; Compared with prior art, this device only needs to calculate the weight of each sentence in new document and the existing summary of document sets, do not need all sentences of every piece of document in the document sets are recomputated weight, can obtain the new summary of described document sets, this device can be document sets generation summary simply, apace, improve the efficient that generates summary for document sets greatly, can adapt to the requirement that information updating speed improves constantly.
Description of drawings
Fig. 1 is the process flow diagram that generates the method for summary for document sets automatically provided by the present invention;
Fig. 2 is the apparatus structure synoptic diagram that generates summary for document sets automatically provided by the present invention;
Fig. 3 shown in Figure 2ly provided by the present inventionly generates in the device of summary the structural representation of vector calculation unit automatically for document sets;
Fig. 4 be invention shown in Figure 2 provide generate in the device of summary the structural representation of document sets feature updating block automatically for document sets;
Fig. 5 be invention shown in Figure 2 provide generate in the device of summary the structural representation of weight calculation unit automatically for document sets;
Fig. 6 be invention shown in Figure 2 provide generate in the device of summary the structural representation of sequencing unit automatically for document sets.
Embodiment
When being document sets generation summary in order to solve prior art, need recomputate weight to each sentence of whole documents in the document sets, cause slow, the inefficient problem of speed that generates summary, the invention provides a kind of is the method that document sets generates summary automatically, below in conjunction with drawings and Examples the present invention is elaborated.
As shown in Figure 1, provided by the present invention is the method that document sets generates summary automatically, is used for for document sets generates summary automatically, comprising the steps: after increasing new document to document sets
Step 101 is calculated the vector of described new document, and the vector of each sentence in the described new document;
Its concrete step is:
To described new document d NewCarry out subordinate sentence, the sentence S set that obtains New, S New={ s i| 1≤i≤n}, wherein, positive integer n is new document d NewIn the sentence number that comprises;
When calculating the sentence S set NewIn each sentence s iVector
Figure S2007101874807D00061
The time, to described sentence s iCarry out participle, obtain described sentence s iSet of words w behind the participle i, w i={ w Ij| 1≤j≤m}, wherein, positive integer m is described sentence s iIn the word number that comprises because vector
Figure S2007101874807D00062
So a speech of the corresponding described new document of each dimension is vector
Figure S2007101874807D00063
The computing formula of the corresponding weight of each dimension be:
w ijf wij×idf wij (1-1)
Wherein, w Ijf WijBe speech w IjThe frequency of the appearance in described document sets, idf WijBe speech w IjInverted entry frequency in described document sets, described idf WijComputing formula can be expressed as:
idf wij=1+log(N/n wij) (1-2)
Wherein, N is the quantity of all documents in the document sets, n WijBe wherein to comprise speech w IjThe quantity of document,
(1-1) calculates vector by above-mentioned formula
Figure S2007101874807D00064
The corresponding weight of each dimension, can obtain described vector
Figure S2007101874807D00065
When calculating described new document d NewVector
Figure S2007101874807D00066
The time, to described new document d NewCarry out participle, obtain described document d NewSet of words W behind the participle New, W New={ w k| 1≤k≤z}, wherein, positive integer z is described document d NewIn the word number that comprises,
Above-mentioned to described new document d NewThe method of carrying out participle can be divided into two kinds: a kind of is directly to described new document d NewCarry out participle, another kind is W new = ∪ i w i , Wherein, 1≤i≤n, positive integer n is new document d NewIn the sentence number that comprises, w iBe sentence s iSet of words behind the participle,
Because described new document d NewVector
Figure S2007101874807D00068
Each dimension also can be to a speech in the new document, so vectorial
Figure S2007101874807D00069
The corresponding weight calculation formula of each dimension is:
w kf wk×idf wk (1-3)
Wherein, w kf WkBe speech w kThe frequency of the appearance in described document sets, idf WkBe speech w kInverted entry frequency in described document sets, described idf WkComputing formula can for:
idf wk=1+log(N/n wk) (1-4)
Wherein, N is the quantity of all documents in the document sets, n WkBe wherein to comprise speech w kThe quantity of document,
(1-3) calculates vector by above-mentioned formula
Figure S2007101874807D00071
The corresponding weight of each dimension, can obtain described vector
Figure S2007101874807D00072
Step 102, the center vector and the document vector lists of renewal document sets;
This step specifically comprises:
Document sets D is updated to D '=D ∪ { d New;
With document sets vector lists L DBe updated to L ′ D = L D ∪ { d → new } ;
With following formula, with the corresponding center vector of document sets D
Figure S2007101874807D00074
Be updated to
Figure S2007101874807D00075
c → ′ = 1 | D ′ | Σ d i ∈ D ′ d → i - - - ( 1 - 5 )
Wherein, | D ' | the number of documents among the expression document sets D ',
Described with the corresponding center vector of document sets D
Figure S2007101874807D00077
Be updated to
Figure S2007101874807D00078
Can also use following formulate:
c → ′ = | D | × c → + d → new | D | + 1 - - - ( 1 - 6 )
Wherein, | D| represents the number of documents among the document sets D.
Step 103 is calculated the weight of each sentence in the new document;
Its concrete grammar is:
Calculate sentence s iContent weight w Content(s i):
w content ( s i ) = cos ( s → i , c → ′ i ) = s → i · c → ′ i | | s → i | | × | | c → ′ i | | - - - ( 1 - 7 )
Wherein,
Figure S2007101874807D000711
Be sentence s iVector,
Figure S2007101874807D000712
Be the center vector after the document sets D renewal,
From formula (1-7) as can be seen, by asking sentence s iVector
Figure S2007101874807D000713
Center vector with document sets Between cosine value, determine sentence s iContent weight w Content(s i) size, that is: with the center vector of document sets
Figure S2007101874807D000715
Similar more sentence s iContent weight w Content(s i) big more, wherein, the center vector of described document sets The theme of reaction the document collection, we can utilize sentence s iContent weight w Content(s i) as sentence s iWeighted value;
In order to make described sentence s iWeight react this sentence s more exactly iImportance in document sets (with the correlation degree of theme and the position in document sets etc.), the method for the weight of each sentence also comprises in the new document of described calculating:
Write down the positional information of each sentence in the described new document, for example: the memory location of each sentence, perhaps and the up and down relation information between the sentence etc.;
Calculate sentence s iPosition weight w Position(s i):
w position ( s i ) = n - i + 1 n · max s ∈ d new { w content ( s ) } - - - ( 1 - 8 )
Wherein, n is described new document d NewThe sentence sum, i (1≤i≤n) is the sentence sequence number, max s ∈ d new { w content ( s ) } . Be described new document d NewThe content weights of maximum in all sentences;
Calculate sentence s iComprehensive weight value w (s i):
w(s i)=α·w content(s i)+β·w position(s i) (1-9)
Wherein, α, β are parameter, 0≤α, and β≤1, and alpha+beta=1 is arranged,
By calculating sentence s iComprehensive weight value w (s i), can give weighted value for each sentence in the new document more effectively.
Step 104, upgrade the weight of sentence in the existing summary of described document sets, that is: recomputate the weight of sentence in the existing summary of described document sets, in method that it is concrete and the above-mentioned steps 103, the weight of calculating each sentence in the new document is identical, utilize formula (1-8) to calculate the content weight of sentence in the existing summary of described document sets, sentence does not need to recomputate the position weight again in the existing summary of the document collection, can directly use the position weighted value of preserving in the last round of summary generative process, calculate the comprehensive weight that described document sets has sentence in the summary by formula (1-9), can obtain the weight of sentence in the existing summary of described document sets;
Step 105 is with each sentence in described new document and the existing summary of document sets, by the ordering of weight size, for example: described new document d NewComprise n sentence, the existing summary of described document sets comprises k sentence, then a described k+n sentence is arranged from big to small according to its weighted value of giving (can be the content weighted value, also can be the comprehensive weight value, but the type of described weighted value must be identical);
Step 106, the sentence that repeat deletion ordering back;
Its concrete delet method is:
From the sequence that above-mentioned k+n sentence formed, second sentence begins, and judges this sentence s iWith each the sentence s that comes its front j(multiplicity between the j<i), its judgment formula is:
sim ( s i , s j ) = cos ( s → i , s → j ) = s → i · s → j | | s → i | | × | | s → j | | - - - ( 1 - 10 )
As the sentence s that calculates by formula (1-10) iWith s jBetween multiplicity during greater than threshold epsilon (0≤ε≤1), this sentence s is judged in threshold epsilon=0.85 in the present embodiment iWith s jFor the sentence that repeats, can delete sentence s iWith s jIn any one sentence;
Can demonstrate the content of latest update for the summary that can make described document sets, when receiving new document, can preserve the time of reception of described new document, when the sentence that duplicates, can delete time of reception sentence early, for example: judge sentence s by above-mentioned steps 106 iWith s jSentence for repeating comprises sentence s iThe document time of reception be on July 15th, 2007, comprise sentence s jThe document time of reception be on May 28th, 2006, then with sentence s jDeletion.
Step 107 according to described weight ordering, is selected the big sentence of weighted value, generates the new summary of described document sets.
After the sentence deletion of above-mentioned steps 106 with repetition, obtain (the sequence that the individual sentence of k<p<k+p) is formed by p, this sequence is according to the descending arrangement of the weighted value of each sentence, in order to access the document set abstracts of forming by k sentence, can from a described p sentence, select the new summary of k bigger sentence of weighted value as described document sets.
In order to make method provided by the present invention generate summary automatically for document sets more apace, after step 101, also comprise the steps:
Step 108 is judged new document repeatability, obtains non-repetitive file.
Its concrete determining step is as follows:
As described new document d NewDuring for first document among the document sets D, this new document is non-repetitive file, continues step 102;
As described new document d NewNot document sets D={d i| during first document among 1≤i≤m} (wherein, positive integer m is that current document is concentrated the number of files comprise), with every piece of document d among this new document and the document sets D iCarry out similarity relatively, its concrete comparison formula is:
sim ( d new , d i ) = cos ( d → new , d → i ) = d → new · d → i | | d → new | | × | | d → i | | - - - ( 1 - 11 )
Wherein
Figure S2007101874807D00102
Be document d iCorresponding vector is directly taken from the document vector lists of document sets D correspondence L D = { d → i | 1 ≤ i ≤ m } , do not need to recomputate;
Work as d NewWith d iBetween similarity value during greater than threshold value θ (0≤θ≤1), described new document d NewBe repetitive file, do not proceed step 102, wait for receiving new document again; Work as d NewWith d iBetween similarity value during smaller or equal to threshold value θ, described new document is non-repetitive file, proceeds step 102;
Judge that by step 108 new document and the document similarity in the document sets can directly delete the document that repeats, do not process, that is: initiate repetitive file is not generated newly and make a summary, can be faster, generate summary automatically for document sets effectively.
Automatically the method that generates summary for document sets provided by the present invention, by calculating the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of document sets, each sentence in described new document and the existing summary of document sets is sorted, screens, thus the new summary of formation document sets; Compared with prior art, this method only need be calculated the weight of each sentence of the existing summary of new document and document sets, do not need all sentences of every piece in document sets document are recomputated weight, can obtain the new summary of described document sets, this method can be document sets generation summary simply, apace, improve the efficient that generates summary for document sets greatly, can adapt to the requirement that information updating speed improves constantly.
Corresponding with said method, it is the device that document sets generates summary automatically that the present invention also provides a kind of, is used for after increasing new document to document sets, and document sets generates summary automatically, and as shown in Figure 2, described is the device that document sets generates summary automatically, comprising:
Vector calculation unit is calculated the vector of described new document, and the vector of each sentence in the described new document;
As shown in Figure 3, described vector calculation unit comprises:
The subordinate sentence unit is used for described new document d NewCarry out subordinate sentence, the sentence S set that obtains New, S New={ s i| 1≤i≤n}, wherein, positive integer n is new document d NewIn the sentence number that comprises;
Calculate the sentence S set NewIn each sentence s iCorresponding vector
Figure S2007101874807D00111
The time,
The participle unit is used for described sentence s iCarry out participle, obtain described sentence s iSet of words w behind the participle i, w i={ w Ij| 1≤j≤m}, wherein, positive integer m is described sentence s iIn the word number that comprises;
Vector
Figure S2007101874807D00112
A speech of the corresponding described new document of each dimension, vector
Figure S2007101874807D00113
The computing formula of the corresponding weight of each dimension can repeat no more referring to formula (1-1) herein, (1-1) calculates vector by above-mentioned formula
Figure S2007101874807D00114
The corresponding weight of each dimension, can obtain described vector
Figure S2007101874807D00115
Calculate described new document d NewCorresponding vector
Figure S2007101874807D00116
The time,
Described participle unit is used for described new document d NewCarry out participle, obtain described document d NewSet of words W behind the participle New, W New={ w k| 1≤k≤z}, wherein, positive integer z is described document d NewIn the word number that comprises;
Above-mentioned to described new document d NewThe method of carrying out participle can be divided into two kinds: a kind of is that described participle unit is the new document d to receiving directly NewCarry out participle; Another kind is, described participle unit obtains the participle of each sentence in the new document by the subordinate sentence unit, and it is asked union W new = ∪ i w i (wherein, 1≤i≤n, positive integer n is new document d NewIn the sentence number that comprises, w iBe sentence s iSet of words behind the participle) obtains d NewSet of words W behind the participle New={ w k| 1≤k≤z};
Because described new document d NewVector
Figure S2007101874807D00118
Each dimension also can be to a speech in the new document, so vectorial
Figure S2007101874807D00119
The computing formula of the weight that each dimension is corresponding can repeat no more referring to formula (1-3) herein, and (1-3) calculates vector by formula The corresponding weight of each dimension, can obtain described vector
Figure S2007101874807D001111
Document sets feature updating block according to the result that vector calculation unit obtains, upgrades the center vector and the document vector lists of described document sets;
As shown in Figure 4, described document sets feature updating block comprises:
The document sets updating block is used for document sets D is updated to D '=D ∪ { d New;
Document sets vector lists updating block, the vector of the described new document that obtains according to vector calculation unit is with document sets vector lists L DBe updated to L ′ D = L D ∪ { d → new } ;
Document sets center vector updating block, the vector of the described new document that obtains according to vector calculation unit is with the corresponding center vector of document sets D Be updated to
Figure S2007101874807D00123
, concrete formula referring to formula (1-5) or (1-6) repeats no more herein.
Weight calculation unit, the result according to vector calculation unit and document sets feature updating block obtain calculates the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of described document sets;
As shown in Figure 5, described weight calculation unit comprises:
The content weight calculation unit, the result according to vector calculation unit and document sets feature updating block obtain calculates sentence s iContent weight w Content(s i), see formula (1-7);
In order to make described sentence s iWeight react this sentence s more exactly iImportance in document sets (with the correlation degree of theme and the position in document sets etc.), as shown in Figure 5, described weight calculation unit also comprises:
The positional information record cell is used for writing down the positional information of described new each sentence of document;
The position weight calculation unit, according to the sentence positional information of positional information recording unit records, and the content weighted value that obtains of content weight calculation unit, calculate sentence s iPosition weight w Position(s i), see formula (1-8);
The comprehensive weight computing unit is according to content weight and position weight calculation sentence s iComprehensive weight value w (s i), see formula (1-9).
Select sequencing unit, from weight calculation unit, obtain the weighted value of all non-repetition sentences of new document and the existing summary of document sets, and it is sorted;
As shown in Figure 6, described selection sequencing unit comprises:
Sequencing unit, according to the weighted value that weight calculation unit calculates, each sentence with in described new document and the existing summary of document sets sorts by the weight size, for example: described new document d NewComprise n sentence, the existing summary of described document sets comprises k sentence, then a described k+n sentence is arranged from big to small according to its weighted value of giving (can be the content weighted value, also can be the comprehensive weight value, but the type of described weighted value must be identical);
The screening unit, the sentence that repeat deletion ordering back;
Its concrete delet method is:
Second sentence from the sequence that above-mentioned k+n sentence formed begins, and judges this sentence s iWith each the sentence s that comes its front j(multiplicity between the j<i), its judgment formula is referring to formula (1-10)
As the sentence s that calculates by formula (1-10) iWith s jBetween multiplicity during greater than threshold epsilon (0≤ε≤1), this sentence s is judged in threshold epsilon=0.85 in the present embodiment iWith s jFor the sentence that repeats, can delete sentence s iWith s jIn any one sentence;
Can demonstrate the content of latest update for the summary that can make described document sets, described screening unit also comprises: the time keeping unit, write down the time of reception of described new document, when described screening unit judges goes out to have the sentence that repeats, can delete time of reception sentence early, for example: judge sentence s by above-mentioned steps iWith s jBe the sentence that repeats, sentence s iTime of reception be on July 15th, 2007, sentence s jTime of reception be on May 28th, 2006, then with sentence s jDeletion.
The summary generation unit according to predefined document sets sentence summary number, will be selected the big sentence of weighted value in the sequencing unit, generates the new summary of described document sets;
After of the sentence deletion of above-mentioned screening unit with repetition, obtain (the sequence that the individual sentence of k<p<k+p) is formed by p, this sequence is according to the descending arrangement of the weighted value of each sentence, in order to access the document set abstracts of forming by k sentence, can from a described p sentence, select the new summary of k bigger sentence of weighted value as described document sets.
In order to make device provided by the present invention generate summary automatically for document sets more apace, the described device that generates summary automatically for document sets also comprises: judging unit, the result who obtains according to vector calculation unit, judge new document repeatability, obtain non-repetitive file, when described new document was non-repetitive file, the result of calculation that described judging unit just obtains vector calculation unit sent document sets feature updating block to.
Its concrete determining step is as follows:
As described new document d NewDuring for first document among the document sets D, this new document is non-repetitive file;
As described new document d NewNot document sets D={d i| during first document among 1≤i≤m} (wherein, positive integer m is that current document is concentrated the number of files comprise), with every piece of document d among this new document and the document sets D iCarry out similarity relatively, the comparison formula that it is concrete is seen formula (1-11);
Work as d NewWith d iBetween similarity value during greater than threshold value θ (0≤θ≤1), described new document d NewBe repetitive file; Work as d NewWith d iBetween similarity value during smaller or equal to threshold value θ, described new document is non-repetitive file;
Similarity by the document in new document of judgment unit judges and the document sets can directly be deleted the document that repeats, and does not process, and promptly initiate repetitive file is not generated summary, can be faster, generate summary automatically for document sets effectively.
Automatically the device that generates summary for document sets provided by the present invention, by calculating the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of document sets, each sentence in described new document and the existing summary of document sets is sorted, screens, thus the new summary of formation document sets; Compared with prior art, this device only needs to calculate the weight of each sentence in new document and the existing summary of document sets, do not need all sentences of every piece of document in the document sets are recomputated weight, can obtain the new summary of described document sets, this device can be document sets generation summary simply, apace, improve the efficient that generates summary for document sets greatly, can adapt to the requirement that information updating speed improves constantly; Application in actual internet public feelings analytic system shows that device of the present invention can improve the efficient of summary greatly under the prerequisite that guarantees the summary quality, and the method that summary efficiency ratio prior art is provided improves more than 50 times.
The above; only be the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain that claim was defined.

Claims (19)

1. one kind is the method that document sets generates summary automatically, is used for for document sets generates summary automatically, it is characterized in that after increasing new document to document sets, comprises the steps:
Calculate the weight of each sentence in the described new document;
Upgrade the weight of sentence in the existing summary of described document sets;
Obtain the weight ordering of all non-repetition sentences of the existing summary of new document and document sets;
According to described weight ordering, select the big sentence of weighted value, generate the new summary of described document sets.
2. according to claim 1 is the method that document sets generates summary automatically, it is characterized in that, calculates before the step of the weight of each sentence in the described new document, also comprises the steps:
Calculate the vector of described new document, and the vector of each sentence in the described new document;
Upgrade the center vector and the document vector lists of described document sets.
3. according to claim 2ly, document sets it is characterized in that described step is calculated the vector of new document for generating the method for summary automatically, and the vector of each sentence in the described new document, comprising:
To described new document d NewCarry out subordinate sentence, the sentence S set that obtains New, S New={ s i| 1≤i≤n}, wherein, positive integer n is new document d NewIn the sentence number that comprises;
Calculate the sentence S set NewIn each sentence s iVector The time, to described sentence s iCarry out participle, obtain described sentence s iSet of words w behind the participle i, w i={ w Ij| 1≤j≤m}, wherein, positive integer m is described sentence s iIn the word number that comprises, vector The computing formula of the weight that each dimension is corresponding is:
w ijf wij×idf wij
Wherein, w Ijf WijBe speech w IjThe frequency of the appearance in described document sets, idf WijBe speech w IjInverted entry frequency in described document sets;
Calculate described new document d NewVector
Figure FSB00000232272300013
The time, to described new document d NewCarry out participle, obtain described document d NewSet of words W behind the participle New, W New={ w k| 1≤k≤z}, wherein, positive integer z is described document d NewIn the word number that comprises, vector The computing formula of the weight that each dimension is corresponding is:
w kf wk×idf wk
Wherein, w kf WkBe speech w kThe frequency of the appearance in described document sets, idf WkBe speech w kInverted entry frequency in described document sets.
4. according to claim 2ly it is characterized in that,, and in the described new document after the vector of each sentence, also comprise the steps: at the vector of calculating described new document for document sets generates the method for summary automatically
Judge new document repeatability, obtain non-repetitive file;
Its concrete determining step is as follows:
When described new document was first piece of document in the document sets, then this new document was non-repetitive file;
Otherwise, calculate in described new document and the document sets similarity between every piece of document, when the similarity value between two pieces of documents during greater than threshold value θ, described new document is a repetitive file, wherein, 0≤θ≤1; When the similarity value between two pieces of documents during smaller or equal to threshold value θ, described new document is non-repetitive file.
5. according to claim 4 is the method that document sets generates summary automatically, it is characterized in that the new document d of described calculating NewWith every piece of document d in the document sets iBetween the similarity value adopt following cosine formula:
sim ( d new , d i ) = cos ( d → new , d → i ) = d → new · d → i | | d → new | | × | | d → i | |
Wherein,
Figure FSB00000232272300022
Be document d iCorresponding vector.
6. according to claim 2 is the method that document sets generates summary automatically, it is characterized in that the step of the center vector of described renewal document sets and document vector lists specifically comprises the steps:
Document sets D is updated to D '=D ∪ { d New;
With document sets vector lists L DBe updated to
Figure FSB00000232272300023
With following formula, with the corresponding center vector of document sets D
Figure FSB00000232272300024
Be updated to
Figure FSB00000232272300025
c → ′ = 1 | D ′ | Σ d i ∈ D ′ d → i
Wherein | D ' | the number of documents among the expression document sets D '.
7. according to claim 1 is the method that document sets generates summary automatically, it is characterized in that, and the weight of each sentence in the new document of described calculating, method is:
Calculate sentence s iContent weight w Content(s i):
w content ( s i ) = cos ( s → i , c → ′ i ) = s → i · c → ′ i | | s → i | | × | | c → ′ i | |
Wherein,
Figure FSB00000232272300032
Be sentence s iVector, Be the center vector after the document sets D renewal.
8. according to claim 7 is the method that document sets generates summary automatically, it is characterized in that the weight of each sentence in the new document of described calculating also comprises:
Write down the positional information of each sentence in the described new document;
Calculate sentence s iPosition weight w Position(s i):
w position ( s i ) = n - i + 1 n · max s ∈ d new { w content ( s ) }
Wherein, n is described new document d NewThe sentence sum, i is the sentence sequence number, wherein, 1≤i≤n;
Figure FSB00000232272300035
Be described new document d NewThe content weights of maximum in all sentences;
Calculate sentence s iComprehensive weight value w (s i):
w(s i)=α·w content(s i)+β·w position(s i)
Wherein, α, β are parameter, 0≤α, and β≤1, and alpha+beta=1 is arranged.
9. according to claim 1 or the 7 or 8 described methods that generate summary for document sets automatically, it is characterized in that, upgrade the weight of sentence in the existing summary of described document sets, adopt with the new document of calculating in the identical method of weight of each sentence, wherein, the position weighted value of sentence is the position weighted value of preserving in the last round of summary generative process in the existing summary of described document sets.
10. according to claim 1 is the method that document sets generates summary automatically, it is characterized in that, the weight ordering of all non-repetition sentences of the existing summary of new document of described acquisition and document sets comprises:
With each sentence in described new document and the existing summary of document sets, by the ordering of weight size;
The sentence that repeat deletion ordering back.
11. according to claim 10 is the method that document sets generates summary automatically, it is characterized in that, the sentence that repeat described step deletion ordering back specifically comprises the steps:
Write down the time of reception of described new document;
Begin by second sentence the weight size collating sequence from described, judge this sentence s iWith each the sentence s that comes its front jBetween multiplicity, wherein, j<i;
When described multiplicity during greater than threshold epsilon, deletion time of reception sentence early is 0≤ε≤1 wherein.
12. one kind is the device that document sets generates summary automatically, is used for after increasing new document to document sets, document sets generates summary automatically, it is characterized in that, comprising:
Weight calculation unit is calculated the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of described document sets;
Select sequencing unit, from weight calculation unit, obtain the weighted value of all non-repetition sentences of new document and the existing summary of document sets, and it is sorted;
The summary generation unit will be selected the big sentence of weighted value in the sequencing unit, generates the new summary of described document sets.
13. according to claim 12 is the device that document sets generates summary automatically, it is characterized in that, also comprises:
Vector calculation unit is calculated the vector of described new document, and the vector of each sentence in the described new document;
Document sets feature updating block according to the result that vector calculation unit obtains, upgrades the center vector and the document vector lists of described document sets.
14. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that described vector calculation unit comprises:
The subordinate sentence unit is used for described new document d NewCarry out subordinate sentence, the sentence S set that obtains New, S New={ s i| 1≤i≤n}, wherein, positive integer n is new document d NewIn the sentence number that comprises;
Calculate the sentence S set NewIn each sentence s iVector
Figure FSB00000232272300041
The time,
The participle unit is used for described sentence s iCarry out participle, obtain described sentence s iSet of words w behind the participle i, w i={ w Ij| 1≤j≤m}, wherein, positive integer m is described sentence s iIn the word number that comprises;
Vector
Figure FSB00000232272300051
The computing formula of each dimensional vector is:
w ijf wij×idf wij
Wherein, w Ijf WijBe speech w IjThe frequency of the appearance in described document sets, idf WijBe speech w IjInverted entry frequency in described document sets;
Calculate described new document d NewVector
Figure FSB00000232272300052
The time,
Described participle unit is used for described new document d NewCarry out participle, obtain described document d NewSet of words W behind the participle New, W New={ w k| 1≤k≤z}, wherein, positive integer z is described document d NewIn the word number that comprises;
Vector
Figure FSB00000232272300053
The computing formula of each dimensional vector is:
w kf wk×idf wk
Wherein, w kf WkBe speech w kThe frequency of the appearance in described document sets, idf WkBe speech w kInverted entry frequency in described document sets.
15. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that, also comprises:
Judging unit according to the result that vector calculation unit obtains, is judged new document repeatability, obtains non-repetitive file.
16. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that described document sets feature updating block comprises:
The document sets updating block is used for document sets D is updated to D '=D ∪ { d New;
Document sets vector lists updating block, the vector of the described new document that obtains according to vector calculation unit is with document sets vector lists L DBe updated to
Figure FSB00000232272300054
Document sets center vector updating block, the vector of the described new document that obtains according to vector calculation unit is with the corresponding center vector of document sets D
Figure FSB00000232272300055
Be updated to
Figure FSB00000232272300056
Concrete formula is:
c → ′ = 1 | D ′ | Σ d i ∈ D ′ d → i
Wherein | D ' | the number of documents among the expression document sets D '.
17. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that described weight calculation unit comprises:
The content weight calculation unit, the result according to vector calculation unit and document sets feature updating block obtain calculates sentence s iContent weight w Content(s i):
w content ( s i ) = cos ( s → i , c → ′ i ) = s → i · c → ′ i | | s → i | | × | | c → ′ i | |
Wherein,
Figure FSB00000232272300062
Be sentence s iVector,
Figure FSB00000232272300063
Be the center vector after the document sets D renewal.
18. according to claim 17 is the device that document sets generates summary automatically, it is characterized in that described weight calculation unit also comprises:
The positional information record cell is used for writing down the positional information of described new each sentence of document;
The position weight calculation unit, according to the sentence positional information of positional information recording unit records, and the content weighted value that obtains of content weight calculation unit, calculate sentence s iPosition weight w Position(s i):
w position ( s i ) = n - i + 1 n · max s ∈ d new { w content ( s ) }
Wherein, n is described new document d NewThe sentence sum, i is the sentence sequence number, wherein, 1≤i≤n
Figure FSB00000232272300065
Be described new document d NewThe content weights of maximum in all sentences;
The comprehensive weight computing unit is according to content weight and position weight calculation sentence s iComprehensive weight value w (s i):
w(s i)=α·w content(s i)+β·w position(s i)
Wherein, α, β are parameter, 0≤α, and β≤1, and alpha+beta=1 is arranged.
19. according to claim 12 is the device that document sets generates summary automatically, it is characterized in that described selection sequencing unit comprises:
Sequencing unit, according to the weighted value that weight calculation unit calculates, each sentence with in described new document and the existing summary of document sets sorts by the weight size;
The screening unit, the sentence that repeat deletion ordering back;
Described screening unit comprises: the time keeping unit, write down the time of reception of described new document.
CN2007101874807A 2007-11-27 2007-11-27 Method and device of automatically generating a summary for document set Expired - Fee Related CN101446940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101874807A CN101446940B (en) 2007-11-27 2007-11-27 Method and device of automatically generating a summary for document set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101874807A CN101446940B (en) 2007-11-27 2007-11-27 Method and device of automatically generating a summary for document set

Publications (2)

Publication Number Publication Date
CN101446940A CN101446940A (en) 2009-06-03
CN101446940B true CN101446940B (en) 2011-09-28

Family

ID=40742624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101874807A Expired - Fee Related CN101446940B (en) 2007-11-27 2007-11-27 Method and device of automatically generating a summary for document set

Country Status (1)

Country Link
CN (1) CN101446940B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9916375B2 (en) 2014-08-15 2018-03-13 International Business Machines Corporation Extraction of concept-based summaries from documents

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699525B (en) * 2014-01-03 2016-08-31 江苏金智教育信息股份有限公司 A kind of method and apparatus automatically generating summary based on text various dimensions feature
CN105989058A (en) * 2015-02-06 2016-10-05 北京中搜网络技术股份有限公司 Chinese news brief generating system and method
CN104778204B (en) * 2015-03-02 2018-03-02 华南理工大学 More document subject matters based on two layers of cluster find method
KR101656245B1 (en) * 2015-09-09 2016-09-09 주식회사 위버플 Method and system for extracting sentences
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
CN107590193A (en) * 2017-08-14 2018-01-16 安徽晶奇网络科技股份有限公司 A kind of government affairs public sentiment management system for monitoring
CN107688652B (en) * 2017-08-31 2020-12-29 苏州大学 Evolution type abstract generation method facing internet news events
CN109815328B (en) * 2018-12-28 2021-05-25 东软集团股份有限公司 Abstract generation method and device
CN110377808A (en) * 2019-06-14 2019-10-25 北京达佳互联信息技术有限公司 Document processing method, device, electronic equipment and storage medium
CN110222344B (en) * 2019-06-17 2022-09-23 上海元趣信息技术有限公司 Composition element analysis algorithm for composition tutoring of pupils
CN110287309B (en) * 2019-06-21 2022-04-22 深圳大学 Method for quickly extracting text abstract
CN110287289A (en) * 2019-06-25 2019-09-27 北京金海群英网络信息技术有限公司 A kind of document keyword extraction and the method based on document matches commodity
CN110728143A (en) * 2019-09-23 2020-01-24 上海蜜度信息技术有限公司 Method and equipment for identifying document key sentences
CN110750976A (en) * 2019-09-26 2020-02-04 平安科技(深圳)有限公司 Language model construction method, system, computer device and readable storage medium
CN110705287B (en) * 2019-09-27 2023-06-30 北京妙笔智能科技有限公司 Method and system for generating text abstract
CN113204956B (en) * 2021-07-06 2021-10-08 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN115098667B (en) * 2022-08-25 2023-01-03 北京聆心智能科技有限公司 Abstract generation method, device and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341899A (en) * 2000-09-07 2002-03-27 国际商业机器公司 Method for automatic generating abstract from word or file
CN1609845A (en) * 2003-10-22 2005-04-27 国际商业机器公司 Method and apparatus for improving readability of automatic generated abstract by machine
CN101008941A (en) * 2007-01-10 2007-08-01 复旦大学 Successive principal axes filter method of multi-document automatic summarization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341899A (en) * 2000-09-07 2002-03-27 国际商业机器公司 Method for automatic generating abstract from word or file
CN1609845A (en) * 2003-10-22 2005-04-27 国际商业机器公司 Method and apparatus for improving readability of automatic generated abstract by machine
CN101008941A (en) * 2007-01-10 2007-08-01 复旦大学 Successive principal axes filter method of multi-document automatic summarization

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9916375B2 (en) 2014-08-15 2018-03-13 International Business Machines Corporation Extraction of concept-based summaries from documents

Also Published As

Publication number Publication date
CN101446940A (en) 2009-06-03

Similar Documents

Publication Publication Date Title
CN101446940B (en) Method and device of automatically generating a summary for document set
CN106055538B (en) The automatic abstracting method of the text label that topic model and semantic analysis combine
CN100405371C (en) Method and system for abstracting new word
CN102033955B (en) Method for expanding user search results and server
Cano Basave et al. Automatic labelling of topic models learned from twitter by summarisation
CN101377777A (en) Automatic inquiring and answering method and system
CN103365924A (en) Method, device and terminal for searching information
CN102999625A (en) Method for realizing semantic extension on retrieval request
CN102737021B (en) Search engine and realization method thereof
CN100511214C (en) Method and system for abstracting batch single document for document set
CN101706812B (en) Method and device for searching documents
CN102722498A (en) Search engine and implementation method thereof
CN101315623A (en) Text subject recommending method and device
CN100435145C (en) Multiple file summarization method based on sentence relation graph
CN102722501A (en) Search engine and realization method thereof
CN103678412A (en) Document retrieval method and device
CN102722499A (en) Search engine and implementation method thereof
CN109948154A (en) A kind of personage's acquisition and relationship recommender system and method based on name
Liu et al. The research of Web mining
CN101599075B (en) Chinese abbreviation processing method and device therefor
CN111651675B (en) UCL-based user interest topic mining method and device
Chakraborty Domain keyword extraction technique: Anew weighting method
Viveros-Jiménez et al. Improving the boilerpipe algorithm for boilerplate removal in news articles using html tree structure
CN103984731A (en) Self-adaption topic tracing method and device under microblog environment
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110928