CN101446940B

CN101446940B - Method and device of automatically generating a summary for document set

Info

Publication number: CN101446940B
Application number: CN2007101874807A
Authority: CN
Inventors: 万小军; 余军; 杨建武; 吴於茜
Original assignee: Peking University Founder E-Government Technology Co Ltd; Peking University; Peking University Founder Group Co Ltd
Current assignee: Peking University Founder E-Government Technology Co Ltd; Peking University; Peking University Founder Group Co Ltd
Priority date: 2007-11-27
Filing date: 2007-11-27
Publication date: 2011-09-28
Anticipated expiration: 2027-11-27
Also published as: CN101446940A

Abstract

The invention discloses a method and a device of automatically generating a summary for a document set, relates to the language and word processing field, and aims at solving the problems of slow speed and low efficiency of the summary generation due to weight recomputation of each sentence in all documents of the document set when the summary is generated for the document set in the prior art. The method comprises the following steps: computing the weight of each sentence in a new document; updating the weight of the sentence in the existing summary of the document set; acquiring a weight order of all nonrepetitive sentences in the existing summary of the new document and the document set; and generating a new summary of the document set. The method and the device are applicable to the automatic summary generation of a plurality of documents.

Description

Automatically generate the method and the device of summary for document sets

Technical field

The present invention relates to spoken and written languages and handle and information retrieval field, particularly a kind of method and apparatus that generates summary for document sets automatically.

Background technology

Be meant for document sets generates summary automatically: from each piece document of a document sets, want or main points automatically by the essence that extracts the document collection for computer system; Its objective is: by each piece document in the described document sets is compressed, refines, for the user provides the document collection brief and concise content description.Along with computer technology, and constantly the applying of Internet technology, for generating the summary technology automatically, document sets has been widely used in aspects such as text/website (Web) content retrieval.For example: the press services that search engine provided such as Google, Baidu, be exactly by the various news informations on the collection network, different according to its theme and type, form a plurality of Special Topics in Journalism (news documents collection), and generate summary for document sets generates the summary technology automatically for each document sets by described, so that the user can browse oneself interested Special Topics in Journalism more easily.

In short, the described method that generates summary automatically for document sets can be divided into two kinds: the method (Extraction) that extracts based on sentence and based on the method (Abstraction) of sentence generation.Wherein, the described method that extracts based on sentence is that every piece in document sets document is cut apart by sentence, give different weighted values according to the importance of each sentence in described document sets for it, the sentence of selection weight maximum forms the summary of described document sets, this method does not need to utilize the natural language understanding technology of deep layer can be embodied as the purpose that document sets generates summary automatically, it realizes simple, and is easy to use; The described natural language understanding technology that need utilize deep layer based on the method for sentence generation, each sentence in the described document sets is carried out sentence structure, semantic analysis, and utilize information extraction or natural language generation technique to produce new sentence, thereby generate summary automatically for described document sets, this method need utilize the natural language understanding technology of deep layer just can be embodied as the purpose that document sets generates summary automatically, it implements more complicated, and inconvenience is used.

Owing to having, the described method that extracts based on sentence realizes advantages such as simple, easy to use, so major part all is to adopt the method that extracts based on sentence for the method that document sets generates summary automatically at present.For example: (author is D.R.Radev to article Centroid-based summarization of multiple documents, H.Y.Jing, M.Stys and D.Tam, be published in the periodical InformationProcessing and Management that published in 2004) a kind of sentence abstracting method based on central point disclosed, this method each sentence in giving document sets is given in the process of weight, taken all factors into consideration the feature between sentence level and the sentence, comprise class bunch central point, the sentence position, TF/IDF (frequency of keyword/inverted order index order) etc., give different weighted values by described each sentence that is characterized as, and extract the summary of the bigger sentence of weighted value as document sets; (author is C.-Y.Lin and E.H.Hovy to article From Single to Multi-document Summarization:APrototype System and its Evaluation, be published in the periodical Proceedings of the 40th Anniversary Meeting ofthe Association for Computational Linguistics (ACL-02) that published in 2002) the sentence extraction system of a kind of NeATS by name disclosed, this system is by considering the sentence position, the word frequency, features such as theme signature and word class bunch, for different weighted values given in each sentence in the document sets, utilize MMR (Modified Modified Read simultaneously, improved two dimensional compaction coding) technology disappears heavily to described sentence, thereby is that described document sets forms summary; (author is H.Hardy to article Cross-document summarization by conceptclassification, N.Shimizu, T.Strzalkowski, L.Ting, G.B.Wise, and X.Zhang, be published in the periodical of publication in 2003: the sentence extraction system that Proceedings of SIGIR ' 02) discloses a kind of XdoX by name, this system is suitably for large-scale document sets and generates summary, it at first detects most important theme in the document sets by the paragraph cluster, and the sentence that extracts the reflection important theme then forms summary; (author is S.Harabagiu and F.Lacatusu to article Topic themes for multi-document summarization, be published in the periodical Proceedings of SIGIR ' 05 that published in 2005) method of Harabagiu and Lacatusu is disclosed, this method has been inquired into five kinds of different many document subject matter manifestation modes and has been proposed a kind of new theme manifestation mode.

When utilizing the described method that extracts based on sentence to generate summary automatically, also be used to the importance of sentence is sorted based on the method for graph structure for document sets.For example: (author is I.Mani and E.Bloedorn to article Summarizing Similaritiesand Differences Among Related Documents, be published in the periodical Information Retrieval that published in 2000) method of a kind of WebSumm by name disclosed, this method is utilized the figure link model, many more according to other summit that is connected with certain summit, this summit is important more, this supposes to come the importance to sentence to sort, thereby makes a summary for document sets generates; (author is G.Erkan and D.Radev to article LexPageRank:prestige in multi-document text summarization, be published in the periodical of publication in 2004: the method that Proceedings of the Conferenceon Empirical Methods in Natural Language Processing (EMNLP ' 04)) discloses a kind of LexPageRank by name, this method at first makes up the sentence connection matrix, algorithm based on similar PageRank calculates sentence importance then, according to the importance position document sets generation summary of each sentence; (author is R.Mihalcea and P.Tarau to article Alanguage independent algorithm for single and multiple documentsummarization, be published in the periodical of publication in 2005: Proceedings of the Second International Joint Conference on NaturalLanguage Processing (IJCNLP ' 05)) disclose the method for a kind of Mihalcea by name and Tarau, this method has also proposed a similar algorithm computation sentence importance based on PageRank and HITS.

In sum, method described in the above example or system, when utilizing the described method that extracts based on sentence to generate summary automatically for document sets, it all is the weight of calculating each sentence in the document sets earlier, select the bigger sentence of weight as summary then, its difference only is to give for each sentence the method difference of weight.

In actual internet, applications, because internet content upgrades very fast, represent the document sets of certain theme, type also can bring in constant renewal in thereupon, that is to say, for each document sets, constantly have new relevant documentation and join current document and concentrate, especially for certain hot news topic, a large amount of documents of being correlated with this topic appear in meeting on the internet, also can be very frequent to the document set abstracts renewal of described hot news topic.If adopting existing multiple file summarization method makes a summary to the document sets of frequent updating; one piece of new document of the every increase of document sets all needs to recomputate the weight of all sentences in the document sets; its calculated amount is very huge; and can not generate new summary fast for described document sets; thereby cause generating the problem of the inefficiency of summary, can't satisfy the demand that large-scale internet is used (for example news topic detection, analysis of central issue etc.).

Summary of the invention

On the one hand, the invention provides a kind of is the method that document sets generates summary automatically, and this method can generate summary for document sets simply, apace automatically, has improved the efficient that generates summary for document sets.

The technical solution used in the present invention comprises: a kind of is the method that document sets generates summary automatically, is used for for document sets generates summary automatically, comprising the steps: after increasing new document to document sets

Calculate the weight of each sentence in the described new document;

Upgrade the weight of sentence in the existing summary of described document sets;

Obtain the weight ordering of all non-repetition sentences of the existing summary of new document and document sets;

Generate the new summary of described document sets.

Automatically the method that generates summary for document sets provided by the present invention, by calculating the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of document sets, each sentence in described new document and the existing summary of document sets is sorted, screens, thus the new summary of formation document sets; Compared with prior art, this method only need be calculated the weight of each sentence of the existing summary of new document and document sets, do not need all sentences of every piece in document sets document are recomputated weight, can obtain the new summary of described document sets, this method can be document sets generation summary simply, apace, improve the efficient that generates summary for document sets greatly, can adapt to the requirement that information updating speed improves constantly.

On the other hand, it is the device that document sets generates summary automatically that the present invention also provides a kind of, and this device can generate summary for document sets simply, apace automatically, has improved the efficient that generates summary for document sets.

The technical solution used in the present invention comprises: a kind of is the device that document sets generates summary automatically, is used for after increasing new document to document sets, and document sets generates summary automatically, it is characterized in that, comprising:

Weight calculation unit is calculated the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of described document sets;

Select sequencing unit, from weight calculation unit, obtain the weighted value of all non-repetition sentences of new document and the existing summary of document sets, and it is sorted;

The summary generation unit will be selected the big sentence of weighted value in the sequencing unit, generates the new summary of described document sets.

Automatically the device that generates summary for document sets provided by the present invention, by calculating the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of document sets, each sentence in described new document and the existing summary of document sets is sorted, screens, thus the new summary of formation document sets; Compared with prior art, this device only needs to calculate the weight of each sentence in new document and the existing summary of document sets, do not need all sentences of every piece of document in the document sets are recomputated weight, can obtain the new summary of described document sets, this device can be document sets generation summary simply, apace, improve the efficient that generates summary for document sets greatly, can adapt to the requirement that information updating speed improves constantly.

Description of drawings

Fig. 1 is the process flow diagram that generates the method for summary for document sets automatically provided by the present invention;

Fig. 2 is the apparatus structure synoptic diagram that generates summary for document sets automatically provided by the present invention;

Fig. 3 shown in Figure 2ly provided by the present inventionly generates in the device of summary the structural representation of vector calculation unit automatically for document sets;

Fig. 4 be invention shown in Figure 2 provide generate in the device of summary the structural representation of document sets feature updating block automatically for document sets;

Fig. 5 be invention shown in Figure 2 provide generate in the device of summary the structural representation of weight calculation unit automatically for document sets;

Fig. 6 be invention shown in Figure 2 provide generate in the device of summary the structural representation of sequencing unit automatically for document sets.

Embodiment

When being document sets generation summary in order to solve prior art, need recomputate weight to each sentence of whole documents in the document sets, cause slow, the inefficient problem of speed that generates summary, the invention provides a kind of is the method that document sets generates summary automatically, below in conjunction with drawings and Examples the present invention is elaborated.

As shown in Figure 1, provided by the present invention is the method that document sets generates summary automatically, is used for for document sets generates summary automatically, comprising the steps: after increasing new document to document sets

Step 101 is calculated the vector of described new document, and the vector of each sentence in the described new document;

Its concrete step is:

To described new document d _NewCarry out subordinate sentence, the sentence S set that obtains _New, S _New={ s _i| 1≤i≤n}, wherein, positive integer n is new document d _NewIn the sentence number that comprises;

When calculating the sentence S set _NewIn each sentence s _iVector

The time, to described sentence s _iCarry out participle, obtain described sentence s _iSet of words w behind the participle _i, w _i={ w _Ij| 1≤j≤m}, wherein, positive integer m is described sentence s _iIn the word number that comprises because vector

So a speech of the corresponding described new document of each dimension is vector

The computing formula of the corresponding weight of each dimension be:

w _ijf _wij×idf _wij (1-1)

Wherein, w _Ijf _WijBe speech w _IjThe frequency of the appearance in described document sets, idf _WijBe speech w _IjInverted entry frequency in described document sets, described idf _WijComputing formula can be expressed as:

idf _wij＝1+log(N/n _wij) (1-2)

Wherein, N is the quantity of all documents in the document sets, n _WijBe wherein to comprise speech w _IjThe quantity of document,

(1-1) calculates vector by above-mentioned formula

The corresponding weight of each dimension, can obtain described vector

When calculating described new document d _NewVector

The time, to described new document d _NewCarry out participle, obtain described document d _NewSet of words W behind the participle _New, W _New={ w _k| 1≤k≤z}, wherein, positive integer z is described document d _NewIn the word number that comprises,

Above-mentioned to described new document d _NewThe method of carrying out participle can be divided into two kinds: a kind of is directly to described new document d _NewCarry out participle, another kind is

W_{new} = \underset{i}{\cup} w_{i},

Wherein, 1≤i≤n, positive integer n is new document d _NewIn the sentence number that comprises, w _iBe sentence s _iSet of words behind the participle,

Because described new document d _NewVector

Each dimension also can be to a speech in the new document, so vectorial

The corresponding weight calculation formula of each dimension is:

w _kf _wk×idf _wk (1-3)

Wherein, w _kf _WkBe speech w _kThe frequency of the appearance in described document sets, idf _WkBe speech w _kInverted entry frequency in described document sets, described idf _WkComputing formula can for:

idf _wk＝1+log(N/n _wk) (1-4)

Wherein, N is the quantity of all documents in the document sets, n _WkBe wherein to comprise speech w _kThe quantity of document,

(1-3) calculates vector by above-mentioned formula

The corresponding weight of each dimension, can obtain described vector

Step 102, the center vector and the document vector lists of renewal document sets;

This step specifically comprises:

Document sets D is updated to D '=D ∪ { d _New;

With document sets vector lists L _DBe updated to

{L^{'}}_{D} = L_{D} \cup {{\overset{&RightArrow;}{d}}_{new}};

With following formula, with the corresponding center vector of document sets D

Be updated to

{\overset{&RightArrow;}{c}}^{'} = \frac{1}{| D^{'} |} \underset{d_{i} &Element; D^{'}}{Σ} {\overset{&RightArrow;}{d}}_{i} - - - (1 - 5)

Wherein, | D ' | the number of documents among the expression document sets D ',

Described with the corresponding center vector of document sets D

Be updated to

Can also use following formulate:

{\overset{&RightArrow;}{c}}^{'} = \frac{| D | \times \overset{&RightArrow;}{c} + {\overset{&RightArrow;}{d}}_{new}}{| D | + 1} - - - (1 - 6)

Wherein, | D| represents the number of documents among the document sets D.

Step 103 is calculated the weight of each sentence in the new document;

Its concrete grammar is:

Calculate sentence s _iContent weight w _Content(s _i):

w_{content} (s_{i}) = \cos ({\overset{&RightArrow;}{s}}_{i}, {\overset{&RightArrow;}{c}}^{'}_{i}) = \frac{{\overset{&RightArrow;}{s}}_{i} \cdot {\overset{&RightArrow;}{c}}^{'}_{i}}{| | {\overset{&RightArrow;}{s}}_{i} | | \times | | {\overset{&RightArrow;}{c}}^{'}_{i} | |} - - - (1 - 7)

Wherein,

Be sentence s _iVector,

Be the center vector after the document sets D renewal,

From formula (1-7) as can be seen, by asking sentence s _iVector

Center vector with document sets Between cosine value, determine sentence s _iContent weight w _Content(s _i) size, that is: with the center vector of document sets

Similar more sentence s _iContent weight w _Content(s _i) big more, wherein, the center vector of described document sets The theme of reaction the document collection, we can utilize sentence s _iContent weight w _Content(s _i) as sentence s _iWeighted value;

In order to make described sentence s _iWeight react this sentence s more exactly _iImportance in document sets (with the correlation degree of theme and the position in document sets etc.), the method for the weight of each sentence also comprises in the new document of described calculating:

Write down the positional information of each sentence in the described new document, for example: the memory location of each sentence, perhaps and the up and down relation information between the sentence etc.;

Calculate sentence s _iPosition weight w _Position(s _i):

w_{position} (s_{i}) = \frac{n - i + 1}{n} \cdot \max_{s &Element; d_{new}} {w_{content} (s)} - - - (1 - 8)

Wherein, n is described new document d _NewThe sentence sum, i (1≤i≤n) is the sentence sequence number,

\max_{s &Element; d_{new}} {w_{content} (s)} .

Be described new document d _NewThe content weights of maximum in all sentences;

Calculate sentence s _iComprehensive weight value w (s _i):

w(s _i)＝α·w _content(s _i)+β·w _position(s _i) (1-9)

Wherein, α, β are parameter, 0≤α, and β≤1, and alpha+beta=1 is arranged,

By calculating sentence s _iComprehensive weight value w (s _i), can give weighted value for each sentence in the new document more effectively.

Step 104, upgrade the weight of sentence in the existing summary of described document sets, that is: recomputate the weight of sentence in the existing summary of described document sets, in method that it is concrete and the above-mentioned steps 103, the weight of calculating each sentence in the new document is identical, utilize formula (1-8) to calculate the content weight of sentence in the existing summary of described document sets, sentence does not need to recomputate the position weight again in the existing summary of the document collection, can directly use the position weighted value of preserving in the last round of summary generative process, calculate the comprehensive weight that described document sets has sentence in the summary by formula (1-9), can obtain the weight of sentence in the existing summary of described document sets;

Step 105 is with each sentence in described new document and the existing summary of document sets, by the ordering of weight size, for example: described new document d _NewComprise n sentence, the existing summary of described document sets comprises k sentence, then a described k+n sentence is arranged from big to small according to its weighted value of giving (can be the content weighted value, also can be the comprehensive weight value, but the type of described weighted value must be identical);

Step 106, the sentence that repeat deletion ordering back;

Its concrete delet method is:

From the sequence that above-mentioned k+n sentence formed, second sentence begins, and judges this sentence s _iWith each the sentence s that comes its front _j(multiplicity between the j＜i), its judgment formula is:

sim (s_{i}, s_{j}) = \cos ({\overset{&RightArrow;}{s}}_{i}, {\overset{&RightArrow;}{s}}_{j}) = \frac{{\overset{&RightArrow;}{s}}_{i} \cdot {\overset{&RightArrow;}{s}}_{j}}{| | {\overset{&RightArrow;}{s}}_{i} | | \times | | {\overset{&RightArrow;}{s}}_{j} | |} - - - (1 - 10)

As the sentence s that calculates by formula (1-10) _iWith s _jBetween multiplicity during greater than threshold epsilon (0≤ε≤1), this sentence s is judged in threshold epsilon=0.85 in the present embodiment _iWith s _jFor the sentence that repeats, can delete sentence s _iWith s _jIn any one sentence;

Can demonstrate the content of latest update for the summary that can make described document sets, when receiving new document, can preserve the time of reception of described new document, when the sentence that duplicates, can delete time of reception sentence early, for example: judge sentence s by above-mentioned steps 106 _iWith s _jSentence for repeating comprises sentence s _iThe document time of reception be on July 15th, 2007, comprise sentence s _jThe document time of reception be on May 28th, 2006, then with sentence s _jDeletion.

Step 107 according to described weight ordering, is selected the big sentence of weighted value, generates the new summary of described document sets.

After the sentence deletion of above-mentioned steps 106 with repetition, obtain (the sequence that the individual sentence of k＜p＜k+p) is formed by p, this sequence is according to the descending arrangement of the weighted value of each sentence, in order to access the document set abstracts of forming by k sentence, can from a described p sentence, select the new summary of k bigger sentence of weighted value as described document sets.

In order to make method provided by the present invention generate summary automatically for document sets more apace, after step 101, also comprise the steps:

Step 108 is judged new document repeatability, obtains non-repetitive file.

Its concrete determining step is as follows:

As described new document d _NewDuring for first document among the document sets D, this new document is non-repetitive file, continues step 102;

As described new document d _NewNot document sets D={d _i| during first document among 1≤i≤m} (wherein, positive integer m is that current document is concentrated the number of files comprise), with every piece of document d among this new document and the document sets D _iCarry out similarity relatively, its concrete comparison formula is:

sim (d_{new}, d_{i}) = \cos ({\overset{&RightArrow;}{d}}_{new}, {\overset{&RightArrow;}{d}}_{i}) = \frac{{\overset{&RightArrow;}{d}}_{new} \cdot {\overset{&RightArrow;}{d}}_{i}}{| | {\overset{&RightArrow;}{d}}_{new} | | \times | | {\overset{&RightArrow;}{d}}_{i} | |} - - - (1 - 11)

Wherein

Be document d _iCorresponding vector is directly taken from the document vector lists of document sets D correspondence

L_{D} = {{\overset{&RightArrow;}{d}}_{i} | 1 \leq i \leq m}

, do not need to recomputate;

Work as d _NewWith d _iBetween similarity value during greater than threshold value θ (0≤θ≤1), described new document d _NewBe repetitive file, do not proceed step 102, wait for receiving new document again; Work as d _NewWith d _iBetween similarity value during smaller or equal to threshold value θ, described new document is non-repetitive file, proceeds step 102;

Judge that by step 108 new document and the document similarity in the document sets can directly delete the document that repeats, do not process, that is: initiate repetitive file is not generated newly and make a summary, can be faster, generate summary automatically for document sets effectively.

Corresponding with said method, it is the device that document sets generates summary automatically that the present invention also provides a kind of, is used for after increasing new document to document sets, and document sets generates summary automatically, and as shown in Figure 2, described is the device that document sets generates summary automatically, comprising:

Vector calculation unit is calculated the vector of described new document, and the vector of each sentence in the described new document;

As shown in Figure 3, described vector calculation unit comprises:

The subordinate sentence unit is used for described new document d _NewCarry out subordinate sentence, the sentence S set that obtains _New, S _New={ s _i| 1≤i≤n}, wherein, positive integer n is new document d _NewIn the sentence number that comprises;

Calculate the sentence S set _NewIn each sentence s _iCorresponding vector

The time,

The participle unit is used for described sentence s _iCarry out participle, obtain described sentence s _iSet of words w behind the participle _i, w _i={ w _Ij| 1≤j≤m}, wherein, positive integer m is described sentence s _iIn the word number that comprises;

Vector

A speech of the corresponding described new document of each dimension, vector

The computing formula of the corresponding weight of each dimension can repeat no more referring to formula (1-1) herein, (1-1) calculates vector by above-mentioned formula

The corresponding weight of each dimension, can obtain described vector

Calculate described new document d _NewCorresponding vector

The time,

Described participle unit is used for described new document d _NewCarry out participle, obtain described document d _NewSet of words W behind the participle _New, W _New={ w _k| 1≤k≤z}, wherein, positive integer z is described document d _NewIn the word number that comprises;

Above-mentioned to described new document d _NewThe method of carrying out participle can be divided into two kinds: a kind of is that described participle unit is the new document d to receiving directly _NewCarry out participle; Another kind is, described participle unit obtains the participle of each sentence in the new document by the subordinate sentence unit, and it is asked union

W_{new} = \underset{i}{\cup} w_{i}

(wherein, 1≤i≤n, positive integer n is new document d _NewIn the sentence number that comprises, w _iBe sentence s _iSet of words behind the participle) obtains d _NewSet of words W behind the participle _New={ w _k| 1≤k≤z};

Because described new document d _NewVector

Each dimension also can be to a speech in the new document, so vectorial

The computing formula of the weight that each dimension is corresponding can repeat no more referring to formula (1-3) herein, and (1-3) calculates vector by formula The corresponding weight of each dimension, can obtain described vector

Document sets feature updating block according to the result that vector calculation unit obtains, upgrades the center vector and the document vector lists of described document sets;

As shown in Figure 4, described document sets feature updating block comprises:

The document sets updating block is used for document sets D is updated to D '=D ∪ { d _New;

Document sets vector lists updating block, the vector of the described new document that obtains according to vector calculation unit is with document sets vector lists L _DBe updated to

{L^{'}}_{D} = L_{D} \cup {{\overset{&RightArrow;}{d}}_{new}};

Document sets center vector updating block, the vector of the described new document that obtains according to vector calculation unit is with the corresponding center vector of document sets D Be updated to

, concrete formula referring to formula (1-5) or (1-6) repeats no more herein.

Weight calculation unit, the result according to vector calculation unit and document sets feature updating block obtain calculates the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of described document sets;

As shown in Figure 5, described weight calculation unit comprises:

The content weight calculation unit, the result according to vector calculation unit and document sets feature updating block obtain calculates sentence s _iContent weight w _Content(s _i), see formula (1-7);

In order to make described sentence s _iWeight react this sentence s more exactly _iImportance in document sets (with the correlation degree of theme and the position in document sets etc.), as shown in Figure 5, described weight calculation unit also comprises:

The positional information record cell is used for writing down the positional information of described new each sentence of document;

The position weight calculation unit, according to the sentence positional information of positional information recording unit records, and the content weighted value that obtains of content weight calculation unit, calculate sentence s _iPosition weight w _Position(s _i), see formula (1-8);

The comprehensive weight computing unit is according to content weight and position weight calculation sentence s _iComprehensive weight value w (s _i), see formula (1-9).

As shown in Figure 6, described selection sequencing unit comprises:

Sequencing unit, according to the weighted value that weight calculation unit calculates, each sentence with in described new document and the existing summary of document sets sorts by the weight size, for example: described new document d _NewComprise n sentence, the existing summary of described document sets comprises k sentence, then a described k+n sentence is arranged from big to small according to its weighted value of giving (can be the content weighted value, also can be the comprehensive weight value, but the type of described weighted value must be identical);

The screening unit, the sentence that repeat deletion ordering back;

Its concrete delet method is:

Second sentence from the sequence that above-mentioned k+n sentence formed begins, and judges this sentence s _iWith each the sentence s that comes its front _j(multiplicity between the j＜i), its judgment formula is referring to formula (1-10)

Can demonstrate the content of latest update for the summary that can make described document sets, described screening unit also comprises: the time keeping unit, write down the time of reception of described new document, when described screening unit judges goes out to have the sentence that repeats, can delete time of reception sentence early, for example: judge sentence s by above-mentioned steps _iWith s _jBe the sentence that repeats, sentence s _iTime of reception be on July 15th, 2007, sentence s _jTime of reception be on May 28th, 2006, then with sentence s _jDeletion.

The summary generation unit according to predefined document sets sentence summary number, will be selected the big sentence of weighted value in the sequencing unit, generates the new summary of described document sets;

After of the sentence deletion of above-mentioned screening unit with repetition, obtain (the sequence that the individual sentence of k＜p＜k+p) is formed by p, this sequence is according to the descending arrangement of the weighted value of each sentence, in order to access the document set abstracts of forming by k sentence, can from a described p sentence, select the new summary of k bigger sentence of weighted value as described document sets.

In order to make device provided by the present invention generate summary automatically for document sets more apace, the described device that generates summary automatically for document sets also comprises: judging unit, the result who obtains according to vector calculation unit, judge new document repeatability, obtain non-repetitive file, when described new document was non-repetitive file, the result of calculation that described judging unit just obtains vector calculation unit sent document sets feature updating block to.

Its concrete determining step is as follows:

As described new document d _NewDuring for first document among the document sets D, this new document is non-repetitive file;

As described new document d _NewNot document sets D={d _i| during first document among 1≤i≤m} (wherein, positive integer m is that current document is concentrated the number of files comprise), with every piece of document d among this new document and the document sets D _iCarry out similarity relatively, the comparison formula that it is concrete is seen formula (1-11);

Work as d _NewWith d _iBetween similarity value during greater than threshold value θ (0≤θ≤1), described new document d _NewBe repetitive file; Work as d _NewWith d _iBetween similarity value during smaller or equal to threshold value θ, described new document is non-repetitive file;

Similarity by the document in new document of judgment unit judges and the document sets can directly be deleted the document that repeats, and does not process, and promptly initiate repetitive file is not generated summary, can be faster, generate summary automatically for document sets effectively.

Automatically the device that generates summary for document sets provided by the present invention, by calculating the weight of each sentence in the described new document, and the weight of upgrading sentence in the existing summary of document sets, each sentence in described new document and the existing summary of document sets is sorted, screens, thus the new summary of formation document sets; Compared with prior art, this device only needs to calculate the weight of each sentence in new document and the existing summary of document sets, do not need all sentences of every piece of document in the document sets are recomputated weight, can obtain the new summary of described document sets, this device can be document sets generation summary simply, apace, improve the efficient that generates summary for document sets greatly, can adapt to the requirement that information updating speed improves constantly; Application in actual internet public feelings analytic system shows that device of the present invention can improve the efficient of summary greatly under the prerequisite that guarantees the summary quality, and the method that summary efficiency ratio prior art is provided improves more than 50 times.

The above; only be the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain that claim was defined.

Claims

1. one kind is the method that document sets generates summary automatically, is used for for document sets generates summary automatically, it is characterized in that after increasing new document to document sets, comprises the steps:

Calculate the weight of each sentence in the described new document;

According to described weight ordering, select the big sentence of weighted value, generate the new summary of described document sets.

2. according to claim 1 is the method that document sets generates summary automatically, it is characterized in that, calculates before the step of the weight of each sentence in the described new document, also comprises the steps:

Calculate the vector of described new document, and the vector of each sentence in the described new document;

Upgrade the center vector and the document vector lists of described document sets.

3. according to claim 2ly, document sets it is characterized in that described step is calculated the vector of new document for generating the method for summary automatically, and the vector of each sentence in the described new document, comprising:

Calculate the sentence S set _NewIn each sentence s _iVector The time, to described sentence s _iCarry out participle, obtain described sentence s _iSet of words w behind the participle _i, w _i={ w _Ij| 1≤j≤m}, wherein, positive integer m is described sentence s _iIn the word number that comprises, vector The computing formula of the weight that each dimension is corresponding is:

w _ijf _wij×idf _wij

Wherein, w _Ijf _WijBe speech w _IjThe frequency of the appearance in described document sets, idf _WijBe speech w _IjInverted entry frequency in described document sets;

Calculate described new document d _NewVector

The time, to described new document d _NewCarry out participle, obtain described document d _NewSet of words W behind the participle _New, W _New={ w _k| 1≤k≤z}, wherein, positive integer z is described document d _NewIn the word number that comprises, vector The computing formula of the weight that each dimension is corresponding is:

w _kf _wk×idf _wk

Wherein, w _kf _WkBe speech w _kThe frequency of the appearance in described document sets, idf _WkBe speech w _kInverted entry frequency in described document sets.

4. according to claim 2ly it is characterized in that,, and in the described new document after the vector of each sentence, also comprise the steps: at the vector of calculating described new document for document sets generates the method for summary automatically

Judge new document repeatability, obtain non-repetitive file;

Its concrete determining step is as follows:

When described new document was first piece of document in the document sets, then this new document was non-repetitive file;

Otherwise, calculate in described new document and the document sets similarity between every piece of document, when the similarity value between two pieces of documents during greater than threshold value θ, described new document is a repetitive file, wherein, 0≤θ≤1; When the similarity value between two pieces of documents during smaller or equal to threshold value θ, described new document is non-repetitive file.

5. according to claim 4 is the method that document sets generates summary automatically, it is characterized in that the new document d of described calculating _NewWith every piece of document d in the document sets _iBetween the similarity value adopt following cosine formula:

sim (d_{new}, d_{i}) = \cos ({\overset{&RightArrow;}{d}}_{new}, {\overset{&RightArrow;}{d}}_{i}) = \frac{{\overset{&RightArrow;}{d}}_{new} \cdot {\overset{&RightArrow;}{d}}_{i}}{| | {\overset{&RightArrow;}{d}}_{new} | | \times | | {\overset{&RightArrow;}{d}}_{i} | |}

Wherein,

Be document d _iCorresponding vector.

6. according to claim 2 is the method that document sets generates summary automatically, it is characterized in that the step of the center vector of described renewal document sets and document vector lists specifically comprises the steps:

Document sets D is updated to D '=D ∪ { d _New;

With document sets vector lists L _DBe updated to

With following formula, with the corresponding center vector of document sets D

Be updated to

{\overset{&RightArrow;}{c}}^{'} = \frac{1}{| D^{'} |} \underset{d_{i} &Element; D^{'}}{Σ} {\overset{&RightArrow;}{d}}_{i}

Wherein | D ' | the number of documents among the expression document sets D '.

7. according to claim 1 is the method that document sets generates summary automatically, it is characterized in that, and the weight of each sentence in the new document of described calculating, method is:

Calculate sentence s _iContent weight w _Content(s _i):

w_{content} (s_{i}) = \cos ({\overset{&RightArrow;}{s}}_{i}, {\overset{&RightArrow;}{c}}^{'}_{i}) = \frac{{\overset{&RightArrow;}{s}}_{i} \cdot {\overset{&RightArrow;}{c}}^{'}_{i}}{| | {\overset{&RightArrow;}{s}}_{i} | | \times | | {\overset{&RightArrow;}{c}}^{'}_{i} | |}

Wherein,

Be sentence s _iVector, Be the center vector after the document sets D renewal.

8. according to claim 7 is the method that document sets generates summary automatically, it is characterized in that the weight of each sentence in the new document of described calculating also comprises:

Write down the positional information of each sentence in the described new document;

Calculate sentence s _iPosition weight w _Position(s _i):

w_{position} (s_{i}) = \frac{n - i + 1}{n} \cdot \max_{s &Element; d_{new}} {w_{content} (s)}

Wherein, n is described new document d _NewThe sentence sum, i is the sentence sequence number, wherein, 1≤i≤n;

Calculate sentence s _iComprehensive weight value w (s _i):

w(s _i)＝α·w _content(s _i)+β·w _position(s _i)

Wherein, α, β are parameter, 0≤α, and β≤1, and alpha+beta=1 is arranged.

9. according to claim 1 or the 7 or 8 described methods that generate summary for document sets automatically, it is characterized in that, upgrade the weight of sentence in the existing summary of described document sets, adopt with the new document of calculating in the identical method of weight of each sentence, wherein, the position weighted value of sentence is the position weighted value of preserving in the last round of summary generative process in the existing summary of described document sets.

10. according to claim 1 is the method that document sets generates summary automatically, it is characterized in that, the weight ordering of all non-repetition sentences of the existing summary of new document of described acquisition and document sets comprises:

With each sentence in described new document and the existing summary of document sets, by the ordering of weight size;

The sentence that repeat deletion ordering back.

11. according to claim 10 is the method that document sets generates summary automatically, it is characterized in that, the sentence that repeat described step deletion ordering back specifically comprises the steps:

Write down the time of reception of described new document;

Begin by second sentence the weight size collating sequence from described, judge this sentence s _iWith each the sentence s that comes its front _jBetween multiplicity, wherein, j＜i;

When described multiplicity during greater than threshold epsilon, deletion time of reception sentence early is 0≤ε≤1 wherein.

12. one kind is the device that document sets generates summary automatically, is used for after increasing new document to document sets, document sets generates summary automatically, it is characterized in that, comprising:

13. according to claim 12 is the device that document sets generates summary automatically, it is characterized in that, also comprises:

Document sets feature updating block according to the result that vector calculation unit obtains, upgrades the center vector and the document vector lists of described document sets.

14. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that described vector calculation unit comprises:

Calculate the sentence S set _NewIn each sentence s _iVector

The time,

Vector

The computing formula of each dimensional vector is:

w _ijf _wij×idf _wij

Calculate described new document d _NewVector

The time,

Vector

The computing formula of each dimensional vector is:

w _kf _wk×idf _wk

15. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that, also comprises:

Judging unit according to the result that vector calculation unit obtains, is judged new document repeatability, obtains non-repetitive file.

16. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that described document sets feature updating block comprises:

Document sets center vector updating block, the vector of the described new document that obtains according to vector calculation unit is with the corresponding center vector of document sets D

Be updated to

Concrete formula is:

{\overset{&RightArrow;}{c}}^{'} = \frac{1}{| D^{'} |} \underset{d_{i} &Element; D^{'}}{Σ} {\overset{&RightArrow;}{d}}_{i}

Wherein | D ' | the number of documents among the expression document sets D '.

17. according to claim 13 is the device that document sets generates summary automatically, it is characterized in that described weight calculation unit comprises:

The content weight calculation unit, the result according to vector calculation unit and document sets feature updating block obtain calculates sentence s _iContent weight w _Content(s _i):

w_{content} (s_{i}) = \cos ({\overset{&RightArrow;}{s}}_{i}, {\overset{&RightArrow;}{c}}^{'}_{i}) = \frac{{\overset{&RightArrow;}{s}}_{i} \cdot {\overset{&RightArrow;}{c}}^{'}_{i}}{| | {\overset{&RightArrow;}{s}}_{i} | | \times | | {\overset{&RightArrow;}{c}}^{'}_{i} | |}

Wherein,

Be sentence s _iVector,

Be the center vector after the document sets D renewal.

18. according to claim 17 is the device that document sets generates summary automatically, it is characterized in that described weight calculation unit also comprises:

The position weight calculation unit, according to the sentence positional information of positional information recording unit records, and the content weighted value that obtains of content weight calculation unit, calculate sentence s _iPosition weight w _Position(s _i):

w_{position} (s_{i}) = \frac{n - i + 1}{n} \cdot \max_{s &Element; d_{new}} {w_{content} (s)}

Wherein, n is described new document d _NewThe sentence sum, i is the sentence sequence number, wherein, 1≤i≤n

The comprehensive weight computing unit is according to content weight and position weight calculation sentence s _iComprehensive weight value w (s _i):

w(s _i)＝α·w _content(s _i)+β·w _position(s _i)

19. according to claim 12 is the device that document sets generates summary automatically, it is characterized in that described selection sequencing unit comprises:

Sequencing unit, according to the weighted value that weight calculation unit calculates, each sentence with in described new document and the existing summary of document sets sorts by the weight size;

The screening unit, the sentence that repeat deletion ordering back;

Described screening unit comprises: the time keeping unit, write down the time of reception of described new document.