CN102411621B

CN102411621B - Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode

Info

Publication number: CN102411621B
Application number: CN201110373752.9A
Authority: CN
Inventors: 陈劲光; 何婷婷; 胡珀; 赵军民; 李芳�
Original assignee: Huazhong Normal University
Current assignee: Huazhong Normal University
Priority date: 2011-11-22
Filing date: 2011-11-22
Publication date: 2014-01-08
Anticipated expiration: 2031-11-22
Also published as: CN102411621A

Abstract

The invention discloses a Chinese inquiry oriented multi-document automatic abstraction method based on a cloud mode, which comprises the following steps of: segmenting sentences, dividing words and removing stop words for the inquiry and the multi-document collection; expressing the inquire and the document as a vector; processing the acquired vector by cloud mode; modifying the source code of the English automatic abstract testing tool ROUGE by building a Chinese corpus for automatically testing Chinese abstract and training parameters, finding sentences related with inquiry, and calculating the important degree of the sentence in the document collection; scoring the sentence by considering two aspects; and removing the redundancy and generating the initial abstract. The technical scheme of the invention can automatically acquire the related document collection from the given inquiry by search engine and further automatically generate user demanding abstracts. Meanwhile, important user demanding content can be directly returned, which can avoid waste of time of users for finding the needing result from web pages. The invention is a first complete system suitable for generating Chinese inquiry oriented multi-document automatic abstraction. The system has good performance proved by an experiment on Chinese and English large-scale corpus.

Description

A kind of many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model

Technical field

The present invention relates to technical field of information processing, definite saying relates to a kind of many Document Automatic Summarizations method of the inquiry oriented based on cloud model.

Background technology

Universal along with internet, on internet, comprising magnanimity and the time be engraved in the information of increase.For a simple queries of user input, search engine generally can return to a series of webpages through sequence that the user may need, and incoherent, the data that repeat are in a large number wherein arranged, and needs the user to expend a lot of energy and carrys out oneself to search useful result.Many Document Automatic Summarization Techniques of inquiry oriented refine the content in a large amount of inquiry relevant documentations, be reassembled as the short summary of certain-length, accelerate user's acquisition of information.Many Document Automatic Summarization Techniques of inquiry oriented can reduce the difficulty of obtaining information from mass data, improve the speed of acquisition of information and understanding, and then improve the efficiency that the user obtains and utilizes information, improve the competitive strength of user in information society.

The technology such as many Document Automatic Summarizations of inquiry oriented and information retrieval, automatic question answering not only are related but also have any different.The main task of information retrieval is to find out the document that meets specific search condition, and the user need to strive to find needed information from the lists of documents that comprises in a large number various redundant informations of returning.The main task of automatic question answering is to find out the answer that meets particular problem, also only limits at present the problem of some specific areas, particular type, and the answer provided is sometimes due to too simple and indigestion.The research of the question answering system of open field also is faced with substantial difficulty, and effect is also barely satisfactory.Many Document Automatic Summarizations of inquiry oriented combine the advantage on the prior aries such as many Document Automatic Summarizations, information retrieval and automatic question answering, have avoided to a certain extent again its deficiency.The fields such as it obtains at recommendation customization, the magnanimity information of user personalized information, digital library, business intelligence analysis, E-Government and mobile computing all have important Research Significance and wide application prospect.

According to the difference of summarize by, many Document Automatic Summarizations of inquiry oriented can be divided into to information extraction formula and extracts formula, its main difference is: the former extracts Useful Information in sentence, through rewriteeing, is combined into summary; The latter chooses most important sentence by certain method and forms summary.At present, extracts formula summary is the main flow direction of research.According to the difference of research object, the research of many Document Automatic Summarizations of inquiry oriented can be divided into for the digest of specific area with for the digest of Opening field.Although in general the digest system readability for Opening field be not so good as the former, wide accommodation is portable strong, is current main flow direction.The method of the invention is the extracts formula, for open field.

Cloud model is a kind of qualitative, quantitative transformation model of processing ambiguity, randomness and relevance thereof in uncertain concept that the firm academician of Li De proposes.Cloud model is started with from the uncertainty of research natural language concept, launches the research to uncertain artificial intelligence.Although cloud model is originated in the concept in natural language, but regrettably, the paper situation of just collecting at present be it seems, the work that cloud model directly is applied in to natural language processing field itself is also more rare, the method of the invention is a kind of typical case application of cloud model in natural language processing, can be extended to the other field of natural language processing.

Many Document Automatic Summarizations system of inquiry oriented is generally extracted and is generated three phases by text internal representation, text analyzing, digest and forms.The text internal representation stage is converted into the internal representation form by input text.Thereby the text analyzing part is carried out the importance of definite each text elementary cell (statement, paragraph or chapters and sections etc.) of analysis of different levels to text.Digest extracts and generating portion carrys out by the sequence to the digest extracting unit digest that generating content is coherent, reflect the original text theme.At present, the difference of each digest system is mainly reflected in latter two stage.

In the text analyzing stage, the method based on extracting mainly contains: the method based on high frequency words, and the method based on figure, the method based on theme, and the method for semantic-based etc.These existing methods may be summarized to be substantially: find certain stochastic distribution of digest unit, utilize statistics, drawing method or more complicated language model to resolve these and distribute, and accordingly the importance of digest unit is estimated.Through the text analyzing stage, choose most important sentence and can directly generate digest, but owing to just simply quoting and piling up, the summary redundance of its composition is high, continuity and readable poor, is difficult to be understood by the reader.

Digest extraction and generating portion, on the basis of previous stage, are adjusted and are modified select sentence, and current technical way comprises that de-redundancy, sentence are pruned, the sentence sequence.Wherein de-redundancy is generally taked the MMR method, not only considers the importance degree of sentence in the process of choosing the digest sentence, also considers sentence and has selected the degree of correlation of digest sentence, chooses that those are important but and select the incoherent sentence of digest sentence as the digest sentence.

Sentence is pruned by removing some effective informations in sentence seldom or there is no the content of effective information, express the core content of a sentence by the form of relatively simple and grammatical, can effectively improve the effective information content of digest, express more content in limited space.Utilize in recent years surfing Internet with cell phone also to become gradually a kind of main way of obtaining information resource, and marked difference of cell phone platform and computer platform is the difference of screen size, short and small summary of simplifying will help the cellphone subscriber to obtain faster the consulting of their demand, and the sentence pruning technique also thereby very likely receives more concern.The research of pruning for Chinese sentence at present, is also extremely rare.

The sentence in digest is resequenced in the sentence sequence, thereby makes the digest after sequence more coherent, easily by the reader, understood, and be also one of gordian technique of automatic abstract.At present, the method for sentence sequence mainly contains three kinds, i.e. the method for time order, most order, probability order.Wherein, the chronological order method is published according to former document or the order of date issued is sorted, and it is often very difficult that its limitation is to obtain actual time information, and the method is not considered Subject elements simultaneously.The basic thought of most order is to determine the order of digest sentence according to the order of theme under the digest sentence, and the order of theme is the determining positions of most of sentence in theme.The limitation of most order is: while only having the relative position of each theme in document more stable, most order methods generate the readable just better of summary, change when frequent the digest structure confusion that easily becomes at relative position.The thinking of probability order is that the digest sentence is decomposed into to feature, sequencing in these features of corpus learning, the order of recycling feature determines the order of digest sentence, its limitation is the dependence for corpus, and the quality of the corpus of artificial selection is very large for sentence sequence impact.The Liu Dexi of Wuhan University has proposed a kind of mixture model of many document abstracts sentence sequence, utilizes the integrated position relationship of linear combination, time relationship, dependence, topic relation.The Jiang Xiaoyu of Beijing Institute of Technology has proposed a kind of sentence sort method that interior poly-degree between local topic is combined with most order.The horse of Central China Normal University is bright has proposed a kind of digest sentence ordering strategy merged based on single template, according to the representative template of selecting of the digest of document, utilizes template to come for the sequence of digest sentence, thereby guarantees digest sentence continuity in logic.The people such as the Xu Yongdong of Harbin Institute of Technology propose the sentence sort method of processing based on the text temporal information, have proposed the extraction of Chinese text temporal information, semantic computation and temporal inferences algorithm, extracting time information.

The inventor has announced a kind of many Document Automatic Summarizations method of the inquiry oriented based on cloud model on periodical in 2011, the method for having announced is confined to English language material, and only limits to described subordinate phase, i.e. the innovation in text analyzing stage.

Summary of the invention

For solving the problems of the technologies described above, the invention provides a kind of many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model, adopted the newest research results in this uncertainty study field of cloud model as theoretical direction, apply in a flexible way in the links of the constructing system thought and method of cloud, consider fully to generate the uncertain factor in the digest process, and utilize these uncertain factors to improve the performance of system, for given Chinese document set and querying condition, this system can automatically generate the query demand that meets of designated length, succinctly, coherent autoabstract.The method is applicable to Chinese language material, and the summary of generation has higher compatible degree with artificial summary, and has stronger readability, thereby reduce the user, searches the information time used.

For achieving the above object, the invention provides a kind of many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model, comprise the following steps:

1) inquiry and many collection of document are carried out sentence cutting, participle, removed stop words, will inquire about and become vector with document representation;

2) utilize cloud model to be processed the vector obtained, at first calculate the correlativity of digest unit and querying condition, regard the degree of correlation of digest unit and each query word as water dust, by the probabilistic calculating to cloud, find out the digest unit relevant to the real meaning of querying condition, utilize subsequently the collection of document importance degree to be revised inquiring about relevant result, regard the similarity of digest sentence and other each digest sentences as water dust, utilize the numerical characteristic of cloud to calculate Sentence significance, the factor of comprehensive two aspects, give a mark to sentence;

3) parameter training, de-redundancy, generate initial digest;

4) formulate sentence pruning rule initial digest sentence is carried out to the sentence pruning, produce many candidate sentence, utilize the multidimensional cloud to choose the pruning sentence and replace original digest sentence, generation refining digest;

5) the refining digest collection of document generated is carried out to cluster, find out the sub-topics that comprises one or more digest sentences, regard all documents in collection of document as template, the set of a plurality of templates has formed cloud, it is the cloud template, utilize the cloud template successively the digest sentence of sub-topics and sub-topics inside to be sorted, finally generate required summary.

And, described step 2) in utilize cloud model to be processed vector, the concrete grammar that finally carries out sentence marking is:

(1) calculate the relevant score of inquiry of sentence, comprise the following steps:

Word in a, employing HAL method calculating collection of document and the degree of association between query word;

B, regard the degree of correlation of each word in a word and querying condition as water dust, utilize backward cloud generator, obtain the numerical characteristic of cloud

Figure 2011103737529100002DEST_PATH_IMAGE002

;

The numerical characteristic linear combination of c, cloud that step b is obtained, obtain the relevant score of inquiry of word;

The relevant score (step c provides) of the inquiry of d, each word that sentence is comprised is regarded water dust as, utilizes backward cloud generator, obtains the numerical characteristic of cloud

Figure 2011103737529100002DEST_PATH_IMAGE004

;

The numerical characteristic of e, cloud that steps d is obtained carries out linear combination, obtains the relevant score of inquiry of sentence.

Calculate the importance degree score of sentence, comprise the following steps:

Cosine similarity between a, calculating sentence and sentence;

B, regard the similarity of each sentence in sentence and collection of document as water dust, utilize backward cloud generator, obtain the numerical characteristic of cloud

Figure 2011103737529100002DEST_PATH_IMAGE006

;

The numerical characteristic linear combination of c, cloud that step b is obtained, obtain the importance degree score of sentence.

Calculate the integrate score of sentence: the relevant score of the inquiry of the sentence that (1) (2) are obtained and importance degree score are carried out linear combination, obtain the integrate score of sentence.

and, in described sentence shearing procedure, utilizing the multidimensional cloud to choose the original digest sentence of pruning sentence replacement specifically refers to: the distribution by word between collection of document, distribution between all sentences, and the degree of correlation three aspects: between all query words is regarded respectively water dust as, obtain the numerical characteristic of three kinds of clouds to obtain word multidimensional cloud by backward cloud generator respectively, obtain word one-dimensional cloud by comprehensive cloud computing, word one-dimensional cloud forms sentence multidimensional cloud, calculated candidate sentence importance degree score, the information density of calculated candidate sentence together with candidate sentence length again, by information density, the highest candidate sentence is replaced original digest sentence.

And, in described sentence shearing procedure, calculated candidate sentence importance degree score refers to, by calculating the similarity of sentence multidimensional cloud and former sentence multidimensional cloud, thereby obtains the importance degree score of candidate sentence, the method for calculating sentence multidimensional cloud and former sentence multidimensional cloud similarity is:

Figure 2011103737529100002DEST_PATH_IMAGE007

Wherein, C1 and C2 are two multidimensional clouds, Ex _1k, Ex _2k, En _1k, En _2k, He _1k, He _2kbe respectively mathematical expectation, entropy, the super entropy of k the property value that concept C1 and C2 have; V _kfor the weight of attribute k, its size is between 0 to 1, depending on specific object and contact thereof.

And, in described sentence shearing procedure, the method for calculated candidate sentence information density is:

Figure 2011103737529100002DEST_PATH_IMAGE008

Wherein C, O mean respectively candidate sentence and former sentence, and what function Length calculated is sentence length, take word as unit.

And, in described step 4), utilize the cloud template successively sub-topics to be sorted and specifically refers to: the one-dimensional cloud of each digest sentence comprised by theme forms theme relative position multidimensional cloud, obtain theme relative position one-dimensional cloud with comprehensive cloud computing, by expectation, Ex obtains theme relative position score, with this, theme is sorted.

And, in described step 4), utilize the cloud template successively the digest sentence of sub-topics inside to be sorted and specifically refers to: to find out which sentence the most similar to the digest sentence obtained in back in all documents, as the relative position of this digest sentence in the document, regard each relative position as water dust, carry out reverse cloud computing, obtain the numerical characteristic of sentence relative position cloud, obtain sentence relative position score by expectation Ex together with the inner sentence of theme, with this, theme inside sentence is sorted.

The method of the invention compared with prior art, has following effect: owing to having adopted cloud model, taken into full account uncertain problem, guaranteed the better performance of the links in the digest generative process; Sentence is pruned and can be made digest more brief, more likely is generalized to the field that mobile search etc. is had relatively high expectations to the digest terseness; The sentence sequence can reduce the jumping characteristic of content again, makes digest more coherent; The validity that experimental results show that the method that the present invention proposes of carrying out in large-scale corpus.

Technology of the present invention can realize for given inquiry, by the set of search engine automatic acquisition relevant documentation, and then automatically generates the summary that the user needs.The present invention can directly return to the important content that the user needs, and avoids the user to expend a large amount of time and find the result needed from webpage.The present invention is the first holonomic system that is suitable for generating many Document Automatic Summarizations of Chinese inquiry oriented known at present, and the experiment of carrying out on Chinese and English large-scale corpus shows that this system has good performance.

The accompanying drawing explanation

The overall flow figure of many Document Automatic Summarizations method that Fig. 1 is a kind of Chinese inquiry oriented based on cloud model of the present invention.

Fig. 2 is the process flow diagram that process chosen in sentence.

The process flow diagram that Fig. 3 is the de-redundancy process.

Fig. 4 is the process flow diagram that process pruned in sentence.

The process flow diagram that Fig. 5 is the sentence sequencer procedure.

Fig. 6 is that TAC 2010 its tasks of evaluation and test data set A(are similar to many Document Automatic Summarizations of inquiry oriented) on, the described cloud abstract system of step in embodiment () is numbered 23, its ROUGE-2(a), ROUGE-SU4 (b), Basic Elements (c), manually evaluate and test the achievement that Average Overall Responsiveness (d) has obtained respectively rank the the the 3rd, the 2nd, the 2nd, the 3rd in 43 systems that participate in evaluation and electing, wherein A is artificial summary to H, and 1 to 43 is machine summary (just listing front ten).

The ROUGE evaluation result that Fig. 7 is the described system of step in embodiment () and baseline system SumFocus and 95% fiducial interval.

The method that Fig. 8 is step in the embodiment of the present invention (two) with artificial prune that sentence contrasted manually comment evaluation result.

Fig. 9 is for being used sentence to prune the impact for the ROUGE evaluation result.

Figure 10 is the readable artificial evaluation and test of digest, the number percent that all kinds of results are shared.

Embodiment

embodiment 1

(1), the direct initial digest of generation after de-redundancy, comprise the following steps:

1, querying condition and collection of document are carried out sentence cutting, participle, remove stop words.Use sentence cutting module (SplitSentence) and the word-dividing mode (CRFWordSeg) of the LTP v2.01 version of Harbin Institute of Technology's exploitation.The up-to-date word-dividing mode of LTP is based on the CRF model construction, and participle performance F1 value has reached 97.4%.After participle, further adopt homemade inactive vocabulary to be gone the work of stop words.

2, the sentence expression obtained after above-mentioned processing in collection of document, querying condition is become to vectorial form, the line number of vector is the sentence number, and columns is word kind number, the element correspondence of vector the number of times that occurs in certain sentence of certain word.

3, the sentence based on cloud model is chosen (Fig. 2), sentences all in collection of document is given a mark, in four steps:

(1) calculate the relevant score of inquiry of sentence, comprise following steps:

Word in a, employing HAL method calculating collection of document and the degree of association between query word, the HAL method can be by the method that is called the window co-occurrence of image, utilize word and the query word co-occurrence situation in certain length of window to calculate the correlativity score between word and query word, thereby obtain the semantic association information existed between word and query word.Observe the co-occurrence situation of word (w) in collection of document and query word (w ') in the window ranges that is K a length, then this window is moved in whole collection of document scope, move forward a word at every turn.Statistics word and query word are in the co-occurrence situation of certain distance, and distance is less, and the co-occurrence number of times is more, illustrates that this word is more relevant to query word.

If

Figure 2011103737529100002DEST_PATH_IMAGE009

represent that w and w ' are the co-occurrence number of times of k in distance, W (k)=K-k+1 represents the co-occurrence intensity of word w and w '.The degree of correlation of word and query word can be expressed as:

Figure 2011103737529100002DEST_PATH_IMAGE011

.

The numerical characteristic linear combination (linear combination 1) of c, cloud that step b is obtained obtains the relevant score of inquiry of word:

Figure 2011103737529100002DEST_PATH_IMAGE012

The relevant score (step c provides) of the inquiry of d, each word that sentence is comprised is regarded water dust as, utilizes backward cloud generator, obtains the numerical characteristic of cloud .

The numerical characteristic of e, cloud that steps d is obtained carries out linear combination (linear combination 2), obtains the relevant score of inquiry of sentence:

Figure 2011103737529100002DEST_PATH_IMAGE014

(2) calculate the importance degree score of sentence, comprise following steps:

Cosine similarity between a, calculating sentence and sentence.

Adopt vector space model to calculate the similarity between sentence.For given collection of document, the vector (w that is the m dimension by each sentence expression _i1, w _i2..., w _im), wherein m is the word kind number of collection of document, the every one dimension correspondence in vector space a word in vocabulary.In vector, the weight of each element must assign to mean with the TF-ISF of the corresponding word of dimension at this element place, that is:

Figure 2011103737529100002DEST_PATH_IMAGE015

Wherein, TF means the word frequency of word w in sentence S, and ISF, for arranging the sentence frequency, is calculated by following formula:

Figure 2011103737529100002DEST_PATH_IMAGE016

Wherein, N means the sum of sentence in collection of document, and n means the sentence number that contains word w.

Between sentence, similarity can be calculated by the cosine similarity between vector:

Figure 2011103737529100002DEST_PATH_IMAGE018

.

The numerical characteristic linear combination (linear combination 3) of c, cloud that step b is obtained obtains the importance degree score of sentence:

Figure 2011103737529100002DEST_PATH_IMAGE019

(3) calculate the integrate score of sentence.

The relevant score of the inquiry of the sentence that (1) (2) are obtained and importance degree score are carried out linear combination (linear combination 4), obtain the integrate score of sentence:

(4) parameter training process.

For Chinese language material, build many Document Automatic Summarizations corpus and the Chinese abstract automatic Evaluation instrument of Chinese inquiry oriented, determine parameter in (1) (2) (3)

Figure 2011103737529100002DEST_PATH_IMAGE021

,

Figure 2011103737529100002DEST_PATH_IMAGE022

,

Figure 2011103737529100002DEST_PATH_IMAGE023

, δ.Be divided into following steps:

A, build many Document Automatic Summarizations corpus of Chinese inquiry oriented.

The many Document Automatic Summarizations corpus that there is no at present disclosed Chinese inquiry oriented.The present invention has built many Document Automatic Summarizations corpus of Chinese inquiry oriented, at first manually after the focus incident theme that selected 100 2009-2010 occur, focus incident title (for example " the Guangzhou Asian Games ") is used as query word and is input to search engine, and the result of search engine is through automatically extracting and be converted into relevant documentation.This paper adopts a kind of relevant documentation extracting method based on label density.After definite relevant documentation set, further by the expert, write querying condition and artificial summary.Finally obtain the corpus containing 100 collection of document, 1000 pieces of relevant documentations, 400 pieces of artificial summaries.

B, structure Chinese Text Summarization appraisal tool, be used for the digest generated under different parameters is carried out to auto-scoring.

This instrument is to modify and obtain on the basis of English automatic Evaluation instrument ROUGE, is below the step of ROUGE-CN concrete modification source program:

Step 1: adopt the CRFWordSeg module of LTP V2.01 platform to carry out participle to artificial summary, autoabstract, adopt space-separated during participle.

Step 2: all the elements of " smart_common_words.txt " under " data " file below Chinese stoplist replacement ROUGE installation kit.

Step 3: find and delete the relevant statement that filters Chinese character in source program.

C, parameter training.

According to constraint condition, determine parameter

Figure 2011103737529100002DEST_PATH_IMAGE024

,

Figure 2011103737529100002DEST_PATH_IMAGE025

,

Figure 2011103737529100002DEST_PATH_IMAGE026

should be a group in following candidate parameter set:

Figure 2011103737529100002DEST_PATH_IMAGE027

In concrete training process, determine successively

,

,

, the locally optimal solution of δ, then combine 3 solutions of every group of optimum, carries out recycle to extinction, generates 3 ⁴=127 digests, select best parameter group by automatic Evaluation.

Through the parameter training process, determined parameter, also just determined the integrate score of the described sentence of step 3.

4, de-redundancy (Fig. 3)

(1) sentence is sorted from high to low by score, choose sentence that score is the highest as first digest sentence.

(2) adjust the score of remaining all sentences, and it is higher to have selected the similarity of digest sentence, score just is lowered manyly:

Figure 2011103737529100002DEST_PATH_IMAGE028

Wherein R is the set of all sentences, and F is the set of all digest sentences that chosen, thereby S _imean candidate's digest sentence; S _lmean the digest sentence of choosing recently.

(3) judge whether to reach the digest length requirement, if reach, technology; If do not reach length requirement, get back to step (1).

5, generate initial digest.

If the length sum of the sentence of choosing is greater than digest length, remove the part that exceeds length in last sentence, generate final summary.

6, effect

Although this method is applicable to Chinese language material, also is suitable for English language material.Owing to openly not evaluating and testing on a large scale language material in Chinese, here at first by the experimental result in English language material as reflecting card, provide subsequently the experimental result in Chinese language material.

Our employing in (1) 2010 year is participated in TAC 2010 as above-mentioned 5 the cloud abstract systems that step was formed and is led the digest international tournament, in order to have task and the inquiry oriented automatic abstract task led to interrelate, only using organizing committee, given classification information is as querying condition for we, and other aspects all are consistent with experiment before.

Adopt the language material of international text analyzing meeting TAC2008 as corpus, through step 3(4) described parameter training process, training obtains parameter:

Figure 2011103737529100002DEST_PATH_IMAGE029

We have submitted two systems to, and ID is respectively 6, No. 23, and the described cloud abstract system of this law is numbered No. 23.Similar to many Document Automatic Summarizations of inquiry oriented in TAC 2010 its tasks of evaluation and test data set A() on, Fig. 6 has shown every evaluation result, its ROUGE-2(a of the digest based on cloud model), ROUGE-SU4(b), Basic Elements(c), artificial comprehensive evaluation metrics Average Overall Responsiveness(d) four automatic evaluation metricses have obtained respectively the achievement of rank the the the 3rd, the 2nd, the 2nd, the 3rd in 43 systems that participate in evaluation and electing.

(2) at first by step 3(4) 100 collection of document of many Document Automatic Summarizations corpus of the described Chinese inquiry oriented of a be divided at random two parts, i.e. each 50 collection of document of every part, respectively as corpus and testing material.Corpus is mainly used to train the parameters of cloud abstract system, and testing material is used for the effect of confirmatory experiment.

Through step 3(4) described parameter training process, training obtains being applicable to Chinese automatic abstract parameter:

Figure 2011103737529100002DEST_PATH_IMAGE030

Fig. 7 is the average evaluation result that the described cloud abstract system of this law and baseline system SumFocus obtain on the testing material that comprises 50 collection of document, and has provided 95% fiducial interval.Wherein SumFocus is the digest system of the people such as the Vanderwende exploitation of Microsoft Research (Microsoft Research), this system is participated in the evaluation and test of DUC 2006, it is one of digest system behaved oneself best, wherein pyramid is evaluated and tested first that ranks 22 systems, and we have built this system and have generated summary in Chinese language material.As can be seen from Figure 7, the described method of this method all is significantly improved than the every score of SumFocus.

These result reflection methods described herein have consistance preferably with artificial summary aspect content, and popular says, take word as unit, and it is the same with artificial clip Text that the summary that method of the present invention generates on average has 1/3 content.

(2), de-redundancy carries out the sentence shearing procedure after generating initial digest, by deleting less important or irrelevant sentence element in the digest sentence, further increases the information content of digest.

Process corresponding diagram 4 pruned in sentence, comprises following steps:

1. formulate artificial rule base, for sentence is pruned.

Following table has provided 10 artificial regular concise and to the point description and examples thereof that this paper adopts, and underscore means to use the content of this redundant rule elimination.

rule	describe	example
			1	parenthetic word	xinhua News Agency Bonn Dec 30 (reporter Lv Hong)germany foreign minister Jin Keer makes a speech to press during this time on the 30th, claims Europe all to obtain achievement in 1997 aspect economical, political and diplomatic.
2	absolute construction	it is reportedmunicipal Party committee of Harbin group starts with from helping laid-off young worker to improve the employment ability, for the youth reemploys, provides out and out service.
			3	the independent adverbial modifier of sentence beginning	early morning today,in resonant national song, ceremony of rising national flag is observed the grand opening of in Lhasa.
4	" XX says, " of sentence beginning	this newspaper Paris reporter Liu Zhengxue on May 26, fruit be firm report forever:the Li Ruihuan of President of Chinese People's Political Consultative Conference has met with friendly group member in French senate method in Paris on 26th.
			5	the independent conjunction of sentence beginning	so,unification of the country can reach.
6	do the adverbial modifier's prepositional phrase	carry out this work , for setting up socialist market economy system, promote the national economy sustained, rapid and sound development,there is very important meaning.
			7	" " the word structure	it is reported that municipal Party committee of Harbin group improves the employment ability from the young worker that helps to be laid off and start with, provide for the youth reemploys out and outservice.
8	" " the word structure	china has realized the fastest growth in large-scale economy, and it successfullymove towards more open market oriented economy.
			9	adverbial word	if we furtheremancipate the mind, seek truth from the facts, seize the opportunity, pioneer and keep forging ahead, the road of building socialism with Chinese characteristics will be walked broader and broader.
10	adjective	knowledge industrialization be by famous American professor mark Lu Pu 1962 in " knowledge production and distribution " book up-to-datepropose.

2. use successively artificial rule to be pruned sentence, produce many candidate sentence.

(1), for sentence to be pruned, each rule in matching rule base one by one, see whether meet this rule in order.

(2) if meet rule, carry out the sentence pruning with regard to carrying out by regular requirement, thereby obtain candidate sentence, and the input using this candidate sentence as next rule.

(3), until mated the last item rule, all candidate sentence that output obtains previously are as many candidate sentence.

3. from three different aspects, the importance degree of the word that comprises candidate sentence is given a mark, is obtained word multidimensional cloud:

(1) frequency word occurred in every piece of document is regarded water dust as, utilizes backward cloud generator, obtains the numerical characteristic of cloud

Figure 2011103737529100002DEST_PATH_IMAGE031

.

(2) frequency word occurred in each sentence in collection of document is regarded water dust as, utilizes backward cloud generator, obtains the numerical characteristic of cloud

Figure 2011103737529100002DEST_PATH_IMAGE032

.

(3) regard the degree of correlation of word and each query word as water dust, utilize backward cloud generator, obtain the numerical characteristic of cloud

Figure 2011103737529100002DEST_PATH_IMAGE033

.This step is consistent with step 3 content in step (), and the cloud obtained is identical with the cloud that the d step of step 3 in step () obtains.

(4) cloud first three step obtained is combined, and obtains the multidimensional cloud:

WMC={(Ex ₁，En ₁，He ₁)，(Ex ₂，En ₂，He ₂)， (Ex ₃，En ₃，He ₃)}

4. adopt comprehensive cloud computing, word multidimensional cloud is converted into to word one-dimensional cloud.

Comprehensive cloud operation definition is:

Figure 2011103737529100002DEST_PATH_IMAGE034

Wherein

weight for each dimension.

Make in above formula

Figure 2011103737529100002DEST_PATH_IMAGE036

Can obtain word one-dimensional cloud

Figure 2011103737529100002DEST_PATH_IMAGE037

.

4. sentence expression is become to sentence multidimensional cloud, every one dimension of cloud is a word one-dimensional cloud.

If in former sentence, m word arranged, can be vector by each sentence expression in the candidate sentence set

Figure 2011103737529100002DEST_PATH_IMAGE038

form, different components appears repeatedly also being used as in same word in same sentence to be processed, the position of word vector and the word position in former sentence is corresponding one by one.

In candidate sentence, if certain word from former sentence, deleted, Wesy's null vector of this word position

Figure 2011103737529100002DEST_PATH_IMAGE039

mean, that is:

Figure 2011103737529100002DEST_PATH_IMAGE040

Sentence multidimensional cloud (Sentence Multi-dimension Cloud, referred to as SMC) can be expressed as:

Figure 2011103737529100002DEST_PATH_IMAGE041

Although it should be noted that each candidate sentence length of same sentence is different, the dimension of their SMC is all identical.

5. calculate the similarity of sentence multidimensional cloud and former sentence multidimensional cloud, obtain the information importance degree score of candidate sentence.

For the difference of three feature roles of cloud model, the present invention proposes a kind of improved multidimensional cloud similarity calculating method, when calculating multidimensional cloud similarity, to each numerical characteristic of cloud, gives different weights.Prune sentence more similar with former sentence, illustrate that the important information of its reservation is more.

Similarity definition between two multidimensional cloud C1 and C2 is:

Wherein, ex _1k, ex _2k, en _1k, en _2k, he _1k, he _2kbe respectively concept c1 He c2 have kthe mathematical expectation of individual property value, entropy, super entropy; v _kfor attribute kweight, its size is between 0 to 1, depending on specific object and contact thereof.Calculate similarity between sentence and need to determine weight vector

Figure 2011103737529100002DEST_PATH_IMAGE042

and

Figure 2011103737529100002DEST_PATH_IMAGE043

value.Here give the word relevant with event, i.e. higher weights of noun and verb, and the weights of noun are higher than verb.At sentence multidimensional cloud, in the process of sentence one-dimensional cloud conversion, the weight of each dimension is determined by following formula:

Figure 2011103737529100002DEST_PATH_IMAGE044

Wherein, pos means the part of speech (part of speech, POS) of word.

After having defined the weight of each dimension, order

Figure 2011103737529100002DEST_PATH_IMAGE045

, can calculated candidate prune the similarity between sentence and former sentence, as the information importance degree score of pruning sentence.

6. the information density of calculated candidate sentence.

The present invention proposes a kind of improved information density computing method that candidate sentence is chosen that are suitable for:

Wherein C, O mean respectively candidate sentence and former sentence, it is the importance degree score of the pruning sentence that obtains of step 5.What function Length calculated is sentence length, take word as unit.

7. by information density, the highest candidate sentence is replaced former sentence as the digest sentence.

8. for the space owing to deleting content and save out, with last the digest sentence through pruning owing to exceeding the deleted part of length or new digest sentence is filled, the reformulation digest.

9. implementation result:

(1) prune the artificial evaluation result of sentence quality

Adopt 6(2 described in above-mentioned steps ()) in 50 collection of document using as testing material, the first two of each collection of document adds up to 100 sentences to be selected manually to estimate.

The pruning sentence 3 kinds of methods generated by 4 evaluation and test persons provides respectively 5 grades of scorings of 1 to 5 from grammer correctness, information importance degree two aspects, the higher expression of score more grammaticalness or the important information that comprises more.In concrete evaluation and test process, they know the content of former sentence, and the sentence of 3 kinds of method generations is mixed at random, and evaluation and test person does not know the sentence of evaluating and testing is from which kind of method in advance.

Fig. 8 has shown this result.Result shows, this method has been preserved important information preferably, at sentence, shortens on 32.2% basis, only loses 18% important information.

(2) the automatic Evaluation result of digest quality

Fig. 9 has shown whether use the impact of Chinese sentence pruning method for the digest quality.

In concrete evaluation and test process, utilize the ROUGE-1 evaluating tool, the average evaluation result obtained on the testing material that comprises 50 collection of document, and provided 95% fiducial interval.

The experimental result demonstration, the ROUGE-1 score of (two) () has improved 4.7%.

(3),after the sentence shearing procedure, further increased the sentence ordered steps.

Sentence sequence process flow diagram is shown in Fig. 5, comprises following steps:

1. the sentence in collection of document is carried out to cluster, obtain sub-topics.We adopt a kind of self-adaption cluster algorithm based on the discovery of uniting to carry out cluster.

2. obtain sub-topics from step 1, find the sub-topics that comprises one or more digest sentences, remove the sub-topics that does not comprise the digest sentence.

3. obtain the sentence relative position cloud of each digest sentence:

A, to find out in document which sentence the most similar to the digest sentence, as the relative position in the document of this digest sentence;

For sentence S, its relative position rp in document D is defined as the relative position of the sentence the most similar to this, and circular is:

Figure 2011103737529100002DEST_PATH_IMAGE048

Wherein, function Position returns to the absolute position of sentence in the document of place, and N is the sentence number in document D.The computing method of sentence similarity and step () step 3(2) a is identical.

B, the relative position by the digest sentence in every piece of document are regarded water dust as, utilize backward cloud generator, calculate the numerical characteristic of cloud, have obtained sentence relative position cloud

Figure 2011103737529100002DEST_PATH_IMAGE049

.

4. obtain theme relative position multidimensional cloud.

For theme T, be provided with the digest sentence

Figure 2011103737529100002DEST_PATH_IMAGE050

come from T, T can be expressed as the form of k dimension multidimensional cloud, that is:

Figure 2011103737529100002DEST_PATH_IMAGE051

5. utilize comprehensive cloud computing, obtain theme relative position one-dimensional cloud.

6. calculate theme relative position score, theme is sorted.

For theme

Figure 2011103737529100002DEST_PATH_IMAGE053

, its relative position score is directly determined by its expectation:

Figure 2011103737529100002DEST_PATH_IMAGE054

Finally, according to

order from low to high, sorted to theme successively, at first obtains the first that minute minimum theme is placed on digest, then obtains the second portion that the low theme of gradation is placed on digest, until all themes all sequence order.

7. the sentence of each theme inside sorted, generated final digest.

For the sequence of the sentence of theme inside, equally only consider the expectation of SRPCloud.For sentence

, relative position must be divided into.

Figure 2011103737529100002DEST_PATH_IMAGE057

The order of sentence by

determine,

less, sentence is more forward in the position of theme inside.

8. result

Figure 10 has provided and has used before and after the sentence order module and the artificial evaluation result of artificial summary aspect readable.Wherein: Perfect refers to how to change no matter again the sentence order, and the result of digest can not become and better make a summary; Acceptable: refer to and can understand, although and adjust may be better, and nonessential adjustment, do not adjust yet passable summary; Poor: refer to discontinuously somewhere, through inching slightly, just can reach the summary of the level of Acceptable; Unacceptable refers to that the place that need to adjust is too many, rearrange the summary of order.

Evaluation result shows, the sentence sequence makes the readability of digest that lifting by a relatively large margin be arranged, wherein the ratio of Perfect is 30%, and 47.5% digest readability is Acceptable in addition, that is to say that 77.5% digest is the summary that can no longer make further modification.Simultaneously, owing to having used the sentence pruning, step (three) also has step (twos') advantage concurrently, and digest is is automatically evaluated and tested the ROUGE-1 score and improved 4.7% than step ().

Claims

1. many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model, is characterized in that comprising the steps:

(1) inquiry and many collection of document are carried out sentence cutting, participle, removed stop words, will inquire about and become vector with document representation;

(2) utilize cloud model to be processed vector, calculate the integrate score of the relevant score of sentence inquiry, Sentence significance score, sentence, the concrete grammar of step (2) is:

1) calculate the relevant score of inquiry of sentence, comprise following sub-step:

Word in a1, employing HAL method calculating collection of document and the degree of association between query word;

B1, regard the degree of correlation of each word in a word and querying condition as water dust, utilize backward cloud generator, obtain the numerical characteristic of cloud;

The numerical characteristic linear combination of c1, cloud that step b1 is obtained, obtain the relevant score of inquiry of word;

The relevant score of the inquiry of d1, each word that sentence is comprised is regarded water dust as, utilizes backward cloud generator, obtains the numerical characteristic of cloud;

The numerical characteristic of e1, cloud that steps d 1 is obtained carries out linear combination, obtains the relevant score of inquiry of sentence;

2) calculate the importance degree score of sentence, comprise following sub-step:

Cosine similarity between a2, calculating sentence and sentence;

B2, regard the similarity of each sentence in sentence and collection of document as water dust, utilize backward cloud generator, obtain the numerical characteristic of cloud;

The numerical characteristic linear combination of c2, cloud that step b2 is obtained, obtain the importance degree score of sentence;

3) calculate the integrate score of sentence: to 1), 2) the relevant score of inquiry and the importance degree score of the sentence that obtains carry out linear combination, obtains the integrate score of sentence.

(3) parameter training, de-redundancy, generate initial digest;

(4) formulate sentence pruning rule initial digest sentence is carried out to the sentence pruning, produce many candidate sentence, utilize the multidimensional cloud to choose the pruning sentence and replace original digest sentence, generation refining digest;

(5) the refining digest collection of document generated is carried out to cluster, find out the sub-topics that comprises one or more digest sentences, regard all documents in collection of document as template, the set of a plurality of templates has formed cloud, it is the cloud template, utilize the cloud template successively the digest sentence of sub-topics and sub-topics inside to be sorted, finally generate required summary.

2. many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model according to claim 1, it is characterized in that describedly utilizing the multidimensional cloud to choose to prune sentence to replace an original digest sentence specifically to refer to: the distribution by word between collection of document, distribution between all sentences, and the degree of correlation three aspects: between all query words is regarded respectively water dust as, obtain the numerical characteristic of three kinds of clouds to obtain word multidimensional cloud by backward cloud generator respectively, obtain word one-dimensional cloud by comprehensive cloud computing, word one-dimensional cloud forms sentence multidimensional cloud, calculated candidate sentence importance degree score, the information density of calculated candidate sentence together with candidate sentence length again, by information density, the highest candidate sentence is replaced original digest sentence.

3. many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model according to claim 2, it is characterized in that described calculated candidate sentence importance degree score refers to, by calculating the similarity of sentence multidimensional cloud and former sentence multidimensional cloud, thereby obtain the importance degree score of candidate sentence, the method for calculating sentence multidimensional cloud and former sentence multidimensional cloud similarity is:

Wherein, C1 and C2 are two multidimensional clouds, Ex _1k, Ex _2k, En _1k, En _2k, He _1k, He _2kbe respectively mathematical expectation, entropy, the super entropy of k the property value that concept C1 and C2 have; V _kfor the weight of attribute k, its size is 0 ~ 1,

the weighting parameter of linear combination,

, .

4. many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model according to claim 3 is characterized in that the method for described calculated candidate sentence information density is:

5. many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model according to claim 1, it is characterized in that the described cloud template of utilizing is sorted and specifically refers to sub-topics successively: the one-dimensional cloud of each digest sentence comprised by theme forms theme relative position multidimensional cloud, obtain theme relative position one-dimensional cloud with comprehensive cloud computing, by expectation, Ex obtains theme relative position score, with this, theme is sorted.

6. according to many Document Automatic Summarizations method of the Chinese inquiry oriented based on cloud model as claimed in claim 1, it is characterized in that the described cloud template of utilizing is sorted and specifically refers to the digest sentence of sub-topics inside successively: find out which sentence the most similar to the digest sentence obtained in back in all documents, as the relative position of this digest sentence in the document, regard each relative position as water dust, carry out reverse cloud computing, obtain the numerical characteristic of sentence relative position cloud, obtain sentence relative position score by expectation Ex together with the inner sentence of theme, with this, theme inside sentence is sorted.