CN102411621A

CN102411621A - Chinese inquiry oriented multi-document automatic abstraction method based on cloud mode

Info

Publication number: CN102411621A
Application number: CN2011103737529A
Authority: CN
Inventors: 陈劲光; 何婷婷; 胡珀; 赵军民; 李芳�
Original assignee: Huazhong Normal University
Current assignee: Huazhong Normal University
Priority date: 2011-11-22
Filing date: 2011-11-22
Publication date: 2012-04-11
Anticipated expiration: 2031-11-22
Also published as: CN102411621B

Abstract

The invention discloses a Chinese inquiry oriented multi-document automatic abstraction method based on a cloud mode, which comprises the following steps of: segmenting sentences, dividing words and removing stop words for the inquiry and the multi-document collection; expressing the inquire and the document as a vector; processing the acquired vector by cloud mode; modifying the source code of the English automatic abstract testing tool ROUGE by building a Chinese corpus for automatically testing Chinese abstract and training parameters, finding sentences related with inquiry, and calculating the important degree of the sentence in the document collection; scoring the sentence by considering two aspects; and removing the redundancy and generating the initial abstract. The technical scheme of the invention can automatically acquire the related document collection from the given inquiry by search engine and further automatically generate user demanding abstracts. Meanwhile, important user demanding content can be directly returned, which can avoid waste of time of users for finding the needing result from web pages. The invention is a first complete system suitable for generating Chinese inquiry oriented multi-document automatic abstraction. The system has good performance proved by an experiment on Chinese and English large-scale corpus.

Description

A kind of Chinese based on cloud model is towards many documents automatic abstract method of inquiry

Technical field

The present invention relates to technical field of information processing, definite saying relates to a kind of many documents automatic abstract method towards inquiry based on cloud model.

Background technology

Along with popularizing of internet, comprising on the internet magnanimity and the time be engraved in the information of increase.To a simple queries of user's input, search engine generally can return a series of webpages through ordering that the user possibly need, and data incoherent in a large number, that repeat are wherein arranged, and needs the user to expend a lot of energy and comes oneself to search useful results.Will the content in a large amount of inquiry relevant documentations refine, be reassembled as the short summary of certain-length towards the many documents automatic abstract technology of inquiry, acceleration user's information is obtained.Many documents automatic abstract technology towards inquiry can reduce the difficulty of the information of from mass data, obtaining, the speed that raising information is obtained and understood, and then improve the efficient that the user obtains and utilizes information, improve the competitive strength of user in information society.

Not only be related but also have any different towards technology such as many documents automatic abstract and the information retrieval of inquiry, automatic question answerings.The main task of information retrieval is to find out the document that satisfies specific search condition, and the user then need strive to find needed information from the lists of documents of returning that comprises various redundant informations in a large number.The main task of automatic question answering then is to find out the answer that meets particular problem, also only limits at present the problem of some specific areas, particular type, and the answer that provides is sometimes because too simple and indigestion.The research of the question answering system of open field also is faced with substantial difficulty, and effect is also barely satisfactory.Many documents automatic abstract towards inquiry combines the advantage on the prior aries such as many documents automatic abstract, information retrieval and automatic question answering, has avoided its deficiency again to a certain extent.Fields such as it obtains at recommendation customization, the magnanimity information of user personalized information, digital library, business intelligence analysis, E-Government and mobile computing all have important Research Significance and wide application prospect.

According to the difference of summarize by, can the many documents automatic abstract towards inquiry be divided into information extraction formula and extracts formula, its main difference is: the former extracts Useful Information in the sentence, is combined into summary through rewriteeing; The latter chooses most important sentence through certain method and constitutes summary.At present, extracts formula summary is the main flow direction of research.According to the difference of research object, can the research of many documents automatic abstract towards inquiry be divided into to the digest of specific area with to the digest in open field.Though in general the digest system readability to open field be not so good as the former, wide accommodation is portable strong, is present main flow direction.The method of the invention is the extracts formula, to open field.

Cloud model is a kind of qualitative, quantitative transformation model of handling ambiguity, randomness and relevance thereof in the uncertain notion that the firm academician of Li De proposes.Cloud model is started with from the uncertainty of research natural language notion, launches the research to uncertain artificial intelligence.Though cloud model is originated in the notion in the natural language; But regrettably; The paper situation of just collecting at present it seems; The work that cloud model directly is applied in natural language processing field itself is also relatively more rare, and the method for the invention is a kind of typical application of cloud model in natural language processing, can be extended to the other field of natural language processing.

Generally extracting and generate three phases by text internal representation, text analyzing, digest towards many documents automatic abstracting system of inquiry constitutes.The text internal representation stage is converted into the internal representation form with input text.Thereby the text analyzing part is carried out the importance of definite each text elementary cell (statement, paragraph or chapters and sections etc.) of analysis of different levels to text.Digest extracts and generates part and generates the digest that content links up, reflects the original text theme through the ordering to the digest extracting unit.At present, the difference of each digest system is mainly reflected in latter two stage.

In the text analyzing stage, mainly contain based on the method that extracts: based on the method for high frequency words, based on the method for figure, based on the method for theme, and based on method of semantic etc.These existing methods may be summarized to be basically: find certain stochastic distribution of digest unit, utilize statistics, drawing method or more complicated language model to resolve these and distribute, and in view of the above the importance of digest unit is estimated.Through the text analyzing stage, choose most important sentence and can directly generate digest, but owing to just simply quoting and piling up, the summary redundance of its composition is high, continuity and readable relatively poor, is difficult to understood by the reader.

Digest extracts and generates part on the basis of previous stage, and select sentence is adjusted and modified, and present major technique means comprise that redundancy, sentence are pruned, the sentence ordering.Wherein go redundancy generally to take the MMR method, in the process of choosing the digest sentence, not only consider the importance degree of sentence, also consider sentence and selected the degree of correlation of digest sentence, choose those important but with select the incoherent sentence of digest sentence as the digest sentence.

Sentence prune through remove in the sentence some effective informations seldom or do not have a content of effective information; With the simple relatively also core content of a sentence of formal representation of grammatical; Can effectively improve the effective information content of digest, in limited space, express more contents.Utilize surfing Internet with cell phone also to become a kind of main flow mode of obtaining information resources gradually in recent years; And marked difference of cell phone platform and computer platform is the difference of screen size; Short and small summary of simplifying will help the cellphone subscriber to obtain the consulting of their demand faster, and the sentence pruning technique also thereby very likely receives more concern.At present, also extremely rare for the research of Chinese sentence pruning.

The sentence ordering is with the rearrangement of the sentence in the digest, thereby the digest after making process sort is more coherent, is understood by the reader easily, also is one of gordian technique of automatic abstract.At present, the method for sentence ordering mainly contains three kinds, i.e. the method for time order, most order, probability order.Wherein, the chronological order method is published according to former document or the order of date issued sorts, and its limitation is to obtain information actual time unusual difficulty often, and this method is not considered the theme factor simultaneously.The basic thought of most order is the orders according to the order decision digest sentence of theme under the digest sentence, and the order of theme is then by the determining positions of most of sentence in the theme.The limitation of most order is: when having only the relative position of each theme in document more stable, most order methods generate the readable just better of summary, change when frequent the digest structure confusion that becomes easily at relative position.The thinking of probability order is that the digest sentence is decomposed into characteristic; The sequencing of these characteristics of study in corpus; Utilize the order of the order decision digest sentence of characteristic again; Its limitation is the dependence for corpus, and the quality of the corpus of artificial selection is very big for sentence ordering influence.The Liu Dexi of Wuhan University has proposed a kind of mixture model of many documents digest sentence ordering, utilizes the integrated position of linear combination relation, time relationship, dependence, topic relation.The Jiang Xiaoyu of Beijing Institute of Technology has proposed a kind of sentence sort method that degree of gathering between local theme is combined with most order.The bright digest sentence ordering strategy that has proposed a kind of based on single template fusion of the horse of Central China Normal University, according to the representative template of selecting of the digest of document, utilizing template is the ordering of digest sentence, thereby guarantees digest sentence continuity in logic.People such as the Xu Yongdong of Harbin Institute of Technology propose the sentence sort method based on the processing of text temporal information, have proposed the extraction of Chinese text temporal information, semantic calculating and temporal inferences algorithm, extracting time information.

The inventor has announced a kind of many documents automatic abstract method towards inquiry based on cloud model on periodical in 2011, the method for having announced is confined to English language material, and only limits to said subordinate phase, i.e. the innovation in text analyzing stage.

Summary of the invention

For solving the problems of the technologies described above; The invention provides the many document automatic abstract method of a kind of Chinese based on cloud model towards inquiry; The newest research results that has adopted this uncertain research field of cloud model is as theoretical direction; In each link of constructing system, the apply in a flexible way thought and the method for cloud are considered fully to generate the uncertain factor in the digest process, and are utilized these uncertain factors to improve the performance of systems; For given Chinese collection of document and querying condition, what this system can automatically generate designated length satisfies query demand, succinct, the autoabstract that links up.The method is fit to Chinese language material, and the summary of generation has higher compatible degree with artificial summary, and has stronger readability, searches the used time of information thereby reduce the user.

Be to realize above-mentioned purpose, the invention provides the many document automatic abstract method of a kind of Chinese based on cloud model, may further comprise the steps towards inquiry:

1) inquiry and many collection of document are carried out sentence cutting, participle, removed stop words, will inquire about with document and be expressed as vector;

2) utilize cloud model that the vector that obtains is handled; Through the source code setting up Chinese corpus, revise English automatic abstract evaluating tool ROUGE to realize that Chinese digest is evaluated and tested automatically, parameter training; Find out sentence associated with the query; And calculate the importance degree of sentence in collection of document, and take all factors into consideration the factor of two aspects, give a mark to sentence;

3) go redundancy, generate initial digest.

And, also comprise a sentence shearing procedure after the said step 3), promptly formulate sentence pruning rule initial digest sentence is carried out the sentence pruning, produce many candidate sentence, utilize the multidimensional cloud to choose the pruning sentence and replace original digest sentence, generate the refining digest.

And the method also has a sentence ordered steps at last, promptly collection of document is carried out cluster, finds out the sub-topics that comprises one or more digest sentences, regards all documents in the collection of document as template, and the set of a plurality of templates has constituted cloud, i.e. the cloud template.Utilize the cloud template successively the inner digest sentence of sub-topics and sub-topics to be sorted, the final generation satisfied inquiry, terse, the summary that links up.

And in the described sentence shearing procedure, rule pruned in sentence is 10 artificial rules based on interdependent analysis.

And; In the described sentence shearing procedure; Utilize the multidimensional cloud to choose to prune the original digest sentence of sentence replacement and specifically be meant: with word the distribution between collection of document, the distribution between all sentences, and all query words between the degree of correlation three aspect regard water dust respectively as, the numerical characteristic that obtains three kinds of clouds through reverse cloud generator respectively obtains word one-dimensional cloud to obtain word multidimensional cloud through comprehensive cloud computing; Word one-dimensional cloud is formed sentence multidimensional cloud; Calculated candidate sentence importance degree score with the information density of candidate sentence length calculated candidate sentence, is replaced original digest sentence with the highest candidate sentence of information density again.

And in the described sentence shearing procedure, calculated candidate sentence importance degree score is meant, through calculating the similarity of sentence multidimensional cloud and former sentence multidimensional cloud, thereby obtains the importance degree score of candidate sentence, and the method for calculating sentence multidimensional cloud and former multidimensional cloud similarity is:

Figure 2011103737529100002DEST_PATH_IMAGE001

Wherein, C1 and C2 are two multidimensional clouds, Ex _1k, Ex _2k, En _1k, En _2k, He _1k, He _2kBe respectively mathematical expectation, entropy, the ultra entropy of k the property value that notion C1 and C2 had; V _kBe the weight of attribute k, its size is between 0 to 1, looks specific object and contact thereof and decides.

And in the described sentence shearing procedure, the method for calculated candidate sentence information density is:

Figure 2011103737529100002DEST_PATH_IMAGE002

Wherein C, O represent candidate sentence and former sentence respectively, and what function Length calculated is sentence length, is unit with the word.

And; Utilizing the cloud template successively sub-topics to be sorted in the described step 4) specifically is meant: the one-dimensional cloud by each digest sentence that theme comprised constitutes theme relative position multidimensional cloud; Obtain theme relative position one-dimensional cloud with comprehensive cloud computing; Ex obtains theme relative position score through expectation, with this theme is sorted.

And; Utilizing the cloud template successively the digest sentence of sub-topics inside to be sorted in the said step 4) specifically is meant: it is the most similar in all documents, to find out the digest sentence that obtains in which sentence and the back; As the relative position of this digest sentence in the document; Regard each relative position as water dust, carry out reverse cloud computing, obtain the numerical characteristic of sentence relative position cloud; Obtain sentence relative position score with the inner sentence of theme through expectation Ex, theme inside sentence is sorted with this.

The method of the invention compared with prior art has following effect: owing to adopted cloud model, taken into full account uncertain problem, guaranteed the better performance of each link in the digest generative process; Sentence is pruned and can be made digest more brief, more likely is generalized to the field that mobile search etc. is had relatively high expectations to the digest terseness; The sentence ordering can reduce the jumping characteristic of digest content again, makes digest more coherent; The experiment of carrying out in the extensive language material has proved the validity of the method that the present invention proposes.

Technology of the present invention can realize for given inquiry, obtains the relevant documentation set automatically through search engine, and then generates the summary that the user needs automatically.The present invention can directly return the important content of user's needs, avoids the result of time searching needs from webpage of user's labor.The present invention knows the first holonomic system of Chinese towards many documents automatic abstract of inquiry that be suitable for generating at present, and the experiment of on the extensive language material of Chinese and English, carrying out shows that this system has good performance.

Description of drawings

Fig. 1 is the overall flow figure of a kind of Chinese based on cloud model of the present invention towards many documents automatic abstract method of inquiry.

Fig. 2 chooses the process flow diagram of process for sentence.

Fig. 3 is the process flow diagram of redundant process.

Fig. 4 prunes the process flow diagram of process for sentence.

Fig. 5 is the process flow diagram of sentence sequencer procedure.

Fig. 6 is on the TAC 2010 evaluation and test data set A (its task is with similar towards many documents automatic abstract of inquiry); Embodiment 1 said cloud abstract system is numbered 23; Its ROUGE-2 (a), ROUGE-SU4 (b), Basic Elements (c), artificial evaluation and test Average Overall Responsiveness (d) have obtained the achievement of rank the the the 3rd, the 2nd, the 2nd, the 3rd in 43 systems that participate in evaluation and electing respectively; Wherein A is artificial summary to H, and 1 to 43 is machine summary (just listing preceding ten).

Fig. 7 is ROUGE evaluation result and 95% fiducial interval of embodiment 1 said system and baseline system SumFocus.

Fig. 8 comments evaluation result for the method for the embodiment of the invention 2 with the manual work that artificial pruning sentence compares.

Fig. 9 prunes the influence for the ROUGE evaluation result for using sentence.

Figure 10 is the readable artificial evaluation and test of digest, the number percent that all kinds of results are shared.

Embodiment

Embodiment 1

Embodiment 1 corresponding diagram 1 is chosen the situation of leftmost dashed path, promptly removes the redundant summary that directly generates afterwards, may further comprise the steps:

1, querying condition and collection of document are carried out sentence cutting, participle, removes stop words.Use the sentence cutting module (SplitSentence) and the word-dividing mode (CRFWordSeg) of the LTP v2.01 version of Harbin Institute of Technology's exploitation.The up-to-date word-dividing mode of LTP is based on the CRF model construction, and participle performance F1 value has reached 97.4%.After participle, further adopt homemade inactive vocabulary to go the work of stop words.

2, the sentence expression that the above-mentioned processing of the process in collection of document, the querying condition is obtained later on becomes the form of vector, and the line number of vector is the sentence number, and columns is a speech kind number, the element of vector is corresponding number of times that certain speech occurs in certain sentence.

3, choose (Fig. 2) based on the sentence of cloud model, sentences all in the collection of document given a mark, divide four steps:

(1) the relevant score of the inquiry of calculating sentence comprises following steps:

The word in a, the employing HAL method calculating collection of document and the degree of association between the query word; The HAL method can be by the method that is called the window co-occurrence of image; Utilize speech and the query word co-occurrence situation in certain length of window to calculate the correlativity score between word and the query word, thereby obtain the semantic association information that exists between word and the query word.In the window ranges that a length is K, observe the co-occurrence situation of word (w) and query word in the collection of document (w '), then this window is moved in the entire document range of convergence, a word moves forward at every turn.Statistics word and query word are in the co-occurrence situation of certain distance, and distance is more little, and the co-occurrence number of times is many more, explains that then this word is relevant more with query word.

If on behalf of w and w ',

Figure 2011103737529100002DEST_PATH_IMAGE003

be the co-occurrence number of times of k in distance, W (k)=K-k+1 represents the co-occurrence intensity of word w and w '.Then the degree of correlation of word and query word can be expressed as:

Figure 2011103737529100002DEST_PATH_IMAGE004

B, regard the degree of correlation of each word in a word and the querying condition as water dust; Utilize reverse cloud generator, obtain the numerical characteristic

Figure 2011103737529100002DEST_PATH_IMAGE005

of cloud.

The numerical characteristic linear combination (linear combination 1) of c, cloud that step b is obtained obtains the relevant score of inquiry of word:

Figure 2011103737529100002DEST_PATH_IMAGE006

D, regard the relevant score (step c provides) of the inquiry of each word that sentence comprised as water dust; Utilize reverse cloud generator, obtain the numerical characteristic

Figure 2011103737529100002DEST_PATH_IMAGE007

of cloud.

The numerical characteristic of e, cloud that steps d is obtained carries out linear combination (linear combination 2), obtains the relevant score of inquiry of sentence:

Figure 2011103737529100002DEST_PATH_IMAGE008

(2) the importance degree score of calculating sentence comprises following steps:

Cosine similarity between a, calculating sentence and the sentence.

Adopt vector space model to calculate the similarity between sentence.For given collection of document, be the vector (w of m dimension with each sentence expression _I1, w _I2..., w _Im), wherein m is the speech kind number of collection of document, each dimension in the vector space corresponding a speech in the vocabulary.The weight of each element must assign to represent with the TF-ISF of the pairing word of dimension at this element place in the vector, that is:

Wherein, TF representes the word frequency of speech w in sentence S, and ISF is for arranging the sentence frequency, by computes:

Figure 2011103737529100002DEST_PATH_IMAGE010

Wherein, N representes the sum of sentence in the collection of document, and n representes to contain the sentence number of speech w.

Similarity can use the cosine similarity between the vector to calculate between the sentence:

Figure 2011103737529100002DEST_PATH_IMAGE011

B, regard the similarity of each sentence in sentence and the collection of document as water dust; Utilize reverse cloud generator, obtain the numerical characteristic of cloud.

The numerical characteristic linear combination (linear combination 3) of c, cloud that step b is obtained obtains the importance degree score of sentence:

Figure 2011103737529100002DEST_PATH_IMAGE013

(3) integrate score of calculating sentence.

Relevant score of the inquiry of the sentence that (1) (2) are obtained and importance degree score are carried out linear combination (linear combination 4), obtain the integrate score of sentence:

Figure 2011103737529100002DEST_PATH_IMAGE014

(4) parameter training process.

To Chinese language material; Make up many document automatic abstract corpus and the Chinese digest automatic Evaluation instrument of Chinese towards inquiry; Confirm parameter in (1) (2) (3);

Figure 2011103737529100002DEST_PATH_IMAGE016

;

Figure 2011103737529100002DEST_PATH_IMAGE017

, δ.Be divided into following steps:

A, structure Chinese are towards many documents automatic abstract corpus of inquiring about.

There is not the many document automatic abstract corpus of disclosed Chinese at present towards inquiry.The present invention has made up the many document automatic abstract corpus of Chinese towards inquiry; After the focus incident theme that at first artificial selected 100 2009-2010 take place; Focus incident title (for example " the Guangzhou Asian Games ") is used as query word and is input to search engine, and the result of search engine is converted into relevant documentation through extracting automatically.This paper adopts a kind of relevant documentation method for distilling based on label density.After definite relevant documentation set, further write querying condition and artificial summary by the expert.Obtain containing the corpus of 100 collection of document, 1000 pieces of relevant documentations, 400 pieces of artificial summaries at last.

B, structure Chinese Text Summarization appraisal tool are used for the digest that different parameters generates is down given a mark automatically.

This instrument is on the basis of English automatic Evaluation instrument ROUGE, to make amendment to obtain, and below is the step of ROUGE-CN concrete modification source program:

Step 1: adopt the CRFWordSeg module of LTP V2.01 platform that participle is carried out in manual work summary, autoabstract, adopt space-separated during participle.

Step 2: all the elements of " smart_common_words.txt " under " data " file below the inactive vocabulary replacement of Chinese ROUGE installation kit.

Step 3: find and delete the relevant statement that filters Chinese character in the source program.

C, parameter training.

According to constraint condition; Confirm parameter ;

Figure 2011103737529100002DEST_PATH_IMAGE019

,

Figure 2011103737529100002DEST_PATH_IMAGE020

should be a group in the following candidate parameter set:

In the concrete training process, confirm successively

,

,

, the locally optimal solution of δ is separated and is combined every group optimum 3 then, carries out recycle to extinction, promptly generates 3 ⁴=127 digests are selected the optimized parameter combination through automatic Evaluation.

Through the parameter training process, confirmed parameter, also just confirmed the integrate score of the said sentence of step 3.

4, go redundancy (Fig. 3)

(1) sentence is sorted by score from high to low, choose the highest sentence of score as first digest sentence.

(2) score of remaining all sentences of adjustment and to have selected the similarity of digest sentence high more, score just is lowered manyly more:

Figure 2011103737529100002DEST_PATH_IMAGE022

Wherein R is the set of all sentences, and the set of the digest sentence that F is all to have chosen, thereby S _iExpression candidate digest sentence; S _LThe digest sentence that expression is chosen recently.

(3) judge whether to reach the digest length requirement, if it is reach, then technological; If do not reach length requirement, then get back to step (1).

5, generate summary.

If the length sum of the sentence of choosing is then removed the part that exceeds length in last sentence greater than digest length, generate final summary.

6, effect

Though present embodiment is fit to Chinese language material, also is suitable for English language material.Owing to openly do not evaluate and test language material in the Chinese on a large scale, at first reflect card here with the conduct of the experimental result in the English language material, provide the experimental result in the Chinese language material subsequently.

We adopted like above-mentioned 5 the cloud abstract systems participation TAC 2010 that step constituted and lead the digest international tournament in (1) 2010 year; In order the task of leading to be arranged and to interrelate towards inquiry automatic abstract task; Our only that organizing committee is given classification information is as querying condition, and other aspects all are consistent with before experiment.

The language material that adopts international text analyzing meeting TAC2008 is as corpus, and through the said parameter training process of step 3 (4), training obtains parameter:

We have submitted two systems to, and ID is respectively 6, No. 23, and embodiment 1 described cloud abstract system is numbered No. 23.On TAC 2010 evaluation and test data set A (its task is with similar towards many documents automatic abstract of inquiry); Fig. 6 has shown each item evaluation result, has obtained the achievement of rank the the the 3rd, the 2nd, the 2nd, the 3rd in 43 systems that participate in evaluation and electing respectively based on its ROUGE-2 of digest (a), ROUGE-SU4 (b), Basic Elements (c), four automatic evaluation metricses of artificial comprehensive evaluation metrics Average Overall Responsiveness (d) of cloud model.

(2) at first the described Chinese of a of step 3 (4) is divided into two parts at random towards 100 collection of document of many documents automatic abstract corpus of inquiry, i.e. each 50 collection of document of each part are respectively as corpus and testing material.Corpus mainly is used for training the parameters of cloud abstract system, and testing material is used for the effect of confirmatory experiment.

Through the said parameter training process of step 3 (4), training obtains being fit to the automatic abstract parameter of Chinese:

Figure 2011103737529100002DEST_PATH_IMAGE024

Fig. 7 is the average evaluation result that said cloud abstract system of present embodiment and baseline system SumFocus obtain on the testing material that comprises 50 collection of document, and has provided 95% fiducial interval.Wherein SumFocus is the digest system of people such as the Vanderwende exploitation of Microsoft Research (Microsoft Research); This system participates in the evaluation and test of DUC 2006; It is one of digest system that behaves oneself best; Wherein pyramid is evaluated and tested first that ranks 22 systems, and we have built this system and have generated summary in Chinese language material.As can be seen from Figure 7, the described method of present embodiment all is significantly improved than SumFocus each item score.This result reflects that methods described herein have consistance preferably with artificial summary aspect content, and popular says, is unit with the speech, and it is the same with artificial clip Text that the summary that method of the present invention generated on average has 1/3 content.

Embodiment 2

Embodiment 2 is with the difference of embodiment 1, between the step 4 and step 5 of embodiment 1, has increased a sentence shearing procedure, through less important or irrelevant sentence element in the deletion digest sentence, further increases the information content of digest.

Choose the situation of the dashed path of the rightmost side in embodiment 2 corresponding diagram 1.

Process corresponding diagram 4 pruned in sentence, comprises following steps:

1. formulate artificial rule base, be used for sentence is pruned.

Following table has provided the concise and to the point description and the example thereof of 10 artificial rules that this paper adopted, and underscore representes to use the content of this redundant rule elimination.

<tables num="0001"> <table > <tgroup cols="3"> <colspec colname="c001" colwidth="2%" /> <colspec colname="c002" colwidth="12%" /> <colspec colname="c003" colwidth="84%" /> <tbody > <row > <entry morerows="1">Rule</entry> <entry morerows="1">Describe</entry> <entry morerows="1">Example</entry> </row> <row > <entry morerows="1"> 1 </entry> <entry morerows="1">Parenthetic literal</entry> <entry morerows="1"> Xinhua News Agency Bonn Dec 30(reporter Lv Hong)Germany foreign minister Jin Keer made a speech to press during this time on the 30th, and achievement was all obtained in 1997 in title Europe aspect economical, political and diplomatic.</entry></row><row ><entry morerows=" 1 ">2</entry><entry morerows=" 1 ">Absolute construction</entry><entry morerows=" 1 ">It is reportedMunicipal Party committee of Harbin group starts with from helping laid-off young worker to improve the employment ability, for the youth reemploys out and out service is provided.</entry></row><row ><entry morerows=" 1 ">3</entry><entry morerows=" 1 ">The independent adverbial modifier of sentence beginning</entry><entry morerows=" 1 ">Early morning today,In resonant national song, ceremony of rising national flag is observed the grand opening of in Lhasa.</entry></row><row ><entry morerows=" 1 ">4</entry><entry morerows=" 1 ">" XX says, " of sentence beginning</entry><entry morerows=" 1 ">This newspaper Paris reporter Liu Zhengxue on May 26, the firm forever report of fruit:The Li Ruihuan of President of Chinese People's Political Consultative Conference 26 days has met with friendly group member in the French senate method in Paris.</entry></row><row ><entry morerows=" 1 ">5</entry><entry morerows=" 1 ">The independent conjunction of sentence beginning</entry><entry morerows=" 1 ">So,Unification of the country can reach.</entry></row><row ><entry morerows=" 1 ">6</entry><entry morerows=" 1 ">Do the adverbial modifier's prepositional phrase</entry><entry morerows=" 1 ">Carry out this work, for setting up socialist market economy system, promote the national economy sustained, rapid and sound development,Has very important meaning.</entry></row><row ><entry morerows=" 1 ">7</entry><entry morerows=" 1 ">" " the word structure</entry><entry morerows=" 1 ">It is reported that municipal Party committee of Harbin group improves the employment ability from the young worker that helps to be laid off and start with, provide for the youth reemploysOut and outService.</entry></row><row ><entry morerows=" 1 ">8</entry><entry morerows=" 1 ">" " the word structure</entry><entry morerows=" 1 ">China has realized the fastest growth in large-scale economy, itSuccessfullyThe market oriented economy that trend is open more.</entry></row><row ><entry morerows=" 1 ">9</entry><entry morerows=" 1 ">Adverbial word</entry><entry morerows=" 1 ">If weFurtherEmancipate the mind, seek truth from the facts, seize the opportunity, pioneer and keep forging ahead, the road of building socialism with Chinese characteristics will be walked broader and broader.</entry></row><row ><entry morerows=" 1 ">10</entry><entry morerows=" 1 ">Adjective</entry><entry morerows=" 1 ">Brain industryization be by the U.S. famous professor mark Lu Pu 1962 in " knowledge production and distribution " bookUp-to-datePropose. </entry></row></tbody></tgroup></table></tables>

2. use artificial rule that sentence is pruned successively, produce many candidate sentence.

(1) for sentence to be pruned, each bar rule in the matching rule base one by one sees whether meet this rule in order.

(2) if meet rule, prune with regard to carrying out carrying out sentence, thereby obtain candidate sentence by the requirement of rule, and with the input of this candidate sentence as next bar rule.

(3) up to having mated the last item rule, all candidate sentence that the output front obtains are as many candidate sentence.

3. from three different aspects the importance degree of the word that comprises the candidate sentence is given a mark, obtains word multidimensional cloud:

(1) frequency that word is occurred in every piece of document is regarded water dust as; Utilize reverse cloud generator, obtain the numerical characteristic

Figure 2011103737529100002DEST_PATH_IMAGE025

of cloud.

(2) frequency that word is occurred in each sentence in collection of document is regarded water dust as; Utilize reverse cloud generator, obtain the numerical characteristic

Figure 2011103737529100002DEST_PATH_IMAGE026

of cloud.

(3) regard the degree of correlation of word and each query word as water dust; Utilize reverse cloud generator, obtain the numerical characteristic

Figure 2011103737529100002DEST_PATH_IMAGE027

of cloud.Step 3 content is consistent among this step and the embodiment 1, and the cloud that the d step of step 3 obtains among the cloud that obtains and the embodiment 1 is identical.

(4) cloud that first three step is obtained makes up, and obtains the multidimensional cloud:

WMC={(Ex ₁，En ₁，He ₁)，(Ex ₂，En ₂，He ₂)，?(Ex ₃，En ₃，He ₃)}

4. adopt comprehensive cloud computing, word multidimensional cloud is converted into word one-dimensional cloud.

Comprehensive cloud operation definition is:

Figure 2011103737529100002DEST_PATH_IMAGE028

Wherein

is the weight of each dimension.

Make in the following formula

Can obtain word one-dimensional cloud

Figure 2011103737529100002DEST_PATH_IMAGE031

.

4. sentence expression is become sentence multidimensional cloud, each dimension of cloud is a word one-dimensional cloud.

If in the former sentence m word arranged; Can be the form of vector

Figure 2011103737529100002DEST_PATH_IMAGE032

then with each sentence expression in the candidate sentence set; Different components appears repeatedly also being used as in same word in same sentence handles, and the position of word vector and the word position in former sentence is corresponding one by one.

In the candidate sentence; If certain word is deleted from former sentence; Then Wesy's null vector of this speech position

Figure 2011103737529100002DEST_PATH_IMAGE033

expression, that is:

Figure 2011103737529100002DEST_PATH_IMAGE034

Then sentence multidimensional cloud (Sentence Multi-dimension Cloud abbreviates SMC as) can be expressed as:

Figure 2011103737529100002DEST_PATH_IMAGE035

Have nothing in common with each other though it should be noted that each candidate sentence length of same sentence, the dimension of their SMC all is identical.

5. calculate the similarity of sentence multidimensional cloud and former sentence multidimensional cloud, obtain the information importance degree score of candidate sentence.

To the difference of three characteristic roles of cloud model, the present invention proposes a kind of improved multidimensional cloud similarity calculating method, when calculating multidimensional cloud similarity, gives different weights for each numerical characteristic of cloud.It is similar more with former sentence to prune sentence, explains that the important information of its reservation is many more.

Similarity between two multidimensional cloud C1 and the C2 is defined as:

Wherein, Ex _1k, Ex _2k, En _1k, En _2k, He _1k, He _2kBe respectively notion C1 draw C2 had kThe mathematical expectation of individual property value, entropy, ultra entropy; V _kBe attribute kWeight, its size is between 0 to 1, looks specific object and contact thereof and decides.Calculate the similarity between sentences need to determine the weight vector

Figure 2011103737529100002DEST_PATH_IMAGE036

and

Figure 2011103737529100002DEST_PATH_IMAGE037

value.Here give the speech relevant with incident, i.e. higher weights of noun and verb, and the weights of noun are higher than verb.In the process of sentence one-dimensional cloud conversion, the weight of each dimension is determined by following formula at sentence multidimensional cloud:

Wherein, pos represent word part of speech (part of speech, POS).

After having defined the weight of each dimension; Make

; Can calculated candidate prune the similarity between sentence and the former sentence, as the information importance degree score of pruning sentence.

6. the information density of calculated candidate sentence.

The present invention proposes a kind of improved information density computing method that candidate sentence is chosen that are suitable for:

Figure 2011103737529100002DEST_PATH_IMAGE040

Wherein C, O represent candidate sentence and former sentence respectively,

be the importance degree score of the pruning sentence that obtains of step 5.What function Length calculated is sentence length, is unit with the word.

7. replace former sentence as the digest sentence with the highest candidate sentence of information density.

8. for owing to the deletion content space of practicing thrift out, with through last digest sentence of pruning owing to exceed part or the new digest sentence that length deleted and fill, reformulate digest.

9. implementation result:

(1) the artificial evaluation result of pruning sentence quality

Adopt described in the embodiment 1 in 6 (2) employed 50 collection of document as testing material, preceding two of each collection of document promptly add up to 100 sentences to be selected to carry out manual work evaluation.

By 4 evaluation and test persons the pruning sentence that 3 kinds of methods generate is provided 5 grades of scorings of 1 to 5 respectively from grammer correctness, information importance degree two aspects, the high more expression of score grammaticalness or the important information that comprises more is many more.In the concrete evaluation and test process, they know the content of former sentence, and 3 kinds of sentence quilt mixing at random that method generates, the sentence that evaluation and test person does not know evaluating and testing in advance is from which kind of method.

Fig. 8 has shown this result.The result shows that this method has been preserved important information preferably, on the basis of sentence shortening 32.2%, only loses 18% important information.

(2) the automatic Evaluation result of digest quality

Fig. 9 has shown whether use the influence of Chinese sentence pruning method for the digest quality.In the concrete evaluation and test process, utilize the ROUGE-1 evaluating tool, the average evaluation result that on the testing material that comprises 50 collection of document, obtains, and provided 95% fiducial interval.

Experimental result shows that the ROUGE-1 score of embodiment 2 has improved 4.7% than embodiment 1.

Embodiment 3

Embodiment 3 is the preferred embodiments of the present invention, the middle solid line part main flow chart of corresponding diagram 1, and embodiment 3 is with embodiment 2 differences, after the sentence shearing procedure at embodiment 2, has further increased the sentence ordered steps.

Embodiment 3 is with embodiment 1 difference, between the step 4 and step 5 of embodiment 1, increases sentence shearing procedure and sentence ordered steps.

Sentence ordering process flow diagram is seen Fig. 5, comprises following steps:

1. the sentence in the collection of document is carried out cluster, obtain sub-topics.We adopt a kind of self-adaption cluster algorithm based on the discovery of uniting to carry out cluster.

2. obtain the sub-topics from step 1, find the sub-topics that comprises one or more digest sentences, remove the sub-topics that does not comprise the digest sentence.

3. obtain the sentence relative position cloud of each digest sentence:

A, to find out in the document which sentence the most similar with the digest sentence, as the relative position in the document of this digest sentence;

For sentence S, the relative position that is defined as the sentence the most similar of its relative position rp in document D with this, concrete computing method are:

Figure 2011103737529100002DEST_PATH_IMAGE042

Wherein, function Position returns the absolute position of sentence in the document of place, and N is the sentence number in the document D.The computing method of sentence similarity are identical with embodiment 1 step 3 (2) a.

B, regard the relative position of digest sentence in every piece of document as water dust; Utilize reverse cloud generator; Calculate the numerical characteristic of cloud, promptly obtained sentence relative position cloud

.

4. obtain theme relative position multidimensional cloud.

For theme T; Be provided with digest sentence

Figure 2011103737529100002DEST_PATH_IMAGE044

and come from T; Then T can be expressed as the form of k dimension multidimensional cloud, that is:

5. utilize comprehensive cloud computing, obtain theme relative position one-dimensional cloud.

Figure 2011103737529100002DEST_PATH_IMAGE046

6. calculate theme relative position score, theme is sorted.

For theme , its relative position score is directly by its expectation decision:

Figure 2011103737529100002DEST_PATH_IMAGE048

Finally; According to

order from low to high; Successively theme is sorted; Promptly at first obtain the first that the minimum theme of branch is placed on digest, obtain the second portion that the low theme of gradation is placed on digest then, until all themes all sequence order.

7. the inner sentence of each theme is sorted.

For the ordering of the sentence of theme inside, equally only consider the expectation of SRPCloud.For sentence

, relative position must be divided into.

The order of sentence is determined by

;

is more little, and sentence is forward more in the inner position of theme.

8. result

Figure 10 has provided and has used before and after the sentence order module and the artificial evaluation result of artificial summary aspect readable.Wherein: Perfect is meant how to change sentence order, the summary that the result of digest can not improve no matter again; Acceptable: be meant and can understand, though and adjust maybe be better, yet passable summary or not in and nonessential adjustment; Poor: be meant discontinuously somewhere, just can reach the summary of the level of Acceptable through inching slightly; Unacceptable is meant needs the local too many of adjustment, the summary that put in order again.

Evaluation result shows; The sentence ordering makes the readability of digest that lifting by a relatively large margin arranged; Wherein the ratio of Perfect is 30%, and 47.5% digest readability is Acceptable in addition, that is to say that 77.5% digest is the summary that can no longer make further modification.Simultaneously, owing to used the sentence pruning, embodiment 3 also has the advantage of embodiment 2 concurrently, and promptly digest is evaluated and tested the ROUGE-1 score automatically than embodiment 1 raising 4.7%.

Claims

1. one kind based on the Chinese of the cloud model many documents automatic abstract method towards inquiry, it is characterized in that comprising the steps:

3) go redundancy, generate initial digest.

2. the Chinese based on cloud model according to claim 1 is towards many documents automatic abstract method of inquiry; It is characterized in that also comprising a sentence shearing procedure after the said step 3); Promptly formulate sentence pruning rule initial digest sentence is carried out the sentence pruning; Produce many candidate sentence, utilize the multidimensional cloud to choose and prune the original digest sentence of sentence replacement, generate the refining digest.

3. the Chinese based on cloud model according to claim 2 also comprises a sentence ordered steps at last towards many documents automatic abstract method of inquiry, promptly collection of document is carried out cluster; Find out the sub-topics that comprises one or more digest sentences; Regard all documents in the collection of document as template, the set of a plurality of templates has constituted cloud, i.e. the cloud template; Utilize the cloud template successively sub-topics and the inner digest sentence of sub-topics to be sorted, finally generate required summary.

4. the Chinese based on cloud model according to claim 2 is towards many documents automatic abstract method of inquiry, it is characterized in that rule pruned in described sentence is 10 artificial regular based on interdependent analysis.

5. the Chinese based on cloud model according to claim 2 is towards many documents automatic abstract method of inquiry; It is characterized in that described utilize the multidimensional cloud to choose to prune the original digest sentence of sentence replacement specifically be meant: with word the distribution between collection of document, the distribution between all sentences, and all query words between the degree of correlation three aspect regard water dust respectively as; The numerical characteristic that obtains three kinds of clouds through reverse cloud generator respectively is to obtain word multidimensional cloud; Obtain word one-dimensional cloud through comprehensive cloud computing; Word one-dimensional cloud is formed sentence multidimensional cloud; Calculated candidate sentence importance degree score with the information density of candidate sentence length calculated candidate sentence, is replaced original digest sentence with the highest candidate sentence of information density again.

6. the Chinese based on cloud model according to claim 5 is towards many documents automatic abstract method of inquiry; It is characterized in that described calculated candidate sentence importance degree score is meant; Through calculating the similarity of sentence multidimensional cloud and former sentence multidimensional cloud; Thereby obtain the importance degree score of candidate sentence, the method for calculating sentence multidimensional cloud and former sentence multidimensional cloud similarity is:

Wherein, C1 and C2 are two multidimensional clouds, Ex _1k, Ex _2k, En _1k, En _2k, He _1k, He _2kBe respectively mathematical expectation, entropy, the ultra entropy of k the property value that notion C1 and C2 had; V _kBe the weight of attribute k, its size is 0 ~ 1.

7. the Chinese based on cloud model according to claim 5 is characterized in that towards many documents automatic abstract method of inquiry the method for described calculated candidate sentence information density is:

8. the Chinese based on cloud model according to claim 3 is towards many documents automatic abstract method of inquiry; It is characterized in that saidly utilizing the cloud template successively sub-topics to be sorted specifically to be meant: the one-dimensional cloud by each digest sentence that theme comprised constitutes theme relative position multidimensional cloud; Obtain theme relative position one-dimensional cloud with comprehensive cloud computing; Ex obtains theme relative position score through expectation, with this theme is sorted.

9. the Chinese based on cloud model according to claim 3 is towards many documents automatic abstract method of inquiry; It is characterized in that saidly utilizing the cloud template successively the digest sentence of sub-topics inside to be sorted specifically to be meant: it is the most similar in all documents, to find out the digest sentence that obtains in which sentence and the back; As the relative position of this digest sentence in the document; Regard each relative position as water dust, carry out reverse cloud computing, obtain the numerical characteristic of sentence relative position cloud; Obtain sentence relative position score with the inner sentence of theme through expectation Ex, theme inside sentence is sorted with this.