CN108664598A

CN108664598A - A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Info

Publication number: CN108664598A
Application number: CN201810435232.8A
Authority: CN
Inventors: 高扬; 黄河燕; 魏林静
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2018-10-16
Anticipated expiration: 2038-05-09
Also published as: CN108664598B

Abstract

Disclosed herein is a kind of extraction-type abstract methods based on integral linear programming with comprehensive advantage, belong to natural language processing field.Extraction-type digest is divided into document content study first for context of methods and digest sentence extracts, and is divided into similitude, conspicuousness and continuity three parts for document content study；The content study for considering document and redundancy are extracted for digest sentence, and digest sentence is extracted using integral linear programming frame.This method can learn the semantic expressiveness of sentence automatically by language material, the similarity between sentence can be calculated using simple mathematic calculation, deep excavation is carried out for conspicuousness, similitude, continuity and the redundancy in extraction-type digest task to construct the digest system of high quality.

Description

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Technical field

The present invention relates to a kind of extraction-type abstract methods based on integral linear programming with comprehensive advantage, belong to nature Language Processing field.

Background technology

With the rapid growth of new media information, people can be obtained by extensive source and sharing information, so net The document for including in network is increased with exponential form.We are faced with inevitable, challenging information overload Problem.In order to alleviate this problem, it would be desirable to provide the system for obtaining various data in time.Search engine is to a certain extent Solves the problems, such as this, user can return to the document or webpage of a sequence by providing a specified inquiry, search engine List.Even however, also lack the ability of integrated information from many aspects using the search engine of most advanced information retrieval technique, Therefore active user cannot be given succinct and informative response.In order to mitigate the problem of information overload that people face, it is necessary to carry The tool that can be integrated information for one and respond in time, presently, there are these problems excite people to automatic abstracting system Interest.

The purpose of automatic abstracting system design is then generated using single collection of document or multiple collection of document as input One succinct, smooth text snippet for retaining source document most important information.Automatic abstract substantially can be regarded as an information Compression process, by the single document of input or more documents brief and concise sentence expression extracted, but during this It is inevitably present information loss, so digest needs to retain obtains similar significant information more as far as possible.

The effect of digest is mainly assessed by four aspects in multi-document summarization task：Correlation, conspicuousness, company Coherence and redundancy.Correlation refers to that content is consistent with the interested content of user；Conspicuousness refers in source document The higher content of the frequency of occurrences；Continuity refers to that content expression meets logic, keeps digest readability higher；Redundancy refers to There is no duplicate message in digest.Wherein correlation and conspicuousness are the key problem in automatic abstract task, continuity and redundancy Property be assist high quality digest structure index.

Current automaticabstracting has made intensive studies primarily directed to similitude and conspicuousness.It is passed for similitude The method of system is similar by feature, for example the features such as word frequency, descriptor, part of speech give a mark to sentence, and the method is simple and easy In understanding, but it is a lack of the semantic understanding of deep layer, there are also methods later learns Deep Semantics using vector approach, but simultaneously Similitude, conspicuousness, continuity and redundancy are not considered.It is to be led to based on statistics mostly for conspicuousness existing method Cross statistics word frequency, sentence position, the information such as concept determine the significance level of sentence.

Invention content

The purpose of the present invention is high-quality to solve how to consider similitude, conspicuousness, continuity and redundancy structure The problem of measuring digest proposes that a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, this method are logical Cross language material and learn sentence vector automatically, using mathematics similarity calculation, count the continuity between theme conspicuousness and sentence to Complete the digest system of structure high quality.

Core of the invention thought is：It is similar with the method calculating that characteristic similarity combines by using vector similarity Property, then using theme, this hierarchical information carries out conspicuousness calculating, and sentence continuity is calculated to mutual information by word, finally Consider that redundancy optimizes solution, the structure that comprehensive similitude, conspicuousness, continuity and redundancy are using integral linear programming The digest built is more accurate.

To achieve the above object, the present invention adopts the following technical scheme that：

Related definition is carried out first, it is specific as follows：

Define 1：Query, i.e. query term；Each query term is known as a query, each query is a sentence Son typically represents the content of user's concern；

Define 2：Collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes base again In the extraction-type digest of query and extraction-type digest based on content；Two kinds of digests of extraction-type digest and production digest wrap Containing multiple collection of document；Each collection of document corresponds to a document query；It is a master that each, which inquires corresponding collection of document, Topic set, is denoted as D, and D={ d_i| 1≤i≤N }, N indicates the number of document in collection of document D；

Define 3：Digest sentence set and digest candidate sentence set；For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query contents The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ s_j| 1≤j≤M }, M is indicated in digest sentence set The number of sentence, s_jA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (s_j) indicate sentence s_jLength, L indicate digest sentence set Length limitation；Digest candidate sentence collection is combined into all sentences in document D, wherein each sentence in document D is known as one Digest candidate sentence, distributed vector indicate that also known as sentence vector, digest candidate sentence are made of word, the distribution vector of word It indicates to be also known as term vector；

Define 4：Similar words set, the word for including in set is all synonym；

Due to 5：Similitude, the semantic overlapping degree and feature of sentence and query in digest candidate sentence set are overlapped journey Degree is referred to as similitude；Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is noun phrase and moves The level of coverage of word phrase, also known as characteristic similarity；

Define 6：Conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, I.e. the number of sentence is more under theme, and corresponding theme is more notable；

Define 7：Continuity needs the digest sentence that will be extracted to rearrange in extraction-type digest, and continuity refers to most The digest sentence arranged eventually links up readable on semantic logic；

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, includes the following steps：

Step 1: calculating the similitude of each digest candidate sentence and query, it is similar to learn sentence vector calculating vector first Degree then by feature calculation characteristic similarity, then the two is added to obtain；

Wherein PV algorithms study sentence vector is selected in the calculating of vector similarity；Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature；

Wherein, PV is the abbreviation of paragraph vector；PV algorithms are a unsupervised frames, which can learn Practise the distribution vector of word segment；

Wherein, word segment is based on sentence, paragraph and document, and length is variable；

PV algorithms in the training process, word are predicted by constantly adjusting a vector sum term vector, until PV algorithms are received It holds back；Sentence vector sum term vector is got by stochastic gradient descent and backpropagation training；

Characteristic similarity selects parsing tree and Kmeans algorithms to calculate；

The calculating process of vector similarity and characteristic similarity specifically includes following sub-step：

Corpus is arranged one form in a row and is input in PV algorithms by step 1.1 carries out study sentence vector, specifically Vector similarity is obtained using cosine similarity, is calculated by formula (1)；

Wherein, s_jIndicate any one digest candidate sentence, vec (s_j) indicate s_jSentence vector, q indicate query, vec (q) Indicate the sentence vector of query, R (s_j, q) and indicate s_jWith the vector similarity of query；

Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, is had Body includes following sub-step：

Step 1.2.1 segments corpus；

Corpus after participle is learnt term vector by step 1.2.2 using word2vec algorithms；

Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithms again, obtains similar list Set of words；

Wherein, the rule classified using Kmeans algorithms is：Similar term vector result just belongs on semantic space One set；

Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is counted by following formula (2) It calculates：

Fe_j=∑_np∈Qtf(np)+∑_vp∈Qtf(vp) (2)

Wherein, Fe_jIndicate the characteristic similarity of the j sentences, characteristic similarity refers specifically to query and digest is candidate The synonymous Term co-occurrence number of noun phrase and verb phrase in sentence；

Q indicates that the set classified belonging to query words, np indicate s_jIn noun phrase, vp indicate s_jIn verb it is short Language；Tf (np) indicates s_jIt is overlapped word frequency with the noun phrase of query；Tf (vp) indicates s_jWith the verb phrase reduplication of query Frequently；

Step 1.3 calculates similitude, is added to obtain with characteristic similarity by vector similarity, is calculated by formula (3)：

Rele_j=R (s_j,q)+Fe_j (3)

Wherein, digest candidate sentence s_jCharacteristic similarity, be denoted as Rele_j；

Step 2: calculating the conspicuousness of digest candidate sentence using LDA algorithm；

Wherein, using LDA algorithm the reason of, is as follows：LDA is to be developed so far more complete topic model, overcomes tradition The defect of topic model, by feat of probability theory and bayesian theory basis, text retrieval, text classification, image recognition, The fields such as social networks are widely used；

Step 2, and include following sub-step：

Step 2.1 calculates the theme distribution of digest candidate sentence, is denoted as θ；

Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distributions, obtains all digest candidate sentences Theme；

Step 2.3 is normalized that can to obtain theme notable by counting the number of digest candidate sentence under theme again Property；

Wherein, i-th of theme conspicuousness, is denoted as t_i；

Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information, and include following son Step：

Step 3.1 is for any two digest candidate sentence s in digest candidate sentence set_jAnd s_k, calculate two digest candidates The mutual information of word pair and its similar words pair in sentence, specially：

For s_jAnd s_kIn word pair<u,v>, u ∈ s_j, v ∈ s_k, it is total to obtain similar words collection using step 1.2.3 Word is calculated to mutual information, the mutual information P of the word pair_jk<u,v>Calculation such as formula (4)：

Wherein, U indicates that the similar words set of word u, V indicate that the similar words set of word v, cnt (U, V) indicate U, The number that word in V set occurs in two adjacent sentences, freq (U) indicate the word word frequency in U set, freq (V) Indicate the word word frequency in V set；

Step 3.2 is by s_jAnd s_kThe mutual information of middle word pair is added to obtain continuity, is calculated especially by formula (5)：

Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence, especially by formula (6) it calculates：

Wherein, s_jAnd s_kIt is any two sentence in digest candidate sentence set, this similarity R<s_j,s_k>Utilize cosine phase It is calculated like degree；

Step 5: the comprehensive advantage that similitude, conspicuousness, continuity and redundancy are made up of integral linear programming into Row global optimization solves, and carries out digest sentence extraction, obtains digest sentence set, is realized especially by object function (7) is maximized：

max{α_it_i+β_jRele_j+∑_{J ＜ k}β_jkc<s_j,s_k>-∑_{J ＜ k}β_jkR<s_j,s_k>} (7)

Wherein, similitude, i.e. Rele_j, obtained by step 1.3；Conspicuousness t_iIt is obtained by step 2.3；Continuity, i.e. c<s_j, s_k>It is obtained by step 3.2；R<s_j,s_k>It is obtained by step 4；Similarity between digest sentence is lower, and to represent redundancy lower；

ILP is abbreviated as in integral linear programming；α_iAnd β_jIt is two-valued variable, has respectively represented whether theme i and digest are candidate Sentence j is selected into digest, t_iIndicate theme conspicuousness, Rele_jIndicate the similitude of digest candidate sentence, β_jkIndicate digest candidate sentence Son is right<s_j,s_k>The two-valued variable in digest sentence set whether is appeared in simultaneously, is also needed while maximizing object function (7) Meet following formula (8) to (12) five constraints：

β_jAsso_ij≤α_i (8)

∑_jβ_jAsso_ij≥α_i (9)

β_jk-β_j≤0；β_jk-β_k≤0；β_j+β_k-β_jk≤1 (10)

∑_jl(s_j)≤L (11)

Wherein, Asso_ijIndicate whether the theme of sentence j is consistent with theme i, is a two-valued variable, inequality (8) (9) It ensure that and some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, otherwise such as Some theme has been selected so just at least to select a digest candidate sentence under the theme in fruit digest sentence set；β_kIt represents Whether digest candidate sentence k is selected into the two-valued variable in digest, and the length for the digest sentence set that inequality (11) indicates does not surpass Cross L；

So far, from step 1 to step 5, selected it is semantic it is similar, theme is notable, sentence is coherent has no redundancy High quality digest sentence, complete a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage.

Advantageous effect

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention, compares existing skill Art has the advantages that：

1. considering Deep Semantics using vector similarity and characteristic similarity and effective feature improving digest sentence The similitude of set and query；

2. the calculating of theme conspicuousness improves the accuracy of the extraction of important information so that digest sentence set improves theme Spatial saliency；

3. calculating the continuity between digest candidate sentence to mutual information using word, the readable of final digest sentence is improved Property so that the digest sentence of extraction preferably expresses the content of collection of document；

4. having considered similitude, conspicuousness, continuity and redundancy using ILP frames, acquired in the overall situation optimal Solution, improves the quality of digest sentence set.

Description of the drawings

Fig. 1 is a kind of flow of the extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention Figure；

Fig. 2 is the classification results figure by being obtained after Kmeans sorting algorithms in 1 step B of embodiment.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below according to accompanying drawings and embodiments pair Abstract method of the present invention is further described.

Embodiment 1

The present embodiment describes the specific implementation process of the present invention, as shown in Figure 1.

It will be seen from figure 1 that a kind of extraction-type digest side based on integral linear programming with comprehensive advantage of the present invention The flow of method is as follows：

Step A pretreatments；It is that subordinate sentence is carried out to language material specific to the present embodiment, goes the processing of stop words；Selection standard number According to collection DUC2005, which includes 50 collection of document, and each collection of document includes 25-50 documents.Extraction-type digest is The digest sentence set no more than 250 words is extracted according to the collection of document under each query.The data of data set DUC2005 Format is xml formats, by query and collection of document respectively in label<narr></narr>,<TEXT></TEXT>In extract Come, recycles nltk tools to carry out subordinate sentence operation to collection of document, then just obtaining the new document D of a line one_new1.It is obtained to new The document D arrived_new1It carries out stop words to operate, obtains new document D_new2。

Step B is calculated similarity using PV and kmeans algorithms, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter Calculate continuity；

Wherein, similarity is calculated using PV and kmeans algorithms, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter It is simultaneously column count to calculate continuity, specific to the present embodiment：

Similarity is calculated using PV and kmeans algorithms, that is, utilizes PV algorithms study sentence vector；By document D_new2It is input to In PV algorithms, the sentence vector of each digest candidate sentence is obtained, dimension size is 256, the sentence vector of some digest candidate sentence It is [0.00150049 0.08735332-0.10565963 0.04739858 0.18809512 0.280207 ...- 0.19442209 0.17960664 0.30010329 0.06458669 0.12353758]；The sentence vector of query is [0.16279337 0.00488725 -0.30741466 0.83172139 0.25234198 0.00017076 … 0.30811236-0.2949384 0.03353651 0.18530557 0.94691929], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.15；

Learn term vector using word2vec algorithms；By document D_new2It is input in word2vec algorithms, obtains term vector, Its object function such as formula (13)：

Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vector of 256 dimensions；

Characteristic similarity needs to count noun phrase and verb phrase, and syntactic analysis is carried out using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithms by the noun phrase extracted and verb phrase Classified using term vector, is divided into 50 classifications；By counting noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q₁,q₂…q_q], word w=[w₁,w₂…w_w], pass through The classification results obtained after Kmeans sorting algorithms are as shown in Figure 2；So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8；

Correlation result is that vector similarity adds characteristic similarity：0.15+3/8=0.525

In the present embodiment, conspicuousness is calculated using LDA algorithm, process is as follows：First by document D_new2It is input to LDA algorithm In, the theme distribution of each digest candidate sentence is obtained, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences The theme distribution difference of son is as follows：[0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09, 0.02,0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness meter of five themes It is exactly 1/5,0,1/5,1/5,0 to calculate result；

In the present embodiment, continuity is calculated using mutual information, process is as follows：Calculate the mutual trust of word pair in digest candidate sentence Breath calculates the continuity between digest candidate sentence by counting word to mutual information；For digest candidate sentence s₁And s₂, s₁= [w₁₁,w₁₂,…,w_1i,…,w_1n],s₂=[w₂₁,w₂₂,…,w_2i,…,w_2m], first by s₁And s₂In word combination of two meter The mutual information for calculating each word pair, for word pair<w₁₁,w₂₁>Calculation such as formula (14)：

Wherein U and V indicate w respectively₁₁And w₂₁Similar words set, w₁₁The frequency that adjacent sentence is appeared in V is 3, U And w₂₁The frequency for appearing in adjacent sentence is frequency 100 in corpus 2, U and V, word pair<w₁₁,w₂₁>Mutual information result is 5.001/101, the mutual information of remaining word pair is similarly obtained, s is acquired using formula (15)₁With s₂Continuity

Step C calculates similarity between sentence；Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithms in step B, The cosine similarity of each two digest candidate sentence is calculated using formula (1)；

Step D considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving-optimizing formula (7)；

Obtain β_jValue [0,1,1,0,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in Digest sentence set.

Embodiment 2

Step B in invention content embodiment 1 is calculated similarity using PV and kmeans algorithms, is calculated using LDA algorithm Conspicuousness calculates continuity using mutual information；The step of three sequences execute is split into, it is specific as follows：

Step A pretreatments；It is that subordinate sentence is carried out to language material specific to the present embodiment, goes the processing of stop words；Select data set DailyMail, the data set include 1000 collection of document, and each collection of document averagely includes 802 words.Utilize two tuples Method query and collection of document are extracted, recycle nltk tools to collection of document carry out subordinate sentence operation, then must To one new document D of a line_new1, to the document D newly obtained_new1It carries out stop words to operate, obtains new document D_new2；

Step B calculates similarity using PV and kmeans algorithms, that is, utilizes PV algorithms study sentence vector；By document D_new2It is defeated Enter into PV algorithms, obtain the sentence vector of each digest candidate sentence, dimension size is 256, the sentence of some digest candidate sentence Vector is [0.30150011-0.60735332 0.00165963 0.31739858 0.11809512 0.080117 ...- 0.04042209 0.12560614 0.17610322 0.06183569 0.38161758]；The sentence vector of query is [0.01539337 0.18238734 -0.30741466 0.03572199 0.45234808 0.60017210 … 0.80311120-0.1038382 0.02234642 0.17560577 0.91691008], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.12；

Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vectors of 256 dimensions；

Characteristic similarity needs to count noun phrase and verb phrase, and syntactic analysis is carried out using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithms by the noun phrase extracted and verb phrase Classified using term vector, is divided into 70 classifications；By counting noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q₁,q₂…q_q], word w=[w₁,w₂…w_w], pass through The classification results obtained after Kmeans sorting algorithms are as shown in Figure 2；So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8；

Correlation result is that vector similarity adds characteristic similarity：0.12+3/8=0.495；

Step C calculates continuity using mutual information, and process is as follows：The mutual information of word pair in digest candidate sentence is calculated, is led to Statistics word is crossed to mutual information to calculate the continuity between digest candidate sentence；For digest candidate sentence s₁And s₂, s₁=[w₁₁, w₁₂,…,w_1i,…,w_1n],s₂=[w₂₁,w₂₂,…,w_2i,…,w_2m], first by s₁And s₂In word combination of two calculate it is each The mutual information of word pair, for word pair<w₁₁,w₂₁>Calculation such as formula (14)：

Step D calculates conspicuousness using LDA algorithm, and process is as follows：First by document D_new2It is input in LDA algorithm, obtains To the theme distribution of each digest candidate sentence, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences Theme distribution difference is as follows：[0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09,0.02, 0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness of five themes calculates knot Fruit is exactly 1/5,0,1/5,1/5,0；

Step E calculates similarity between sentence；Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithms in step B, The cosine similarity of each two digest candidate sentence is calculated using formula (1)；

Step F considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving-optimizing formula (7)；Obtain β_jValue [0,0,1,1,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in digest Sentence set.

" a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage " of the invention is carried out above Detailed description, but the specific implementation form of the present invention is not limited thereto.Embodiment explanation is merely used to help understand this The method and its core concept of invention；Meanwhile for those of ordinary skill in the art, according to the thought of the present invention, specific There will be changes in embodiment and application range, in conclusion the content of the present specification should not be construed as to the present invention's Limitation.

Without departing substantially from the method for the invention spirit and in the case of right to its carry out various aobvious and The change being clear to is all within protection scope of the present invention.

Claims

1. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, it is characterised in that：Pass through language material The vector of automatic study sentence indicates, using mathematics similarity calculation, count continuity between theme conspicuousness and sentence from And complete the digest system of structure high quality；Core concept is the method combined by using vector similarity and characteristic similarity Similitude is calculated, then this hierarchical information carries out conspicuousness calculating using theme, and calculating sentence to mutual information by word connects Coherence finally considers that redundancy optimizes solutions using integral linear programming, integrates similitude, conspicuousness, continuity and superfluous The digest for the structure that remaining property is is more accurate；

Related definition is carried out first, it is specific as follows：

Define 1：Query, i.e. query term；Each query term is known as a query, each query is a sentence, leads to Chang represents the content of user's concern；

Define 2：Collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes being based on again The extraction-type digest of query and the extraction-type digest based on content；Two kinds of digests of extraction-type digest and production digest include Multiple collection of document；Each collection of document corresponds to a document query；It is a theme that each, which inquires corresponding collection of document, Set, is denoted as D, and D={ d_i| 1≤i≤N }, N indicates the number of document in collection of document D；

Define 3：Digest sentence set and digest candidate sentence set；For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query contents The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ s_i| 1≤i≤M }, M is indicated in digest sentence set The number of sentence, s_iA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (s_i) indicate sentence s_iLength, L indicate digest sentence set Length limitation；Digest candidate sentence collection is combined into all sentences in document D, wherein each sentence in document D is known as one Digest candidate sentence, distributed vector indicate that also known as sentence vector, digest candidate sentence are made of word, the distribution vector of word It indicates to be also known as term vector；

Define 4：Similar words set, the word for including in set is all synonym；

Due to 5：Similitude, semantic overlapping degree and feature the overlapping degree system of sentence and query in digest candidate sentence set Referred to as similitude；Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is that noun phrase and verb are short The level of coverage of language, also known as characteristic similarity；

Define 6：Conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, i.e., main The number of the lower sentence of topic is more, and corresponding theme is more notable；

Define 7：Continuity needs the digest sentence that will be extracted to rearrange in extraction-type digest, and continuity refers to finally arranging The digest sentence of row links up readable on semantic logic；

Step 1: calculate the similitude of each digest candidate sentence and query, especially by calculating separately vector similarity and spy Similarity is levied, then the two is added to obtain；

Wherein, PV algorithms study sentence vector is selected in the calculating of vector similarity；Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature；

Wherein, PV is the abbreviation of paragraph vector；PV algorithms are a unsupervised frames, which can learn text The distribution vector of word slice section；

PV algorithms in the training process, predict word, until PV algorithmic statements by constantly adjusting a vector sum term vector；Sentence Vector sum term vector is got by stochastic gradient descent and backpropagation training；

Wherein, using LDA algorithm the reason of, is as follows：LDA is to be developed so far more complete topic model, overcomes traditional theme The defect of model, by feat of probability theory and bayesian theory basis, in text retrieval, text classification, image recognition, social activity The fields such as network are widely used；

Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information；

Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence；

Step 5: being carried out to the comprehensive advantage that similitude, conspicuousness, continuity and redundancy form by integral linear programming complete Office's optimization, carries out digest sentence extraction, obtains digest sentence set；

So far, from step 1 to step 5, selected it is semantic it is similar, theme is notable, the coherent height for having no redundancy of sentence Quality digest sentence.

2. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that：The calculating process of vector similarity and characteristic similarity in step 1 specifically includes following sub-step：

Corpus is arranged one form in a row and is input in PV algorithms by step 1.1 carries out study sentence vector, specific to utilize Cosine similarity obtains vector similarity, is calculated by formula (1)；

Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, specific to wrap Include following sub-step：

Step 1.2.1 segments corpus；

Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithms again, obtains similar words collection It closes；

Wherein, the rule classified using Kmeans algorithms is：Similar term vector result just belongs to one on semantic space Set；

Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is calculated by following formula (2)：

Fe_j=∑_np∈Qtf(np)+∑_vp∈Qtf(vp) (2)

Wherein, Fe_jIndicate that the characteristic similarity of the j sentences, characteristic similarity refer specifically in query and digest candidate sentence Noun phrase and verb phrase synonymous Term co-occurrence number；

Q indicates that the set classified belonging to query words, np indicate s_jIn noun phrase, vp indicate s_jIn verb phrase；tf (np) s is indicated_jIt is overlapped word frequency with the noun phrase of query；Tf (vp) indicates s_jIt is overlapped word frequency with the verb phrase of query；

Rele_j=R (s_j, q) and+Fe_j (3)

Wherein, digest candidate sentence s_jCharacteristic similarity, be denoted as Rele_j。

3. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that：Step 2, and include following sub-step：

Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distributions, obtains the master of all digest candidate sentences Topic；

Step 2.3 is normalized by the number of digest candidate sentence under statistics theme again can obtain theme conspicuousness；

Wherein, i-th of theme conspicuousness, is denoted as t_i。

4. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that：Step 3 includes following sub-step again：

Step 3.1 is for any two digest candidate sentence s in digest candidate sentence set_jAnd s_k, calculate in two digest candidate sentences The mutual information of word pair and its similar words pair, specially：

For s_jAnd s_kIn word to ＜ u, v ＞, u ∈ s_j, v ∈ s_k, obtain similar words set using step 1.2.3 and calculate list Word is to mutual information, the mutual information P of the word pair_jk＜ u, v ＞ calculations such as formula (4)：

Wherein, U indicates that the similar words set of word u, V indicate that the similar words set of word v, cnt (U, V) indicate U, V collection The number that word in conjunction occurs in two adjacent sentences, freq (U) indicate that the word word frequency in U set, freq (V) indicate Word word frequency in V set；

5. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that：Step 4 is calculated especially by formula (6)：

Wherein, s_jAnd s_kIt is any two sentence in digest candidate sentence set, this similarity R ＜ s_j, s_k＞ is similar using cosine Degree calculates.

6. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that：Step 5 is realized especially by object function (7) is maximized：

max{α_it_i+β_jRele_j+∑_{J ＜ k}β_jkC ＜ s_j, s_k＞-∑_{J ＜ k}β_jkR ＜ s_j, s_k＞ } (7)

Wherein, similitude, i.e. Rele_j, obtained by step 1.3；Conspicuousness t_iIt is obtained by step 2.3；Continuity, i.e. c ＜ s_j, s_k＞ It is obtained by step 3.2；R ＜ s_j, s_k＞ is obtained by step 4；Similarity between digest sentence is lower, and to represent redundancy lower；

ILP is abbreviated as in integral linear programming；α_iAnd β_jTwo-valued variable, respectively represented whether theme i and digest candidate sentence j quilts It is selected into digest, t_iIndicate theme conspicuousness, Rele_jIndicate the similitude of digest candidate sentence, β_jkIndicate digest candidate sentences pair ＜ s_j, s_kWhether ＞ appears in the two-valued variable in digest sentence set simultaneously, is also needed to while maximizing object function (7) Meet following formula (8) to (12) five constraints：

β_jAsso_ij≤α_i (8)

∑_jβ_jAsso_ij≥α_i (9)

β_jk-β_j≤0；β_jk-β_k≤0；β_j+β_k-β_jk≤1 (10)

∑_jl(s_j)≤L (11)

Wherein, Asso_ijIt indicates whether the theme of sentence j is consistent with theme i, is a two-valued variable, inequality (8) (9) ensures Some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, if instead literary Pluck a digest candidate sentence for having selected some theme so just at least to select under the theme in sentence set；β_kRepresenting is No digest candidate sentence k is selected into the two-valued variable in digest, the of length no more than L for the digest sentence set that inequality (11) indicates.