CN108664598B

CN108664598B - A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Info

Publication number: CN108664598B
Application number: CN201810435232.8A
Authority: CN
Inventors: 高扬; 黄河燕; 魏林静
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2019-04-02
Anticipated expiration: 2038-05-09
Also published as: CN108664598A

Abstract

Disclosed herein is a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, belongs to natural language processing field.Extraction-type digest is divided into document content study first for context of methods and digest sentence extracts, and is divided into similitude, conspicuousness and continuity three parts for document content study；The content study and redundancy for comprehensively considering document are extracted for digest sentence, and digest sentence is extracted using integral linear programming frame.This method can learn automatically the semantic expressiveness of sentence by corpus, the similarity between sentence can be calculated using simple mathematic calculation, deep excavation is carried out to construct the digest system of high quality for conspicuousness, similitude, continuity and the redundancy in extraction-type digest task.

Description

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Technical field

The present invention relates to a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, belongs to nature Language Processing field.

Background technique

With the rapid growth of new media information, people can be obtained by extensive source and sharing information, so net The document for including in network is increased with exponential form.We are faced with inevitable, challenging information overload Problem.In order to alleviate this problem, it would be desirable to provide the system for obtaining various data in time.Search engine is to a certain extent Solves this problem, user can return to the document or webpage of a sequence by providing a specified inquiry, search engine List.Even however, also lack the ability of integrated information from many aspects using the search engine of most advanced information retrieval technique, It therefore cannot the succinct and informative response to active user.In order to mitigate the problem of information overload that people face, it is necessary to mention The tool that can be integrated information for one and respond in time, presently, there are these problems excite people to automatic abstracting system Interest.

The purpose of automatic abstracting system design is then generated using single collection of document or multiple collection of document as input One succinct, smooth text snippet for retaining source document most important information.Automatic abstract substantially can be regarded as an information Compression process, by the single document of input or more documents brief and concise sentence expression extracted, but during this It is inevitably present information loss, so digest needs to retain obtains similar significant information more as far as possible.

The effect of digest is mainly assessed by four aspects in multi-document summarization task: correlation, conspicuousness, company Coherence and redundancy.Correlation refers to that content is consistent with the interested content of user；Conspicuousness refers in source document The higher content of the frequency of occurrences；Continuity refers to that content expression meets logic, keeps digest readability higher；Redundancy refers to There is no duplicate message in digest.Wherein correlation and conspicuousness are the key problem in automatic abstract task, continuity and redundancy Property be assist high quality digest building index.

Current automaticabstracting has made intensive studies primarily directed to similitude and conspicuousness.It is passed for similitude The method of system is similar by feature, for example the features such as word frequency, descriptor, part of speech give a mark to sentence, and the method is simple and easy In understanding, but it is a lack of the semantic understanding of deep layer, there are also methods later learns Deep Semantics using vector approach, but simultaneously Similitude, conspicuousness, continuity and redundancy are not comprehensively considered.It is to be led to based on statistics mostly for conspicuousness existing method Cross statistics word frequency, sentence position, the information such as concept determine the significance level of sentence.

Summary of the invention

The purpose of the present invention is high-quality to solve how to comprehensively consider similitude, conspicuousness, continuity and redundancy building The problem of measuring digest, proposes a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, and this method is logical Cross corpus and learn sentence vector automatically, using mathematics similarity calculation, count the continuity between theme conspicuousness and sentence to Complete the digest system of building high quality.

Core of the invention thought is: similar with the method calculating that characteristic similarity combines by using vector similarity Property, conspicuousness calculating then is carried out using this hierarchical information of theme, sentence continuity is calculated to mutual information by word, finally Consider that redundancy is optimized using integral linear programming, the structure that comprehensive similitude, conspicuousness, continuity and redundancy are The digest built is more accurate.

To achieve the above object, the present invention adopts the following technical scheme:

Related definition is carried out first, specific as follows:

Define 1:query, i.e. query term；Each query term is known as a query, each query is a sentence Son typically represents the content of user's concern；

Define 2: collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes base again In the extraction-type digest of query and extraction-type digest based on content；Two kinds of digests of extraction-type digest and production digest wrap Containing multiple collection of document；The corresponding document query of each collection of document；It is a master that each, which inquires corresponding collection of document, Topic set, is denoted as D, and D={ d_i| 1≤i≤N }, N indicates the number of document in collection of document D；

Definition 3: digest sentence set and digest candidate sentence set；For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query content The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ s_j| 1≤j≤M }, M is indicated in digest sentence set The number of sentence, s_jA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (s_j) indicate sentence s_jLength, L indicate digest sentence set Length limitation；Digest candidate sentence collection is combined into all sentences in document D, wherein each of document D sentence is known as one Digest candidate sentence, distributed vector indicate that also known as sentence vector, digest candidate sentence are made of word, the distributed vector of word It indicates to be also known as term vector；

Define 4: similar words set, the word for including in set is all synonym；

Due to 5: similitude, the semantic overlapping degree and feature of sentence and query in digest candidate sentence set are overlapped journey Degree is referred to as similitude；Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is noun phrase and moves The level of coverage of word phrase, also known as characteristic similarity；

Define 6: conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, I.e. the number of sentence is more under theme, and corresponding theme is more significant；

Define 7: continuity, the digest sentence for needing to extract in extraction-type digest rearrange, and continuity refers to most The digest sentence arranged eventually links up readable on semantic logic；

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, comprising the following steps:

Step 1: calculating the similitude of each digest candidate sentence and query, it is similar to calculate vector for study sentence vector first Degree then by feature calculation characteristic similarity, then the two is added to obtain；

Wherein the calculating of vector similarity selects PV algorithm to learn sentence vector；Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature；

Wherein, PV is the abbreviation of paragraph vector；PV algorithm is a unsupervised frame, which can learn Practise the distributed vector of text segment；

Wherein, text segment is based on sentence, paragraph and document, and length is variable；

PV algorithm in the training process, predicts word by constantly adjusting a vector sum term vector, until PV algorithm is received It holds back；Sentence vector sum term vector is got by stochastic gradient descent and backpropagation training；

Characteristic similarity selects parsing tree and Kmeans algorithm to calculate；

The calculating process of vector similarity and characteristic similarity specifically includes following sub-step:

The form that corpus arranges one in a row is input in PV algorithm by step 1.1 carries out study sentence vector, specifically Vector similarity is obtained using cosine similarity, is calculated by formula (1)；

Wherein, s_jIndicate any one digest candidate sentence, vec (s_j) indicate s_jSentence vector, q indicate query, vec (q) Indicate the sentence vector of query, R (s_j, q) and indicate s_jWith the vector similarity of query；

Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, is had Body includes following sub-step:

Step 1.2.1 segments corpus；

Corpus after participle is learnt term vector using word2vec algorithm by step 1.2.2；

Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithm again, obtains similar list Set of words；

Wherein, the rule classified using Kmeans algorithm are as follows: similar term vector result just belongs on semantic space One set；

Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is counted by following formula (2) It calculates:

Fe_j=∑_np∈Qtf(np)+∑_vp∈Qtf(vp) (2)

Wherein, Fe_jIndicate the characteristic similarity of the j sentence, characteristic similarity refers specifically to query and digest is candidate The synonymous Term co-occurrence number of noun phrase and verb phrase in sentence；

Q indicates the set classified belonging to query word, and np indicates s_jIn noun phrase, vp indicate s_jIn verb it is short Language；Tf (np) indicates s_jWord frequency is overlapped with the noun phrase of query；Tf (vp) indicates s_jWith the verb phrase reduplication of query Frequently；

Step 1.3 calculates similitude, is added to obtain with characteristic similarity by vector similarity, is calculated by formula (3):

Rele_j=R (s_j,q)+Fe_j (3)

Wherein, digest candidate sentence s_jCharacteristic similarity, be denoted as Rele_j；

Step 2: calculating the conspicuousness of digest candidate sentence using LDA algorithm；

Wherein, using LDA algorithm the reason of is as follows: LDA is to be developed so far more complete topic model, overcomes tradition The defect of topic model, by feat of probability theory and bayesian theory basis, text retrieval, text classification, image recognition, The fields such as social networks are widely used；

Step 2, and including following sub-step:

Step 2.1 calculates the theme distribution of digest candidate sentence, is denoted as θ；

Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distribution, obtains all digest candidate sentences Theme；

Step 2.3 is normalized that can to obtain theme significant by the number of digest candidate sentence under statistics theme again Property；

Wherein, i-th of theme conspicuousness, is denoted as t_i；

Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information, and include following son Step:

Step 3.1 is for any two digest candidate sentence s in digest candidate sentence set_jAnd s_k, calculate two digest candidates The mutual information of word pair and its similar words pair in sentence, specifically:

For s_jAnd s_kIn word pair<u,v>, u ∈ s_j, v ∈ s_k, it is total that similar words collection is obtained using step 1.2.3 Word is calculated to mutual information, the mutual information P of the word pair_jk<u,v>calculation such as formula (4):

Wherein, U indicates the similar words set of word u, and V indicates the similar words set of word v, and cnt (U, V) indicates U, The number that word in V set occurs in two adjacent sentences, freq (U) indicate the word word frequency in U set, freq (V) Indicate the word word frequency in V set；

Step 3.2 is by s_jAnd s_kThe mutual information of middle word pair is added to obtain continuity, calculates especially by formula (5):

Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence, especially by formula (6) it calculates:

Wherein, s_jAnd s_kIt is any two sentence in digest candidate sentence set, this similarity R < s_j,s_k> utilize cosine phase It is calculated like degree；

Step 5: the comprehensive advantage that similitude, conspicuousness, continuity and redundancy are made up of integral linear programming into Row global optimization solves, and carries out the extraction of digest sentence, obtains digest sentence set, realizes especially by objective function (7) are maximized:

max{α_it_i+β_jRele_j+∑_{J < k}β_jkc<s_j,s_k>-∑_{J < k}β_jkR<s_j,s_k>} (7)

Wherein, similitude, i.e. Rele_j, obtained by step 1.3；Conspicuousness t_iIt is obtained by step 2.3；Continuity, i.e. c < s_j, s_k> obtained by step 3.2；R<s_j,s_k> obtained by step 4；Similarity between digest sentence is lower, and to represent redundancy lower；

ILP is abbreviated as in integral linear programming；α_iAnd β_jIt is two-valued variable, has respectively represented whether theme i and digest are candidate Sentence j is selected into digest, t_iIndicate theme conspicuousness, Rele_jIndicate the similitude of digest candidate sentence, β_jkIndicate digest candidate sentence Son is to < s_j,s_k> whether two-valued variable in digest sentence set is appeared in simultaneously, it is also needed while maximizing objective function (7) Meet following formula (8) to (12) five constraints:

β_jAsso_ij≤α_i (8)

∑_jβ_jAsso_ij≥α_i (9)

β_jk-β_j≤0；β_jk-β_k≤0；β_j+β_k-β_jk≤1 (10)

∑_jl(s_j)≤L (11)

Wherein, Asso_ijIndicate whether the theme of sentence j is consistent with theme i, is a two-valued variable, inequality (8) (9) It ensure that and some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, otherwise such as Some theme has been selected so just at least to select a digest candidate sentence under the theme in fruit digest sentence set；β_kIt represents Whether digest candidate sentence k is selected into the two-valued variable in digest, and the length for the digest sentence set that inequality (11) indicates does not surpass Cross L；

So far, from step 1 to step 5, selected it is semantic it is similar, theme is significant, sentence is coherent has no redundancy High quality digest sentence, complete a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage.

Beneficial effect

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention, compares existing skill Art has the following beneficial effects:

1. comprehensively considering Deep Semantics using vector similarity and characteristic similarity and effective feature improving digest sentence The similitude of set and query；

2. the calculating of theme conspicuousness improves the accuracy of the extraction of important information, so that digest sentence set improves theme Spatial saliency；

3. calculating the continuity between digest candidate sentence to mutual information using word, the readable of final digest sentence is improved Property, so that the digest sentence extracted preferably expresses the content of collection of document；

4. having comprehensively considered similitude, conspicuousness, continuity and redundancy using ILP frame, acquired in the overall situation optimal Solution, improves the quality of digest sentence set.

Detailed description of the invention

Fig. 1 is a kind of process of the extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention Figure；

Fig. 2 is the classification results figure in 1 step B of embodiment by obtaining after Kmeans sorting algorithm.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below according to accompanying drawings and embodiments pair Abstract method of the present invention is further described.

Embodiment 1

The present embodiment describes specific implementation process of the invention, as shown in Figure 1.

It will be seen from figure 1 that a kind of extraction-type digest side based on integral linear programming with comprehensive advantage of the present invention The process of method is as follows:

Step A pretreatment；It is that subordinate sentence is carried out to corpus specific to the present embodiment, goes the processing of stop words；Selection standard number According to collection DUC2005, which includes 50 collection of document, and each collection of document includes 25-50 documents.Extraction-type digest is The digest sentence set no more than 250 words is extracted according to the collection of document under each query.The data of data set DUC2005 Format is xml format, by query and collection of document respectively in label<narr></narr>,<tEXT></TEXT>in extract Come, recycles nltk tool to carry out subordinate sentence operation to collection of document, then just obtaining a line one new document D_new1.It is obtained to new The document D arrived_new1It carries out stop words to operate, obtains new document D_new2。

Step B is calculated similarity using PV and kmeans algorithm, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter Calculate continuity；

Wherein, similarity is calculated using PV and kmeans algorithm, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter Calculating continuity is simultaneously column count, specific to the present embodiment:

Similarity is calculated using PV and kmeans algorithm, i.e., learns sentence vector using PV algorithm；By document D_new2It is input to In PV algorithm, the sentence vector of each digest candidate sentence is obtained, dimension size is 256, the sentence vector of some digest candidate sentence It is [0.00150049 0.08735332-0.10565963 0.04739858 0.18809512 0.280207 ...- 0.19442209 0.17960664 0.30010329 0.06458669 0.12353758]；The sentence vector of query is [0.16279337 0.00488725 -0.30741466 0.83172139 0.25234198 0.00017076 … 0.30811236-0.2949384 0.03353651 0.18530557 0.94691929], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.15；

Learn term vector using word2vec algorithm；By document D_new2It is input in word2vec algorithm, obtains term vector, Its objective function such as formula (13):

Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vector of 256 dimensions；

Characteristic similarity needs to count noun phrase and verb phrase, carries out syntactic analysis using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithm by the noun phrase extracted and verb phrase Classified using term vector, is divided into 50 classifications；By statistics noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q₁,q₂…q_q], word w=[w₁,w₂…w_w], pass through The classification results obtained after Kmeans sorting algorithm are as shown in Figure 2；So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8；

Correlation result is that vector similarity adds characteristic similarity: 0.15+3/8=0.525

In the present embodiment, conspicuousness is calculated using LDA algorithm, process is as follows: first by document D_new2It is input to LDA algorithm In, the theme distribution of each digest candidate sentence is obtained, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences The theme distribution difference of son is as follows: [0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09, 0.02,0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness meter of five themes Calculating result is exactly 1/5,0,1/5,1/5,0；

In the present embodiment, continuity is calculated using mutual information, process is as follows: calculating the mutual trust of word pair in digest candidate sentence Breath calculates mutual information the continuity between digest candidate sentence by counting word；For digest candidate sentence s₁And s₂, s₁= [w₁₁,w₁₂,…,w_1i,…,w_1n],s₂=[w₂₁,w₂₂,…,w_2i,…,w_2m], first by s₁And s₂In word combination of two meter The mutual information for calculating each word pair, for word to < w₁₁,w₂₁> calculation such as formula (14):

Wherein U and V respectively indicate w₁₁And w₂₁Similar words set, w₁₁It is 3, U with the V frequency for appearing in adjacent sentence And w₂₁The frequency for appearing in adjacent sentence is frequency 100 in corpus 2, U and V, and word is to < w₁₁,w₂₁> mutual information result is 5.001/101, the mutual information of remaining word pair is similarly obtained, acquires s using formula (15)₁With s₂Continuity

Step C calculates similarity between sentence；Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithm in step B, The cosine similarity of every two digest candidate sentence is calculated using formula (1)；

Step D comprehensively considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving optimization formula (7)；

Obtain β_jValue [0,1,1,0,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in Digest sentence set.

Embodiment 2

Step B in summary of the invention embodiment 1 is calculated similarity using PV and kmeans algorithm, is calculated using LDA algorithm Conspicuousness calculates continuity using mutual information；The step of three sequences execute is split into, specific as follows:

Step A pretreatment；It is that subordinate sentence is carried out to corpus specific to the present embodiment, goes the processing of stop words；Select data set DailyMail, the data set include 1000 collection of document, and each collection of document averagely includes 802 words.Utilize binary group Method query and collection of document are extracted, recycle nltk tool to collection of document carry out subordinate sentence operation, then must To a line one new document D_new1, to the document D newly obtained_new1It carries out stop words to operate, obtains new document D_new2；

Step B calculates similarity using PV and kmeans algorithm, i.e., learns sentence vector using PV algorithm；By document D_new2It is defeated Enter into PV algorithm, obtain the sentence vector of each digest candidate sentence, dimension size is 256, the sentence of some digest candidate sentence Vector is [0.30150011-0.60735332 0.00165963 0.31739858 0.11809512 0.080117 ...- 0.04042209 0.12560614 0.17610322 0.06183569 0.38161758]；The sentence vector of query is [0.01539337 0.18238734 -0.30741466 0.03572199 0.45234808 0.60017210 … 0.80311120-0.1038382 0.02234642 0.17560577 0.91691008], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.12；

Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vectors of 256 dimensions；

Characteristic similarity needs to count noun phrase and verb phrase, carries out syntactic analysis using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithm by the noun phrase extracted and verb phrase Classified using term vector, is divided into 70 classifications；By statistics noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q₁,q₂…q_q], word w=[w₁,w₂…w_w], pass through The classification results obtained after Kmeans sorting algorithm are as shown in Figure 2；So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8；

Correlation result is that vector similarity adds characteristic similarity: 0.12+3/8=0.495；

Step C calculates continuity using mutual information, and process is as follows: calculating the mutual information of word pair in digest candidate sentence, leads to It crosses statistics word and calculates mutual information continuity between digest candidate sentence；For digest candidate sentence s₁And s₂, s₁=[w₁₁, w₁₂,…,w_1i,…,w_1n],s₂=[w₂₁,w₂₂,…,w_2i,…,w_2m], first by s₁And s₂In word combination of two calculate it is each The mutual information of word pair, for word to < w₁₁,w₂₁> calculation such as formula (14):

Step D calculates conspicuousness using LDA algorithm, and process is as follows: first by document D_new2It is input in LDA algorithm, obtains To the theme distribution of each digest candidate sentence, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences Theme distribution difference is as follows: [0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09,0.02, 0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness of five themes calculates knot Fruit is exactly 1/5,0,1/5,1/5,0；

Step E calculates similarity between sentence；Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithm in step B, The cosine similarity of every two digest candidate sentence is calculated using formula (1)；

Step F comprehensively considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving optimization formula (7)；Obtain β_jValue [0,0,1,1,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in digest Sentence set.

" a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage " of the invention is carried out above Detailed description, but specific implementation form of the invention is not limited thereto.Embodiment explanation is merely used to help understand this The method and its core concept of invention；At the same time, for those skilled in the art, according to the thought of the present invention, specific There will be changes in embodiment and application range, in conclusion the content of the present specification should not be construed as to of the invention Limitation.

The spirit without departing substantially from the method for the invention and in the case where scope of the claims to its carry out various aobvious and The change being clear to is all within protection scope of the present invention.

Claims

1. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, it is characterised in that:

Related definition is carried out first, specific as follows:

Define 1:query, i.e. query term；Each query term is known as a query, each query is a sentence, leads to Chang represents the content of user's concern；

Define 2: collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes being based on again The extraction-type digest of query and extraction-type digest based on content；Two kinds of digests of extraction-type digest and production digest include Multiple collection of document；The corresponding document query of each collection of document；It is a theme that each, which inquires corresponding collection of document, Set, is denoted as D, and D={ d_i| 1≤i≤N }, N indicates the number of document in collection of document D；

Definition 3: digest sentence set and digest candidate sentence set；For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query content The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ s_i| 1≤i≤M }, M is indicated in digest sentence set The number of sentence, s_iA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (s_i) indicate sentence s_iLength, L indicates digest sentence set Length limitation；Digest candidate sentence collection is combined into all sentences in document D, wherein each of document D sentence is known as a text Candidate sentence is plucked, distributed vector indicates to be also known as sentence vector, and digest candidate sentence is made of word, the distributed vector table of word Show also known as term vector；

Define 4: similar words set, the word for including in set is all synonym；

Due to 5: similitude, semantic overlapping degree and feature the overlapping degree system of sentence and query in digest candidate sentence set Referred to as similitude；Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is that noun phrase and verb are short The level of coverage of language, also known as characteristic similarity；

Define 6: conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, i.e., main The number for inscribing lower sentence is more, and corresponding theme is more significant；

Define 7: continuity, the digest sentence for needing to extract in extraction-type digest rearrange, and continuity refers to final row The digest sentence of column links up readable on semantic logic；

Step 1: calculate the similitude of each digest candidate sentence and query, especially by calculating separately vector similarity and spy Similarity is levied, then the two is added to obtain；

Wherein, the calculating of vector similarity selects PV algorithm to learn sentence vector；Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature；

Wherein, PV is the abbreviation of paragraph vector；PV algorithm is a unsupervised frame, which can learn text The distributed vector of word slice section；

PV algorithm in the training process, predicts word by constantly adjusting a vector sum term vector, until PV algorithmic statement；Sentence Vector sum term vector is got by stochastic gradient descent and backpropagation training；

Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information；

Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence；

Step 5: being carried out by integral linear programming to the comprehensive advantage that similitude, conspicuousness, continuity and redundancy form complete Office's optimization, carries out the extraction of digest sentence, obtains digest sentence set.

2. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: the calculating process of vector similarity and characteristic similarity in step 1 specifically includes following sub-step:

The form that corpus arranges one in a row is input in PV algorithm by step 1.1 carries out study sentence vector, specific to utilize Cosine similarity obtains vector similarity, is calculated by formula (1)；

Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, specific to wrap Include following sub-step:

Step 1.2.1 segments corpus；

Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithm again, obtains similar words collection It closes；

Wherein, the rule classified using Kmeans algorithm are as follows: similar term vector result just belongs to one on semantic space Set；

Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is calculated by following formula (2):

Fe_j=∑_np∈Qtf(np)+∑_vp∈Qtf(vp) (2)

Wherein, Fe_jIndicate that the characteristic similarity of j-th of sentence, characteristic similarity refer specifically in query and digest candidate sentence Noun phrase and verb phrase synonymous Term co-occurrence number；

Q indicates the set classified belonging to query word, and np indicates s_jIn noun phrase, vp indicate s_jIn verb phrase；tf (np) s is indicated_jWord frequency is overlapped with the noun phrase of query；Tf (vp) indicates s_jWord frequency is overlapped with the verb phrase of query；

Rele_j=R (s_j,q)+ Fe_j (3)

Wherein, digest candidate sentence s_jCharacteristic similarity, be denoted as Rele_j。

3. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 2, and including following sub-step:

Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distribution, obtains the master of all digest candidate sentences Topic；

Step 2.3 is normalized by the number of digest candidate sentence under statistics theme again can obtain theme conspicuousness；

Wherein, i-th of theme conspicuousness, is denoted as t_i。

4. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 3 includes following sub-step again:

Step 3.1 is for any two digest candidate sentence s in digest candidate sentence set_jAnd s_k, calculate in two digest candidate sentences The mutual information of word pair and its similar words pair, specifically:

For s_jAnd s_kIn word pair<u,v>, u ∈ s_j, v ∈ s_k, similar words set, which is obtained, using step 1.2.3 calculates word To mutual information, the mutual information P of the word pair_jk<u,v>calculation such as formula (4):

Wherein, U indicates the similar words set of word u, and V indicates the similar words set of word v, and cnt (U, v) indicates word v The number occurred in two adjacent sentences in U set, freq (U) indicate the word word frequency in U set, and freq (V) indicates V Word word frequency in set；

5. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 4 is calculated especially by formula (6):

Wherein, s_jAnd s_kIt is any two sentence in digest candidate sentence set, this similarity R < s_j,s_k> utilize cosine similarity It calculates.

6. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 5 is realized especially by objective function (7) are maximized:

max{α_it_i+β_jRele_j+∑_j<kβ_jkc<s_j,s_k>-∑_j<kβ_jkR<s_j,s_k>}(7)

Wherein, similitude, i.e. Rele_j, obtained by step 1.3；Conspicuousness t_iIt is obtained by step 2.3；Continuity, i.e. c < s_j,s_k> by Step 3.2 obtains；R<s_j,s_k> obtained by step 4；Similarity between digest sentence is lower, and to represent redundancy lower；

ILP is abbreviated as in integral linear programming；α_iAnd β_jTwo-valued variable, respectively represented whether theme i and digest candidate sentence j quilt It is selected into digest, t_iIndicate theme conspicuousness, Rele_jIndicate the similitude of digest candidate sentence, β_jkIndicate digest candidate sentences to < s_j,s_k> whether two-valued variable in digest sentence set is appeared in simultaneously, it also needs to meet while maximizing objective function (7) Following formula (8) is constrained to (12) five:

β_jAsso_ij≤α_i (8)

∑_jβ_jAsso_ij≥α_i (9)

β_jk-β_j≤0；β_jk-β_k≤0；β_j+β_k-β_jk≤1 (10)

∑_jl(s_j)≤L (11)

Wherein, Asso_ijIndicate whether the theme of sentence j is consistent with theme i, be a two-valued variable, inequality (8) (9) guarantees Some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, if instead literary Pluck a digest candidate sentence for having selected some theme so just at least to select under the theme in sentence set；β_kRepresenting is No digest candidate sentence k is selected into the two-valued variable in digest, the of length no more than L for the digest sentence set that inequality (11) indicates.