CN108664598B - A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage - Google Patents

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage Download PDF

Info

Publication number
CN108664598B
CN108664598B CN201810435232.8A CN201810435232A CN108664598B CN 108664598 B CN108664598 B CN 108664598B CN 201810435232 A CN201810435232 A CN 201810435232A CN 108664598 B CN108664598 B CN 108664598B
Authority
CN
China
Prior art keywords
sentence
digest
vector
word
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810435232.8A
Other languages
Chinese (zh)
Other versions
CN108664598A (en
Inventor
高扬
黄河燕
魏林静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810435232.8A priority Critical patent/CN108664598B/en
Publication of CN108664598A publication Critical patent/CN108664598A/en
Application granted granted Critical
Publication of CN108664598B publication Critical patent/CN108664598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

Disclosed herein is a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, belongs to natural language processing field.Extraction-type digest is divided into document content study first for context of methods and digest sentence extracts, and is divided into similitude, conspicuousness and continuity three parts for document content study;The content study and redundancy for comprehensively considering document are extracted for digest sentence, and digest sentence is extracted using integral linear programming frame.This method can learn automatically the semantic expressiveness of sentence by corpus, the similarity between sentence can be calculated using simple mathematic calculation, deep excavation is carried out to construct the digest system of high quality for conspicuousness, similitude, continuity and the redundancy in extraction-type digest task.

Description

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage
Technical field
The present invention relates to a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, belongs to nature Language Processing field.
Background technique
With the rapid growth of new media information, people can be obtained by extensive source and sharing information, so net The document for including in network is increased with exponential form.We are faced with inevitable, challenging information overload Problem.In order to alleviate this problem, it would be desirable to provide the system for obtaining various data in time.Search engine is to a certain extent Solves this problem, user can return to the document or webpage of a sequence by providing a specified inquiry, search engine List.Even however, also lack the ability of integrated information from many aspects using the search engine of most advanced information retrieval technique, It therefore cannot the succinct and informative response to active user.In order to mitigate the problem of information overload that people face, it is necessary to mention The tool that can be integrated information for one and respond in time, presently, there are these problems excite people to automatic abstracting system Interest.
The purpose of automatic abstracting system design is then generated using single collection of document or multiple collection of document as input One succinct, smooth text snippet for retaining source document most important information.Automatic abstract substantially can be regarded as an information Compression process, by the single document of input or more documents brief and concise sentence expression extracted, but during this It is inevitably present information loss, so digest needs to retain obtains similar significant information more as far as possible.
The effect of digest is mainly assessed by four aspects in multi-document summarization task: correlation, conspicuousness, company Coherence and redundancy.Correlation refers to that content is consistent with the interested content of user;Conspicuousness refers in source document The higher content of the frequency of occurrences;Continuity refers to that content expression meets logic, keeps digest readability higher;Redundancy refers to There is no duplicate message in digest.Wherein correlation and conspicuousness are the key problem in automatic abstract task, continuity and redundancy Property be assist high quality digest building index.
Current automaticabstracting has made intensive studies primarily directed to similitude and conspicuousness.It is passed for similitude The method of system is similar by feature, for example the features such as word frequency, descriptor, part of speech give a mark to sentence, and the method is simple and easy In understanding, but it is a lack of the semantic understanding of deep layer, there are also methods later learns Deep Semantics using vector approach, but simultaneously Similitude, conspicuousness, continuity and redundancy are not comprehensively considered.It is to be led to based on statistics mostly for conspicuousness existing method Cross statistics word frequency, sentence position, the information such as concept determine the significance level of sentence.
Summary of the invention
The purpose of the present invention is high-quality to solve how to comprehensively consider similitude, conspicuousness, continuity and redundancy building The problem of measuring digest, proposes a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, and this method is logical Cross corpus and learn sentence vector automatically, using mathematics similarity calculation, count the continuity between theme conspicuousness and sentence to Complete the digest system of building high quality.
Core of the invention thought is: similar with the method calculating that characteristic similarity combines by using vector similarity Property, conspicuousness calculating then is carried out using this hierarchical information of theme, sentence continuity is calculated to mutual information by word, finally Consider that redundancy is optimized using integral linear programming, the structure that comprehensive similitude, conspicuousness, continuity and redundancy are The digest built is more accurate.
To achieve the above object, the present invention adopts the following technical scheme:
Related definition is carried out first, specific as follows:
Define 1:query, i.e. query term;Each query term is known as a query, each query is a sentence Son typically represents the content of user's concern;
Define 2: collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes base again In the extraction-type digest of query and extraction-type digest based on content;Two kinds of digests of extraction-type digest and production digest wrap Containing multiple collection of document;The corresponding document query of each collection of document;It is a master that each, which inquires corresponding collection of document, Topic set, is denoted as D, and D={ di| 1≤i≤N }, N indicates the number of document in collection of document D;
Definition 3: digest sentence set and digest candidate sentence set;For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query content The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ sj| 1≤j≤M }, M is indicated in digest sentence set The number of sentence, sjA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (sj) indicate sentence sjLength, L indicate digest sentence set Length limitation;Digest candidate sentence collection is combined into all sentences in document D, wherein each of document D sentence is known as one Digest candidate sentence, distributed vector indicate that also known as sentence vector, digest candidate sentence are made of word, the distributed vector of word It indicates to be also known as term vector;
Define 4: similar words set, the word for including in set is all synonym;
Due to 5: similitude, the semantic overlapping degree and feature of sentence and query in digest candidate sentence set are overlapped journey Degree is referred to as similitude;Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is noun phrase and moves The level of coverage of word phrase, also known as characteristic similarity;
Define 6: conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, I.e. the number of sentence is more under theme, and corresponding theme is more significant;
Define 7: continuity, the digest sentence for needing to extract in extraction-type digest rearrange, and continuity refers to most The digest sentence arranged eventually links up readable on semantic logic;
A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, comprising the following steps:
Step 1: calculating the similitude of each digest candidate sentence and query, it is similar to calculate vector for study sentence vector first Degree then by feature calculation characteristic similarity, then the two is added to obtain;
Wherein the calculating of vector similarity selects PV algorithm to learn sentence vector;Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature;
Wherein, PV is the abbreviation of paragraph vector;PV algorithm is a unsupervised frame, which can learn Practise the distributed vector of text segment;
Wherein, text segment is based on sentence, paragraph and document, and length is variable;
PV algorithm in the training process, predicts word by constantly adjusting a vector sum term vector, until PV algorithm is received It holds back;Sentence vector sum term vector is got by stochastic gradient descent and backpropagation training;
Characteristic similarity selects parsing tree and Kmeans algorithm to calculate;
The calculating process of vector similarity and characteristic similarity specifically includes following sub-step:
The form that corpus arranges one in a row is input in PV algorithm by step 1.1 carries out study sentence vector, specifically Vector similarity is obtained using cosine similarity, is calculated by formula (1);
Wherein, sjIndicate any one digest candidate sentence, vec (sj) indicate sjSentence vector, q indicate query, vec (q) Indicate the sentence vector of query, R (sj, q) and indicate sjWith the vector similarity of query;
Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, is had Body includes following sub-step:
Step 1.2.1 segments corpus;
Corpus after participle is learnt term vector using word2vec algorithm by step 1.2.2;
Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithm again, obtains similar list Set of words;
Wherein, the rule classified using Kmeans algorithm are as follows: similar term vector result just belongs on semantic space One set;
Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is counted by following formula (2) It calculates:
Fej=∑np∈Qtf(np)+∑vp∈Qtf(vp) (2)
Wherein, FejIndicate the characteristic similarity of the j sentence, characteristic similarity refers specifically to query and digest is candidate The synonymous Term co-occurrence number of noun phrase and verb phrase in sentence;
Q indicates the set classified belonging to query word, and np indicates sjIn noun phrase, vp indicate sjIn verb it is short Language;Tf (np) indicates sjWord frequency is overlapped with the noun phrase of query;Tf (vp) indicates sjWith the verb phrase reduplication of query Frequently;
Step 1.3 calculates similitude, is added to obtain with characteristic similarity by vector similarity, is calculated by formula (3):
Relej=R (sj,q)+Fej (3)
Wherein, digest candidate sentence sjCharacteristic similarity, be denoted as Relej
Step 2: calculating the conspicuousness of digest candidate sentence using LDA algorithm;
Wherein, using LDA algorithm the reason of is as follows: LDA is to be developed so far more complete topic model, overcomes tradition The defect of topic model, by feat of probability theory and bayesian theory basis, text retrieval, text classification, image recognition, The fields such as social networks are widely used;
Step 2, and including following sub-step:
Step 2.1 calculates the theme distribution of digest candidate sentence, is denoted as θ;
Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distribution, obtains all digest candidate sentences Theme;
Step 2.3 is normalized that can to obtain theme significant by the number of digest candidate sentence under statistics theme again Property;
Wherein, i-th of theme conspicuousness, is denoted as ti
Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information, and include following son Step:
Step 3.1 is for any two digest candidate sentence s in digest candidate sentence setjAnd sk, calculate two digest candidates The mutual information of word pair and its similar words pair in sentence, specifically:
For sjAnd skIn word pair<u,v>, u ∈ sj, v ∈ sk, it is total that similar words collection is obtained using step 1.2.3 Word is calculated to mutual information, the mutual information P of the word pairjk<u,v>calculation such as formula (4):
Wherein, U indicates the similar words set of word u, and V indicates the similar words set of word v, and cnt (U, V) indicates U, The number that word in V set occurs in two adjacent sentences, freq (U) indicate the word word frequency in U set, freq (V) Indicate the word word frequency in V set;
Step 3.2 is by sjAnd skThe mutual information of middle word pair is added to obtain continuity, calculates especially by formula (5):
Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence, especially by formula (6) it calculates:
Wherein, sjAnd skIt is any two sentence in digest candidate sentence set, this similarity R < sj,sk> utilize cosine phase It is calculated like degree;
Step 5: the comprehensive advantage that similitude, conspicuousness, continuity and redundancy are made up of integral linear programming into Row global optimization solves, and carries out the extraction of digest sentence, obtains digest sentence set, realizes especially by objective function (7) are maximized:
max{αitijRelej+∑J < kβjkc<sj,sk>-∑J < kβjkR<sj,sk>} (7)
Wherein, similitude, i.e. Relej, obtained by step 1.3;Conspicuousness tiIt is obtained by step 2.3;Continuity, i.e. c < sj, sk> obtained by step 3.2;R<sj,sk> obtained by step 4;Similarity between digest sentence is lower, and to represent redundancy lower;
ILP is abbreviated as in integral linear programming;αiAnd βjIt is two-valued variable, has respectively represented whether theme i and digest are candidate Sentence j is selected into digest, tiIndicate theme conspicuousness, RelejIndicate the similitude of digest candidate sentence, βjkIndicate digest candidate sentence Son is to < sj,sk> whether two-valued variable in digest sentence set is appeared in simultaneously, it is also needed while maximizing objective function (7) Meet following formula (8) to (12) five constraints:
βjAssoij≤αi (8)
jβjAssoij≥αi (9)
βjkj≤0;βjkk≤0;βjkjk≤1 (10)
jl(sj)≤L (11)
Wherein, AssoijIndicate whether the theme of sentence j is consistent with theme i, is a two-valued variable, inequality (8) (9) It ensure that and some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, otherwise such as Some theme has been selected so just at least to select a digest candidate sentence under the theme in fruit digest sentence set;βkIt represents Whether digest candidate sentence k is selected into the two-valued variable in digest, and the length for the digest sentence set that inequality (11) indicates does not surpass Cross L;
So far, from step 1 to step 5, selected it is semantic it is similar, theme is significant, sentence is coherent has no redundancy High quality digest sentence, complete a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage.
Beneficial effect
A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention, compares existing skill Art has the following beneficial effects:
1. comprehensively considering Deep Semantics using vector similarity and characteristic similarity and effective feature improving digest sentence The similitude of set and query;
2. the calculating of theme conspicuousness improves the accuracy of the extraction of important information, so that digest sentence set improves theme Spatial saliency;
3. calculating the continuity between digest candidate sentence to mutual information using word, the readable of final digest sentence is improved Property, so that the digest sentence extracted preferably expresses the content of collection of document;
4. having comprehensively considered similitude, conspicuousness, continuity and redundancy using ILP frame, acquired in the overall situation optimal Solution, improves the quality of digest sentence set.
Detailed description of the invention
Fig. 1 is a kind of process of the extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention Figure;
Fig. 2 is the classification results figure in 1 step B of embodiment by obtaining after Kmeans sorting algorithm.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below according to accompanying drawings and embodiments pair Abstract method of the present invention is further described.
Embodiment 1
The present embodiment describes specific implementation process of the invention, as shown in Figure 1.
It will be seen from figure 1 that a kind of extraction-type digest side based on integral linear programming with comprehensive advantage of the present invention The process of method is as follows:
Step A pretreatment;It is that subordinate sentence is carried out to corpus specific to the present embodiment, goes the processing of stop words;Selection standard number According to collection DUC2005, which includes 50 collection of document, and each collection of document includes 25-50 documents.Extraction-type digest is The digest sentence set no more than 250 words is extracted according to the collection of document under each query.The data of data set DUC2005 Format is xml format, by query and collection of document respectively in label<narr></narr>,<tEXT></TEXT>in extract Come, recycles nltk tool to carry out subordinate sentence operation to collection of document, then just obtaining a line one new document Dnew1.It is obtained to new The document D arrivednew1It carries out stop words to operate, obtains new document Dnew2
Step B is calculated similarity using PV and kmeans algorithm, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter Calculate continuity;
Wherein, similarity is calculated using PV and kmeans algorithm, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter Calculating continuity is simultaneously column count, specific to the present embodiment:
Similarity is calculated using PV and kmeans algorithm, i.e., learns sentence vector using PV algorithm;By document Dnew2It is input to In PV algorithm, the sentence vector of each digest candidate sentence is obtained, dimension size is 256, the sentence vector of some digest candidate sentence It is [0.00150049 0.08735332-0.10565963 0.04739858 0.18809512 0.280207 ...- 0.19442209 0.17960664 0.30010329 0.06458669 0.12353758];The sentence vector of query is [0.16279337 0.00488725 -0.30741466 0.83172139 0.25234198 0.00017076 … 0.30811236-0.2949384 0.03353651 0.18530557 0.94691929], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.15;
Learn term vector using word2vec algorithm;By document Dnew2It is input in word2vec algorithm, obtains term vector, Its objective function such as formula (13):
Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vector of 256 dimensions;
Characteristic similarity needs to count noun phrase and verb phrase, carries out syntactic analysis using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithm by the noun phrase extracted and verb phrase Classified using term vector, is divided into 50 classifications;By statistics noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q1,q2…qq], word w=[w1,w2…ww], pass through The classification results obtained after Kmeans sorting algorithm are as shown in Figure 2;So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8;
Correlation result is that vector similarity adds characteristic similarity: 0.15+3/8=0.525
In the present embodiment, conspicuousness is calculated using LDA algorithm, process is as follows: first by document Dnew2It is input to LDA algorithm In, the theme distribution of each digest candidate sentence is obtained, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences The theme distribution difference of son is as follows: [0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09, 0.02,0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness meter of five themes Calculating result is exactly 1/5,0,1/5,1/5,0;
In the present embodiment, continuity is calculated using mutual information, process is as follows: calculating the mutual trust of word pair in digest candidate sentence Breath calculates mutual information the continuity between digest candidate sentence by counting word;For digest candidate sentence s1And s2, s1= [w11,w12,…,w1i,…,w1n],s2=[w21,w22,…,w2i,…,w2m], first by s1And s2In word combination of two meter The mutual information for calculating each word pair, for word to < w11,w21> calculation such as formula (14):
Wherein U and V respectively indicate w11And w21Similar words set, w11It is 3, U with the V frequency for appearing in adjacent sentence And w21The frequency for appearing in adjacent sentence is frequency 100 in corpus 2, U and V, and word is to < w11,w21> mutual information result is 5.001/101, the mutual information of remaining word pair is similarly obtained, acquires s using formula (15)1With s2Continuity
Step C calculates similarity between sentence;Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithm in step B, The cosine similarity of every two digest candidate sentence is calculated using formula (1);
Step D comprehensively considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving optimization formula (7);
Obtain βjValue [0,1,1,0,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in Digest sentence set.
Embodiment 2
Step B in summary of the invention embodiment 1 is calculated similarity using PV and kmeans algorithm, is calculated using LDA algorithm Conspicuousness calculates continuity using mutual information;The step of three sequences execute is split into, specific as follows:
Step A pretreatment;It is that subordinate sentence is carried out to corpus specific to the present embodiment, goes the processing of stop words;Select data set DailyMail, the data set include 1000 collection of document, and each collection of document averagely includes 802 words.Utilize binary group Method query and collection of document are extracted, recycle nltk tool to collection of document carry out subordinate sentence operation, then must To a line one new document Dnew1, to the document D newly obtainednew1It carries out stop words to operate, obtains new document Dnew2
Step B calculates similarity using PV and kmeans algorithm, i.e., learns sentence vector using PV algorithm;By document Dnew2It is defeated Enter into PV algorithm, obtain the sentence vector of each digest candidate sentence, dimension size is 256, the sentence of some digest candidate sentence Vector is [0.30150011-0.60735332 0.00165963 0.31739858 0.11809512 0.080117 ...- 0.04042209 0.12560614 0.17610322 0.06183569 0.38161758];The sentence vector of query is [0.01539337 0.18238734 -0.30741466 0.03572199 0.45234808 0.60017210 … 0.80311120-0.1038382 0.02234642 0.17560577 0.91691008], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.12;
Learn term vector using word2vec algorithm;By document Dnew2It is input in word2vec algorithm, obtains term vector, Its objective function such as formula (13):
Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vectors of 256 dimensions;
Characteristic similarity needs to count noun phrase and verb phrase, carries out syntactic analysis using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithm by the noun phrase extracted and verb phrase Classified using term vector, is divided into 70 classifications;By statistics noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q1,q2…qq], word w=[w1,w2…ww], pass through The classification results obtained after Kmeans sorting algorithm are as shown in Figure 2;So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8;
Correlation result is that vector similarity adds characteristic similarity: 0.12+3/8=0.495;
Step C calculates continuity using mutual information, and process is as follows: calculating the mutual information of word pair in digest candidate sentence, leads to It crosses statistics word and calculates mutual information continuity between digest candidate sentence;For digest candidate sentence s1And s2, s1=[w11, w12,…,w1i,…,w1n],s2=[w21,w22,…,w2i,…,w2m], first by s1And s2In word combination of two calculate it is each The mutual information of word pair, for word to < w11,w21> calculation such as formula (14):
Wherein U and V respectively indicate w11And w21Similar words set, w11It is 3, U with the V frequency for appearing in adjacent sentence And w21The frequency for appearing in adjacent sentence is frequency 100 in corpus 2, U and V, and word is to < w11,w21> mutual information result is 5.001/101, the mutual information of remaining word pair is similarly obtained, acquires s using formula (15)1With s2Continuity
Step D calculates conspicuousness using LDA algorithm, and process is as follows: first by document Dnew2It is input in LDA algorithm, obtains To the theme distribution of each digest candidate sentence, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences Theme distribution difference is as follows: [0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09,0.02, 0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness of five themes calculates knot Fruit is exactly 1/5,0,1/5,1/5,0;
Step E calculates similarity between sentence;Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithm in step B, The cosine similarity of every two digest candidate sentence is calculated using formula (1);
Step F comprehensively considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving optimization formula (7);Obtain βjValue [0,0,1,1,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in digest Sentence set.
" a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage " of the invention is carried out above Detailed description, but specific implementation form of the invention is not limited thereto.Embodiment explanation is merely used to help understand this The method and its core concept of invention;At the same time, for those skilled in the art, according to the thought of the present invention, specific There will be changes in embodiment and application range, in conclusion the content of the present specification should not be construed as to of the invention Limitation.
The spirit without departing substantially from the method for the invention and in the case where scope of the claims to its carry out various aobvious and The change being clear to is all within protection scope of the present invention.

Claims (6)

1. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, it is characterised in that:
Related definition is carried out first, specific as follows:
Define 1:query, i.e. query term;Each query term is known as a query, each query is a sentence, leads to Chang represents the content of user's concern;
Define 2: collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes being based on again The extraction-type digest of query and extraction-type digest based on content;Two kinds of digests of extraction-type digest and production digest include Multiple collection of document;The corresponding document query of each collection of document;It is a theme that each, which inquires corresponding collection of document, Set, is denoted as D, and D={ di| 1≤i≤N }, N indicates the number of document in collection of document D;
Definition 3: digest sentence set and digest candidate sentence set;For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query content The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ si| 1≤i≤M }, M is indicated in digest sentence set The number of sentence, siA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (si) indicate sentence siLength, L indicates digest sentence set Length limitation;Digest candidate sentence collection is combined into all sentences in document D, wherein each of document D sentence is known as a text Candidate sentence is plucked, distributed vector indicates to be also known as sentence vector, and digest candidate sentence is made of word, the distributed vector table of word Show also known as term vector;
Define 4: similar words set, the word for including in set is all synonym;
Due to 5: similitude, semantic overlapping degree and feature the overlapping degree system of sentence and query in digest candidate sentence set Referred to as similitude;Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is that noun phrase and verb are short The level of coverage of language, also known as characteristic similarity;
Define 6: conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, i.e., main The number for inscribing lower sentence is more, and corresponding theme is more significant;
Define 7: continuity, the digest sentence for needing to extract in extraction-type digest rearrange, and continuity refers to final row The digest sentence of column links up readable on semantic logic;
A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, comprising the following steps:
Step 1: calculate the similitude of each digest candidate sentence and query, especially by calculating separately vector similarity and spy Similarity is levied, then the two is added to obtain;
Wherein, the calculating of vector similarity selects PV algorithm to learn sentence vector;Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature;
Wherein, PV is the abbreviation of paragraph vector;PV algorithm is a unsupervised frame, which can learn text The distributed vector of word slice section;
Wherein, text segment is based on sentence, paragraph and document, and length is variable;
PV algorithm in the training process, predicts word by constantly adjusting a vector sum term vector, until PV algorithmic statement;Sentence Vector sum term vector is got by stochastic gradient descent and backpropagation training;
Characteristic similarity selects parsing tree and Kmeans algorithm to calculate;
Step 2: calculating the conspicuousness of digest candidate sentence using LDA algorithm;
Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information;
Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence;
Step 5: being carried out by integral linear programming to the comprehensive advantage that similitude, conspicuousness, continuity and redundancy form complete Office's optimization, carries out the extraction of digest sentence, obtains digest sentence set.
2. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: the calculating process of vector similarity and characteristic similarity in step 1 specifically includes following sub-step:
The form that corpus arranges one in a row is input in PV algorithm by step 1.1 carries out study sentence vector, specific to utilize Cosine similarity obtains vector similarity, is calculated by formula (1);
Wherein, sjIndicate any one digest candidate sentence, vec (sj) indicate sjSentence vector, q indicate query, vec (q) indicate The sentence vector of query, R (sj, q) and indicate sjWith the vector similarity of query;
Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, specific to wrap Include following sub-step:
Step 1.2.1 segments corpus;
Corpus after participle is learnt term vector using word2vec algorithm by step 1.2.2;
Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithm again, obtains similar words collection It closes;
Wherein, the rule classified using Kmeans algorithm are as follows: similar term vector result just belongs to one on semantic space Set;
Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is calculated by following formula (2):
Fej=∑np∈Qtf(np)+∑vp∈Qtf(vp) (2)
Wherein, FejIndicate that the characteristic similarity of j-th of sentence, characteristic similarity refer specifically in query and digest candidate sentence Noun phrase and verb phrase synonymous Term co-occurrence number;
Q indicates the set classified belonging to query word, and np indicates sjIn noun phrase, vp indicate sjIn verb phrase;tf (np) s is indicatedjWord frequency is overlapped with the noun phrase of query;Tf (vp) indicates sjWord frequency is overlapped with the verb phrase of query;
Step 1.3 calculates similitude, is added to obtain with characteristic similarity by vector similarity, is calculated by formula (3):
Relej=R (sj,q)+ Fej (3)
Wherein, digest candidate sentence sjCharacteristic similarity, be denoted as Relej
3. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 2, and including following sub-step:
Step 2.1 calculates the theme distribution of digest candidate sentence, is denoted as θ;
Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distribution, obtains the master of all digest candidate sentences Topic;
Step 2.3 is normalized by the number of digest candidate sentence under statistics theme again can obtain theme conspicuousness;
Wherein, i-th of theme conspicuousness, is denoted as ti
4. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 3 includes following sub-step again:
Step 3.1 is for any two digest candidate sentence s in digest candidate sentence setjAnd sk, calculate in two digest candidate sentences The mutual information of word pair and its similar words pair, specifically:
For sjAnd skIn word pair<u,v>, u ∈ sj, v ∈ sk, similar words set, which is obtained, using step 1.2.3 calculates word To mutual information, the mutual information P of the word pairjk<u,v>calculation such as formula (4):
Wherein, U indicates the similar words set of word u, and V indicates the similar words set of word v, and cnt (U, v) indicates word v The number occurred in two adjacent sentences in U set, freq (U) indicate the word word frequency in U set, and freq (V) indicates V Word word frequency in set;
Step 3.2 is by sjAnd skThe mutual information of middle word pair is added to obtain continuity, calculates especially by formula (5):
5. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 4 is calculated especially by formula (6):
Wherein, sjAnd skIt is any two sentence in digest candidate sentence set, this similarity R < sj,sk> utilize cosine similarity It calculates.
6. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized by: step 5 is realized especially by objective function (7) are maximized:
max{αitijRelej+∑j<kβjkc<sj,sk>-∑j<kβjkR<sj,sk>}(7)
Wherein, similitude, i.e. Relej, obtained by step 1.3;Conspicuousness tiIt is obtained by step 2.3;Continuity, i.e. c < sj,sk> by Step 3.2 obtains;R<sj,sk> obtained by step 4;Similarity between digest sentence is lower, and to represent redundancy lower;
ILP is abbreviated as in integral linear programming;αiAnd βjTwo-valued variable, respectively represented whether theme i and digest candidate sentence j quilt It is selected into digest, tiIndicate theme conspicuousness, RelejIndicate the similitude of digest candidate sentence, βjkIndicate digest candidate sentences to < sj,sk> whether two-valued variable in digest sentence set is appeared in simultaneously, it also needs to meet while maximizing objective function (7) Following formula (8) is constrained to (12) five:
βjAssoij≤αi (8)
jβjAssoij≥αi (9)
βjkj≤0;βjkk≤0;βjkjk≤1 (10)
jl(sj)≤L (11)
Wherein, AssoijIndicate whether the theme of sentence j is consistent with theme i, be a two-valued variable, inequality (8) (9) guarantees Some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, if instead literary Pluck a digest candidate sentence for having selected some theme so just at least to select under the theme in sentence set;βkRepresenting is No digest candidate sentence k is selected into the two-valued variable in digest, the of length no more than L for the digest sentence set that inequality (11) indicates.
CN201810435232.8A 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage Active CN108664598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810435232.8A CN108664598B (en) 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810435232.8A CN108664598B (en) 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Publications (2)

Publication Number Publication Date
CN108664598A CN108664598A (en) 2018-10-16
CN108664598B true CN108664598B (en) 2019-04-02

Family

ID=63778925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810435232.8A Active CN108664598B (en) 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Country Status (1)

Country Link
CN (1) CN108664598B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110892400B (en) * 2019-09-23 2023-05-09 香港应用科技研究院有限公司 Method for summarizing text using sentence extraction
CN112860881A (en) * 2019-11-27 2021-05-28 北大方正集团有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111159393B (en) * 2019-12-30 2023-10-10 电子科技大学 Text generation method for abstract extraction based on LDA and D2V

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908042A (en) * 2010-08-09 2010-12-08 中国科学院自动化研究所 Tagging method of bilingual combination semantic role
CN105320642A (en) * 2014-06-30 2016-02-10 中国科学院声学研究所 Automatic abstract generation method based on concept semantic unit
CN106874362A (en) * 2016-12-30 2017-06-20 中国科学院自动化研究所 Multilingual automaticabstracting

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150152474A1 (en) * 2012-03-09 2015-06-04 Caris Life Sciences Switzerland Holdings Gmbh Biomarker compositions and methods

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908042A (en) * 2010-08-09 2010-12-08 中国科学院自动化研究所 Tagging method of bilingual combination semantic role
CN105320642A (en) * 2014-06-30 2016-02-10 中国科学院声学研究所 Automatic abstract generation method based on concept semantic unit
CN106874362A (en) * 2016-12-30 2017-06-20 中国科学院自动化研究所 Multilingual automaticabstracting

Also Published As

Publication number Publication date
CN108664598A (en) 2018-10-16

Similar Documents

Publication Publication Date Title
US11580415B2 (en) Hierarchical multi-task term embedding learning for synonym prediction
Ling et al. Fine-grained entity recognition
CN108664598B (en) A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage
Chen et al. Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features
Mukhtar et al. Effective use of evaluation measures for the validation of best classifier in Urdu sentiment analysis
Aumiller et al. Structural text segmentation of legal documents
Foxcroft et al. Name2vec: Personal names embeddings
Ma et al. Author name disambiguation in heterogeneous academic networks
Günther et al. Pre-trained web table embeddings for table discovery
Pham et al. The approach of using ontology as a pre-knowledge source for semi-supervised labelled topic model by applying text dependency graph
Franciscus et al. Word mover’s distance for agglomerative short text clustering
Wadawadagi et al. A multi-layer approach to opinion polarity classification using augmented semantic tree kernels
Mendoza et al. Benchmark for research theme classification of scholarly documents
Kong et al. Construction of microblog-specific chinese sentiment lexicon based on representation learning
Tran et al. A named entity recognition approach for tweet streams using active learning
Wu et al. Facet annotation by extending CNN with a matching strategy
Ding et al. Graph structure-aware bi-directional graph convolution model for semantic role labeling
Golubev et al. Use of augmentation and distant supervision for sentiment analysis in Russian
Hao Naive Bayesian Prediction of Japanese Annotated Corpus for Textual Semantic Word Formation Classification
Suryamukhi et al. Mining tag relationships in cqa sites
Štihec et al. Simplified hybrid approach for detection of semantic orientations in economic texts
Parkar et al. A survey paper on the latest techniques for sarcasm detection using BG method
Li et al. Nominal compound chain extraction: a new task for semantic-enriched lexical chain
Cheng et al. Improved Deep Bi-directional Transformer Keyword Extraction based on Semantic Understanding of News
Giancaterino NLP and Insurance–Workshop Results at SwissText 2022

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant