CN108664598A - A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage - Google Patents

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage Download PDF

Info

Publication number
CN108664598A
CN108664598A CN201810435232.8A CN201810435232A CN108664598A CN 108664598 A CN108664598 A CN 108664598A CN 201810435232 A CN201810435232 A CN 201810435232A CN 108664598 A CN108664598 A CN 108664598A
Authority
CN
China
Prior art keywords
sentence
digest
vector
word
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810435232.8A
Other languages
Chinese (zh)
Other versions
CN108664598B (en
Inventor
高扬
黄河燕
魏林静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810435232.8A priority Critical patent/CN108664598B/en
Publication of CN108664598A publication Critical patent/CN108664598A/en
Application granted granted Critical
Publication of CN108664598B publication Critical patent/CN108664598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed herein is a kind of extraction-type abstract methods based on integral linear programming with comprehensive advantage, belong to natural language processing field.Extraction-type digest is divided into document content study first for context of methods and digest sentence extracts, and is divided into similitude, conspicuousness and continuity three parts for document content study;The content study for considering document and redundancy are extracted for digest sentence, and digest sentence is extracted using integral linear programming frame.This method can learn the semantic expressiveness of sentence automatically by language material, the similarity between sentence can be calculated using simple mathematic calculation, deep excavation is carried out for conspicuousness, similitude, continuity and the redundancy in extraction-type digest task to construct the digest system of high quality.

Description

A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage
Technical field
The present invention relates to a kind of extraction-type abstract methods based on integral linear programming with comprehensive advantage, belong to nature Language Processing field.
Background technology
With the rapid growth of new media information, people can be obtained by extensive source and sharing information, so net The document for including in network is increased with exponential form.We are faced with inevitable, challenging information overload Problem.In order to alleviate this problem, it would be desirable to provide the system for obtaining various data in time.Search engine is to a certain extent Solves the problems, such as this, user can return to the document or webpage of a sequence by providing a specified inquiry, search engine List.Even however, also lack the ability of integrated information from many aspects using the search engine of most advanced information retrieval technique, Therefore active user cannot be given succinct and informative response.In order to mitigate the problem of information overload that people face, it is necessary to carry The tool that can be integrated information for one and respond in time, presently, there are these problems excite people to automatic abstracting system Interest.
The purpose of automatic abstracting system design is then generated using single collection of document or multiple collection of document as input One succinct, smooth text snippet for retaining source document most important information.Automatic abstract substantially can be regarded as an information Compression process, by the single document of input or more documents brief and concise sentence expression extracted, but during this It is inevitably present information loss, so digest needs to retain obtains similar significant information more as far as possible.
The effect of digest is mainly assessed by four aspects in multi-document summarization task:Correlation, conspicuousness, company Coherence and redundancy.Correlation refers to that content is consistent with the interested content of user;Conspicuousness refers in source document The higher content of the frequency of occurrences;Continuity refers to that content expression meets logic, keeps digest readability higher;Redundancy refers to There is no duplicate message in digest.Wherein correlation and conspicuousness are the key problem in automatic abstract task, continuity and redundancy Property be assist high quality digest structure index.
Current automaticabstracting has made intensive studies primarily directed to similitude and conspicuousness.It is passed for similitude The method of system is similar by feature, for example the features such as word frequency, descriptor, part of speech give a mark to sentence, and the method is simple and easy In understanding, but it is a lack of the semantic understanding of deep layer, there are also methods later learns Deep Semantics using vector approach, but simultaneously Similitude, conspicuousness, continuity and redundancy are not considered.It is to be led to based on statistics mostly for conspicuousness existing method Cross statistics word frequency, sentence position, the information such as concept determine the significance level of sentence.
Invention content
The purpose of the present invention is high-quality to solve how to consider similitude, conspicuousness, continuity and redundancy structure The problem of measuring digest proposes that a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, this method are logical Cross language material and learn sentence vector automatically, using mathematics similarity calculation, count the continuity between theme conspicuousness and sentence to Complete the digest system of structure high quality.
Core of the invention thought is:It is similar with the method calculating that characteristic similarity combines by using vector similarity Property, then using theme, this hierarchical information carries out conspicuousness calculating, and sentence continuity is calculated to mutual information by word, finally Consider that redundancy optimizes solution, the structure that comprehensive similitude, conspicuousness, continuity and redundancy are using integral linear programming The digest built is more accurate.
To achieve the above object, the present invention adopts the following technical scheme that:
Related definition is carried out first, it is specific as follows:
Define 1:Query, i.e. query term;Each query term is known as a query, each query is a sentence Son typically represents the content of user's concern;
Define 2:Collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes base again In the extraction-type digest of query and extraction-type digest based on content;Two kinds of digests of extraction-type digest and production digest wrap Containing multiple collection of document;Each collection of document corresponds to a document query;It is a master that each, which inquires corresponding collection of document, Topic set, is denoted as D, and D={ di| 1≤i≤N }, N indicates the number of document in collection of document D;
Define 3:Digest sentence set and digest candidate sentence set;For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query contents The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ sj| 1≤j≤M }, M is indicated in digest sentence set The number of sentence, sjA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (sj) indicate sentence sjLength, L indicate digest sentence set Length limitation;Digest candidate sentence collection is combined into all sentences in document D, wherein each sentence in document D is known as one Digest candidate sentence, distributed vector indicate that also known as sentence vector, digest candidate sentence are made of word, the distribution vector of word It indicates to be also known as term vector;
Define 4:Similar words set, the word for including in set is all synonym;
Due to 5:Similitude, the semantic overlapping degree and feature of sentence and query in digest candidate sentence set are overlapped journey Degree is referred to as similitude;Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is noun phrase and moves The level of coverage of word phrase, also known as characteristic similarity;
Define 6:Conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, I.e. the number of sentence is more under theme, and corresponding theme is more notable;
Define 7:Continuity needs the digest sentence that will be extracted to rearrange in extraction-type digest, and continuity refers to most The digest sentence arranged eventually links up readable on semantic logic;
A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, includes the following steps:
Step 1: calculating the similitude of each digest candidate sentence and query, it is similar to learn sentence vector calculating vector first Degree then by feature calculation characteristic similarity, then the two is added to obtain;
Wherein PV algorithms study sentence vector is selected in the calculating of vector similarity;Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature;
Wherein, PV is the abbreviation of paragraph vector;PV algorithms are a unsupervised frames, which can learn Practise the distribution vector of word segment;
Wherein, word segment is based on sentence, paragraph and document, and length is variable;
PV algorithms in the training process, word are predicted by constantly adjusting a vector sum term vector, until PV algorithms are received It holds back;Sentence vector sum term vector is got by stochastic gradient descent and backpropagation training;
Characteristic similarity selects parsing tree and Kmeans algorithms to calculate;
The calculating process of vector similarity and characteristic similarity specifically includes following sub-step:
Corpus is arranged one form in a row and is input in PV algorithms by step 1.1 carries out study sentence vector, specifically Vector similarity is obtained using cosine similarity, is calculated by formula (1);
Wherein, sjIndicate any one digest candidate sentence, vec (sj) indicate sjSentence vector, q indicate query, vec (q) Indicate the sentence vector of query, R (sj, q) and indicate sjWith the vector similarity of query;
Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, is had Body includes following sub-step:
Step 1.2.1 segments corpus;
Corpus after participle is learnt term vector by step 1.2.2 using word2vec algorithms;
Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithms again, obtains similar list Set of words;
Wherein, the rule classified using Kmeans algorithms is:Similar term vector result just belongs on semantic space One set;
Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is counted by following formula (2) It calculates:
Fej=∑np∈Qtf(np)+∑vp∈Qtf(vp) (2)
Wherein, FejIndicate the characteristic similarity of the j sentences, characteristic similarity refers specifically to query and digest is candidate The synonymous Term co-occurrence number of noun phrase and verb phrase in sentence;
Q indicates that the set classified belonging to query words, np indicate sjIn noun phrase, vp indicate sjIn verb it is short Language;Tf (np) indicates sjIt is overlapped word frequency with the noun phrase of query;Tf (vp) indicates sjWith the verb phrase reduplication of query Frequently;
Step 1.3 calculates similitude, is added to obtain with characteristic similarity by vector similarity, is calculated by formula (3):
Relej=R (sj,q)+Fej (3)
Wherein, digest candidate sentence sjCharacteristic similarity, be denoted as Relej
Step 2: calculating the conspicuousness of digest candidate sentence using LDA algorithm;
Wherein, using LDA algorithm the reason of, is as follows:LDA is to be developed so far more complete topic model, overcomes tradition The defect of topic model, by feat of probability theory and bayesian theory basis, text retrieval, text classification, image recognition, The fields such as social networks are widely used;
Step 2, and include following sub-step:
Step 2.1 calculates the theme distribution of digest candidate sentence, is denoted as θ;
Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distributions, obtains all digest candidate sentences Theme;
Step 2.3 is normalized that can to obtain theme notable by counting the number of digest candidate sentence under theme again Property;
Wherein, i-th of theme conspicuousness, is denoted as ti
Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information, and include following son Step:
Step 3.1 is for any two digest candidate sentence s in digest candidate sentence setjAnd sk, calculate two digest candidates The mutual information of word pair and its similar words pair in sentence, specially:
For sjAnd skIn word pair<u,v>, u ∈ sj, v ∈ sk, it is total to obtain similar words collection using step 1.2.3 Word is calculated to mutual information, the mutual information P of the word pairjk<u,v>Calculation such as formula (4):
Wherein, U indicates that the similar words set of word u, V indicate that the similar words set of word v, cnt (U, V) indicate U, The number that word in V set occurs in two adjacent sentences, freq (U) indicate the word word frequency in U set, freq (V) Indicate the word word frequency in V set;
Step 3.2 is by sjAnd skThe mutual information of middle word pair is added to obtain continuity, is calculated especially by formula (5):
Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence, especially by formula (6) it calculates:
Wherein, sjAnd skIt is any two sentence in digest candidate sentence set, this similarity R<sj,sk>Utilize cosine phase It is calculated like degree;
Step 5: the comprehensive advantage that similitude, conspicuousness, continuity and redundancy are made up of integral linear programming into Row global optimization solves, and carries out digest sentence extraction, obtains digest sentence set, is realized especially by object function (7) is maximized:
max{αitijRelej+∑J < kβjkc<sj,sk>-∑J < kβjkR<sj,sk>} (7)
Wherein, similitude, i.e. Relej, obtained by step 1.3;Conspicuousness tiIt is obtained by step 2.3;Continuity, i.e. c<sj, sk>It is obtained by step 3.2;R<sj,sk>It is obtained by step 4;Similarity between digest sentence is lower, and to represent redundancy lower;
ILP is abbreviated as in integral linear programming;αiAnd βjIt is two-valued variable, has respectively represented whether theme i and digest are candidate Sentence j is selected into digest, tiIndicate theme conspicuousness, RelejIndicate the similitude of digest candidate sentence, βjkIndicate digest candidate sentence Son is right<sj,sk>The two-valued variable in digest sentence set whether is appeared in simultaneously, is also needed while maximizing object function (7) Meet following formula (8) to (12) five constraints:
βjAssoij≤αi (8)
jβjAssoij≥αi (9)
βjkj≤0;βjkk≤0;βjkjk≤1 (10)
jl(sj)≤L (11)
Wherein, AssoijIndicate whether the theme of sentence j is consistent with theme i, is a two-valued variable, inequality (8) (9) It ensure that and some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, otherwise such as Some theme has been selected so just at least to select a digest candidate sentence under the theme in fruit digest sentence set;βkIt represents Whether digest candidate sentence k is selected into the two-valued variable in digest, and the length for the digest sentence set that inequality (11) indicates does not surpass Cross L;
So far, from step 1 to step 5, selected it is semantic it is similar, theme is notable, sentence is coherent has no redundancy High quality digest sentence, complete a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage.
Advantageous effect
A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention, compares existing skill Art has the advantages that:
1. considering Deep Semantics using vector similarity and characteristic similarity and effective feature improving digest sentence The similitude of set and query;
2. the calculating of theme conspicuousness improves the accuracy of the extraction of important information so that digest sentence set improves theme Spatial saliency;
3. calculating the continuity between digest candidate sentence to mutual information using word, the readable of final digest sentence is improved Property so that the digest sentence of extraction preferably expresses the content of collection of document;
4. having considered similitude, conspicuousness, continuity and redundancy using ILP frames, acquired in the overall situation optimal Solution, improves the quality of digest sentence set.
Description of the drawings
Fig. 1 is a kind of flow of the extraction-type abstract method based on integral linear programming with comprehensive advantage of the present invention Figure;
Fig. 2 is the classification results figure by being obtained after Kmeans sorting algorithms in 1 step B of embodiment.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below according to accompanying drawings and embodiments pair Abstract method of the present invention is further described.
Embodiment 1
The present embodiment describes the specific implementation process of the present invention, as shown in Figure 1.
It will be seen from figure 1 that a kind of extraction-type digest side based on integral linear programming with comprehensive advantage of the present invention The flow of method is as follows:
Step A pretreatments;It is that subordinate sentence is carried out to language material specific to the present embodiment, goes the processing of stop words;Selection standard number According to collection DUC2005, which includes 50 collection of document, and each collection of document includes 25-50 documents.Extraction-type digest is The digest sentence set no more than 250 words is extracted according to the collection of document under each query.The data of data set DUC2005 Format is xml formats, by query and collection of document respectively in label<narr></narr>,<TEXT></TEXT>In extract Come, recycles nltk tools to carry out subordinate sentence operation to collection of document, then just obtaining the new document D of a line onenew1.It is obtained to new The document D arrivednew1It carries out stop words to operate, obtains new document Dnew2
Step B is calculated similarity using PV and kmeans algorithms, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter Calculate continuity;
Wherein, similarity is calculated using PV and kmeans algorithms, conspicuousness is calculated using LDA algorithm, utilizes mutual information meter It is simultaneously column count to calculate continuity, specific to the present embodiment:
Similarity is calculated using PV and kmeans algorithms, that is, utilizes PV algorithms study sentence vector;By document Dnew2It is input to In PV algorithms, the sentence vector of each digest candidate sentence is obtained, dimension size is 256, the sentence vector of some digest candidate sentence It is [0.00150049 0.08735332-0.10565963 0.04739858 0.18809512 0.280207 ...- 0.19442209 0.17960664 0.30010329 0.06458669 0.12353758];The sentence vector of query is [0.16279337 0.00488725 -0.30741466 0.83172139 0.25234198 0.00017076 … 0.30811236-0.2949384 0.03353651 0.18530557 0.94691929], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.15;
Learn term vector using word2vec algorithms;By document Dnew2It is input in word2vec algorithms, obtains term vector, Its object function such as formula (13):
Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vector of 256 dimensions;
Characteristic similarity needs to count noun phrase and verb phrase, and syntactic analysis is carried out using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithms by the noun phrase extracted and verb phrase Classified using term vector, is divided into 50 classifications;By counting noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q1,q2…qq], word w=[w1,w2…ww], pass through The classification results obtained after Kmeans sorting algorithms are as shown in Figure 2;So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8;
Correlation result is that vector similarity adds characteristic similarity:0.15+3/8=0.525
In the present embodiment, conspicuousness is calculated using LDA algorithm, process is as follows:First by document Dnew2It is input to LDA algorithm In, the theme distribution of each digest candidate sentence is obtained, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences The theme distribution difference of son is as follows:[0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09, 0.02,0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness meter of five themes It is exactly 1/5,0,1/5,1/5,0 to calculate result;
In the present embodiment, continuity is calculated using mutual information, process is as follows:Calculate the mutual trust of word pair in digest candidate sentence Breath calculates the continuity between digest candidate sentence by counting word to mutual information;For digest candidate sentence s1And s2, s1= [w11,w12,…,w1i,…,w1n],s2=[w21,w22,…,w2i,…,w2m], first by s1And s2In word combination of two meter The mutual information for calculating each word pair, for word pair<w11,w21>Calculation such as formula (14):
Wherein U and V indicate w respectively11And w21Similar words set, w11The frequency that adjacent sentence is appeared in V is 3, U And w21The frequency for appearing in adjacent sentence is frequency 100 in corpus 2, U and V, word pair<w11,w21>Mutual information result is 5.001/101, the mutual information of remaining word pair is similarly obtained, s is acquired using formula (15)1With s2Continuity
Step C calculates similarity between sentence;Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithms in step B, The cosine similarity of each two digest candidate sentence is calculated using formula (1);
Step D considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving-optimizing formula (7);
Obtain βjValue [0,1,1,0,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in Digest sentence set.
Embodiment 2
Step B in invention content embodiment 1 is calculated similarity using PV and kmeans algorithms, is calculated using LDA algorithm Conspicuousness calculates continuity using mutual information;The step of three sequences execute is split into, it is specific as follows:
Step A pretreatments;It is that subordinate sentence is carried out to language material specific to the present embodiment, goes the processing of stop words;Select data set DailyMail, the data set include 1000 collection of document, and each collection of document averagely includes 802 words.Utilize two tuples Method query and collection of document are extracted, recycle nltk tools to collection of document carry out subordinate sentence operation, then must To one new document D of a linenew1, to the document D newly obtainednew1It carries out stop words to operate, obtains new document Dnew2
Step B calculates similarity using PV and kmeans algorithms, that is, utilizes PV algorithms study sentence vector;By document Dnew2It is defeated Enter into PV algorithms, obtain the sentence vector of each digest candidate sentence, dimension size is 256, the sentence of some digest candidate sentence Vector is [0.30150011-0.60735332 0.00165963 0.31739858 0.11809512 0.080117 ...- 0.04042209 0.12560614 0.17610322 0.06183569 0.38161758];The sentence vector of query is [0.01539337 0.18238734 -0.30741466 0.03572199 0.45234808 0.60017210 … 0.80311120-0.1038382 0.02234642 0.17560577 0.91691008], according to cosine similarity formula (1) The vector similarity for obtaining the digest candidate sentence is 0.12;
Learn term vector using word2vec algorithms;By document Dnew2It is input in word2vec algorithms, obtains term vector, Its object function such as formula (13):
Wherein, k is window word, and i is current word, and T is word size in corpus, is learnt using gradient descent method Obtain the term vectors of 256 dimensions;
Characteristic similarity needs to count noun phrase and verb phrase, and syntactic analysis is carried out using Stanford Parser, It will be labeled as np, the word of vp extract, then using Kmeans algorithms by the noun phrase extracted and verb phrase Classified using term vector, is divided into 70 classifications;By counting noun phrase and verb phrase in the other quantity of same class To calculate the characteristic similarity of digest candidate sentence and query, query=[q1,q2…qq], word w=[w1,w2…ww], pass through The classification results obtained after Kmeans sorting algorithms are as shown in Figure 2;So its coincidence word frequency is known to classification results according to fig. 2 3, total word number is 8, then the value of characteristic similarity is 3/8;
Correlation result is that vector similarity adds characteristic similarity:0.12+3/8=0.495;
Step C calculates continuity using mutual information, and process is as follows:The mutual information of word pair in digest candidate sentence is calculated, is led to Statistics word is crossed to mutual information to calculate the continuity between digest candidate sentence;For digest candidate sentence s1And s2, s1=[w11, w12,…,w1i,…,w1n],s2=[w21,w22,…,w2i,…,w2m], first by s1And s2In word combination of two calculate it is each The mutual information of word pair, for word pair<w11,w21>Calculation such as formula (14):
Wherein U and V indicate w respectively11And w21Similar words set, w11The frequency that adjacent sentence is appeared in V is 3, U And w21The frequency for appearing in adjacent sentence is frequency 100 in corpus 2, U and V, word pair<w11,w21>Mutual information result is 5.001/101, the mutual information of remaining word pair is similarly obtained, s is acquired using formula (15)1With s2Continuity
Step D calculates conspicuousness using LDA algorithm, and process is as follows:First by document Dnew2It is input in LDA algorithm, obtains To the theme distribution of each digest candidate sentence, to three sentences of Mr. Yu, digest candidate sentence theme number is 5, three sentences Theme distribution difference is as follows:[0.1,0.01,0.5,0.09,0.3], [0.9,0.01,0.02,0,0.7], [0.09,0.02, 0.1,0.8,0], then three sentences are belonging respectively to theme 3, theme 1 and theme 4, therefore the conspicuousness of five themes calculates knot Fruit is exactly 1/5,0,1/5,1/5,0;
Step E calculates similarity between sentence;Learnt to obtain the sentence vector of digest candidate sentence according to PV algorithms in step B, The cosine similarity of each two digest candidate sentence is calculated using formula (1);
Step F considers similitude, conspicuousness, continuity and redundancy, utilizes ILP frame structure solving-optimizing formula (7);Obtain βjValue [0,0,1,1,0 ..., 1], dimension be 1 corresponding sentence be chosen as digest sentence, result in digest Sentence set.
" a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage " of the invention is carried out above Detailed description, but the specific implementation form of the present invention is not limited thereto.Embodiment explanation is merely used to help understand this The method and its core concept of invention;Meanwhile for those of ordinary skill in the art, according to the thought of the present invention, specific There will be changes in embodiment and application range, in conclusion the content of the present specification should not be construed as to the present invention's Limitation.
Without departing substantially from the method for the invention spirit and in the case of right to its carry out various aobvious and The change being clear to is all within protection scope of the present invention.

Claims (6)

1. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, it is characterised in that:Pass through language material The vector of automatic study sentence indicates, using mathematics similarity calculation, count continuity between theme conspicuousness and sentence from And complete the digest system of structure high quality;Core concept is the method combined by using vector similarity and characteristic similarity Similitude is calculated, then this hierarchical information carries out conspicuousness calculating using theme, and calculating sentence to mutual information by word connects Coherence finally considers that redundancy optimizes solutions using integral linear programming, integrates similitude, conspicuousness, continuity and superfluous The digest for the structure that remaining property is is more accurate;
Related definition is carried out first, it is specific as follows:
Define 1:Query, i.e. query term;Each query term is known as a query, each query is a sentence, leads to Chang represents the content of user's concern;
Define 2:Collection of document, automatic abstract include extraction-type digest and production digest, and extraction-type digest includes being based on again The extraction-type digest of query and the extraction-type digest based on content;Two kinds of digests of extraction-type digest and production digest include Multiple collection of document;Each collection of document corresponds to a document query;It is a theme that each, which inquires corresponding collection of document, Set, is denoted as D, and D={ di| 1≤i≤N }, N indicates the number of document in collection of document D;
Define 3:Digest sentence set and digest candidate sentence set;For the extraction-type digest based on query, each Query corresponds to a collection of document, and the digest sentence that each collection of document extracts needs related, the pumping to query contents The collection of the digest sentence composition taken is collectively referred to as digest sentence set, is denoted as S, and S={ si| 1≤i≤M }, M is indicated in digest sentence set The number of sentence, siA digest sentence in digest sentence set is indicated, since the digest sentence set number of words of extraction-type digest is limited System, so needing to meet conditionWherein, l (si) indicate sentence siLength, L indicate digest sentence set Length limitation;Digest candidate sentence collection is combined into all sentences in document D, wherein each sentence in document D is known as one Digest candidate sentence, distributed vector indicate that also known as sentence vector, digest candidate sentence are made of word, the distribution vector of word It indicates to be also known as term vector;
Define 4:Similar words set, the word for including in set is all synonym;
Due to 5:Similitude, semantic overlapping degree and feature the overlapping degree system of sentence and query in digest candidate sentence set Referred to as similitude;Wherein, semantic overlapping degree is also known as vector similarity, and feature overlapping degree is that noun phrase and verb are short The level of coverage of language, also known as characteristic similarity;
Define 6:Conspicuousness, i.e. theme conspicuousness refer to the theme proportion of all sentences in digest candidate sentence set, i.e., main The number of the lower sentence of topic is more, and corresponding theme is more notable;
Define 7:Continuity needs the digest sentence that will be extracted to rearrange in extraction-type digest, and continuity refers to finally arranging The digest sentence of row links up readable on semantic logic;
A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage, includes the following steps:
Step 1: calculate the similitude of each digest candidate sentence and query, especially by calculating separately vector similarity and spy Similarity is levied, then the two is added to obtain;
Wherein, PV algorithms study sentence vector is selected in the calculating of vector similarity;Noun phrase is selected in the calculating of characteristic similarity With verb phrase as feature;
Wherein, PV is the abbreviation of paragraph vector;PV algorithms are a unsupervised frames, which can learn text The distribution vector of word slice section;
Wherein, word segment is based on sentence, paragraph and document, and length is variable;
PV algorithms in the training process, predict word, until PV algorithmic statements by constantly adjusting a vector sum term vector;Sentence Vector sum term vector is got by stochastic gradient descent and backpropagation training;
Characteristic similarity selects parsing tree and Kmeans algorithms to calculate;
Step 2: calculating the conspicuousness of digest candidate sentence using LDA algorithm;
Wherein, using LDA algorithm the reason of, is as follows:LDA is to be developed so far more complete topic model, overcomes traditional theme The defect of model, by feat of probability theory and bayesian theory basis, in text retrieval, text classification, image recognition, social activity The fields such as network are widely used;
Step 3: calculating continuity, the continuity between digest candidate sentence is calculated using mutual information;
Step 4: the sentence vector based on step 1 study calculates the similarity between digest candidate sentence;
Step 5: being carried out to the comprehensive advantage that similitude, conspicuousness, continuity and redundancy form by integral linear programming complete Office's optimization, carries out digest sentence extraction, obtains digest sentence set;
So far, from step 1 to step 5, selected it is semantic it is similar, theme is notable, the coherent height for having no redundancy of sentence Quality digest sentence.
2. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that:The calculating process of vector similarity and characteristic similarity in step 1 specifically includes following sub-step:
Corpus is arranged one form in a row and is input in PV algorithms by step 1.1 carries out study sentence vector, specific to utilize Cosine similarity obtains vector similarity, is calculated by formula (1);
Wherein, sjIndicate any one digest candidate sentence, vec (sj) indicate sjSentence vector, q indicate query, vec (q) indicate The sentence vector of query, R (sj, q) and indicate sjWith the vector similarity of query;
Corpus is segmented, study term vector, is classified through Kmeans and calculate characteristic similarity by step 1.2, specific to wrap Include following sub-step:
Step 1.2.1 segments corpus;
Corpus after participle is learnt term vector by step 1.2.2 using word2vec algorithms;
Step 1.2.3 classifies the term vector result that step 1.2.2 is exported through Kmeans algorithms again, obtains similar words collection It closes;
Wherein, the rule classified using Kmeans algorithms is:Similar term vector result just belongs to one on semantic space Set;
Step 1.2.4 calculates characteristic similarity, specifically utilizes noun phrase and verb phrase, is calculated by following formula (2):
Fej=∑np∈Qtf(np)+∑vp∈Qtf(vp) (2)
Wherein, FejIndicate that the characteristic similarity of the j sentences, characteristic similarity refer specifically in query and digest candidate sentence Noun phrase and verb phrase synonymous Term co-occurrence number;
Q indicates that the set classified belonging to query words, np indicate sjIn noun phrase, vp indicate sjIn verb phrase;tf (np) s is indicatedjIt is overlapped word frequency with the noun phrase of query;Tf (vp) indicates sjIt is overlapped word frequency with the verb phrase of query;
Step 1.3 calculates similitude, is added to obtain with characteristic similarity by vector similarity, is calculated by formula (3):
Relej=R (sj, q) and+Fej (3)
Wherein, digest candidate sentence sjCharacteristic similarity, be denoted as Relej
3. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that:Step 2, and include following sub-step:
Step 2.1 calculates the theme distribution of digest candidate sentence, is denoted as θ;
Step 2.2 chooses theme of the dimension of maximum probability as the sentence in θ distributions, obtains the master of all digest candidate sentences Topic;
Step 2.3 is normalized by the number of digest candidate sentence under statistics theme again can obtain theme conspicuousness;
Wherein, i-th of theme conspicuousness, is denoted as ti
4. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that:Step 3 includes following sub-step again:
Step 3.1 is for any two digest candidate sentence s in digest candidate sentence setjAnd sk, calculate in two digest candidate sentences The mutual information of word pair and its similar words pair, specially:
For sjAnd skIn word to < u, v >, u ∈ sj, v ∈ sk, obtain similar words set using step 1.2.3 and calculate list Word is to mutual information, the mutual information P of the word pairjk< u, v > calculations such as formula (4):
Wherein, U indicates that the similar words set of word u, V indicate that the similar words set of word v, cnt (U, V) indicate U, V collection The number that word in conjunction occurs in two adjacent sentences, freq (U) indicate that the word word frequency in U set, freq (V) indicate Word word frequency in V set;
Step 3.2 is by sjAnd skThe mutual information of middle word pair is added to obtain continuity, is calculated especially by formula (5):
5. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that:Step 4 is calculated especially by formula (6):
Wherein, sjAnd skIt is any two sentence in digest candidate sentence set, this similarity R < sj, sk> is similar using cosine Degree calculates.
6. a kind of extraction-type abstract method based on integral linear programming with comprehensive advantage according to claim 1, It is characterized in that:Step 5 is realized especially by object function (7) is maximized:
max{αitijRelej+∑J < kβjkC < sj, sk>-∑J < kβjkR < sj, sk> } (7)
Wherein, similitude, i.e. Relej, obtained by step 1.3;Conspicuousness tiIt is obtained by step 2.3;Continuity, i.e. c < sj, sk> It is obtained by step 3.2;R < sj, sk> is obtained by step 4;Similarity between digest sentence is lower, and to represent redundancy lower;
ILP is abbreviated as in integral linear programming;αiAnd βjTwo-valued variable, respectively represented whether theme i and digest candidate sentence j quilts It is selected into digest, tiIndicate theme conspicuousness, RelejIndicate the similitude of digest candidate sentence, βjkIndicate digest candidate sentences pair < sj, skWhether > appears in the two-valued variable in digest sentence set simultaneously, is also needed to while maximizing object function (7) Meet following formula (8) to (12) five constraints:
βjAssoij≤αi (8)
jβjAssoij≥αi (9)
βjkj≤0;βjkk≤0;βjkjk≤1 (10)
jl(sj)≤L (11)
Wherein, AssoijIt indicates whether the theme of sentence j is consistent with theme i, is a two-valued variable, inequality (8) (9) ensures Some theme of digest candidate sentence so where the sentence has been selected centainly also to be selected in digest sentence set, if instead literary Pluck a digest candidate sentence for having selected some theme so just at least to select under the theme in sentence set;βkRepresenting is No digest candidate sentence k is selected into the two-valued variable in digest, the of length no more than L for the digest sentence set that inequality (11) indicates.
CN201810435232.8A 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage Active CN108664598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810435232.8A CN108664598B (en) 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810435232.8A CN108664598B (en) 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Publications (2)

Publication Number Publication Date
CN108664598A true CN108664598A (en) 2018-10-16
CN108664598B CN108664598B (en) 2019-04-02

Family

ID=63778925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810435232.8A Active CN108664598B (en) 2018-05-09 2018-05-09 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage

Country Status (1)

Country Link
CN (1) CN108664598B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110892400A (en) * 2019-09-23 2020-03-17 香港应用科技研究院有限公司 Method for summarizing text using sentence extraction
CN111159393A (en) * 2019-12-30 2020-05-15 电子科技大学 Text generation method for abstracting abstract based on LDA and D2V
CN112860881A (en) * 2019-11-27 2021-05-28 北大方正集团有限公司 Abstract generation method and device, electronic equipment and storage medium
CN113626581A (en) * 2020-05-07 2021-11-09 北京沃东天骏信息技术有限公司 Abstract generation method and device, computer readable storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908042A (en) * 2010-08-09 2010-12-08 中国科学院自动化研究所 Tagging method of bilingual combination semantic role
US20150152474A1 (en) * 2012-03-09 2015-06-04 Caris Life Sciences Switzerland Holdings Gmbh Biomarker compositions and methods
CN105320642A (en) * 2014-06-30 2016-02-10 中国科学院声学研究所 Automatic abstract generation method based on concept semantic unit
CN106874362A (en) * 2016-12-30 2017-06-20 中国科学院自动化研究所 Multilingual automaticabstracting

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908042A (en) * 2010-08-09 2010-12-08 中国科学院自动化研究所 Tagging method of bilingual combination semantic role
US20150152474A1 (en) * 2012-03-09 2015-06-04 Caris Life Sciences Switzerland Holdings Gmbh Biomarker compositions and methods
CN105320642A (en) * 2014-06-30 2016-02-10 中国科学院声学研究所 Automatic abstract generation method based on concept semantic unit
CN106874362A (en) * 2016-12-30 2017-06-20 中国科学院自动化研究所 Multilingual automaticabstracting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐惠婷: ""基于信息抽取和语义相似度的多文档自动文摘技术研究"", 《万方数据 知识服务平台》 *
龚书: ""抽取式多文档文摘的文本表示研究"《", 《万方数据 知识服务平台》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110892400A (en) * 2019-09-23 2020-03-17 香港应用科技研究院有限公司 Method for summarizing text using sentence extraction
CN110892400B (en) * 2019-09-23 2023-05-09 香港应用科技研究院有限公司 Method for summarizing text using sentence extraction
CN112860881A (en) * 2019-11-27 2021-05-28 北大方正集团有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111159393A (en) * 2019-12-30 2020-05-15 电子科技大学 Text generation method for abstracting abstract based on LDA and D2V
CN111159393B (en) * 2019-12-30 2023-10-10 电子科技大学 Text generation method for abstract extraction based on LDA and D2V
CN113626581A (en) * 2020-05-07 2021-11-09 北京沃东天骏信息技术有限公司 Abstract generation method and device, computer readable storage medium and electronic equipment

Also Published As

Publication number Publication date
CN108664598B (en) 2019-04-02

Similar Documents

Publication Publication Date Title
US11580415B2 (en) Hierarchical multi-task term embedding learning for synonym prediction
CN108664598B (en) A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage
CN110298032A (en) Text classification corpus labeling training system
Wu et al. Exploring syntactic and semantic features for authorship attribution
Mukhtar et al. Effective use of evaluation measures for the validation of best classifier in Urdu sentiment analysis
Aumiller et al. Structural text segmentation of legal documents
Fei et al. Hierarchical multi-task word embedding learning for synonym prediction
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
CN114997288A (en) Design resource association method
Liu et al. Chinese judicial summarising based on short sentence extraction and GPT-2
Ou et al. Unsupervised citation sentence identification based on similarity measurement
Rachman et al. Word Embedding for Rhetorical Sentence Categorization on Scientific Articles.
Zhou et al. Exploiting chunk-level features to improve phrase chunking
Zhao et al. Web text data mining method based on Bayesian network with fuzzy algorithms
CN114265936A (en) Method for realizing text mining of science and technology project
Kong et al. Construction of microblog-specific chinese sentiment lexicon based on representation learning
Tran et al. A named entity recognition approach for tweet streams using active learning
Wu et al. Facet annotation by extending CNN with a matching strategy
CN112270185A (en) Text representation method based on topic model
Hao Naive Bayesian Prediction of Japanese Annotated Corpus for Textual Semantic Word Formation Classification
Ding et al. Graph structure-aware bi-directional graph convolution model for semantic role labeling
Zenasni et al. Discovering types of spatial relations with a text mining approach
Li et al. Nominal compound chain extraction: a new task for semantic-enriched lexical chain
Cheng et al. Improved Deep Bi-directional Transformer Keyword Extraction based on Semantic Understanding of News
SILVA Extracting structured information from text to augment knowledge bases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant