CN108090049A

CN108090049A - Multi-document summary extraction method and system based on sentence vector

Info

Publication number: CN108090049A
Application number: CN201810045090.4A
Authority: CN
Inventors: 窦全胜; 朱翔
Original assignee: Shandong Technology and Business University
Current assignee: Shandong Technology and Business University
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2018-05-29
Anticipated expiration: 2038-01-17
Also published as: CN108090049B

Abstract

The invention discloses multi-document summary extraction methods and system based on sentence vector, comprise the following steps：S1, pretreatment document sets；S2, sentence vector is generated using doc2vec model trainings；S3, cluster are each sub-topics document；S4, sentence relation graph model is established in each sub-topics document；S5, sentence weight is calculated；S6, extraction sentence sort to form extracts.The present invention trains doc2vec models all sentences to be concentrated to be represented with vector destination document by big corpus；Gathered with spectral clustering for each sub-topics, one sentence of extraction in each sub-topics, so as to avoid sentence redundancy issue；Digest is formed according to name placement of the sentence in original text shelves, improves the contextual of digest sentence.

Description

Multi-document summary extraction method and system based on sentence vector

Technical field

The present invention relates to computer version excavation applications more particularly to the multi-document summary extraction method based on sentence vector And system.

Background technology

Document auto-abstracting technology carries out text summary and refining by computer for user, provides the generality information of text. User, which only needs briefly to read to make a summary, can tentatively spy upon the key content of full text, greatly improve user and obtain or understand information Efficiency.Single document autoabstract was the summary that computer automatically generates a document main contents by algorithm, from 1958 Since Luhn proposes that document automatically generates the method for summary, the research based on single document autoabstract is just unfolded in high gear, So that the result of single document autoabstract to reaching generally accepted degree so far.And multi-document auto-abstracting is by not With the comprehensive summary of document structure tree main contents.Up to the present, multi-document auto-abstracting technology with artificial intelligence section With related algorithm combine closely, and is more combined development with evolution algorithmic and deep learning algorithm in recent years.

Yan et al. will be for the first time by deep learning for text snippet, and input layer is word frequency vector, and hidden layer is by being limited Boltzmann machine Composition finally chooses important sentences by Dynamic Programming and forms summary.Rush carries out summary-type to original text shelves using deep learning and plucks Will, using convolutional network to former document coding, generated and made a summary with context attention feedforward neural network.Google is opened within 2016 Automatic abstract module Textsum based on deep learning in the deep learning frame Tensorflow of source.More documents are plucked automatically Can whether original text be derived from according to the sentence for forming summary and be divided into extraction-type summary and be abstracted formula summary.Extraction-type is made a summary The sentence of original text shelves is mainly subjected to importance assessment, then therefrom chooses emphasis sentence and forms summary.Abstract formula summary mainly from Word information is extracted in original text shelves, word series connection sentence is then organized to form summary.

The implementation method of abstract formula summary is too complicated now, and machine is to the understanding deficiency of natural language, it is necessary to manually participate in part More, development is improved relatively slowly also in the starting stage.Extraction-type summary is currently used method, the text based on graph model Maximum public subgraph method and side right analogue method are more commonly used method for measuring similarity in classification.Also have based on text diagram matrix The corresponding feature vector of left singular value based on carry out measuring similarity, essence assumes that the PCA that sample average is 0 drops Dimension.The problem of existing frequently-used extraction-type method of abstracting is primarily present sentence redundancy, and sentence linking is not smooth.

The content of the invention

For the sentence extracted in the prior art there are redundancy, the deficiencies of sentence order is chaotic, the present invention proposes a kind of to be based on sentence The multi-document summary extraction method of subvector is to provide accurate readable higher documentation summary.

The technical solution adopted by the present invention is：

Based on the multi-document summary extraction method of sentence vector, comprise the following steps：

S1：The document sets of summary to be extracted are pre-processed；

S2：Using doc2vec model trainings generation sentence vector；

S3：Each sub-topics document is saved as by sentence vector clusters and by corresponding sentence；

S4：Sentence relation graph model is established in each sub-topics document；

S5:According to the relation graph model that step S4 is established, sentence weight is calculated in each sub-topics document；

S6：Sentence is extracted to sort to form summary.

Further, S1 comprises the following steps：

Step S101：Sentence is divided according to sentence end mark to every document of document sets, the sentence branch of division is recorded, one A sentence accounts for a line；

Step S102：Record the corresponding position of each sentence；

Step S103：Every document content in document sets after division sentence is copied in same piece document and carries out collection of document And one sentence of the document after merging accounts for a line；

Step S104：To often row sentence cutting word and removing stop words in the document after merging.

Further, the corresponding position of sentence described in the step S102 in step S1 is expressed as:

Wherein, h_n,iRepresent the position of i-th of sentence in a document, text in n-th document_nRepresent n-th document, len (text_n) represent the sentence number that n-th document is included.

Further, step S2 comprises the following steps：

Step S201：All documents in Big-corpus are pre-processed by step S101 to the S104 in step S1, it will The pretreated document of Big-corpus is input to the distribution memory models PV-DM of the sentence vector in doc2vec, point of distich vector Cloth memory models PV-DM is trained；

Step S202：Trained sentence will be inputted by the pretreated destination documents of step S101 to S104 in step S1 The distribution memory models PV-DM of vector obtains sentence vector.

Wherein, the distribution memory models PV-DM of distich vector is trained in step S201, is comprised the following steps：

(2011) by the pretreated document of Big-corpus, often row sentence is initialized as k dimensional vectors with all words, by word w's The corresponding term vector of context sentence vector corresponding with sentence where the word is input to deep neural network model；

(2012) vector of input carries out to summation is cumulative, and the vector that adds up is as output in the hidden layer of deep neural network model The input of layer；

(2013) output layer of deep neural network model corresponds to a binary tree, and the binary tree is worked as with the word in big language material Leafy node, Huffman trees are constructed using the number that each word occurs in big language material as weights, and each word corresponds to the leaf in tree Child node, branch regards one time two classification as each time in tree, from root node into the corresponding leafy node paths of word w each burl The corresponding Label of point is 1-p_j,p_jFor the corresponding coding of j-th of node in path, each tree in addition to root node and leafy node Node corresponds to one and is used for supplemental training model with the auxiliary vector of length with sentence vector；

(2014) sentence vector, term vector and auxiliary vector are constantly corrected using the method that gradient rises, finally obtained trained The distribution memory models PV-DM of sentence vector.

The context of institute predicate w is each C word before and after word w.

The object function of neural metwork training is

Wherein, sentence is sentence, and doc is document after pretreatment, and w is word, and Context (sentence, w) is word w Context words and w where sentence.

Further, the cluster generation of sentence vector uses spectral clustering mode in step S3；

Further, step S3 comprises the following steps：

Step S301：The similar matrix W between all sentence vectors is built, kernel function uses gaussian kernel function,

Wherein, W_i,jFor sentence x_iWith sentence x_jBetween similarity, σ be Gauss radius；

Step S302：Calculate Laplacian Matrix L；

L=D-W

Wherein, D is diagonal matrix, and the line n element of D is the sum of W line n elements；

Step S303：Build the Laplacian Matrix of standardization

Step S304：It calculatesK minimum characteristic value and corresponding feature vector V；

Step S305：By feature vector by row arrangement form eigenmatrix, to the unitization formation matrix of every a line in eigenmatrix F,

I.e. matrix F often row institute into vectorial modulus value be 1；

Step S306：By the often capable sample for regarding a k dimension as of matrix F, clustered, gathered for C classes with Kmeans algorithms；

Step S307：Sentence corresponding to vector saves as C sub- subject documents in C classes.

Further, step S4 is specifically included:

It is the similarity between side and sentence using similarity of the sentence between node, sentence in each sub-topics document, establishes Sentence relation graph model；

Further, the similarity between sentence is calculated by included angle cosine value between sentence vector, cosine value sim (x_i,x_j) meter Calculating formula is：

Wherein, x_i,x_jFor two sentence vectors.

Further, step S5 specific steps include：

Each sentence weights initialisation, according to the relation graph model that step S4 is established, iteration update sentence weight：

Wherein, S (i) is the weight of sentence i, and δ (i) is to be higher than given threshold with sentence i similarities in same sub-topics document All sentences, | δ (j) | in same sub-topics document with sentence i similarities higher than given threshold all sentences sentence Subnumber mesh, S (j) are the weight of sentence j, and it is 0.85 that d, which is that damped coefficient sets up,.

Further, sentence weights initialisation is 1 in step S5, | δ (i) | the middle similarity threshold set is 0.05.

Further, step S6 is specifically included：The sentence of weight maximum is extracted in each sub-topics document, according in step S1 The name placement of the sentence that step S102 is obtained in a document, is combined into summary.

Based on the multi-document summary automatic extracting system of sentence vector, including：Memory, processor and storage are on a memory And the computer instruction run on a processor, when the computer instruction is run by processor, complete any of the above-described method institute The step of stating.

A kind of computer readable storage medium, operation thereon has computer program, when the computer program is run by processor, Complete the step described in any of the above-described method.

Description of the drawings

The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's is schematic Embodiment and its explanation do not form the improper restriction to the application for explaining the application.

Fig. 1 is the flow chart of the present invention.

Fig. 2 is the flow chart of pre-treatment step of the present invention.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless otherwise finger Bright, all technical and scientific terms used herein has to be generally understood with the application person of an ordinary skill in the technical field Identical meanings.

Fig. 1 is the flow chart of the present invention, is comprised the following steps：

S1：Pre-process document sets；

S2：Using doc2vec model trainings generation sentence vector；

S5：Sentence weight is calculated in each sub-topics document；

S6：Sentence is extracted to sort to form summary.

Specifically, the specific implementation step of step S1 is as shown in Fig. 2, it comprises the following steps：

Step S101：Sentence is divided with sentence end mark to every document of document sets, a sentence accounts for a line；

Step S102：Record the corresponding position of each sentence；

Further, the sentence position described in step S102 is expressed as:

Wherein h_n,iRepresent the position of i-th of sentence in a document, text in n-th document_nRepresent n-th document, len is represented The sentence number that n-th document is included.

Step S2 specifically includes following steps:

Step S201：All documents in Big-corpus are pre-processed by step S101 to the S104 in step S1, it will The pretreated document of Big-corpus passes through PV-DM (the distribution memory models of the sentence vector) training in doc2vec；

Step S202：Trained mould will be imported by the pretreated destination documents of step S101 to S104 in step S1 Type obtains sentence vector.

Wherein, training PV-DM models specifically include following steps in step S201：

(1) by the pretreated document of Big-corpus, often row sentence is initialized as k dimensional vectors with all words, by above and below word w Literary (each C word before and after the word) corresponding term vector sentence vector input deep neural network mould corresponding with sentence where the word Type；

(2) these vectors of input are done summation in hidden layer to add up, add up input of the vector as output layer；

(3) output layer corresponds to a binary tree, it works as leafy node with the word occurred in big language material, with each word in big language material The Huffman trees that the number of appearance is constructed as weights, each word correspond to the leaf node in tree, branch each time in tree One time two classification can be regarded as, the corresponding Label of each tree node is 1- into the corresponding leafy node paths of word w from root node p_j,p_jFor the corresponding coding of j-th of node in path, except root node corresponding with each tree node in addition to leafy node one and sentence Subvector is used for supplemental training model with the vector of length, it is referred to as auxiliary vector；

(4) the method training pattern risen using gradient constantly corrects sentence vector, term vector and auxiliary vector, finally obtains training The distribution memory models of good sentence vector.

The object function of neural metwork training is

The cluster generation of each sub-topics document specifically includes following steps using spectral clustering mode in step S3:

Step S301：The similar matrix W between all sentences is built, kernel function uses gaussian kernel function,

Wherein W_i,jFor sentence x_i,x_jBetween similarity, σ be Gauss radius；

Step S302：Laplacian Matrix L is calculated,

L=D-W

Wherein, D is pair of horns matrix, and line n element is the sum of W line n elements；

Step S303：Laplacian Matrix after structure standardization

Step S305：By feature vector by row arrangement form eigenmatrix, to the unitization formation matrix of every a line in eigenmatrix F, i.e. matrix F often row institute into vectorial modulus value be 1；

Step S306：Often row regards the sample that a k is tieed up as to F, is clustered, gathered for C classes with Kmeans algorithms；

Step S307：Sentence corresponding to vector saves as C document in C classes.

Step S4 is specifically included:

In each sub-topics document, using similarity of similarity of the sentence between node, sentence between side, sentence, establish Sentence relation graph model；

Further, the cosine similarity sim (x_i,x_j) calculation formula be：

Wherein, x_i,x_jFor two sentence vectors.

Step S5 specific steps include:

Each sentence weights initialisation updates sentence weight according to the relation graph model that step S4 is established using equation below iteration：

Wherein S (i) is the weight of sentence i, and δ (i) is higher than given threshold in same sub-topics document with sentence i similarities All sentences, | δ (j) | to be higher than the sentence of all sentences of given threshold with sentence i similarities in same sub-topics document Number, S (j) are the weight of sentence j, and it is 0.85 that d, which is that damped coefficient sets up,.

Further, step S6 is specifically included:

The sentence of weight maximum is extracted in each sub-topics document, according to the method described in the step (2) in step S1 according to sentence Name placement of the son in original text shelves, is combined into summary.

In order to further describe the multi-document summary extraction method of the present invention, here is that have to three and " Wu Qingyuan dies " The result of the document structure tree summary of pass：

Celestial being is returned in Sina.com sports news Wu at the age of one hundred years old emergence of radically reforming, and the eternal legend of chess circle are lightly left away.Wu Qingyuan lifes on June 12 in 1914 It is the 3rd son in family in the Fujian Province of China.Second year is granted by three sections by Japanese chess academy, and nineteen fifty obtains nine sections. Mr. Wu Qingyuan develops chess road, guides and supports younger generation, does one's utmost.First emperor in golden age that he is as Japanese go faces Japan, It is referred to as " the Showa chess sage " of Japan.1961, Wu Qingyuan suffered from traffic accident, and gradually fade out a line, until formally living in retirement for 1984. Before Wu Qingyuan, never any chess player of go circle can reach his height.2014, Zhong chesses circle were Mr. Wu Birthday at the age of one hundred years old for holding congratulates that ceremony is grand grand, can possess the weiqi game chess scholar of these honor, all over the world only mono- people of Wu Qingyuan and .The family members of Wu Qingyuan prepare to hold farewell ceremony December 3 for Wu Qingyuan.Wu Qingyuan was once said：" one it is at the age of one hundred years old after I also will under Chess, two after death I also to play chess in universe.The road of go is pursued, Mr. Wu understands thoroughly life and death already.

The foregoing is merely the preferred embodiments of the application, are not limited to the application, for those skilled in the art For member, the application can have various modifications and variations.All any modifications within spirit herein and principle, made, Equivalent substitution, improvement etc., should be included within the protection domain of the application.

Claims

1. based on the multi-document summary extraction method of sentence vector, it is characterized in that, comprise the following steps：

S1：The document sets of summary to be extracted are pre-processed；

S2：Using doc2vec model trainings generation sentence vector；

S6：Sentence is extracted to sort to form summary.

2. the multi-document summary extraction method as described in claim 1 based on sentence vector, it is characterized in that, S1 include with Lower step：

Step S102：Record the corresponding position of each sentence；

3. the multi-document summary extraction method as described in claim 1 based on sentence vector, it is characterized in that, step S2 bags Include following steps：

4. the multi-document summary extraction method as claimed in claim 3 based on sentence vector, it is characterized in that, step S201 The distribution memory models PV-DM of middle distich vector is trained, and is comprised the following steps：

5. the multi-document summary extraction method as described in claim 1 based on sentence vector, it is characterized in that, in step S3 The cluster generation of sentence vector is using spectral clustering mode.

6. the multi-document summary extraction method as claimed in claim 5 based on sentence vector, it is characterized in that, step S3 bags Include following steps：

Step S302：Calculate Laplacian Matrix L；

Step S303：Build the Laplacian Matrix of standardization；

Step S304：Calculate k minimum characteristic value of Laplacian Matrix and corresponding feature vector V；

7. the multi-document summary extraction method as described in claim 1 based on sentence vector, it is characterized in that, step S4 tools Body includes：

It is the similarity between side and sentence using similarity of the sentence between node, sentence in each sub-topics document, establishes Sentence relation graph model.

8. the multi-document summary extraction method as claimed in claim 2 based on sentence vector, it is characterized in that, step S6 tools Body includes：The sentence of weight maximum is extracted in each sub-topics document, is existed according to the obtained sentences of the step S102 in step S1 Name placement in document is combined into summary.

9. based on the multi-document summary automatic extracting system of sentence vector, it is characterized in that, including：It memory, processor and deposits The computer instruction run on a memory and on a processor is stored up, when the computer instruction is run by processor, completes power Profit requires the step of 1-8 any the methods.

10. a kind of computer readable storage medium, it is characterized in that, operation thereon has computer program, the computer program quilt When processor is run, the step of completing claim 1-8 any the method.