CN104834735B

CN104834735B - A kind of documentation summary extraction method based on term vector

Info

Publication number: CN104834735B
Application number: CN201510254719.2A
Authority: CN
Inventors: 林鸿飞; 郝辉辉
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2018-01-23
Anticipated expiration: 2035-05-18
Also published as: CN104834735A

Abstract

A kind of documentation summary extraction method based on term vector, comprises the following steps：S1, the term vector expression using deep neural network model training corpus acquisition Feature Words；S2, structure sentence graph model；S3, calculate sentence weight；S4, generated and made a summary using maximal margin related algorithm.The present invention obtains training characteristics corpus by gathering corpus and the corpus being pre-processed；And constructed training characteristics corpus is trained using deep neural network model, for obtaining the term vector of Feature Words；Candidate documents set and candidate sentences set are obtained according to default query word in corpus, the Semantic Similarity between sentence is obtained further according to the term vector of Feature Words, so as to obtain the semantic relation between two sentences, the calculation error problem that tradition is occurred based on Term co-occurrence computational methods in the case of synonymous different words is avoided, so as to improve the accuracy of Similarity measures and summary performance.

Description

A kind of documentation summary extraction method based on term vector

Technical field

The present invention relates to computer information retrieval and text mining field, especially a kind of documentation summary based on term vector Extraction method.

Background technology

Text summarization technique is the important part of text mining research field, and the technology can find out document or document sets In most important information and be indicated with succinctly coherent short essay.With scientific and technological progress and the development of network technology, on network The available information of magnanimity be present, in face of substantial amounts of data, the research can aid in user's fast understanding information needed, economize on Family reading time, improve operating efficiency.

Text summarization technique is mainly Extraction summarization at present, i.e., most important sentence composition summary is extracted from original text, Its generating process includes three steps：Sentence expression, sentence score, and summarization generation.Specifically, this method is first using a variety of Form represents sentence, such as word frequency comprising Feature Words, TF*IDF, descriptor composition sentence vector.Once the expression shape of sentence Formula determines, then sentence score is calculated using conventional sort method such as BM25, or PageRank etc., to represent the weight of the sentence The property wanted, finally the higher sentence of score is added in summary using de-redundancy method.Up to the present, text summarization technique has been Through there is 50 years of researches history.As the fast development of information retrieval technique, text summarization technique are also increasingly mature.From initial Based on word frequency, TF*IDF method, have to introducing machine learning, and the method represented based on pattern, the performance of text snippet Very big lifting.

Think more comprising the high word of high frequency words or TF*IDF values in sentence based on word frequency, the TF*IDF method counted, then The sentence is more important, therefore has higher probability to be added in the summary ultimately generated.Specifically, this method is first to candidate Corpus pre-processed, including remove stop words, stemmed etc., then count the word frequency or TF*IDF of Feature Words in language material； The every sentence concentrated to candidate documents, calculates the importance of sentence, a kind of most simple and practical way is to take feature in sentence The average value of Word probability, i.e., by the probability of Feature Words in sentence take and and divided by sentence length；Finally sentence is sorted, chosen Point highest sentence is added in the summary of generation.Such method is because of its convenience of calculation, it is easy to accomplish the advantages that often by as right Ratio method extensive use, but this method is partial to high frequency words, and the summary of generation often only covers the distribution subject in Candidate Set, lacks Few semantic understanding etc., therefore performance of making a summary is not notable.

In recent years, as the continuous popularization and improvement of machine learning method, increasing researcher start engineering Learning method is added in experiment, and text snippet field is no exception.A kind of way is to use supervised learning method, by text snippet Problem sees binary classification problems as, i.e. each candidate summary sentence can be added or can not be added in final summary. In training set, using logistic regression, the disaggregated model such as naive Bayesian or SVM is trained to training set, is obtained optimal Weight vectors, and classification prediction is carried out to test set；A kind of way is to use various features such as sentence position, word frequency, clue Word etc. represents sentence, is trained using Ranking Algorithm in training set, optimal feature weight vector is obtained, for surveying Examination set pair candidate sentences carry out calculating score；A kind of way is to see text snippet problem as clustering problem, i.e., to candidate's text The sentence that shelves are concentrated is clustered, and using previously described statistical method, or sort method etc. is to the sentence in each cluster It is ranked up, chooses the preceding n sentence composition summary in each cluster.Carry out automatically generating text snippet using machine learning Method is also a lot, however it is not limited to described above.Although still there are more and more machine learning methods in current text summary field Improve, but in general more document news in brief fields, the summary performance of machine learning method is not better than non-supervisory side Method, this method are more suitable in special dimension or certain types of summary.Machine learning simultaneously is often supervised learning model, is needed Labeled data is wanted, and the step is generally realized by artificial mark, it is very time-consuming, and there is subjectivity, therefore machine learning method is still So need further perfect.

Based on figured text summarization technique because non-supervisory characteristic, the overall situation consider, it is not necessary to domain knowledge and language Method semantic analysis, and multiple advantages such as good summary performance, therefore paid close attention to by Many researchers.This method is by sentence As node in figure, weight of the similitude as the side of link node between sentence, using PageRank, iteration meter the methods of HITS Node weight is calculated, most the larger sentence of weight adds summary at last.Value in sentence similar matrix represent from certain sentence to its His sentence redirects probability, therefore the calculating to node weight is extremely important, but traditional drawing method phase between sentence is calculated During like property, it is to be obtained using the co-occurrence for including Feature Words in sentence mostly, have ignored the Semantic Similarity between sentence, reduction finishes The accuracy of point weight calculation, influence the performance of summary.

The content of the invention

It is an object of the invention to provide it is a kind of can effectively avoid it is traditional based on Term co-occurrence method calculate sentence similitude band The error come, the documentation summary side of automatically extracting based on term vector of accurate and readable higher documentation summary is extracted for user Method.

The present invention solves technical scheme used by prior art problem：A kind of documentation summary based on term vector carries automatically Method is taken, is comprised the following steps：

S1, the term vector expression using deep neural network model training corpus acquisition Feature Words：From database document Collection corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out subordinate sentence processing to the language material in corpus, And control disables vocabulary removal stop words, spcial character and punctuation mark sentence by sentence, obtains training characteristics corpus；Training ginseng is set Number, using training characteristics corpus as training data, is trained using deep neural network model, by training characteristics corpus In each word exported as Feature Words by the training of Skip-gram models in the form of term vector, obtain Feature Words Term vector representation；

S2, structure sentence graph model：

Comprise the following steps:

A1, pretreatment：Retrieved according to default query word in the corpus that step S1 is collected, the text that will be retrieved Shelves are used as candidate documents, carry out subordinate sentence processing to the candidate documents and remove candidate documents to concentrate the sentence repeated, obtain To the candidate sentences set of summary；

A2, structure model：Every sentence in candidate sentences set as the node of graph model and is assigned average initial Weight：

Wherein, S_iFor any sentence in candidate sentence subset S, N is sentence sum；Utilize the obtained Feature Words of step S1 Term vector, the weight on side in figure is used as by calculating Semantic Similarity between sentence, forms sentence graph model；

To any two sentence S in candidate sentences set_iAnd S_j, respectively comprising Feature Words t_iAnd t_jTerm vectorWith Then sentence S_iAnd S_jBetween Semantic Similarity Similarity (S_i,S_j) formula is：

Wherein, for sentence S_iIn Feature Words t_iTerm vector Represent in sentence S_jNeutralize Feature Words t_iBelong to the term vectors of all Feature Words of identical part of speech withMaximum similarity value；|S_i| and | S_j| S is represented respectively_iAnd S_j's Length；

Similitude between the term vector of Feature Words is obtained by equation below：

WhereinWithIt is two Feature Words t₁And t₂Corresponding to training to obtain by step S1 deep neural network model Feature term vector.

S3, calculate sentence weight：The graph model obtained to step S2, average initial weight and sentence in step S2 Between Semantic Similarity the weight of each node is updated using equation below iteration, until convergence：

Wherein d is damped coefficient, span 0-1, Connection (S_i) be and sentence S_iSimilarity is more than 0 sentence Subclass, | | Connection (S_i) | | it is then sentence sum in the set；

S4, generated and made a summary using maximal margin related algorithm：The maximum and nothing using maximal margin related algorithm selection weight The sentence composition summary of redundancy, is concretely comprised the following steps：

B1 empty summary sentence set), is established；Plucked the sentence corresponding to each node in graph model as initial candidate Want sentence set；

B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, will be sorted The sentence corresponding to each node afterwards is as candidate's summary sentence sequence；

B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, it is right Remaining sentence in candidate's summary sentence set updates their weight using equation below：

Weight(S_j)=Weight (S_j)-ω×Similarity(S_i,S_j)

Wherein, i ≠ j, ω are penalty factor, Similarity (S_i,S_j) it is that the sentence semantics obtained in step S2 are similar Property；

B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization.

When the sentence to be updated weight has similitude with the sentence in summary sentence set, penalty factor ω is 1.0.

The deep neural network model is Skip-gram models, and Skip-gram is trained using level softmax methods Model.

Damped coefficient d in step S3 is 0.85.

Default length of summarization is 150 words.

The beneficial effects of the present invention are：The present invention is obtained by gathering corpus and the corpus being pre-processed Training characteristics corpus；And constructed training characteristics corpus is trained using deep neural network model, for To the term vector of Feature Words；Candidate documents set and candidate sentences set are obtained according to default query word in corpus, further according to The term vector of Feature Words obtains the Semantic Similarity between sentence, so as to obtain the semantic relation between two sentences, avoids biography The calculation error problem that system is occurred based on Term co-occurrence computational methods in the case of synonymous different words, so as to improve Similarity measures Accuracy and summary performance.In stand-alone environment (CPU is monokaryon 3.0GHz, inside saves as 4G, hereafter together), for the word of Feature Words It is 148,420KB that the training corpus of vector, which integrates the training pattern committed memory as 1.2G, obtained,.

Brief description of the drawings

Fig. 1 is the logic schematic diagram of the present invention.

Fig. 2 is the result of gained after the completion of step of embodiment of the present invention S1-S3.

Fig. 3 is the final result of the embodiment of the present invention.

Embodiment

Below in conjunction with drawings and the specific embodiments, the present invention will be described：

Fig. 1 is a kind of logic schematic diagram of the documentation summary extraction method based on term vector of the present invention.One kind is based on The documentation summary extraction method of term vector, comprises the following steps：

S1, the term vector expression using deep neural network model training corpus acquisition Feature Words：From database document Collection corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out subordinate sentence processing to the language material in corpus, And control disables vocabulary removal stop words, spcial character and punctuation mark etc. sentence by sentence, obtains training characteristics corpus；Training is set Parameter, using training characteristics corpus as training data, using level softmax methods to deep neural network model Skip- Gram is trained, and the training of Skip-gram models is passed through using each word in training characteristics corpus as Feature Words Exported in the form of term vector, obtain the term vector representation of Feature Words；

Specifically, being represented for the term vector of the training characteristics word from a large amount of unstructured text datas, the present invention mainly adopts With Skip-gram models.Compared to the method that other are realized based on neural network structure, the model does not have substantial amounts of Matrix Multiplication Method, thus it is very efficient.Skip-gram models predict the term vector of specified window context using the term vector of current word.Give Determine feature language material w₁,w₂,w₃,…,w_TAs training data, Skip-gram object function is

Wherein, c is the parameter for determining contextual window size, and c is more big, needs more training datas, generally requires more More training times, but higher accuracy rate can be obtained.

Basic Skip-gram model definition p (w_O|w_I) be：

WhereinWithIt is that w " input " and " output " vector form represent that W is total words in vocabulary.Due toCalculate, the order of magnitude generally very big (10 proportional to W⁵-10⁷), therefore it is near frequently with other calculation formula Like calculating.

The present invention is trained using level softmax algorithms to deep neural network model Skip-gram, algorithm profit With y-bend Huffman tree representations, using W word of output layer as leafy node, shorter path is distributed to high frequency words, accelerates instruction Practice speed.Each feature language material w can be accessed to from the root node of tree along unique paths.If n (w, j) is J-th of node on from root node to w paths, L (w) are the length of this paths, therefore n (w, 1)=root, n (w, L (w)) =w.For any internal node n, ch (n) is any child node of node n.Then level softmax defines p (w_O|w_I) as follows：

Wherein

Calculated in above formulaWith logp (w_o|w_I) L (w in proportion to_O), it is generally not more than logW.

Stochastic gradient descent method is used to solve object function after defining above formula, the term vector for ultimately producing word represents shape Formula.

S2, structure sentence graph model：Comprise the following steps：

Wherein S_iFor any sentence in candidate sentence subset S, N is sentence sum；Utilize the obtained Feature Words of step S1 Term vector, the weight on side in figure is used as by calculating Semantic Similarity between sentence, forms sentence graph model；

Wherein, for sentence S_iIn Feature Words t_iTerm vectorSim_m(t_i,S_j) represent in sentence S_jNeutralize Feature Words Ti belong to the term vectors of all Feature Words of identical part of speech withMaximum similarity value；|S_i| and | S_j| S is represented respectively_iAnd S_j's Length；

Similarity value between the term vector of Feature Words is obtained by equation below：

S3, calculate sentence weight：The graph model obtained to step S2, average initial weight and sentence in step S2 Between Semantic Similarity the weight of each node is updated using following improved PageRank formula iteration, until convergence, so as to To the score value that can reflect sentence importance：

Because the difference of similarity between sentence causes the difference of side right weight between node, and the symmetry of similarity, herein Utilize improved PageRank formula.Wherein d is damped coefficient, span 0-1, under normal circumstances preferably 0.85. Connection(S_i) represent and S_iConnected sentence set, i.e., with sentence S_iSimilarity is more than 0 sentence set, | | Connection(S_i) | | it is then sentence sum in the set；

Original thought of the PageRank formula based on random surfer, fetched using the chain between webpage and weigh the important of webpage Degree, is specifically that quality and the number of links in the source that utilizes link to determine the weight of hyperlink target, and its formula is：

Wherein d generally takes 0.85.In(S_i) represent to point to S_iCollections of web pages, | | Out (S_j) | | represent webpage S_jChain go out Sum.

PageRank thoughts are applied in the graph model of sentence in the present invention, obtain the final weight of sentence.But In order to preferably solve the difference that the difference of similarity between sentence causes side right weight between node, and symmetry of similarity etc. is asked Existing PageRank formula are improved to following form by topic, the present invention：

D is still set to 0.85 herein.Connection(S_i) represent and S_iConnected sentence set, i.e., with sentence S_iSimilarity Sentence set more than 0, | | Connection (S_i) | | it is then sentence sum in the set.

The similarity matrix formed using the Semantic Similarity between the average initial weight and sentence of node in step S2 The weight of each node in graph model, i.e. sentence weight are iterated to calculate with the PageRank formula after improvement, until convergence.Finally Each node will obtain a score for reflecting its importance, be prepared for generation summary in next step.

Due to similitude between sentence be present, so if the maximum preceding K bars sentence of weight directly is added into meeting in summary Larger redundancy be present.To reduce the redundancy rate in making a summary, the present invention uses maximal margin related algorithm, and its basic thought is：Such as One sentence of fruit has higher similitude with existing sentence in summary, then carries out point penalty to the sentence.Therefore there is following S4 Step：

B1 empty summary sentence set), is established, as initial summary sentence set；Each node institute in graph model is right The sentence answered is as initial candidate's summary sentence set；

B2), according to step S3 to the sentence weight descending corresponding to each graph model node in candidate's summary sentence set Arrangement, using the sentence corresponding to each node after sequence as candidate's summary sentence sequence；

Weight(S_j)=Weight (S_j)-ω×Similarity(S_i,S_j)

Wherein, i ≠ j, ω are penalty factor, when the sentence to be updated weight has phase with the sentence in summary sentence set During like property, penalty factor ω is 1.0.Similarity(S_i,S_j) it is the sentence semantics similitude obtained in step S2；

Embodiment：

To make the purpose of the present invention, technical scheme and beneficial effect become apparent from and more easily implemented, with reference to following tool Body embodiment, and referring to the drawings, the present invention is described in further details.The present embodiment sets the length of summarization of generation to be preset as 150 words.

S1, the term vector expression using deep neural network model training corpus acquisition Feature Words：

To obtain the vector representation form of Feature Words, embodiment uses the biomedicine that U.S. national library of medicine is safeguarded Bibliographic data base MEDLINE gathers the corpus of experiment, specifically, that is, inquires about all texts of 2011-2012 on MEDLINE Quotation is offered as corpus, the sentence in quotation is pre-processed, i.e. control disables vocabulary and removes stop words, spcial character and mark Point symbol etc., finally give 1.2G training corpus collection.

The term vector dimension that Feature Words are set in the training process of the present embodiment is 200 dimensions, is instructed using level softmax Practice Skip-gram models, only consider the Feature Words that word frequency is more than 3, window size is set to 5.

S2, structure sentence graph model：

Comprise the following steps：The present embodiment set " HIV Infection " be searching keyword, retrieve MEDLINE on and Correlation all quotations, obtain candidate documents corresponding with the inquiry, the candidate documents subordinate sentence processing and go Except candidate documents concentrate the sentence repeated, the candidate sentences set made a summary, the candidate for including 4581 sentences is ultimately generated Sentence set.

A2, structure model：

Using every sentence in gathering as a node in graph model, according to average initial weight formulaBe averaged initial weight to each node imparting in graph model, i.e., and 1/4581；

Utilize the Semantic Similarity calculation formula i.e. formula between the term vector and sentence of the step S1 Feature Words for training to obtainAnd formulaCalculate sentence Between similitude, the weight on side in figure, generate sentence graph model.

S3, calculate sentence weight：

To above-mentioned graph model, the weight of each node is iterated to calculate until convergence using improved PageRank formula.

Fig. 2 show more than present invention use 3 steps and candidate sentences weight is sorted in descending order, the preceding K bars sentence group of selection Into designated length disease " HIV Infection " make a summary, it is specific as follows：

S4, generated and made a summary using maximal margin related algorithm：

The sentence that weight is obtained in upper step is sorted in descending order, it is related using maximal margin to eliminate the redundancy in making a summary The sentence for having similitude in algorithm pair and summary sentence carries out point penalty, selects the preceding K bars sentence composition summary that weight is larger.Specific step Suddenly it is：

B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set,

Their weight is updated using equation below to the remaining sentence in candidate's summary sentence set：

Weight(S_j)=Weight (S_j)-ω×Similarity(S_i,S_j)

B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization

Fig. 3 show more than present invention use 4 steps and is ranked up de-redundancy to candidate sentences set, ultimately generates specified length " HIV Infection's disease of degree " make a summary.

In terms of the summary result obtained from Fig. 2 and Fig. 3, the summary before de-redundancy is mostly short sentence, there is more repetitor in summary Language, and it is more similar semantically also having.And the summary after de-redundancy is in addition to important information is retained, while include more semantemes Aspect, information content is more, therefore whole structure is more preferable.

The method that above-described embodiment describes and explains the present invention.This method utilizes deep neural network Algorithm for Training feature The term vector of word, and then accurate similarity between calculating sentence, iterate to calculate renewal sentence weight using PageRank thoughts, are based on Maximal margin related algorithm eliminates the information redundancy in summary, improves the performance of system generation summary, further meets user's Information requirement.

Above content is to combine specific optimal technical scheme further description made for the present invention, it is impossible to is assert The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's Protection domain.

Claims

1. a kind of documentation summary extraction method based on term vector, it is characterised in that comprise the following steps：

S1, the term vector expression using deep neural network model training corpus acquisition Feature Words：Gathered from database document Corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out the language material in corpus subordinate sentence processing, and by Sentence control disables vocabulary and removes stop words, spcial character and punctuation mark, obtains training characteristics corpus；Training parameter is set, Using training characteristics corpus as training data, it is trained using deep neural network model, by training characteristics corpus Each word exported as Feature Words by the training of Skip-gram models in the form of term vector, obtain the word of Feature Words Vector representation form；

S2, structure sentence graph model：

Comprise the following steps:

A1, pretreatment：Retrieved according to default query word in the corpus that step S1 is collected, the document retrieved is made For candidate documents, subordinate sentence processing is carried out to the candidate documents and removes candidate documents concentrating the sentence repeated, plucked The candidate sentences set wanted；

A2, structure model：As the node of graph model and assign every sentence in candidate sentences set to average initial weight：

Wherein S_iFor any sentence in candidate sentence subset S, N is sentence sum；Using the obtained Feature Words of step S1 word to Amount, the weight on side in figure, composition sentence graph model be used as by calculating Semantic Similarity between sentence；

To any two sentence S in candidate sentences set_iAnd S_j, respectively comprising Feature Words t_iAnd t_jTerm vectorWithThen sentence Sub- S_iAnd S_jBetween Semantic Similarity Similarity (S_i,S_j) formula is：

<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mi>i</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&Sigma;</mi> <mrow> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>&Element;</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> </mrow> </msub> <msub> <mi>Sim</mi> <mi>m</mi> </msub> <mrow> <mo>(</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&Sigma;</mi> <mrow> <msub> <mi>t</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> </mrow> </msub> <msub> <mi>Sim</mi> <mi>m</mi> </msub> <mrow> <mo>(</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>|</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>|</mo> </mrow> </mfrac> </mrow>

Wherein, for sentence S_iIn Feature Words t_iTerm vector Represent in sentence S_jNeutralize Feature Words t_iBelong to The term vector of all Feature Words of identical part of speech withMaximum similarity value；|S_i| and | S_j| S is represented respectively_iAnd S_jLength；

<mrow> <mi>c</mi> <mi>o</mi> <mi>s</mi> <mrow> <mo>(</mo> <mover> <msub> <mi>t</mi> <mn>1</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>,</mo> <mover> <msub> <mi>t</mi> <mn>2</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <msub> <mi>t</mi> <mn>1</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>&CenterDot;</mo> <mover> <msub> <mi>t</mi> <mn>2</mn> </msub> <mo>&RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <msub> <mi>t</mi> <mn>1</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mo>*</mo> <mo>|</mo> <mo>|</mo> <mover> <msub> <mi>t</mi> <mn>2</mn> </msub> <mo>&RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> </mrow>

Wherein,WithIt is two Feature Words t₁And t₂Train to obtain corresponding feature by step S1 deep neural network model Term vector；

S3, calculate sentence weight：The graph model obtained to step S2, language between the average initial weight and sentence in step S2 Adopted similitude updates the weight of each node using equation below iteration, until convergence：

<mrow> <mi>W</mi> <mi>e</mi> <mi>i</mi> <mi>g</mi> <mi>h</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>d</mi> <mo>&times;</mo> <munder> <mo>&Sigma;</mo> <mrow> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>n</mi> <mi>e</mi> <mi>c</mi> <mi>t</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mi>i</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>n</mi> <mi>e</mi> <mi>c</mi> <mi>t</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>&times;</mo> <mi>W</mi> <mi>e</mi> <mi>i</mi> <mi>g</mi> <mi>h</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow>

Wherein, d is damped coefficient, span 0-1, Connection (S_i) be and sentence S_iSimilarity is more than 0 sentence collection Close, | | Connection (S_i) | | it is then sentence sum in the set；

S4, generated and made a summary using maximal margin related algorithm：It is maximum and irredundant using maximal margin related algorithm selection weight Sentence composition summary, concretely comprise the following steps：

B1 empty summary sentence set), is established；Using the sentence corresponding to each node in graph model as initial candidate's summary sentence Subclass；

B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, after sequence Sentence corresponding to each node is as candidate's summary sentence sequence；

B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, to candidate Remaining sentence in summary sentence set updates their weight using equation below：

Weight(S_j)=Weight (S_j)-ω×Similarity(S_i,S_j)

Wherein, i ≠ j, ω are penalty factor, Similarity (S_i,S_j) it is the sentence semantics similitude obtained in step S2；

2. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that work as institute When the sentence for updating weight has similitude with the sentence in summary sentence set, penalty factor ω is 1.0.

3. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that described Deep neural network model is Skip-gram models, and Skip-gram models are trained using level softmax methods.

A kind of 4. documentation summary extraction method based on term vector according to claim 1, it is characterised in that step Damped coefficient d in S3 is 0.85.

5. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that default Length of summarization be 150 words.