CN104834735B - A kind of documentation summary extraction method based on term vector - Google Patents

A kind of documentation summary extraction method based on term vector Download PDF

Info

Publication number
CN104834735B
CN104834735B CN201510254719.2A CN201510254719A CN104834735B CN 104834735 B CN104834735 B CN 104834735B CN 201510254719 A CN201510254719 A CN 201510254719A CN 104834735 B CN104834735 B CN 104834735B
Authority
CN
China
Prior art keywords
sentence
msub
mrow
term vector
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510254719.2A
Other languages
Chinese (zh)
Other versions
CN104834735A (en
Inventor
林鸿飞
郝辉辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201510254719.2A priority Critical patent/CN104834735B/en
Publication of CN104834735A publication Critical patent/CN104834735A/en
Application granted granted Critical
Publication of CN104834735B publication Critical patent/CN104834735B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of documentation summary extraction method based on term vector, comprises the following steps:S1, the term vector expression using deep neural network model training corpus acquisition Feature Words;S2, structure sentence graph model;S3, calculate sentence weight;S4, generated and made a summary using maximal margin related algorithm.The present invention obtains training characteristics corpus by gathering corpus and the corpus being pre-processed;And constructed training characteristics corpus is trained using deep neural network model, for obtaining the term vector of Feature Words;Candidate documents set and candidate sentences set are obtained according to default query word in corpus, the Semantic Similarity between sentence is obtained further according to the term vector of Feature Words, so as to obtain the semantic relation between two sentences, the calculation error problem that tradition is occurred based on Term co-occurrence computational methods in the case of synonymous different words is avoided, so as to improve the accuracy of Similarity measures and summary performance.

Description

A kind of documentation summary extraction method based on term vector
Technical field
The present invention relates to computer information retrieval and text mining field, especially a kind of documentation summary based on term vector Extraction method.
Background technology
Text summarization technique is the important part of text mining research field, and the technology can find out document or document sets In most important information and be indicated with succinctly coherent short essay.With scientific and technological progress and the development of network technology, on network The available information of magnanimity be present, in face of substantial amounts of data, the research can aid in user's fast understanding information needed, economize on Family reading time, improve operating efficiency.
Text summarization technique is mainly Extraction summarization at present, i.e., most important sentence composition summary is extracted from original text, Its generating process includes three steps:Sentence expression, sentence score, and summarization generation.Specifically, this method is first using a variety of Form represents sentence, such as word frequency comprising Feature Words, TF*IDF, descriptor composition sentence vector.Once the expression shape of sentence Formula determines, then sentence score is calculated using conventional sort method such as BM25, or PageRank etc., to represent the weight of the sentence The property wanted, finally the higher sentence of score is added in summary using de-redundancy method.Up to the present, text summarization technique has been Through there is 50 years of researches history.As the fast development of information retrieval technique, text summarization technique are also increasingly mature.From initial Based on word frequency, TF*IDF method, have to introducing machine learning, and the method represented based on pattern, the performance of text snippet Very big lifting.
Think more comprising the high word of high frequency words or TF*IDF values in sentence based on word frequency, the TF*IDF method counted, then The sentence is more important, therefore has higher probability to be added in the summary ultimately generated.Specifically, this method is first to candidate Corpus pre-processed, including remove stop words, stemmed etc., then count the word frequency or TF*IDF of Feature Words in language material; The every sentence concentrated to candidate documents, calculates the importance of sentence, a kind of most simple and practical way is to take feature in sentence The average value of Word probability, i.e., by the probability of Feature Words in sentence take and and divided by sentence length;Finally sentence is sorted, chosen Point highest sentence is added in the summary of generation.Such method is because of its convenience of calculation, it is easy to accomplish the advantages that often by as right Ratio method extensive use, but this method is partial to high frequency words, and the summary of generation often only covers the distribution subject in Candidate Set, lacks Few semantic understanding etc., therefore performance of making a summary is not notable.
In recent years, as the continuous popularization and improvement of machine learning method, increasing researcher start engineering Learning method is added in experiment, and text snippet field is no exception.A kind of way is to use supervised learning method, by text snippet Problem sees binary classification problems as, i.e. each candidate summary sentence can be added or can not be added in final summary. In training set, using logistic regression, the disaggregated model such as naive Bayesian or SVM is trained to training set, is obtained optimal Weight vectors, and classification prediction is carried out to test set;A kind of way is to use various features such as sentence position, word frequency, clue Word etc. represents sentence, is trained using Ranking Algorithm in training set, optimal feature weight vector is obtained, for surveying Examination set pair candidate sentences carry out calculating score;A kind of way is to see text snippet problem as clustering problem, i.e., to candidate's text The sentence that shelves are concentrated is clustered, and using previously described statistical method, or sort method etc. is to the sentence in each cluster It is ranked up, chooses the preceding n sentence composition summary in each cluster.Carry out automatically generating text snippet using machine learning Method is also a lot, however it is not limited to described above.Although still there are more and more machine learning methods in current text summary field Improve, but in general more document news in brief fields, the summary performance of machine learning method is not better than non-supervisory side Method, this method are more suitable in special dimension or certain types of summary.Machine learning simultaneously is often supervised learning model, is needed Labeled data is wanted, and the step is generally realized by artificial mark, it is very time-consuming, and there is subjectivity, therefore machine learning method is still So need further perfect.
Based on figured text summarization technique because non-supervisory characteristic, the overall situation consider, it is not necessary to domain knowledge and language Method semantic analysis, and multiple advantages such as good summary performance, therefore paid close attention to by Many researchers.This method is by sentence As node in figure, weight of the similitude as the side of link node between sentence, using PageRank, iteration meter the methods of HITS Node weight is calculated, most the larger sentence of weight adds summary at last.Value in sentence similar matrix represent from certain sentence to its His sentence redirects probability, therefore the calculating to node weight is extremely important, but traditional drawing method phase between sentence is calculated During like property, it is to be obtained using the co-occurrence for including Feature Words in sentence mostly, have ignored the Semantic Similarity between sentence, reduction finishes The accuracy of point weight calculation, influence the performance of summary.
The content of the invention
It is an object of the invention to provide it is a kind of can effectively avoid it is traditional based on Term co-occurrence method calculate sentence similitude band The error come, the documentation summary side of automatically extracting based on term vector of accurate and readable higher documentation summary is extracted for user Method.
The present invention solves technical scheme used by prior art problem:A kind of documentation summary based on term vector carries automatically Method is taken, is comprised the following steps:
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:From database document Collection corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out subordinate sentence processing to the language material in corpus, And control disables vocabulary removal stop words, spcial character and punctuation mark sentence by sentence, obtains training characteristics corpus;Training ginseng is set Number, using training characteristics corpus as training data, is trained using deep neural network model, by training characteristics corpus In each word exported as Feature Words by the training of Skip-gram models in the form of term vector, obtain Feature Words Term vector representation;
S2, structure sentence graph model:
Comprise the following steps:
A1, pretreatment:Retrieved according to default query word in the corpus that step S1 is collected, the text that will be retrieved Shelves are used as candidate documents, carry out subordinate sentence processing to the candidate documents and remove candidate documents to concentrate the sentence repeated, obtain To the candidate sentences set of summary;
A2, structure model:Every sentence in candidate sentences set as the node of graph model and is assigned average initial Weight:
Wherein, SiFor any sentence in candidate sentence subset S, N is sentence sum;Utilize the obtained Feature Words of step S1 Term vector, the weight on side in figure is used as by calculating Semantic Similarity between sentence, forms sentence graph model;
To any two sentence S in candidate sentences setiAnd Sj, respectively comprising Feature Words tiAnd tjTerm vectorWith Then sentence SiAnd SjBetween Semantic Similarity Similarity (Si,Sj) formula is:
Wherein, for sentence SiIn Feature Words tiTerm vector Represent in sentence SjNeutralize Feature Words tiBelong to the term vectors of all Feature Words of identical part of speech withMaximum similarity value;|Si| and | Sj| S is represented respectivelyiAnd Sj's Length;
Similitude between the term vector of Feature Words is obtained by equation below:
WhereinWithIt is two Feature Words t1And t2Corresponding to training to obtain by step S1 deep neural network model Feature term vector.
S3, calculate sentence weight:The graph model obtained to step S2, average initial weight and sentence in step S2 Between Semantic Similarity the weight of each node is updated using equation below iteration, until convergence:
Wherein d is damped coefficient, span 0-1, Connection (Si) be and sentence SiSimilarity is more than 0 sentence Subclass, | | Connection (Si) | | it is then sentence sum in the set;
S4, generated and made a summary using maximal margin related algorithm:The maximum and nothing using maximal margin related algorithm selection weight The sentence composition summary of redundancy, is concretely comprised the following steps:
B1 empty summary sentence set), is established;Plucked the sentence corresponding to each node in graph model as initial candidate Want sentence set;
B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, will be sorted The sentence corresponding to each node afterwards is as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, it is right Remaining sentence in candidate's summary sentence set updates their weight using equation below:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, Similarity (Si,Sj) it is that the sentence semantics obtained in step S2 are similar Property;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization.
When the sentence to be updated weight has similitude with the sentence in summary sentence set, penalty factor ω is 1.0.
The deep neural network model is Skip-gram models, and Skip-gram is trained using level softmax methods Model.
Damped coefficient d in step S3 is 0.85.
Default length of summarization is 150 words.
The beneficial effects of the present invention are:The present invention is obtained by gathering corpus and the corpus being pre-processed Training characteristics corpus;And constructed training characteristics corpus is trained using deep neural network model, for To the term vector of Feature Words;Candidate documents set and candidate sentences set are obtained according to default query word in corpus, further according to The term vector of Feature Words obtains the Semantic Similarity between sentence, so as to obtain the semantic relation between two sentences, avoids biography The calculation error problem that system is occurred based on Term co-occurrence computational methods in the case of synonymous different words, so as to improve Similarity measures Accuracy and summary performance.In stand-alone environment (CPU is monokaryon 3.0GHz, inside saves as 4G, hereafter together), for the word of Feature Words It is 148,420KB that the training corpus of vector, which integrates the training pattern committed memory as 1.2G, obtained,.
Brief description of the drawings
Fig. 1 is the logic schematic diagram of the present invention.
Fig. 2 is the result of gained after the completion of step of embodiment of the present invention S1-S3.
Fig. 3 is the final result of the embodiment of the present invention.
Embodiment
Below in conjunction with drawings and the specific embodiments, the present invention will be described:
Fig. 1 is a kind of logic schematic diagram of the documentation summary extraction method based on term vector of the present invention.One kind is based on The documentation summary extraction method of term vector, comprises the following steps:
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:From database document Collection corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out subordinate sentence processing to the language material in corpus, And control disables vocabulary removal stop words, spcial character and punctuation mark etc. sentence by sentence, obtains training characteristics corpus;Training is set Parameter, using training characteristics corpus as training data, using level softmax methods to deep neural network model Skip- Gram is trained, and the training of Skip-gram models is passed through using each word in training characteristics corpus as Feature Words Exported in the form of term vector, obtain the term vector representation of Feature Words;
Specifically, being represented for the term vector of the training characteristics word from a large amount of unstructured text datas, the present invention mainly adopts With Skip-gram models.Compared to the method that other are realized based on neural network structure, the model does not have substantial amounts of Matrix Multiplication Method, thus it is very efficient.Skip-gram models predict the term vector of specified window context using the term vector of current word.Give Determine feature language material w1,w2,w3,…,wTAs training data, Skip-gram object function is
Wherein, c is the parameter for determining contextual window size, and c is more big, needs more training datas, generally requires more More training times, but higher accuracy rate can be obtained.
Basic Skip-gram model definition p (wO|wI) be:
WhereinWithIt is that w " input " and " output " vector form represent that W is total words in vocabulary.Due toCalculate, the order of magnitude generally very big (10 proportional to W5-107), therefore it is near frequently with other calculation formula Like calculating.
The present invention is trained using level softmax algorithms to deep neural network model Skip-gram, algorithm profit With y-bend Huffman tree representations, using W word of output layer as leafy node, shorter path is distributed to high frequency words, accelerates instruction Practice speed.Each feature language material w can be accessed to from the root node of tree along unique paths.If n (w, j) is J-th of node on from root node to w paths, L (w) are the length of this paths, therefore n (w, 1)=root, n (w, L (w)) =w.For any internal node n, ch (n) is any child node of node n.Then level softmax defines p (wO|wI) as follows:
Wherein
Calculated in above formulaWith logp (wo|wI) L (w in proportion toO), it is generally not more than logW.
Stochastic gradient descent method is used to solve object function after defining above formula, the term vector for ultimately producing word represents shape Formula.
S2, structure sentence graph model:Comprise the following steps:
A1, pretreatment:Retrieved according to default query word in the corpus that step S1 is collected, the text that will be retrieved Shelves are used as candidate documents, carry out subordinate sentence processing to the candidate documents and remove candidate documents to concentrate the sentence repeated, obtain To the candidate sentences set of summary;
A2, structure model:Every sentence in candidate sentences set as the node of graph model and is assigned average initial Weight:
Wherein SiFor any sentence in candidate sentence subset S, N is sentence sum;Utilize the obtained Feature Words of step S1 Term vector, the weight on side in figure is used as by calculating Semantic Similarity between sentence, forms sentence graph model;
To any two sentence S in candidate sentences setiAnd Sj, respectively comprising Feature Words tiAnd tjTerm vectorWith Then sentence SiAnd SjBetween Semantic Similarity similarity (Si,Sj) formula is:
Wherein, for sentence SiIn Feature Words tiTerm vectorSimm(ti,Sj) represent in sentence SjNeutralize Feature Words Ti belong to the term vectors of all Feature Words of identical part of speech withMaximum similarity value;|Si| and | Sj| S is represented respectivelyiAnd Sj's Length;
Similarity value between the term vector of Feature Words is obtained by equation below:
WhereinWithIt is two Feature Words t1And t2Corresponding to training to obtain by step S1 deep neural network model Feature term vector.
S3, calculate sentence weight:The graph model obtained to step S2, average initial weight and sentence in step S2 Between Semantic Similarity the weight of each node is updated using following improved PageRank formula iteration, until convergence, so as to To the score value that can reflect sentence importance:
Because the difference of similarity between sentence causes the difference of side right weight between node, and the symmetry of similarity, herein Utilize improved PageRank formula.Wherein d is damped coefficient, span 0-1, under normal circumstances preferably 0.85. Connection(Si) represent and SiConnected sentence set, i.e., with sentence SiSimilarity is more than 0 sentence set, | | Connection(Si) | | it is then sentence sum in the set;
Original thought of the PageRank formula based on random surfer, fetched using the chain between webpage and weigh the important of webpage Degree, is specifically that quality and the number of links in the source that utilizes link to determine the weight of hyperlink target, and its formula is:
Wherein d generally takes 0.85.In(Si) represent to point to SiCollections of web pages, | | Out (Sj) | | represent webpage SjChain go out Sum.
PageRank thoughts are applied in the graph model of sentence in the present invention, obtain the final weight of sentence.But In order to preferably solve the difference that the difference of similarity between sentence causes side right weight between node, and symmetry of similarity etc. is asked Existing PageRank formula are improved to following form by topic, the present invention:
D is still set to 0.85 herein.Connection(Si) represent and SiConnected sentence set, i.e., with sentence SiSimilarity Sentence set more than 0, | | Connection (Si) | | it is then sentence sum in the set.
The similarity matrix formed using the Semantic Similarity between the average initial weight and sentence of node in step S2 The weight of each node in graph model, i.e. sentence weight are iterated to calculate with the PageRank formula after improvement, until convergence.Finally Each node will obtain a score for reflecting its importance, be prepared for generation summary in next step.
Due to similitude between sentence be present, so if the maximum preceding K bars sentence of weight directly is added into meeting in summary Larger redundancy be present.To reduce the redundancy rate in making a summary, the present invention uses maximal margin related algorithm, and its basic thought is:Such as One sentence of fruit has higher similitude with existing sentence in summary, then carries out point penalty to the sentence.Therefore there is following S4 Step:
S4, generated and made a summary using maximal margin related algorithm:The maximum and nothing using maximal margin related algorithm selection weight The sentence composition summary of redundancy, is concretely comprised the following steps:
B1 empty summary sentence set), is established, as initial summary sentence set;Each node institute in graph model is right The sentence answered is as initial candidate's summary sentence set;
B2), according to step S3 to the sentence weight descending corresponding to each graph model node in candidate's summary sentence set Arrangement, using the sentence corresponding to each node after sequence as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, it is right Remaining sentence in candidate's summary sentence set updates their weight using equation below:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, when the sentence to be updated weight has phase with the sentence in summary sentence set During like property, penalty factor ω is 1.0.Similarity(Si,Sj) it is the sentence semantics similitude obtained in step S2;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization.
Embodiment:
To make the purpose of the present invention, technical scheme and beneficial effect become apparent from and more easily implemented, with reference to following tool Body embodiment, and referring to the drawings, the present invention is described in further details.The present embodiment sets the length of summarization of generation to be preset as 150 words.
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:
To obtain the vector representation form of Feature Words, embodiment uses the biomedicine that U.S. national library of medicine is safeguarded Bibliographic data base MEDLINE gathers the corpus of experiment, specifically, that is, inquires about all texts of 2011-2012 on MEDLINE Quotation is offered as corpus, the sentence in quotation is pre-processed, i.e. control disables vocabulary and removes stop words, spcial character and mark Point symbol etc., finally give 1.2G training corpus collection.
The term vector dimension that Feature Words are set in the training process of the present embodiment is 200 dimensions, is instructed using level softmax Practice Skip-gram models, only consider the Feature Words that word frequency is more than 3, window size is set to 5.
S2, structure sentence graph model:
Comprise the following steps:The present embodiment set " HIV Infection " be searching keyword, retrieve MEDLINE on and Correlation all quotations, obtain candidate documents corresponding with the inquiry, the candidate documents subordinate sentence processing and go Except candidate documents concentrate the sentence repeated, the candidate sentences set made a summary, the candidate for including 4581 sentences is ultimately generated Sentence set.
A2, structure model:
Using every sentence in gathering as a node in graph model, according to average initial weight formulaBe averaged initial weight to each node imparting in graph model, i.e., and 1/4581;
Utilize the Semantic Similarity calculation formula i.e. formula between the term vector and sentence of the step S1 Feature Words for training to obtainAnd formulaCalculate sentence Between similitude, the weight on side in figure, generate sentence graph model.
S3, calculate sentence weight:
To above-mentioned graph model, the weight of each node is iterated to calculate until convergence using improved PageRank formula.
Fig. 2 show more than present invention use 3 steps and candidate sentences weight is sorted in descending order, the preceding K bars sentence group of selection Into designated length disease " HIV Infection " make a summary, it is specific as follows:
S4, generated and made a summary using maximal margin related algorithm:
The sentence that weight is obtained in upper step is sorted in descending order, it is related using maximal margin to eliminate the redundancy in making a summary The sentence for having similitude in algorithm pair and summary sentence carries out point penalty, selects the preceding K bars sentence composition summary that weight is larger.Specific step Suddenly it is:
B1 empty summary sentence set), is established, as initial summary sentence set;Each node institute in graph model is right The sentence answered is as initial candidate's summary sentence set;
B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, will be sorted The sentence corresponding to each node afterwards is as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set,
Their weight is updated using equation below to the remaining sentence in candidate's summary sentence set:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, when the sentence to be updated weight has phase with the sentence in summary sentence set During like property, penalty factor ω is 1.0.Similarity(Si,Sj) it is the sentence semantics similitude obtained in step S2;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization
Fig. 3 show more than present invention use 4 steps and is ranked up de-redundancy to candidate sentences set, ultimately generates specified length " HIV Infection's disease of degree " make a summary.
In terms of the summary result obtained from Fig. 2 and Fig. 3, the summary before de-redundancy is mostly short sentence, there is more repetitor in summary Language, and it is more similar semantically also having.And the summary after de-redundancy is in addition to important information is retained, while include more semantemes Aspect, information content is more, therefore whole structure is more preferable.
The method that above-described embodiment describes and explains the present invention.This method utilizes deep neural network Algorithm for Training feature The term vector of word, and then accurate similarity between calculating sentence, iterate to calculate renewal sentence weight using PageRank thoughts, are based on Maximal margin related algorithm eliminates the information redundancy in summary, improves the performance of system generation summary, further meets user's Information requirement.
Above content is to combine specific optimal technical scheme further description made for the present invention, it is impossible to is assert The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's Protection domain.

Claims (5)

1. a kind of documentation summary extraction method based on term vector, it is characterised in that comprise the following steps:
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:Gathered from database document Corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out the language material in corpus subordinate sentence processing, and by Sentence control disables vocabulary and removes stop words, spcial character and punctuation mark, obtains training characteristics corpus;Training parameter is set, Using training characteristics corpus as training data, it is trained using deep neural network model, by training characteristics corpus Each word exported as Feature Words by the training of Skip-gram models in the form of term vector, obtain the word of Feature Words Vector representation form;
S2, structure sentence graph model:
Comprise the following steps:
A1, pretreatment:Retrieved according to default query word in the corpus that step S1 is collected, the document retrieved is made For candidate documents, subordinate sentence processing is carried out to the candidate documents and removes candidate documents concentrating the sentence repeated, plucked The candidate sentences set wanted;
A2, structure model:As the node of graph model and assign every sentence in candidate sentences set to average initial weight:
<mrow> <mi>W</mi> <mi>e</mi> <mi>i</mi> <mi>g</mi> <mi>h</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>N</mi> </mfrac> </mrow>
Wherein SiFor any sentence in candidate sentence subset S, N is sentence sum;Using the obtained Feature Words of step S1 word to Amount, the weight on side in figure, composition sentence graph model be used as by calculating Semantic Similarity between sentence;
To any two sentence S in candidate sentences setiAnd Sj, respectively comprising Feature Words tiAnd tjTerm vectorWithThen sentence Sub- SiAnd SjBetween Semantic Similarity Similarity (Si,Sj) formula is:
<mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mi>i</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&amp;Sigma;</mi> <mrow> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>&amp;Element;</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> </mrow> </msub> <msub> <mi>Sim</mi> <mi>m</mi> </msub> <mrow> <mo>(</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&amp;Sigma;</mi> <mrow> <msub> <mi>t</mi> <mi>j</mi> </msub> <mo>&amp;Element;</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> </mrow> </msub> <msub> <mi>Sim</mi> <mi>m</mi> </msub> <mrow> <mo>(</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>|</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>|</mo> </mrow> </mfrac> </mrow>
Wherein, for sentence SiIn Feature Words tiTerm vector Represent in sentence SjNeutralize Feature Words tiBelong to The term vector of all Feature Words of identical part of speech withMaximum similarity value;|Si| and | Sj| S is represented respectivelyiAnd SjLength;
Similarity value between the term vector of Feature Words is obtained by equation below:
<mrow> <mi>c</mi> <mi>o</mi> <mi>s</mi> <mrow> <mo>(</mo> <mover> <msub> <mi>t</mi> <mn>1</mn> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>,</mo> <mover> <msub> <mi>t</mi> <mn>2</mn> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mover> <msub> <mi>t</mi> <mn>1</mn> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>&amp;CenterDot;</mo> <mover> <msub> <mi>t</mi> <mn>2</mn> </msub> <mo>&amp;RightArrow;</mo> </mover> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mover> <msub> <mi>t</mi> <mn>1</mn> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> <mo>*</mo> <mo>|</mo> <mo>|</mo> <mover> <msub> <mi>t</mi> <mn>2</mn> </msub> <mo>&amp;RightArrow;</mo> </mover> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> </mrow>
Wherein,WithIt is two Feature Words t1And t2Train to obtain corresponding feature by step S1 deep neural network model Term vector;
S3, calculate sentence weight:The graph model obtained to step S2, language between the average initial weight and sentence in step S2 Adopted similitude updates the weight of each node using equation below iteration, until convergence:
<mrow> <mi>W</mi> <mi>e</mi> <mi>i</mi> <mi>g</mi> <mi>h</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>d</mi> <mo>&amp;times;</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>&amp;Element;</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>n</mi> <mi>e</mi> <mi>c</mi> <mi>t</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <mfrac> <mrow> <mi>S</mi> <mi>i</mi> <mi>m</mi> <mi>i</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>|</mo> <mo>|</mo> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>n</mi> <mi>e</mi> <mi>c</mi> <mi>t</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> </mfrac> <mo>&amp;times;</mo> <mi>W</mi> <mi>e</mi> <mi>i</mi> <mi>g</mi> <mi>h</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow>
Wherein, d is damped coefficient, span 0-1, Connection (Si) be and sentence SiSimilarity is more than 0 sentence collection Close, | | Connection (Si) | | it is then sentence sum in the set;
S4, generated and made a summary using maximal margin related algorithm:It is maximum and irredundant using maximal margin related algorithm selection weight Sentence composition summary, concretely comprise the following steps:
B1 empty summary sentence set), is established;Using the sentence corresponding to each node in graph model as initial candidate's summary sentence Subclass;
B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, after sequence Sentence corresponding to each node is as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, to candidate Remaining sentence in summary sentence set updates their weight using equation below:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, Similarity (Si,Sj) it is the sentence semantics similitude obtained in step S2;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization.
2. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that work as institute When the sentence for updating weight has similitude with the sentence in summary sentence set, penalty factor ω is 1.0.
3. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that described Deep neural network model is Skip-gram models, and Skip-gram models are trained using level softmax methods.
A kind of 4. documentation summary extraction method based on term vector according to claim 1, it is characterised in that step Damped coefficient d in S3 is 0.85.
5. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that default Length of summarization be 150 words.
CN201510254719.2A 2015-05-18 2015-05-18 A kind of documentation summary extraction method based on term vector Active CN104834735B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510254719.2A CN104834735B (en) 2015-05-18 2015-05-18 A kind of documentation summary extraction method based on term vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510254719.2A CN104834735B (en) 2015-05-18 2015-05-18 A kind of documentation summary extraction method based on term vector

Publications (2)

Publication Number Publication Date
CN104834735A CN104834735A (en) 2015-08-12
CN104834735B true CN104834735B (en) 2018-01-23

Family

ID=53812621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510254719.2A Active CN104834735B (en) 2015-05-18 2015-05-18 A kind of documentation summary extraction method based on term vector

Country Status (1)

Country Link
CN (1) CN104834735B (en)

Families Citing this family (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630767B (en) * 2015-12-22 2018-06-15 北京奇虎科技有限公司 The comparative approach and device of a kind of text similarity
CN105631018B (en) * 2015-12-29 2018-12-18 上海交通大学 Article Feature Extraction Method based on topic model
CN105653704B (en) * 2015-12-31 2018-10-12 南京财经大学 Autoabstract generation method and device
CN106021272B (en) * 2016-04-04 2019-11-19 上海大学 The keyword extraction method calculated based on distributed expression term vector
CN105930314B (en) * 2016-04-14 2019-02-05 清华大学 System and method is generated based on coding-decoding deep neural network text snippet
US11210324B2 (en) 2016-06-03 2021-12-28 Microsoft Technology Licensing, Llc Relation extraction across sentence boundaries
CN106202042B (en) * 2016-07-06 2019-07-02 中央民族大学 A kind of keyword abstraction method based on figure
CN106227722B (en) * 2016-09-12 2019-07-05 中山大学 A kind of extraction method based on listed company's bulletin abstract
CN106407182A (en) * 2016-09-19 2017-02-15 国网福建省电力有限公司 A method for automatic abstracting for electronic official documents of enterprises
CN106502985B (en) * 2016-10-20 2020-01-31 清华大学 neural network modeling method and device for generating titles
CN108509408B (en) * 2017-02-27 2019-11-22 芋头科技(杭州)有限公司 A kind of sentence similarity judgment method
CN108287858B (en) * 2017-03-02 2021-08-10 腾讯科技(深圳)有限公司 Semantic extraction method and device for natural language
CN108733682B (en) * 2017-04-14 2021-06-22 华为技术有限公司 Method and device for generating multi-document abstract
CN107169049B (en) * 2017-04-25 2023-04-28 腾讯科技(深圳)有限公司 Application tag information generation method and device
CN108959312B (en) 2017-05-23 2021-01-29 华为技术有限公司 Method, device and terminal for generating multi-document abstract
CN107274077B (en) * 2017-05-31 2020-07-31 清华大学 Course first-order and last-order computing method and equipment
CN107291836B (en) * 2017-05-31 2020-06-02 北京大学 Chinese text abstract obtaining method based on semantic relevancy model
CN107291895B (en) * 2017-06-21 2020-05-26 浙江大学 Quick hierarchical document query method
CN107562718B (en) * 2017-07-24 2020-12-22 科大讯飞股份有限公司 Text normalization method and device, storage medium and electronic equipment
CN107463658B (en) * 2017-07-31 2020-03-31 广州市香港科大霍英东研究院 Text classification method and device
CN107766419B (en) * 2017-09-08 2021-08-31 广州汪汪信息技术有限公司 Threshold denoising-based TextRank document summarization method and device
CN108182621A (en) * 2017-12-07 2018-06-19 合肥美的智能科技有限公司 The Method of Commodity Recommendation and device for recommending the commodity, equipment and storage medium
CN108304445B (en) * 2017-12-07 2021-08-03 新华网股份有限公司 Text abstract generation method and device
CN108182247A (en) * 2017-12-28 2018-06-19 东软集团股份有限公司 Text summarization method and apparatus
CN108090049B (en) * 2018-01-17 2021-02-05 山东工商学院 Multi-document abstract automatic extraction method and system based on sentence vectors
CN110609997B (en) * 2018-06-15 2023-05-23 北京百度网讯科技有限公司 Method and device for generating abstract of text
CN110891074A (en) * 2018-08-06 2020-03-17 珠海格力电器股份有限公司 Information pushing method and device
CN109408802A (en) * 2018-08-28 2019-03-01 厦门快商通信息技术有限公司 A kind of method, system and storage medium promoting sentence vector semanteme
CN109522403B (en) * 2018-11-05 2023-04-21 中山大学 Abstract text generation method based on fusion coding
CN109657051A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 Text snippet generation method, device, computer equipment and storage medium
CN109684642B (en) * 2018-12-26 2023-01-13 重庆电信系统集成有限公司 Abstract extraction method combining page parsing rule and NLP text vectorization
CN109902284A (en) * 2018-12-30 2019-06-18 中国科学院软件研究所 A kind of unsupervised argument extracting method excavated based on debate
CN110083828A (en) * 2019-03-29 2019-08-02 珠海远光移动互联科技有限公司 A kind of Text Clustering Method and device
CN110096705B (en) * 2019-04-29 2023-09-08 扬州大学 Unsupervised English sentence automatic simplification algorithm
CN110032741B (en) * 2019-05-06 2020-02-04 重庆理工大学 Pseudo text generation method based on semantic extension and maximum edge correlation
CN112036165A (en) * 2019-05-14 2020-12-04 西交利物浦大学 Method for constructing news characteristic vector and application
CN110232109A (en) * 2019-05-17 2019-09-13 深圳市兴海物联科技有限公司 A kind of Internet public opinion analysis method and system
CN110287309B (en) * 2019-06-21 2022-04-22 深圳大学 Method for quickly extracting text abstract
CN110362674B (en) * 2019-07-18 2020-08-04 中国搜索信息科技股份有限公司 Microblog news abstract extraction type generation method based on convolutional neural network
CN110737768B (en) * 2019-10-16 2022-04-08 信雅达科技股份有限公司 Text abstract automatic generation method and device based on deep learning and storage medium
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics
CN111090731A (en) * 2019-12-20 2020-05-01 山大地纬软件股份有限公司 Electric power public opinion abstract extraction optimization method and system based on topic clustering
US11263388B2 (en) 2020-02-17 2022-03-01 Wipro Limited Method and system for dynamically generating summarised content for visual and contextual text data
CN111339754B (en) * 2020-03-04 2022-06-21 昆明理工大学 Case public opinion abstract generation method based on case element sentence association graph convolution
CN111460117B (en) * 2020-03-20 2024-03-08 平安科技(深圳)有限公司 Method and device for generating intent corpus of conversation robot, medium and electronic equipment
CN111625621B (en) * 2020-04-27 2023-05-09 中国铁道科学研究院集团有限公司电子计算技术研究所 Document retrieval method and device, electronic equipment and storage medium
CN111651562B (en) * 2020-06-05 2023-03-21 东北电力大学 Scientific and technological literature content deep revealing method based on content map
CN111897925B (en) * 2020-08-04 2022-08-26 广西财经学院 Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning
CN112347241A (en) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 Abstract extraction method, device, equipment and storage medium
CN112560496B (en) * 2020-12-09 2024-02-02 北京百度网讯科技有限公司 Training method and device of semantic analysis model, electronic equipment and storage medium
CN113157914B (en) * 2021-02-04 2022-06-14 福州大学 Document abstract extraction method and system based on multilayer recurrent neural network
CN112711662A (en) * 2021-03-29 2021-04-27 贝壳找房(北京)科技有限公司 Text acquisition method and device, readable storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622411A (en) * 2012-02-17 2012-08-01 清华大学 Structured abstract generating method
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7480893B2 (en) * 2002-10-04 2009-01-20 Siemens Corporate Research, Inc. Rule-based system and method for checking compliance of architectural analysis and design models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622411A (en) * 2012-02-17 2012-08-01 清华大学 Structured abstract generating method
CN103605702A (en) * 2013-11-08 2014-02-26 北京邮电大学 Word similarity based network text classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
利用语义关系抽取生成生物医学文摘的算法;商玥等;《计算机科学与探索》;20111231;第5卷(第11期);第1027-1036页 *

Also Published As

Publication number Publication date
CN104834735A (en) 2015-08-12

Similar Documents

Publication Publication Date Title
CN104834735B (en) A kind of documentation summary extraction method based on term vector
Bollacker et al. CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications
Sambasivam et al. Advanced data clustering methods of mining Web documents.
KR20060122276A (en) Relation extraction from documents for the automatic construction of ontologies
Biancalana et al. Social tagging in query expansion: A new way for personalized web search
Odeh et al. Arabic text categorization algorithm using vector evaluation method
CN109597995A (en) A kind of document representation method based on BM25 weighted combination term vector
CN104765779A (en) Patent document inquiry extension method based on YAGO2s
CN110851593A (en) Complex value word vector construction method based on position and semantics
Mao et al. Automatic keywords extraction based on co-occurrence and semantic relationships between words
Ghanem et al. Stemming effectiveness in clustering of Arabic documents
Xu et al. Improving pseudo-relevance feedback with neural network-based word representations
Gamal et al. Hybrid Algorithm Based on Chicken Swarm Optimization and Genetic Algorithm for Text Summarization.
Khotimah et al. Indonesian News Articles Summarization Using Genetic Algorithm.
KR102280494B1 (en) Method for providing internet search service sorted by correlation based priority specialized in professional areas
Nakashole Automatic extraction of facts, relations, and entities for web-scale knowledge base population
Asa et al. A comprehensive survey on extractive text summarization techniques
Pita et al. Strategies for short text representation in the word vector space
Heidary et al. Automatic text summarization using genetic algorithm and repetitive patterns
Heidary et al. Automatic Persian text summarization using linguistic features from text structure analysis
Kanwal et al. Adaptively intelligent meta-search engine with minimum edit distance
CN114580557A (en) Document similarity determination method and device based on semantic analysis
KR102198780B1 (en) Method for providing correlation based internet search service specialized in professional areas
Maria et al. A new model for Arabic multi-document text summarization
Munirsyah et al. Development synonym set for the English wordnet using the method of comutative and agglomerative clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant