CN104834735B - A kind of documentation summary extraction method based on term vector - Google Patents
A kind of documentation summary extraction method based on term vector Download PDFInfo
- Publication number
- CN104834735B CN104834735B CN201510254719.2A CN201510254719A CN104834735B CN 104834735 B CN104834735 B CN 104834735B CN 201510254719 A CN201510254719 A CN 201510254719A CN 104834735 B CN104834735 B CN 104834735B
- Authority
- CN
- China
- Prior art keywords
- sentence
- msub
- mrow
- term vector
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of documentation summary extraction method based on term vector, comprises the following steps:S1, the term vector expression using deep neural network model training corpus acquisition Feature Words;S2, structure sentence graph model;S3, calculate sentence weight;S4, generated and made a summary using maximal margin related algorithm.The present invention obtains training characteristics corpus by gathering corpus and the corpus being pre-processed;And constructed training characteristics corpus is trained using deep neural network model, for obtaining the term vector of Feature Words;Candidate documents set and candidate sentences set are obtained according to default query word in corpus, the Semantic Similarity between sentence is obtained further according to the term vector of Feature Words, so as to obtain the semantic relation between two sentences, the calculation error problem that tradition is occurred based on Term co-occurrence computational methods in the case of synonymous different words is avoided, so as to improve the accuracy of Similarity measures and summary performance.
Description
Technical field
The present invention relates to computer information retrieval and text mining field, especially a kind of documentation summary based on term vector
Extraction method.
Background technology
Text summarization technique is the important part of text mining research field, and the technology can find out document or document sets
In most important information and be indicated with succinctly coherent short essay.With scientific and technological progress and the development of network technology, on network
The available information of magnanimity be present, in face of substantial amounts of data, the research can aid in user's fast understanding information needed, economize on
Family reading time, improve operating efficiency.
Text summarization technique is mainly Extraction summarization at present, i.e., most important sentence composition summary is extracted from original text,
Its generating process includes three steps:Sentence expression, sentence score, and summarization generation.Specifically, this method is first using a variety of
Form represents sentence, such as word frequency comprising Feature Words, TF*IDF, descriptor composition sentence vector.Once the expression shape of sentence
Formula determines, then sentence score is calculated using conventional sort method such as BM25, or PageRank etc., to represent the weight of the sentence
The property wanted, finally the higher sentence of score is added in summary using de-redundancy method.Up to the present, text summarization technique has been
Through there is 50 years of researches history.As the fast development of information retrieval technique, text summarization technique are also increasingly mature.From initial
Based on word frequency, TF*IDF method, have to introducing machine learning, and the method represented based on pattern, the performance of text snippet
Very big lifting.
Think more comprising the high word of high frequency words or TF*IDF values in sentence based on word frequency, the TF*IDF method counted, then
The sentence is more important, therefore has higher probability to be added in the summary ultimately generated.Specifically, this method is first to candidate
Corpus pre-processed, including remove stop words, stemmed etc., then count the word frequency or TF*IDF of Feature Words in language material;
The every sentence concentrated to candidate documents, calculates the importance of sentence, a kind of most simple and practical way is to take feature in sentence
The average value of Word probability, i.e., by the probability of Feature Words in sentence take and and divided by sentence length;Finally sentence is sorted, chosen
Point highest sentence is added in the summary of generation.Such method is because of its convenience of calculation, it is easy to accomplish the advantages that often by as right
Ratio method extensive use, but this method is partial to high frequency words, and the summary of generation often only covers the distribution subject in Candidate Set, lacks
Few semantic understanding etc., therefore performance of making a summary is not notable.
In recent years, as the continuous popularization and improvement of machine learning method, increasing researcher start engineering
Learning method is added in experiment, and text snippet field is no exception.A kind of way is to use supervised learning method, by text snippet
Problem sees binary classification problems as, i.e. each candidate summary sentence can be added or can not be added in final summary.
In training set, using logistic regression, the disaggregated model such as naive Bayesian or SVM is trained to training set, is obtained optimal
Weight vectors, and classification prediction is carried out to test set;A kind of way is to use various features such as sentence position, word frequency, clue
Word etc. represents sentence, is trained using Ranking Algorithm in training set, optimal feature weight vector is obtained, for surveying
Examination set pair candidate sentences carry out calculating score;A kind of way is to see text snippet problem as clustering problem, i.e., to candidate's text
The sentence that shelves are concentrated is clustered, and using previously described statistical method, or sort method etc. is to the sentence in each cluster
It is ranked up, chooses the preceding n sentence composition summary in each cluster.Carry out automatically generating text snippet using machine learning
Method is also a lot, however it is not limited to described above.Although still there are more and more machine learning methods in current text summary field
Improve, but in general more document news in brief fields, the summary performance of machine learning method is not better than non-supervisory side
Method, this method are more suitable in special dimension or certain types of summary.Machine learning simultaneously is often supervised learning model, is needed
Labeled data is wanted, and the step is generally realized by artificial mark, it is very time-consuming, and there is subjectivity, therefore machine learning method is still
So need further perfect.
Based on figured text summarization technique because non-supervisory characteristic, the overall situation consider, it is not necessary to domain knowledge and language
Method semantic analysis, and multiple advantages such as good summary performance, therefore paid close attention to by Many researchers.This method is by sentence
As node in figure, weight of the similitude as the side of link node between sentence, using PageRank, iteration meter the methods of HITS
Node weight is calculated, most the larger sentence of weight adds summary at last.Value in sentence similar matrix represent from certain sentence to its
His sentence redirects probability, therefore the calculating to node weight is extremely important, but traditional drawing method phase between sentence is calculated
During like property, it is to be obtained using the co-occurrence for including Feature Words in sentence mostly, have ignored the Semantic Similarity between sentence, reduction finishes
The accuracy of point weight calculation, influence the performance of summary.
The content of the invention
It is an object of the invention to provide it is a kind of can effectively avoid it is traditional based on Term co-occurrence method calculate sentence similitude band
The error come, the documentation summary side of automatically extracting based on term vector of accurate and readable higher documentation summary is extracted for user
Method.
The present invention solves technical scheme used by prior art problem:A kind of documentation summary based on term vector carries automatically
Method is taken, is comprised the following steps:
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:From database document
Collection corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out subordinate sentence processing to the language material in corpus,
And control disables vocabulary removal stop words, spcial character and punctuation mark sentence by sentence, obtains training characteristics corpus;Training ginseng is set
Number, using training characteristics corpus as training data, is trained using deep neural network model, by training characteristics corpus
In each word exported as Feature Words by the training of Skip-gram models in the form of term vector, obtain Feature Words
Term vector representation;
S2, structure sentence graph model:
Comprise the following steps:
A1, pretreatment:Retrieved according to default query word in the corpus that step S1 is collected, the text that will be retrieved
Shelves are used as candidate documents, carry out subordinate sentence processing to the candidate documents and remove candidate documents to concentrate the sentence repeated, obtain
To the candidate sentences set of summary;
A2, structure model:Every sentence in candidate sentences set as the node of graph model and is assigned average initial
Weight:
Wherein, SiFor any sentence in candidate sentence subset S, N is sentence sum;Utilize the obtained Feature Words of step S1
Term vector, the weight on side in figure is used as by calculating Semantic Similarity between sentence, forms sentence graph model;
To any two sentence S in candidate sentences setiAnd Sj, respectively comprising Feature Words tiAnd tjTerm vectorWith
Then sentence SiAnd SjBetween Semantic Similarity Similarity (Si,Sj) formula is:
Wherein, for sentence SiIn Feature Words tiTerm vector Represent in sentence SjNeutralize Feature Words
tiBelong to the term vectors of all Feature Words of identical part of speech withMaximum similarity value;|Si| and | Sj| S is represented respectivelyiAnd Sj's
Length;
Similitude between the term vector of Feature Words is obtained by equation below:
WhereinWithIt is two Feature Words t1And t2Corresponding to training to obtain by step S1 deep neural network model
Feature term vector.
S3, calculate sentence weight:The graph model obtained to step S2, average initial weight and sentence in step S2
Between Semantic Similarity the weight of each node is updated using equation below iteration, until convergence:
Wherein d is damped coefficient, span 0-1, Connection (Si) be and sentence SiSimilarity is more than 0 sentence
Subclass, | | Connection (Si) | | it is then sentence sum in the set;
S4, generated and made a summary using maximal margin related algorithm:The maximum and nothing using maximal margin related algorithm selection weight
The sentence composition summary of redundancy, is concretely comprised the following steps:
B1 empty summary sentence set), is established;Plucked the sentence corresponding to each node in graph model as initial candidate
Want sentence set;
B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, will be sorted
The sentence corresponding to each node afterwards is as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, it is right
Remaining sentence in candidate's summary sentence set updates their weight using equation below:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, Similarity (Si,Sj) it is that the sentence semantics obtained in step S2 are similar
Property;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization.
When the sentence to be updated weight has similitude with the sentence in summary sentence set, penalty factor ω is 1.0.
The deep neural network model is Skip-gram models, and Skip-gram is trained using level softmax methods
Model.
Damped coefficient d in step S3 is 0.85.
Default length of summarization is 150 words.
The beneficial effects of the present invention are:The present invention is obtained by gathering corpus and the corpus being pre-processed
Training characteristics corpus;And constructed training characteristics corpus is trained using deep neural network model, for
To the term vector of Feature Words;Candidate documents set and candidate sentences set are obtained according to default query word in corpus, further according to
The term vector of Feature Words obtains the Semantic Similarity between sentence, so as to obtain the semantic relation between two sentences, avoids biography
The calculation error problem that system is occurred based on Term co-occurrence computational methods in the case of synonymous different words, so as to improve Similarity measures
Accuracy and summary performance.In stand-alone environment (CPU is monokaryon 3.0GHz, inside saves as 4G, hereafter together), for the word of Feature Words
It is 148,420KB that the training corpus of vector, which integrates the training pattern committed memory as 1.2G, obtained,.
Brief description of the drawings
Fig. 1 is the logic schematic diagram of the present invention.
Fig. 2 is the result of gained after the completion of step of embodiment of the present invention S1-S3.
Fig. 3 is the final result of the embodiment of the present invention.
Embodiment
Below in conjunction with drawings and the specific embodiments, the present invention will be described:
Fig. 1 is a kind of logic schematic diagram of the documentation summary extraction method based on term vector of the present invention.One kind is based on
The documentation summary extraction method of term vector, comprises the following steps:
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:From database document
Collection corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out subordinate sentence processing to the language material in corpus,
And control disables vocabulary removal stop words, spcial character and punctuation mark etc. sentence by sentence, obtains training characteristics corpus;Training is set
Parameter, using training characteristics corpus as training data, using level softmax methods to deep neural network model Skip-
Gram is trained, and the training of Skip-gram models is passed through using each word in training characteristics corpus as Feature Words
Exported in the form of term vector, obtain the term vector representation of Feature Words;
Specifically, being represented for the term vector of the training characteristics word from a large amount of unstructured text datas, the present invention mainly adopts
With Skip-gram models.Compared to the method that other are realized based on neural network structure, the model does not have substantial amounts of Matrix Multiplication
Method, thus it is very efficient.Skip-gram models predict the term vector of specified window context using the term vector of current word.Give
Determine feature language material w1,w2,w3,…,wTAs training data, Skip-gram object function is
Wherein, c is the parameter for determining contextual window size, and c is more big, needs more training datas, generally requires more
More training times, but higher accuracy rate can be obtained.
Basic Skip-gram model definition p (wO|wI) be:
WhereinWithIt is that w " input " and " output " vector form represent that W is total words in vocabulary.Due toCalculate, the order of magnitude generally very big (10 proportional to W5-107), therefore it is near frequently with other calculation formula
Like calculating.
The present invention is trained using level softmax algorithms to deep neural network model Skip-gram, algorithm profit
With y-bend Huffman tree representations, using W word of output layer as leafy node, shorter path is distributed to high frequency words, accelerates instruction
Practice speed.Each feature language material w can be accessed to from the root node of tree along unique paths.If n (w, j) is
J-th of node on from root node to w paths, L (w) are the length of this paths, therefore n (w, 1)=root, n (w, L (w))
=w.For any internal node n, ch (n) is any child node of node n.Then level softmax defines p (wO|wI) as follows:
Wherein
Calculated in above formulaWith logp (wo|wI) L (w in proportion toO), it is generally not more than logW.
Stochastic gradient descent method is used to solve object function after defining above formula, the term vector for ultimately producing word represents shape
Formula.
S2, structure sentence graph model:Comprise the following steps:
A1, pretreatment:Retrieved according to default query word in the corpus that step S1 is collected, the text that will be retrieved
Shelves are used as candidate documents, carry out subordinate sentence processing to the candidate documents and remove candidate documents to concentrate the sentence repeated, obtain
To the candidate sentences set of summary;
A2, structure model:Every sentence in candidate sentences set as the node of graph model and is assigned average initial
Weight:
Wherein SiFor any sentence in candidate sentence subset S, N is sentence sum;Utilize the obtained Feature Words of step S1
Term vector, the weight on side in figure is used as by calculating Semantic Similarity between sentence, forms sentence graph model;
To any two sentence S in candidate sentences setiAnd Sj, respectively comprising Feature Words tiAnd tjTerm vectorWith
Then sentence SiAnd SjBetween Semantic Similarity similarity (Si,Sj) formula is:
Wherein, for sentence SiIn Feature Words tiTerm vectorSimm(ti,Sj) represent in sentence SjNeutralize Feature Words
Ti belong to the term vectors of all Feature Words of identical part of speech withMaximum similarity value;|Si| and | Sj| S is represented respectivelyiAnd Sj's
Length;
Similarity value between the term vector of Feature Words is obtained by equation below:
WhereinWithIt is two Feature Words t1And t2Corresponding to training to obtain by step S1 deep neural network model
Feature term vector.
S3, calculate sentence weight:The graph model obtained to step S2, average initial weight and sentence in step S2
Between Semantic Similarity the weight of each node is updated using following improved PageRank formula iteration, until convergence, so as to
To the score value that can reflect sentence importance:
Because the difference of similarity between sentence causes the difference of side right weight between node, and the symmetry of similarity, herein
Utilize improved PageRank formula.Wherein d is damped coefficient, span 0-1, under normal circumstances preferably 0.85.
Connection(Si) represent and SiConnected sentence set, i.e., with sentence SiSimilarity is more than 0 sentence set, | |
Connection(Si) | | it is then sentence sum in the set;
Original thought of the PageRank formula based on random surfer, fetched using the chain between webpage and weigh the important of webpage
Degree, is specifically that quality and the number of links in the source that utilizes link to determine the weight of hyperlink target, and its formula is:
Wherein d generally takes 0.85.In(Si) represent to point to SiCollections of web pages, | | Out (Sj) | | represent webpage SjChain go out
Sum.
PageRank thoughts are applied in the graph model of sentence in the present invention, obtain the final weight of sentence.But
In order to preferably solve the difference that the difference of similarity between sentence causes side right weight between node, and symmetry of similarity etc. is asked
Existing PageRank formula are improved to following form by topic, the present invention:
D is still set to 0.85 herein.Connection(Si) represent and SiConnected sentence set, i.e., with sentence SiSimilarity
Sentence set more than 0, | | Connection (Si) | | it is then sentence sum in the set.
The similarity matrix formed using the Semantic Similarity between the average initial weight and sentence of node in step S2
The weight of each node in graph model, i.e. sentence weight are iterated to calculate with the PageRank formula after improvement, until convergence.Finally
Each node will obtain a score for reflecting its importance, be prepared for generation summary in next step.
Due to similitude between sentence be present, so if the maximum preceding K bars sentence of weight directly is added into meeting in summary
Larger redundancy be present.To reduce the redundancy rate in making a summary, the present invention uses maximal margin related algorithm, and its basic thought is:Such as
One sentence of fruit has higher similitude with existing sentence in summary, then carries out point penalty to the sentence.Therefore there is following S4
Step:
S4, generated and made a summary using maximal margin related algorithm:The maximum and nothing using maximal margin related algorithm selection weight
The sentence composition summary of redundancy, is concretely comprised the following steps:
B1 empty summary sentence set), is established, as initial summary sentence set;Each node institute in graph model is right
The sentence answered is as initial candidate's summary sentence set;
B2), according to step S3 to the sentence weight descending corresponding to each graph model node in candidate's summary sentence set
Arrangement, using the sentence corresponding to each node after sequence as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, it is right
Remaining sentence in candidate's summary sentence set updates their weight using equation below:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, when the sentence to be updated weight has phase with the sentence in summary sentence set
During like property, penalty factor ω is 1.0.Similarity(Si,Sj) it is the sentence semantics similitude obtained in step S2;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization.
Embodiment:
To make the purpose of the present invention, technical scheme and beneficial effect become apparent from and more easily implemented, with reference to following tool
Body embodiment, and referring to the drawings, the present invention is described in further details.The present embodiment sets the length of summarization of generation to be preset as
150 words.
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:
To obtain the vector representation form of Feature Words, embodiment uses the biomedicine that U.S. national library of medicine is safeguarded
Bibliographic data base MEDLINE gathers the corpus of experiment, specifically, that is, inquires about all texts of 2011-2012 on MEDLINE
Quotation is offered as corpus, the sentence in quotation is pre-processed, i.e. control disables vocabulary and removes stop words, spcial character and mark
Point symbol etc., finally give 1.2G training corpus collection.
The term vector dimension that Feature Words are set in the training process of the present embodiment is 200 dimensions, is instructed using level softmax
Practice Skip-gram models, only consider the Feature Words that word frequency is more than 3, window size is set to 5.
S2, structure sentence graph model:
Comprise the following steps:The present embodiment set " HIV Infection " be searching keyword, retrieve MEDLINE on and
Correlation all quotations, obtain candidate documents corresponding with the inquiry, the candidate documents subordinate sentence processing and go
Except candidate documents concentrate the sentence repeated, the candidate sentences set made a summary, the candidate for including 4581 sentences is ultimately generated
Sentence set.
A2, structure model:
Using every sentence in gathering as a node in graph model, according to average initial weight formulaBe averaged initial weight to each node imparting in graph model, i.e., and 1/4581;
Utilize the Semantic Similarity calculation formula i.e. formula between the term vector and sentence of the step S1 Feature Words for training to obtainAnd formulaCalculate sentence
Between similitude, the weight on side in figure, generate sentence graph model.
S3, calculate sentence weight:
To above-mentioned graph model, the weight of each node is iterated to calculate until convergence using improved PageRank formula.
Fig. 2 show more than present invention use 3 steps and candidate sentences weight is sorted in descending order, the preceding K bars sentence group of selection
Into designated length disease " HIV Infection " make a summary, it is specific as follows:
S4, generated and made a summary using maximal margin related algorithm:
The sentence that weight is obtained in upper step is sorted in descending order, it is related using maximal margin to eliminate the redundancy in making a summary
The sentence for having similitude in algorithm pair and summary sentence carries out point penalty, selects the preceding K bars sentence composition summary that weight is larger.Specific step
Suddenly it is:
B1 empty summary sentence set), is established, as initial summary sentence set;Each node institute in graph model is right
The sentence answered is as initial candidate's summary sentence set;
B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, will be sorted
The sentence corresponding to each node afterwards is as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set,
Their weight is updated using equation below to the remaining sentence in candidate's summary sentence set:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, when the sentence to be updated weight has phase with the sentence in summary sentence set
During like property, penalty factor ω is 1.0.Similarity(Si,Sj) it is the sentence semantics similitude obtained in step S2;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization
Fig. 3 show more than present invention use 4 steps and is ranked up de-redundancy to candidate sentences set, ultimately generates specified length
" HIV Infection's disease of degree " make a summary.
In terms of the summary result obtained from Fig. 2 and Fig. 3, the summary before de-redundancy is mostly short sentence, there is more repetitor in summary
Language, and it is more similar semantically also having.And the summary after de-redundancy is in addition to important information is retained, while include more semantemes
Aspect, information content is more, therefore whole structure is more preferable.
The method that above-described embodiment describes and explains the present invention.This method utilizes deep neural network Algorithm for Training feature
The term vector of word, and then accurate similarity between calculating sentence, iterate to calculate renewal sentence weight using PageRank thoughts, are based on
Maximal margin related algorithm eliminates the information redundancy in summary, improves the performance of system generation summary, further meets user's
Information requirement.
Above content is to combine specific optimal technical scheme further description made for the present invention, it is impossible to is assert
The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention,
On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's
Protection domain.
Claims (5)
1. a kind of documentation summary extraction method based on term vector, it is characterised in that comprise the following steps:
S1, the term vector expression using deep neural network model training corpus acquisition Feature Words:Gathered from database document
Corpus simultaneously pre-processes to the corpus, and the pretreatment includes carrying out the language material in corpus subordinate sentence processing, and by
Sentence control disables vocabulary and removes stop words, spcial character and punctuation mark, obtains training characteristics corpus;Training parameter is set,
Using training characteristics corpus as training data, it is trained using deep neural network model, by training characteristics corpus
Each word exported as Feature Words by the training of Skip-gram models in the form of term vector, obtain the word of Feature Words
Vector representation form;
S2, structure sentence graph model:
Comprise the following steps:
A1, pretreatment:Retrieved according to default query word in the corpus that step S1 is collected, the document retrieved is made
For candidate documents, subordinate sentence processing is carried out to the candidate documents and removes candidate documents concentrating the sentence repeated, plucked
The candidate sentences set wanted;
A2, structure model:As the node of graph model and assign every sentence in candidate sentences set to average initial weight:
<mrow>
<mi>W</mi>
<mi>e</mi>
<mi>i</mi>
<mi>g</mi>
<mi>h</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mi>N</mi>
</mfrac>
</mrow>
Wherein SiFor any sentence in candidate sentence subset S, N is sentence sum;Using the obtained Feature Words of step S1 word to
Amount, the weight on side in figure, composition sentence graph model be used as by calculating Semantic Similarity between sentence;
To any two sentence S in candidate sentences setiAnd Sj, respectively comprising Feature Words tiAnd tjTerm vectorWithThen sentence
Sub- SiAnd SjBetween Semantic Similarity Similarity (Si,Sj) formula is:
<mrow>
<mi>S</mi>
<mi>i</mi>
<mi>m</mi>
<mi>i</mi>
<mi>l</mi>
<mi>a</mi>
<mi>r</mi>
<mi>i</mi>
<mi>t</mi>
<mi>y</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>&Sigma;</mi>
<mrow>
<msub>
<mi>t</mi>
<mi>i</mi>
</msub>
<mo>&Element;</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
</mrow>
</msub>
<msub>
<mi>Sim</mi>
<mi>m</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>t</mi>
<mo>&RightArrow;</mo>
</mover>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>+</mo>
<msub>
<mi>&Sigma;</mi>
<mrow>
<msub>
<mi>t</mi>
<mi>j</mi>
</msub>
<mo>&Element;</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
</mrow>
</msub>
<msub>
<mi>Sim</mi>
<mi>m</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mover>
<mi>t</mi>
<mo>&RightArrow;</mo>
</mover>
<mi>j</mi>
</msub>
<mo>,</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mo>|</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<mo>+</mo>
<mo>|</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>|</mo>
</mrow>
</mfrac>
</mrow>
Wherein, for sentence SiIn Feature Words tiTerm vector Represent in sentence SjNeutralize Feature Words tiBelong to
The term vector of all Feature Words of identical part of speech withMaximum similarity value;|Si| and | Sj| S is represented respectivelyiAnd SjLength;
Similarity value between the term vector of Feature Words is obtained by equation below:
<mrow>
<mi>c</mi>
<mi>o</mi>
<mi>s</mi>
<mrow>
<mo>(</mo>
<mover>
<msub>
<mi>t</mi>
<mn>1</mn>
</msub>
<mo>&RightArrow;</mo>
</mover>
<mo>,</mo>
<mover>
<msub>
<mi>t</mi>
<mn>2</mn>
</msub>
<mo>&RightArrow;</mo>
</mover>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mover>
<msub>
<mi>t</mi>
<mn>1</mn>
</msub>
<mo>&RightArrow;</mo>
</mover>
<mo>&CenterDot;</mo>
<mover>
<msub>
<mi>t</mi>
<mn>2</mn>
</msub>
<mo>&RightArrow;</mo>
</mover>
</mrow>
<mrow>
<mo>|</mo>
<mo>|</mo>
<mover>
<msub>
<mi>t</mi>
<mn>1</mn>
</msub>
<mo>&RightArrow;</mo>
</mover>
<mo>|</mo>
<mo>|</mo>
<mo>*</mo>
<mo>|</mo>
<mo>|</mo>
<mover>
<msub>
<mi>t</mi>
<mn>2</mn>
</msub>
<mo>&RightArrow;</mo>
</mover>
<mo>|</mo>
<mo>|</mo>
</mrow>
</mfrac>
</mrow>
Wherein,WithIt is two Feature Words t1And t2Train to obtain corresponding feature by step S1 deep neural network model
Term vector;
S3, calculate sentence weight:The graph model obtained to step S2, language between the average initial weight and sentence in step S2
Adopted similitude updates the weight of each node using equation below iteration, until convergence:
<mrow>
<mi>W</mi>
<mi>e</mi>
<mi>i</mi>
<mi>g</mi>
<mi>h</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<mi>d</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>d</mi>
<mo>&times;</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>&Element;</mo>
<mi>C</mi>
<mi>o</mi>
<mi>n</mi>
<mi>n</mi>
<mi>e</mi>
<mi>c</mi>
<mi>t</mi>
<mi>i</mi>
<mi>o</mi>
<mi>n</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</munder>
<mfrac>
<mrow>
<mi>S</mi>
<mi>i</mi>
<mi>m</mi>
<mi>i</mi>
<mi>l</mi>
<mi>a</mi>
<mi>r</mi>
<mi>i</mi>
<mi>t</mi>
<mi>y</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mo>|</mo>
<mo>|</mo>
<mi>C</mi>
<mi>o</mi>
<mi>n</mi>
<mi>n</mi>
<mi>e</mi>
<mi>c</mi>
<mi>t</mi>
<mi>i</mi>
<mi>o</mi>
<mi>n</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>|</mo>
<mo>|</mo>
</mrow>
</mfrac>
<mo>&times;</mo>
<mi>W</mi>
<mi>e</mi>
<mi>i</mi>
<mi>g</mi>
<mi>h</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>S</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
Wherein, d is damped coefficient, span 0-1, Connection (Si) be and sentence SiSimilarity is more than 0 sentence collection
Close, | | Connection (Si) | | it is then sentence sum in the set;
S4, generated and made a summary using maximal margin related algorithm:It is maximum and irredundant using maximal margin related algorithm selection weight
Sentence composition summary, concretely comprise the following steps:
B1 empty summary sentence set), is established;Using the sentence corresponding to each node in graph model as initial candidate's summary sentence
Subclass;
B2), the sentence weight descending corresponding to each graph model node in candidate's summary sentence set is arranged, after sequence
Sentence corresponding to each node is as candidate's summary sentence sequence;
B3), according to candidate's summary sentence sequence, primary sentence will be arranged in and be transferred in summary sentence set, to candidate
Remaining sentence in summary sentence set updates their weight using equation below:
Weight(Sj)=Weight (Sj)-ω×Similarity(Si,Sj)
Wherein, i ≠ j, ω are penalty factor, Similarity (Si,Sj) it is the sentence semantics similitude obtained in step S2;
B4 step b2), is repeated) and b3), until the sentence in summary sentence set reaches default length of summarization.
2. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that work as institute
When the sentence for updating weight has similitude with the sentence in summary sentence set, penalty factor ω is 1.0.
3. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that described
Deep neural network model is Skip-gram models, and Skip-gram models are trained using level softmax methods.
A kind of 4. documentation summary extraction method based on term vector according to claim 1, it is characterised in that step
Damped coefficient d in S3 is 0.85.
5. a kind of documentation summary extraction method based on term vector according to claim 1, it is characterised in that default
Length of summarization be 150 words.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510254719.2A CN104834735B (en) | 2015-05-18 | 2015-05-18 | A kind of documentation summary extraction method based on term vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510254719.2A CN104834735B (en) | 2015-05-18 | 2015-05-18 | A kind of documentation summary extraction method based on term vector |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104834735A CN104834735A (en) | 2015-08-12 |
CN104834735B true CN104834735B (en) | 2018-01-23 |
Family
ID=53812621
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510254719.2A Active CN104834735B (en) | 2015-05-18 | 2015-05-18 | A kind of documentation summary extraction method based on term vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104834735B (en) |
Families Citing this family (52)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105630767B (en) * | 2015-12-22 | 2018-06-15 | 北京奇虎科技有限公司 | The comparative approach and device of a kind of text similarity |
CN105631018B (en) * | 2015-12-29 | 2018-12-18 | 上海交通大学 | Article Feature Extraction Method based on topic model |
CN105653704B (en) * | 2015-12-31 | 2018-10-12 | 南京财经大学 | Autoabstract generation method and device |
CN106021272B (en) * | 2016-04-04 | 2019-11-19 | 上海大学 | The keyword extraction method calculated based on distributed expression term vector |
CN105930314B (en) * | 2016-04-14 | 2019-02-05 | 清华大学 | System and method is generated based on coding-decoding deep neural network text snippet |
US11210324B2 (en) | 2016-06-03 | 2021-12-28 | Microsoft Technology Licensing, Llc | Relation extraction across sentence boundaries |
CN106202042B (en) * | 2016-07-06 | 2019-07-02 | 中央民族大学 | A kind of keyword abstraction method based on figure |
CN106227722B (en) * | 2016-09-12 | 2019-07-05 | 中山大学 | A kind of extraction method based on listed company's bulletin abstract |
CN106407182A (en) * | 2016-09-19 | 2017-02-15 | 国网福建省电力有限公司 | A method for automatic abstracting for electronic official documents of enterprises |
CN106502985B (en) * | 2016-10-20 | 2020-01-31 | 清华大学 | neural network modeling method and device for generating titles |
CN108509408B (en) * | 2017-02-27 | 2019-11-22 | 芋头科技(杭州)有限公司 | A kind of sentence similarity judgment method |
CN108287858B (en) * | 2017-03-02 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Semantic extraction method and device for natural language |
CN108733682B (en) * | 2017-04-14 | 2021-06-22 | 华为技术有限公司 | Method and device for generating multi-document abstract |
CN107169049B (en) * | 2017-04-25 | 2023-04-28 | 腾讯科技(深圳)有限公司 | Application tag information generation method and device |
CN108959312B (en) | 2017-05-23 | 2021-01-29 | 华为技术有限公司 | Method, device and terminal for generating multi-document abstract |
CN107274077B (en) * | 2017-05-31 | 2020-07-31 | 清华大学 | Course first-order and last-order computing method and equipment |
CN107291836B (en) * | 2017-05-31 | 2020-06-02 | 北京大学 | Chinese text abstract obtaining method based on semantic relevancy model |
CN107291895B (en) * | 2017-06-21 | 2020-05-26 | 浙江大学 | Quick hierarchical document query method |
CN107562718B (en) * | 2017-07-24 | 2020-12-22 | 科大讯飞股份有限公司 | Text normalization method and device, storage medium and electronic equipment |
CN107463658B (en) * | 2017-07-31 | 2020-03-31 | 广州市香港科大霍英东研究院 | Text classification method and device |
CN107766419B (en) * | 2017-09-08 | 2021-08-31 | 广州汪汪信息技术有限公司 | Threshold denoising-based TextRank document summarization method and device |
CN108182621A (en) * | 2017-12-07 | 2018-06-19 | 合肥美的智能科技有限公司 | The Method of Commodity Recommendation and device for recommending the commodity, equipment and storage medium |
CN108304445B (en) * | 2017-12-07 | 2021-08-03 | 新华网股份有限公司 | Text abstract generation method and device |
CN108182247A (en) * | 2017-12-28 | 2018-06-19 | 东软集团股份有限公司 | Text summarization method and apparatus |
CN108090049B (en) * | 2018-01-17 | 2021-02-05 | 山东工商学院 | Multi-document abstract automatic extraction method and system based on sentence vectors |
CN110609997B (en) * | 2018-06-15 | 2023-05-23 | 北京百度网讯科技有限公司 | Method and device for generating abstract of text |
CN110891074A (en) * | 2018-08-06 | 2020-03-17 | 珠海格力电器股份有限公司 | Information pushing method and device |
CN109408802A (en) * | 2018-08-28 | 2019-03-01 | 厦门快商通信息技术有限公司 | A kind of method, system and storage medium promoting sentence vector semanteme |
CN109522403B (en) * | 2018-11-05 | 2023-04-21 | 中山大学 | Abstract text generation method based on fusion coding |
CN109657051A (en) * | 2018-11-30 | 2019-04-19 | 平安科技(深圳)有限公司 | Text snippet generation method, device, computer equipment and storage medium |
CN109684642B (en) * | 2018-12-26 | 2023-01-13 | 重庆电信系统集成有限公司 | Abstract extraction method combining page parsing rule and NLP text vectorization |
CN109902284A (en) * | 2018-12-30 | 2019-06-18 | 中国科学院软件研究所 | A kind of unsupervised argument extracting method excavated based on debate |
CN110083828A (en) * | 2019-03-29 | 2019-08-02 | 珠海远光移动互联科技有限公司 | A kind of Text Clustering Method and device |
CN110096705B (en) * | 2019-04-29 | 2023-09-08 | 扬州大学 | Unsupervised English sentence automatic simplification algorithm |
CN110032741B (en) * | 2019-05-06 | 2020-02-04 | 重庆理工大学 | Pseudo text generation method based on semantic extension and maximum edge correlation |
CN112036165A (en) * | 2019-05-14 | 2020-12-04 | 西交利物浦大学 | Method for constructing news characteristic vector and application |
CN110232109A (en) * | 2019-05-17 | 2019-09-13 | 深圳市兴海物联科技有限公司 | A kind of Internet public opinion analysis method and system |
CN110287309B (en) * | 2019-06-21 | 2022-04-22 | 深圳大学 | Method for quickly extracting text abstract |
CN110362674B (en) * | 2019-07-18 | 2020-08-04 | 中国搜索信息科技股份有限公司 | Microblog news abstract extraction type generation method based on convolutional neural network |
CN110737768B (en) * | 2019-10-16 | 2022-04-08 | 信雅达科技股份有限公司 | Text abstract automatic generation method and device based on deep learning and storage medium |
CN111125349A (en) * | 2019-12-17 | 2020-05-08 | 辽宁大学 | Graph model text abstract generation method based on word frequency and semantics |
CN111090731A (en) * | 2019-12-20 | 2020-05-01 | 山大地纬软件股份有限公司 | Electric power public opinion abstract extraction optimization method and system based on topic clustering |
US11263388B2 (en) | 2020-02-17 | 2022-03-01 | Wipro Limited | Method and system for dynamically generating summarised content for visual and contextual text data |
CN111339754B (en) * | 2020-03-04 | 2022-06-21 | 昆明理工大学 | Case public opinion abstract generation method based on case element sentence association graph convolution |
CN111460117B (en) * | 2020-03-20 | 2024-03-08 | 平安科技(深圳)有限公司 | Method and device for generating intent corpus of conversation robot, medium and electronic equipment |
CN111625621B (en) * | 2020-04-27 | 2023-05-09 | 中国铁道科学研究院集团有限公司电子计算技术研究所 | Document retrieval method and device, electronic equipment and storage medium |
CN111651562B (en) * | 2020-06-05 | 2023-03-21 | 东北电力大学 | Scientific and technological literature content deep revealing method based on content map |
CN111897925B (en) * | 2020-08-04 | 2022-08-26 | 广西财经学院 | Pseudo-correlation feedback expansion method integrating correlation mode mining and word vector learning |
CN112347241A (en) * | 2020-11-10 | 2021-02-09 | 华夏幸福产业投资有限公司 | Abstract extraction method, device, equipment and storage medium |
CN112560496B (en) * | 2020-12-09 | 2024-02-02 | 北京百度网讯科技有限公司 | Training method and device of semantic analysis model, electronic equipment and storage medium |
CN113157914B (en) * | 2021-02-04 | 2022-06-14 | 福州大学 | Document abstract extraction method and system based on multilayer recurrent neural network |
CN112711662A (en) * | 2021-03-29 | 2021-04-27 | 贝壳找房(北京)科技有限公司 | Text acquisition method and device, readable storage medium and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622411A (en) * | 2012-02-17 | 2012-08-01 | 清华大学 | Structured abstract generating method |
CN103605702A (en) * | 2013-11-08 | 2014-02-26 | 北京邮电大学 | Word similarity based network text classification method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7480893B2 (en) * | 2002-10-04 | 2009-01-20 | Siemens Corporate Research, Inc. | Rule-based system and method for checking compliance of architectural analysis and design models |
-
2015
- 2015-05-18 CN CN201510254719.2A patent/CN104834735B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622411A (en) * | 2012-02-17 | 2012-08-01 | 清华大学 | Structured abstract generating method |
CN103605702A (en) * | 2013-11-08 | 2014-02-26 | 北京邮电大学 | Word similarity based network text classification method |
Non-Patent Citations (1)
Title |
---|
利用语义关系抽取生成生物医学文摘的算法;商玥等;《计算机科学与探索》;20111231;第5卷(第11期);第1027-1036页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104834735A (en) | 2015-08-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104834735B (en) | A kind of documentation summary extraction method based on term vector | |
Bollacker et al. | CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications | |
Sambasivam et al. | Advanced data clustering methods of mining Web documents. | |
KR20060122276A (en) | Relation extraction from documents for the automatic construction of ontologies | |
Biancalana et al. | Social tagging in query expansion: A new way for personalized web search | |
Odeh et al. | Arabic text categorization algorithm using vector evaluation method | |
CN109597995A (en) | A kind of document representation method based on BM25 weighted combination term vector | |
CN104765779A (en) | Patent document inquiry extension method based on YAGO2s | |
CN110851593A (en) | Complex value word vector construction method based on position and semantics | |
Mao et al. | Automatic keywords extraction based on co-occurrence and semantic relationships between words | |
Ghanem et al. | Stemming effectiveness in clustering of Arabic documents | |
Xu et al. | Improving pseudo-relevance feedback with neural network-based word representations | |
Gamal et al. | Hybrid Algorithm Based on Chicken Swarm Optimization and Genetic Algorithm for Text Summarization. | |
Khotimah et al. | Indonesian News Articles Summarization Using Genetic Algorithm. | |
KR102280494B1 (en) | Method for providing internet search service sorted by correlation based priority specialized in professional areas | |
Nakashole | Automatic extraction of facts, relations, and entities for web-scale knowledge base population | |
Asa et al. | A comprehensive survey on extractive text summarization techniques | |
Pita et al. | Strategies for short text representation in the word vector space | |
Heidary et al. | Automatic text summarization using genetic algorithm and repetitive patterns | |
Heidary et al. | Automatic Persian text summarization using linguistic features from text structure analysis | |
Kanwal et al. | Adaptively intelligent meta-search engine with minimum edit distance | |
CN114580557A (en) | Document similarity determination method and device based on semantic analysis | |
KR102198780B1 (en) | Method for providing correlation based internet search service specialized in professional areas | |
Maria et al. | A new model for Arabic multi-document text summarization | |
Munirsyah et al. | Development synonym set for the English wordnet using the method of comutative and agglomerative clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |