CN108182247A - Text summarization method and apparatus - Google Patents
Text summarization method and apparatus Download PDFInfo
- Publication number
- CN108182247A CN108182247A CN201711463868.5A CN201711463868A CN108182247A CN 108182247 A CN108182247 A CN 108182247A CN 201711463868 A CN201711463868 A CN 201711463868A CN 108182247 A CN108182247 A CN 108182247A
- Authority
- CN
- China
- Prior art keywords
- sentence
- digest
- scoring
- candidate
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Abstract
The present invention proposes a kind of Text summarization method and apparatus, wherein, method includes:Obtain the sentence vector of each sentence, according to sentence vector, candidate digest sentence is selected from all sentences, then obtains the first scoring of candidate digest sentence, digest sentence is selected from candidate digest sentence according to the first scoring, further utilizes the digest of digest sentence generation article.Since sentence vector can preferably retain sentence information, first scoring represents the sentence quality of candidate digest sentence unit length, it is scored by sentence vector sum first and digest sentence is obtained to all sentences progress Double Selection in article, improve digest to the summary accuracy of article centre point and coverage.
Description
Technical field
The present invention relates to technical field of information processing more particularly to a kind of Text summarization method and apparatus.
Background technology
Digest is the succinctly coherent short essay of one literature centre's content of accurate reflection comprehensively.At present, related digest extraction skill
Art is typically based on word frequency-inverse file frequency (Term Frequency-Inverse Document Frequency, abbreviation TF-
IDF the score of word) is calculated, so as to score sentence by the word in sentence.
Due to the digest extractive technique based on TF-IDF by the word in sentence after TF-IDF is calculated, take out topN conducts
The score of digest sentence cannot especially consider whole word frequency vectors when sentence is longer, if in sentence low word frequency word compared with
Information that are more, and the word of these low word frequency is all given up, sentence will be influenced being included, therefore the text extracted in this way
It plucks and tends not to covering sentence information, influence digest to the summary accuracy of literature centre's content and coverage.
Invention content
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, first purpose of the present invention is to propose a kind of Text summarization method, time is selected according to sentence vector
Selection plucks sentence, further according to the first scoring of candidate digest sentence, further selects digest sentence from candidate digest sentence,
And then according to digest sentence generation digest, since sentence vector can preferably retain sentence information, the first scoring represents candidate
The sentence quality of digest sentence unit length, the digest thus generated can preferably reflect the centre point of article, improve
The coverage of digest.
Second object of the present invention is to propose a kind of Text summarization device.
Third object of the present invention is to propose a kind of computer equipment.
Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.
The 5th purpose of the present invention is to propose a kind of computer program product.
In order to achieve the above object, first aspect present invention embodiment proposes a kind of Text summarization method, including:
Obtain the sentence vector of each sentence in article;
According to sentence vector, candidate digest sentence is selected from all sentences;
Obtain the first scoring of the candidate digest sentence;Wherein, it described first scores to represent the candidate digest
The sentence quality of sentence unit length;
Digest sentence is selected from the candidate digest sentence according to the described first scoring;
Utilize the digest of article described in the digest sentence generation.
The alternatively possible realization method of first aspect present invention embodiment, the sentence for obtaining each sentence in article
Subvector, including:
The sentence is identified from the article;
Obtain the original sentence vector of the sentence;
By limited Boltzmann machine neural network, dimension-reduction treatment is carried out to the original sentence vector, is obtained described
Sentence vector.
The alternatively possible realization method of first aspect present invention embodiment, the original sentence for obtaining the sentence
Vector, including:
Position of the sentence in the article is identified according to clue word, calculates the location score of the sentence;
Calculate the first similarity between the topic of the sentence and the article;
Obtain the feature space represented by multidimensional dictionary of the sentence;
Using the location score, first similarity and the feature space, the original sentence of the sentence is formed
Vector.
The alternatively possible realization method of first aspect present invention embodiment, it is described to calculate the sentence and the article
Topic between the first similarity, including:
The sentence is converted into the set of words of word composition;
Stop words is filtered out from the set of words, and extracts stem;
According to the stem extracted, first similarity is calculated.
The alternatively possible realization method of first aspect present invention embodiment, it is described according to sentence vector, from institute
Candidate digest sentence is selected in some sentences, including:
Since first sentence of the article, based on supporting vector machine model SVM, one by one to described in each sentence
Sentence vector scores, and obtains the second scoring of the sentence;
By second scoring of the sentence compared with preset first threshold, if second scoring is beyond described
The first threshold then using the sentence as the candidate digest sentence, is updated to second scoring by first threshold;
If second scoring of the sentence abandons the sentence, and maintain without departing from the first threshold
The first threshold is constant.
The alternatively possible realization method of first aspect present invention embodiment, the acquisition candidate digest sentence
First scoring, including:
The second highest sentence of scoring described in the candidate digest sentence is added to digest sentence to concentrate;The digest
Sentence concentration includes the digest sentence;
A candidate digest sentence is chosen from the remaining candidate digest sentence successively as current candidate digest sentence
Son calculates the redundancy scoring between the current candidate digest sentence and the digest sentence collection;
According to redundancy scoring, second scoring of the current candidate digest sentence and current candidate text
The sentence for plucking sentence is long, calculates first scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described to calculate the current candidate digest sentence
The sub redundancy scoring between the digest sentence collection, including:
Obtain the current candidate digest sentence and the digest sentence concentrate between existing each digest sentence the
Two similarities;
All the second similarity is added, and the result after will add up concentrates existing digest sentence with the digest sentence
The number of son does ratio, obtains the redundancy scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described according to redundancy scoring, institute
Second scoring and the sentence of the current candidate digest sentence for stating current candidate digest sentence are long, calculate described first and comment
Point, including:
Obtain the difference between second scoring of the current candidate digest sentence and redundancy scoring;
Ratio is done by the sentence of the difference and the current candidate digest sentence is long, obtains first scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described according to redundancy scoring, institute
Second scoring and the sentence of the current candidate digest sentence for stating current candidate digest sentence are long, calculate first scoring
Before, it further includes:
Determine the redundancy scoring of the current candidate digest sentence less than redundancy threshold value;
When redundancy scoring is greater than the redundancy threshold value, the current candidate digest sentence is abandoned
Son.
The alternatively possible realization method of first aspect present invention embodiment, it is described to score according to described first from described
Digest sentence is selected in candidate digest sentence, including:
If first scoring is more than or equal to preset second threshold, by the current candidate digest sentence,
The digest sentence collection is added to, and the second threshold is updated to first scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described by the current candidate digest sentence
Son is added to the digest sentence collection, including:
The total sentence for obtaining the existing digest sentence of presently described digest sentence collection is long;
Calculate long long with the total sentence and value of sentence of the current candidate digest sentence, it is relatively more described and be worth with it is preset
Digest sentence is long;
If described and value is long without departing from the digest sentence, the current candidate digest sentence is added to the digest
Sentence is concentrated;
If described and value is long beyond the digest sentence, current candidate digest sentence is no longer added to the digest sentence
In subset, stop the update to the digest sentence collection.
The alternatively possible realization method of first aspect present invention embodiment, the selection described second are scored above pre-
If first threshold sentence as the candidate digest sentence after, further include:
According to the sequence of the candidate digest sentence, candidate digest sentence collection is formed;
It is then described by the current candidate digest sentence, it is added to after the digest sentence collection, further includes:
By the current candidate digest sentence, concentrate and delete from the candidate digest sentence.
The Text summarization method of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from
Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time
Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence
Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through
Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article
The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence
Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to achieve the above object, second aspect of the present invention embodiment proposes a kind of Text summarization device, including:
First acquisition module, for obtaining the sentence of each sentence in article vector;
First chooses module, for according to sentence vector, candidate digest sentence to be selected from all sentences;
Second acquisition module, for obtaining the first scoring of the candidate digest sentence;Wherein, first scoring is used for
Represent the sentence quality of the candidate digest sentence unit length;
Second chooses module, for selecting digest sentence from the candidate digest sentence according to the described first scoring;
Generation module, for utilizing the digest of article described in the digest sentence generation.
The Text summarization device of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from
Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time
Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence
Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through
Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article
The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence
Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to achieve the above object, third aspect present invention embodiment proposes a kind of computer equipment, including processor and deposit
Reservoir;
Wherein, the processor by read the executable program code stored in the memory run with it is described can
The corresponding program of program code is performed, for realizing the Text summarization method as described in first aspect present invention embodiment.
To achieve these goals, fourth aspect present invention embodiment proposes a kind of computer-readable storage of non-transitory
Medium is stored thereon with computer program, is realized as described in first aspect present invention embodiment when which is executed by processor
Text summarization method.
To achieve these goals, fifth aspect present invention embodiment proposes a kind of computer program product, when described
The Text summarization as described in first aspect present invention embodiment is realized when instruction in computer program product is performed as processor
Method.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description
It obtains significantly or is recognized by the practice of the present invention.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments
Significantly and it is readily appreciated that, wherein:
Fig. 1 is a kind of flow diagram of Text summarization method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of another Text summarization method provided in an embodiment of the present invention;
Fig. 3 is the process schematic provided in an embodiment of the present invention for obtaining original sentence vector;
Fig. 4 is the flow diagram of another Text summarization method provided in an embodiment of the present invention;
Fig. 5 is a kind of Text summarization device that the embodiment of the present invention proposes;
Fig. 6 is the block diagram suitable for being used for realizing the exemplary computer device of embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end
Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
Below with reference to the accompanying drawings the Text summarization method and apparatus of the embodiment of the present invention are described.
Digest is the succinctly coherent short essay of one literature centre's content of accurate reflection comprehensively.At present, related digest extraction skill
Art is typically based on TF-IDF to score sentence.
Due to the digest extractive technique based on TF-IDF by the word in sentence after TF-IDF is calculated, take out topN conducts
The score of digest sentence cannot especially consider whole word frequency vectors when sentence is longer, if in sentence low word frequency word compared with
Information that are more, and the word of these low word frequency is all given up, sentence will be influenced being included, therefore the text extracted in this way
It plucks and tends not to covering sentence information, influence digest to the summary accuracy of literature centre's content and coverage.
For this problem, the embodiment of the present invention proposes a kind of Text summarization method, and candidate is selected according to sentence vector
Digest sentence further according to the first scoring of candidate digest sentence, further selects digest sentence from candidate digest sentence, into
And according to digest sentence generation digest, since sentence vector can preferably retain sentence information, the first scoring represents candidate text
The sentence quality of sentence unit length is plucked, the digest thus generated can preferably reflect the centre point of article, improve text
The coverage plucked.
Fig. 1 is a kind of flow diagram of Text summarization method provided in an embodiment of the present invention.
As shown in Figure 1, the digest generation method includes:
Step 101, the sentence vector of each sentence in article is obtained.
In the present embodiment, for the article of digest to be extracted, can each sentence in article be obtained according to punctuation mark.
After sentence is obtained, the sentence vector of each sentence is calculated.Specifically, by all words of sentence first with stopping word and punctuation mark
Cutting obtains the sentence set of article.
After sentence is obtained, the sentence vector of each sentence is calculated.Word in corpus carries out the word in sentence
Coding replaces the equivalent in former sentence using the result after coding, obtains the coding vector that machine can identify, the coding to
Amount is sentence vector.
Step 102, according to sentence vector, candidate digest sentence is selected from all sentences.
In the present embodiment, to each sentence, the sentence vector after coded treatment is input to based on supporting vector machine model
SVM obtains the scoring of SVM distich subvectors, and based on the scoring of sentence vector, candidate digest is selected from all sentences
Sentence.Specifically, the sentence for being scored above predetermined threshold value of SVM distich subvectors is chosen, as candidate digest sentence.
Since sentence vector can retain more sentence information, compared to the word using TF-IDF extractions, sentence vector is utilized
The candidate digest sentence of selection can more reflect the centre point of digest.
Step 103, the first scoring of candidate digest sentence is obtained;Wherein, it first scores to represent candidate digest sentence
The sentence quality of unit length.
After candidate digest sentence is extracted, the first scoring of each candidate digest sentence is obtained.Due to the number of words of digest
It typically limits, therefore the length of sentence can have an impact digest, in the present embodiment, the sentence quality of unit length is made
For the first scoring.Wherein, the sentence quality of unit length is obtained by the whole scoring of sentence divided by sentence length, is being obtained
After getting the first scoring, can digest sentence be filtered out from candidate digest sentence according to the first scoring.
Step 104, digest sentence is selected from candidate digest sentence according to the first scoring.
In the present embodiment, as a kind of possible realization method, candidate digest sentence can be dropped according to the first scoring
Sequence arranges, and using the candidate digest sentence for the preceding predetermined number that sorts as digest sentence.
As alternatively possible realization method, the candidate digest sentence of predetermined threshold value can be scored above as text using first
Pluck sentence.
Since the first scoring represents the sentence quality of unit length, the first scoring is higher, illustrates the list of candidate digest sentence
The sentence quality of bit length is higher, therefore the digest sentence selected according to the first scoring, can reflect the center of article comprehensively
Content.
Step 105, the digest of digest sentence generation article is utilized.
In order to make the logical order of digest corresponding with article, after digest sentence is extracted, can be existed according to digest sentence
Sequencing in article is ranked up, and obtains the digest of article.
In order to improve the extraction rate of digest, using support vector machines (Support Vector Machine, abbreviation
SVM the second scoring of sentence vector) is quickly calculated, candidate digest sentence is chosen according to the second scoring.Below by another implementation
Example, illustrates Text summarization method proposed by the present invention.Fig. 2 is another Text summarization method provided in an embodiment of the present invention
Flow diagram.
As shown in Fig. 2, the digest generation method includes:
Step 201, the original sentence vector of sentence is obtained.
In the present embodiment, can according to the first similarity between the topic of the location score of sentence, sentence and article and
The feature space represented by various dimensions of sentence forms original sentence vector.
Fig. 3 is the process schematic provided in an embodiment of the present invention for obtaining original sentence vector.
As shown in figure 3, the process for obtaining original sentence vector includes:
Step 301, sentence is identified from article.
In the present embodiment, sentence all in article can be identified according to punctuation marks such as fullstop, question mark, exclamation mark, branches
Son.It should be noted that for punctuation mark is less or article without punctuation mark, the semantic of sentence can be passed through and combined
The sentence in article is identified in sentence structure, marks off single sentence.Sentence structure can include Subject, Predicate and Object, determine shape benefit
Etc. structures.Since the beginning of article, a segment word word for word is formed with the word of front, when a bit of word identified
Include specific sentence structure, and this bit of word is semantic complete, then it can be by a bit of Text region into a sentence
Son.The number of words that a bit of word can be set is no more than preset number, for example, being up to 30 words.
Step 302, position of the sentence in article is identified according to clue word, calculates the location score of sentence.
Since paragraph beginning and end typically includes summing-up sentence, can sentence be identified by clue word
Position, among the then position according to sentence in article, such as paragraph beginning, paragraph, paragraph ending, by calculating each sentence
Cosine similarity between son and the title of article, then determines according to the cosine similarity between the sentence title calculated
The location score of the sentence.It is alternatively possible to being empirically for different location by professional sets different location scores.
Step 303- steps 305 are to calculate the process of the first similarity between sentence and title of article.
Step 303, sentence is converted into the set of words of word composition.
For each sentence, word segmentation processing, the set of words that sentence is converted into being made of word can be carried out to sentence.
Step 304, stop words is filtered out from set of words, and extracts stem.
Due to that may have stop words such as " " " " " I " in sentence, in order to reduce the calculation amount of the first similarity, this
In embodiment, the stop words in set of words can be filtered out according to deactivated word list, the stem extraction algorithm in Lucene also can be used
Extract stem.Wherein, Lucene is the full-text search engine kit of an open source code.The full-text search engine kit
Cutting can be carried out to the original sentence in article, the stop words such as removal punctuation mark and stopping obtain sentence trunk, utilize sentence
Included morphology is into stem in sub- trunk.
Step 305, according to the stem extracted, the first similarity between sentence and topic is calculated.
Since topic is that the refining to entire article is summarized, the first phase between sentence and the topic of article can be calculated
Like degree, specifically, the stem extracted can be utilized, carry out Word similarity with word in topic, get each stem
Similarity.After the similarity for getting all stems, all similarities can be weighted it is average, obtain sentence with
The first similarity between topic, one that the first similarity between the sentence and topic can be vectorial as original sentence is formed
A element.
Step 306- steps 307 are the processes for the feature space represented by multidimensional dictionary for obtaining sentence.
Step 306, the weight of stem is calculated according to TF-IDF.
To the stem of extraction, weight of the stem in sentence can be calculated based on TF-IDF.
Step 307, the feature space represented by multidimensional dictionary of sentence is obtained.
In the present embodiment, multidimensional dictionary can be combined by the stem of the set of words of all sentences.Dimension is stem
Number.If stem is set as 5000, then the feature space of sentence is represented with the multidimensional dictionary of 5000 dimensions, is passed through by getting
After the feature space of the sentence of multidimensional dictionary mark, which can be identified by machine.
Step 308, using location score, the first similarity and feature space, the original sentence vector of sentence is formed.
In the present embodiment, the feature of the first similarity and sentence between the location score of sentence, sentence and topic is empty
Between respectively as structure original sentence vector element.Getting location score, the first similarity and feature space, it is possible to
Based on location score, the first similarity and feature space, the original sentence vector of sentence is formed.About location score, the first phase
The record of related content in above-described embodiment is can be found in like the acquisition process of degree and feature space, details are not described herein again.
In the present embodiment, since original sentence vector is made of location score, the first similarity and feature space etc., compare
The information of sentence is represented based on the TF-IDF words chosen, original sentence vector retains more sentence information.
Step 202, by limited Boltzmann machine neural network, dimension-reduction treatment is carried out to original sentence vector, is obtained
Sentence vector.
Due to the higher-dimension characteristic of language in itself, related digest extractive technique generally requires to construct by hand to be made for grader
Feature, high labor cost.
In the embodiment of the present invention, pass through limited Boltzmann machine neural network (Restricted Boltzmann
Machine, abbreviation RBM), dimension-reduction treatment is carried out to original sentence vector, obtains sentence vector.
Wherein, RBM only has input layer and hidden layer, obtains hidden layer by the adjustment of weights after input layer input, implies
Layer counter can release the data of input layer again, this process and self-encoding encoder are somewhat similar, self-encoding encoder have input layer, hidden layer and
Output layer, using the principle of the error back propagation of neural network (Back Propagation, abbreviation BP), target when training
Function is to ask for minimum error sum of squares.RBM networks use CD quick calculation methods, using Gibbs model, meet Bohr
Zi Man is distributed, its network is made to rapidly converge to the process of a stable state.Such as by through stopping the original sentences of the pretreatments such as word to
Amount is input in RBM networks, and an output is obtained, and calculate the primitive sentence with input by the processing of the hidden layer of Boltzmann
The error of subvector readjusts weights and continues to train.After ANALOGY OF BOLTZMANN DISTRIBUTION is met, RBM networks can just reach one
Stable state.From stable state when hidden layer part of nodes in the latent space vector that extracts, be the sentence vector after dimensionality reduction.
Feature selecting can be carried out by RBM, cost of labor can be reduced, and be easier to obtain feature and make feature
Comprising sentence information it is more complete.
Step 203, since first sentence of article, based on supporting vector machine model SVM, one by one to the sentence of sentence
Vector scores, and obtains the second scoring of sentence.
Since supporting vector machine model SVM is for high dimension vector, obtaining for sentence can quickly be calculated by supporting vector
Point.In the present embodiment, since first sentence of article, scored the sentence vector of sentence, obtained one by one based on SVM
Second scoring of sentence.
Specifically, SVM models are established by carrying expert's sample set of mark, is established by the largest interval for calculating function
Supporting vector plane (Support Vesctor), digest is divided into plane both sides by supporting vector.Limited Boltzmann machine is carried
The sentence vector of taking-up, puts into SVM models.Sentence vector and the inner product of SVM supporting vectors are calculated, which is classified.SVM moulds
Block can calculate the distance of sentence vector and hyperplane, and distance is bigger, and it is bigger to represent the certainty factor that SVM models are classified,
Namely illustrate that the sentence is more likely to be digest sentence.
Step 204, by the second scoring of sentence compared with preset first threshold.
Step 205, if the second scoring of sentence abandons sentence, and maintain first threshold without departing from first threshold
Constant, return continues to execute a 203 to the last sentence processing and completes, and obtains all candidate digest sentences.
Step 206, if the second scoring is beyond first threshold, using sentence as candidate's digest sentence, and by the first threshold
Value is updated to the second scoring.
In the present embodiment, as soon as after the second scoring of sentence is got, the second scoring and the first threshold by sentence
Value is compared, if the second of the sentence is scored above first threshold, using the sentence as candidate digest sentence.Further
First threshold when second is scored above first threshold, is updated to the second scoring by ground.It should be noted that in article
First sentence, first threshold is initial numerical value, then during being screened to sentence below, if current sentence
The second scoring higher than first threshold, then the second scoring update first threshold of the current sentence can be utilized, so as to protect
The scoring of candidate digest sentence selected below is demonstrate,proved to be stepped up.
Step 207, the second highest sentence of scoring in candidate digest sentence is added to digest sentence to concentrate.
After all candidate digest sentences of article are obtained, according to the second of candidate digest sentence the scoring, to candidate digest
Sentence is ranked up, and by the highest candidate digest sentence of the second scoring, the digest sentence for being added to storage digest sentence is concentrated.
Step 208, a candidate digest sentence is chosen from remaining candidate digest sentence successively as current candidate text
Sentence is plucked, the redundancy calculated between current candidate digest sentence and digest sentence collection scores.
In related digest extractive technique, by the highest word representative sentences sub-informations of TF-IDF, thus similar word can be selected easily
In, therefore redundancy increase between digest sentence can be made, the coverage of digest is made to narrow.
In the present embodiment, using the highest candidate digest sentence of the second scoring as digest sentence, it is added to digest sentence collection
After conjunction, in order to choose digest sentence from remaining candidate digest sentence, remaining each candidate digest sentence and text can be calculated
The redundancy scoring between sentence collection is plucked, with according to the redundancy between digest sentence and digest sentence collection, as extraction digest sentence
The foundation of son.
In specific implementation, can obtain current candidate digest sentence and digest sentence concentrate existing each digest sentence it
Between the second similarity, all the second similarity is added, and the result after will add up concentrates existing text with digest sentence
The number for plucking sentence does ratio, obtains redundancy scoring.As shown in formula (1).
Wherein, x represents current candidate digest sentence, S*Represent digest sentence collection, r (x, S*) represents candidate digest sentence x
With digest sentence collection S*Between redundancy scoring, s ∈ S*Represent the digest sentence that digest sentence is concentrated, similar (x, s) table
Show the second similarity between current candidate digest sentence x and digest sentence s, moleculeRepresent current candidate
Digest sentence and digest sentence concentrate the second similarity between existing each digest sentence, after being added as a result, size
(S*) represent that digest sentence concentrates the number for having digest sentence.
Step 209, it is scored according to redundancy, the second scoring of current candidate digest sentence and current candidate digest sentence
Sentence is long, calculates the first scoring.
As a kind of possible realization, can calculate between the second scoring of current candidate digest sentence and redundancy scoring
Difference does ratio by the sentence of difference and current candidate digest sentence is long, obtains the first scoring.Specifically, it can be counted according to formula (2)
Calculate the first scoring of current candidate digest sentence.
Wherein, x represents current candidate digest sentence, fx(score) the first scoring of current candidate digest sentence x, r are represented
(x, S*) represents candidate digest sentence x and digest sentence collection S*Between redundancy scoring, f (x) represent current candidate digest sentence
Second scoring of son, len (x) represent that the sentence of current candidate digest sentence x is long.Wherein, len (x) is current candidate digest sentence x
In all characters number.
In the present embodiment, using the redundancy scoring between current candidate digest sentence and digest sentence collection as calculating first
The foundation of scoring, according to formula (2) it is found that in the case where sentence length and the second scoring are certain, redundancy scoring is bigger, the
One scoring is smaller.
It should be noted that in order to avoid current candidate digest sentence and existing digest sentence are there are larger redundancy,
After redundancy scoring is got, redundancy can be scored and be compared with preset redundancy threshold value, if current wait
Selection plucks the redundancy scoring of sentence less than redundancy threshold value, illustrates the weight of current candidate digest sentence and existing digest sentence
New content is less, can continue to calculate the first scoring of current candidate digest sentence, if the redundancy of current candidate digest sentence
Degree scoring is less than redundancy threshold value, illustrates that current candidate digest sentence and the content again of existing digest sentence are more, if
Current candidate digest sentence is continued into choosing the phenomenon that generating digest, the duplicate contents of digest can be caused more, at this time may not be used
The first scoring of current candidate digest sentence is calculated again, directly deletes current candidate digest sentence.Pass through meter in the present embodiment
The redundancy of current candidate digest sentence and existing digest sentence is calculated, the content duplicated in digest can be avoided as possible,
So as to which the information content of article in itself can be provided in limited word.
Step 210, if the first scoring is more than or equal to preset second threshold, by current candidate digest sentence,
Digest sentence collection is added to, and second threshold is updated to the first scoring.
In the present embodiment, the first scoring of current candidate digest sentence with preset second threshold is compared, works as time
When the first scoring that selection plucks sentence is equal to or more than second threshold, using current candidate digest sentence as digest sentence, add
Enter to digest sentence and concentrate.Further, when the first scoring of current candidate digest sentence is more than second threshold, first is utilized
Scoring update second threshold, i.e., be updated to the first scoring by second threshold so that the value increase of second threshold, so as to make
The first scoring for obtaining digest sentence currently is consistently higher than previous digest sentence, it is ensured that the list of digest sentence extracted
Position quality higher.
Step 211, the digest for the digest sentence generation article concentrated using digest sentence.
By remaining candidate digest sentence, after extracting digest sentence by the first scoring, digest sentence is utilized to concentrate
Digest sentence, according to the sequence that digest sentence occurs in the text, generate the digest of article.
The Text summarization method of the embodiment of the present invention is quickly scored by SVM distich subvector, is obtained second and is commented
Point, the speed of Text summarization can be improved, it, will be superfluous between candidate digest sentence and digest sentence collection when calculating the first scoring
The sentence that remaining scoring is concentrated as calculation basis, maximum difference digest sentence, avoids the useless digest sentence of repetition, by
The digest of this generation is wider in coverage.
Since digest is the summary to article centre point, the sentence of digest is long generally can not be too long, therefore in generation digest
When, the sentence that can preset digest is long.Since the long sentence for directly affecting digest of the sentence of digest sentence is long, if step 210 first
Scoring is more than or equal to second threshold, when current candidate digest sentence is added to digest sentence collection, further, considers pre-
If digest sentence it is long.
Specifically, the character for current digest sentence being concentrated existing digest sentence first is added, and it is long to obtain total sentence.So
Afterwards, calculate current candidate digest sentence sentence is long long with total sentence and value, compare and value and preset digest sentence it is long.And if value
It is long without departing from digest sentence, that is to say, that after current candidate digest sentence is added in digest sentence collection, total sentence is long do not exceed it is default
Sentence it is long, then current candidate digest sentence can be added to digest sentence concentration.And if value is long beyond digest sentence, will not
Current candidate digest sentence is added to digest sentence concentration.
In the present embodiment, when the first scoring of candidate digest sentence is equal to or more than second threshold, by preset
Digest sentence is long, further determines whether that candidate digest sentence is added to digest sentence concentrates, long so as to avoid the sentence of digest
It is long, it is not terse enough.
It further, can be according to candidate digest sentence after all candidate digest sentences are extracted according to the second scoring
Sequence in article is ranked up to form candidate digest sentence collection, in order to avoid reprocessing, adds by candidate digest sentence
Enter to after digest sentence collection, current candidate digest sentence from candidate digest sentence can be concentrated and deleted.
Illustrate above-described embodiment in order to clearer, Text summarization method proposed by the present invention is explained with reference to Fig. 4.Figure
4 be the flow diagram of another Text summarization method provided in an embodiment of the present invention.
Step 401, the sentence set A of article is obtained, each sentence si, i initial value is 0 in size N, A.
In the embodiment of the present invention, first according to punctuates such as fullstop, branches, identify the sentence in article, obtain article
N number of sentence is shared in sentence set A, wherein A, each sentence in A is represented with si, and the initial value of i is 0.Wherein, i is just
Integer.
Step 402- steps 406 are that candidate digest sentence set C is extracted in subordinate clause subclass A.Specifically,
Step 402, i<N.Judge whether i is less than N.If less than N, step 403 is performed;Otherwise, step 407 is performed.
Step 403, the original sentence vector of sentence is obtained, the second scoring of sentence is calculated according to SVM.
In the present embodiment, the method that obtains the original sentence vector of sentence, reference can be made to above-described embodiment, no longer superfluous herein
It states.
Dimension-reduction treatment is carried out after the original sentence vector for defining sentence, the sentence vector of sentence is obtained, then will
The sentence vector of sentence is input to SVM and scores sentence, obtains the second scoring f (x) of sentence.About SVM to sentence into
The description of row scoring, reference can be made to above-described embodiment, details are not described herein.
Step 404, f (x)>c.Judge whether the second scoring f (x) of sentence is more than c.Wherein, c is under preset scoring
Limit.
If f (x) performs step 405, otherwise, performs step 406 more than c.
Step 405, by f (x) compared with max, then sentence candidate digest sentence collection C is added in into, and will if greater than max
Max is updated to f (x).
It in the present embodiment, is traversed by the sentence in ordered pair article, candidate digest sentence is chosen from all sentences.
By taking current sentence as an example, by the f (x) of current sentence compared with max, sentence is then added in into candidate digest sentence if greater than max
Collect C, and max is updated to f (x).After step 405 is performed, step 406 is continued to execute.It should be noted that when initial, in advance
First set a max value, by f (x) and the max value of first sentence compared with, if greater than max values, then first sentence is added
Enter candidate digest sentence collection C.Further, max values are updated to the f (x) of first sentence.
Step 406, i=i+1, that is, i add 1 certainly.
In the present embodiment, i from plus 1 after, perform step 402, judge whether i is less than N, is i.e. sentence in sentence set A is
No traversal finishes.Work as i>During N, step 407 is performed.
Step 407, the highest candidate digest sentences of f (x) in candidate digest sentence collection C are added to digest sentence collection S*
In, and the sentence is deleted from candidate digest sentence collection C.
It should be noted that it is to be stored in max variables that highest f (x) value is found from candidate digest sentence collection C
Current value, that is to say, that currently the corresponding candidate digest sentences of max are the highest sentences of f (x).
By above-mentioned steps 401- steps 406, by the sentence set A of article, the second of sentence is scored above the first threshold
After the sentence of value is added to candidate digest sentence collection C, candidate digest sentence corresponding with current max is then added to digest sentence
Subset S*In, i.e., the highest candidate digest sentence of the second scoring is added to digest sentence collection S*, and by the sentence from candidate digest
It is deleted in sentence collection C.
Step 408, len (x)<L judges digest sentence collection S*In existing digest sentence the long len (x) of total sentence, be
It is no to be less than preset digest overall length L.If len (x)<L then performs step 409;Otherwise, step 414 is performed.
Step 409, next candidate digest sentence Sj, 0≤j < m are taken out from candidate collection of abstracts C, calculate candidate text
Pluck the redundancy of sentence and the digest sentence in existing collection of abstracts.
It is to be appreciated that when performing step 409, the value of m is had to be larger than equal to 1, that is to say, that candidate digest sentence set
The candidate digest sentence of at least one in C.
Step 410, j<m.Judge whether j is less than m.Work as j<During m, that is to say, that do not have in C in the presence of candidate digest sentence also
When being traversed, step 411 is performed.
Step 411, if the redundancy of candidate digest sentence and existing digest sentence in digest set is less than redundancy threshold
Value then calculates the first scoring of candidate digest sentence, and compared with, and wherein maxScore preserves highest first scoring.
In the present embodiment, by the first scoring compared with maxScore, maxScore is made to preserve highest first scoring.If certain
First current value of the scoring more than maxScore of sentence, then replace the current value of maxScore by the first scoring of sentence.If
First current value of the scoring less than maxScore of certain sentence, then maxScore holdings current value is constant.It should be noted that
MaxScore is the second threshold in above-described embodiment, carves the value for presetting the maxScore at the beginning, then utilizes
First scoring of the digest sentence selected is updated the value of maxScore.
After execution of step 411, perform step 412, j=j+1, that is, j and add 1 certainly.Then, step 409 is performed,
Continue computing redundancy degree and judge whether j is less than m.
Step 413, candidate digest sentence corresponding with maxScore is added to S*In, and the sentence is deleted from C.
It, will be with after the highest candidate digest sentence of the first scoring is extracted from candidate digest sentence C in the present embodiment
The corresponding candidate digest sentences of maxScore are added to S*In, and candidate's digest sentence is deleted from C.Later, step is performed
Rapid 409, judge current digest sentence collection S*In existing digest sentence total sentence it is long whether be less than the preset long L of digest sentence.Such as
Fruit len (x) >=L then performs step 414, otherwise continues to execute 409.
By step 409- steps 413, the higher sentence of the second scoring is extracted from all sentences first as time
Selection plucks sentence, and then the redundancy based on candidate digest sentence and existing digest sentence from candidate digest sentence collection C is made
For foundation, from candidate digest sentence concentration extract remaining existing digest sentence redundancy is low and the long quality of unit sentence (first comments
Point) high candidate digest sentence is added to S*, obtain the digest sentence for building digest.
Step 414, according to S*In all digest sentence generation article digest.
After completing to the traversal of candidate digest sentence, according to digest sentence all in S*, according to digest sentence in text
Position in chapter is spliced to form the digest of article.
The Text summarization method of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from
Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time
Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence
Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through
Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article
The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence
Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to realize above-described embodiment, the present invention also proposes a kind of Text summarization device.Fig. 5 is proposed for the embodiment of the present invention
A kind of Text summarization device.
As shown in figure 5, the device includes:First acquisition module 510, first choose module 520, the second acquisition module 530,
Second chooses module 540, generation module 550.
Wherein, the first acquisition module 510, for obtaining the sentence of each sentence in article vector.
First chooses module 520, for according to sentence vector, candidate digest sentence to be selected from all sentences.
Second acquisition module 530, for obtaining the first of candidate digest sentence the scoring;Wherein, it first scores to represent
The sentence quality of candidate digest sentence unit length.
Second chooses module 540, for selecting digest sentence from candidate digest sentence according to the first scoring.
Generation module 550, for utilizing the digest of digest sentence generation article.
In one embodiment of the invention, the first acquisition module 510 includes:
Recognition unit, for identifying sentence from article;
Acquiring unit, for obtaining the original sentence of sentence vector;
For passing through limited Boltzmann machine neural network, dimension-reduction treatment is carried out to original sentence vector for dimensionality reduction unit,
Obtain sentence vector.
In one embodiment of the invention, acquiring unit is additionally operable to:
Position of the sentence in article is identified according to clue word, calculates the location score of sentence;
Calculate the first similarity between sentence and the topic of article;
Obtain the feature space represented by multidimensional dictionary of sentence;
Using location score, the first similarity and feature space, the original sentence vector of sentence is formed.
In one embodiment of the invention, acquiring unit is additionally operable to:
Sentence is converted into the set of words of word composition;
Stop words is filtered out from set of words, and extracts stem;
According to the stem extracted, the first similarity is calculated.
In one embodiment of the invention, the first selection module 520 may include:
Score unit, for since first sentence of article, based on supporting vector machine model SVM, one by one to each
The sentence vector of sentence scores, and obtains the second scoring of sentence;
Selection unit, for by the second of sentence the scoring with preset first threshold compared with, if the second scoring exceeds the
One threshold value, then using sentence as candidate digest sentence, and first threshold is updated to second scoring;If the second of sentence comments
Divide without departing from first threshold, then abandon sentence, and maintain first threshold constant.
In one embodiment of the invention, the second acquisition module 530 may include:
First computing unit, for the second highest sentence of scoring in candidate digest sentence to be added to digest sentence collection
In;Digest sentence concentration includes digest sentence;A candidate digest sentence is chosen from remaining candidate digest sentence successively to make
For current candidate digest sentence, the redundancy calculated between current candidate digest sentence and digest sentence collection scores;
Second computing unit, for being scored according to redundancy, current candidate digest sentence second scoring and current candidate
The sentence of digest sentence is long, calculates the first scoring.
In one embodiment of the invention, the first computing unit is additionally operable to:
Obtain the second similarity between current candidate digest sentence and the existing each digest sentence of digest sentence concentration;
All the second similarity is added, and the result after will add up concentrates existing digest sentence with digest sentence
Number does ratio, obtains redundancy scoring.
In one embodiment of the invention, the second computing unit is additionally operable to:
Obtain the difference between the second scoring of current candidate digest sentence and redundancy scoring;
Ratio is done by the sentence of difference and current candidate digest sentence is long, obtains the first scoring.
Further, the first computing unit is additionally operable to:
Scored according to redundancy, current candidate digest sentence it is described second scoring and current candidate digest sentence sentence
It is long, before calculating the first scoring, the redundancy scoring of current candidate digest sentence is determined less than redundancy threshold value, when redundancy is commented
When dividing greater than redundancy threshold value, current candidate digest sentence is abandoned.
In one embodiment of the invention, the second selection module 540 is additionally operable to:
If the first scoring is more than or equal to preset second threshold, utilizes current candidate digest sentence, add in
The first scoring is updated to digest sentence collection, and by second threshold.
In one embodiment of the invention, the second selection module 540 is additionally operable to:
The total sentence for obtaining the existing digest sentence of current digest sentence collection is long;
Calculate current candidate digest sentence sentence is long long with total sentence and value, compare and value and preset digest sentence it is long;
And if value is long without departing from digest sentence, and current candidate digest sentence is added to digest sentence concentration;
And if value is long beyond digest sentence, and current candidate digest sentence no longer is added to digest sentence concentration, is stopped
Update to the digest sentence collection.
In one embodiment of the invention, the first selection module 520 may also include:
Form unit, for choose second be scored above the sentence of preset first threshold as candidate digest sentence it
Afterwards, according to the sequence of candidate digest sentence, candidate digest sentence collection is formed.
In one embodiment of the invention, the second selection module 540 is additionally operable to:
It by current candidate digest sentence, is added to after digest sentence collection, by current candidate digest sentence, from candidate text
It plucks sentence and concentrates deletion.
It should be noted that the aforementioned explanation to Text summarization embodiment of the method, suitable for the digest of the embodiment
Generating means, details are not described herein.
The Text summarization device of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from
Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time
Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence
Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through
Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article
The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence
Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to realize above-described embodiment, the invention also provides a kind of computer equipment, including processor and memory;
Wherein, processor is run and executable program code by reading the executable program code stored in memory
Corresponding program, for realizing the Text summarization method as described in aforementioned any embodiment.
Fig. 6 shows the block diagram suitable for being used for the exemplary computer device 30 for realizing the application embodiment.Fig. 6 is shown
Computer equipment 30 be only an example, any restrictions should not be brought to the function and use scope of the embodiment of the present application.
As shown in fig. 6, computer equipment 30 is showed in the form of universal computing device.The component of computer equipment 30 can be with
Including but not limited to:One or more processor or processing unit 31, system storage 32 connect different system component
The bus 33 of (including system storage 32 and processing unit 31).
Bus 33 represents one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts
For example, these architectures include but not limited to industry standard architecture (Industry Standard
Architecture;Hereinafter referred to as:ISA) bus, microchannel architecture (Micro Channel Architecture;Below
Referred to as:MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards
Association;Hereinafter referred to as:VESA) local bus and peripheral component interconnection (Peripheral Component
Interconnection;Hereinafter referred to as:PCI) bus.
Computer equipment 30 typically comprises a variety of computer system readable media.These media can be it is any can be by
The usable medium that computer equipment 30 accesses, including volatile and non-volatile medium, moveable and immovable medium.
System storage 32 can include the computer system readable media of form of volatile memory, such as arbitrary access
Memory (Random Access Memory;Hereinafter referred to as:RAM) 40 and/or cache memory 41.Computer equipment 30
It may further include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only conduct
Citing, storage system 42 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 6 do not show, commonly referred to as " hard disk
Driver ").Although being not shown in Fig. 6, can provide for the magnetic to moving non-volatile magnetic disk (such as " floppy disk ") read-write
Disk drive and to removable anonvolatile optical disk (such as:Compact disc read-only memory (Compact Disc Read Only
Memory;Hereinafter referred to as:CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only
Memory;Hereinafter referred to as:DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving
Device can be connected by one or more data media interfaces with bus 33.Memory 32 can include at least one program and produce
Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application
The function of embodiment.
Program/utility 50 with one group of (at least one) program module 51 can be stored in such as memory 32
In, such program module 51 includes --- but being not limited to --- operating system, one or more application program, other programs
Module and program data may include the realization of network environment in each or certain combination in these examples.Program mould
Block 51 usually performs function and/or method in embodiments described herein.
Computer equipment 30 can also be with one or more external equipments 60 (such as keyboard, sensing equipment, display 70
Deng) communication, the equipment interacted with the computer equipment 30 communication can be also enabled a user to one or more and/or with making
The computer equipment 30 any equipment (such as network interface card, the modulatedemodulate that can communicate with one or more of the other computing device
Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 34.Also, computer equipment 30 may be used also
To pass through network adapter 35 and one or more network (such as LAN (Local Area Network;Hereinafter referred to as:
LAN), wide area network (Wide Area Network;Hereinafter referred to as:WAN) and/or public network, for example, internet) communication.Such as figure
Shown, network adapter 35 is communicated by bus 33 with other modules of computer equipment 30.It should be understood that although do not show in figure
Go out, computer equipment 30 can be combined and use other hardware and/or software module, including but not limited to:Microcode, device drives
Device, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processing unit 31 is stored in program in system storage 32 by operation, so as to perform various functions application and
Data processing, such as realize Fig. 1-Text summarization shown in Fig. 4 method.
In order to realize above-described embodiment, the embodiment of the present invention also proposed a kind of non-transitory computer-readable storage medium
Matter is stored thereon with computer program, and the digest life as described in aforementioned any embodiment is realized when which is executed by processor
Into method.
In order to realize above-described embodiment, the invention also provides a kind of computer program products, work as computer program product
In instruction Text summarization method as described in aforementioned any embodiment is realized when being performed as processor.
In the description of this specification, reference term " one embodiment ", " example ", " is specifically shown " some embodiments "
The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description
Point is contained at least one embodiment of the present invention or example.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It is combined in an appropriate manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the different embodiments or examples described in this specification and the feature of different embodiments or examples
It closes and combines.
In addition, term " first ", " second " are only used for description purpose, and it is not intended that instruction or hint relative importance
Or the implicit quantity for indicating indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present invention, " multiple " are meant that at least two, such as two, three
It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, represent to include
Module, segment or the portion of the code of the executable instruction of one or more the step of being used to implement custom logic function or process
Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable
Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, to perform function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction
The system of row system, device or equipment instruction fetch and execute instruction) it uses or combines these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass
Defeated program is for instruction execution system, device or equipment or the dress used with reference to these instruction execution systems, device or equipment
It puts.The more specific example (non-exhaustive list) of computer-readable medium is including following:Electricity with one or more wiring
Connecting portion (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable
Medium, because can be for example by carrying out optical scanner to paper or other media, then into edlin, interpretation or when necessary with it
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned
In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage
Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used
Any one of art or their combination are realized:With for data-signal realize logic function logic gates from
Logic circuit is dissipated, the application-specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile
Journey gate array (FPGA) etc..
Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries
Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium
In matter, the program when being executed, one or a combination set of the step of including embodiment of the method.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it can also
That each unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould
The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized in the form of software function module and is independent product sale or in use, can also be stored in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although it has been shown and retouches above
The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention
System, those of ordinary skill in the art can be changed above-described embodiment, change, replace and become within the scope of the invention
Type.
Claims (10)
- A kind of 1. Text summarization method, which is characterized in that including:Obtain the sentence vector of each sentence in article;According to sentence vector, candidate digest sentence is selected from all sentences;Obtain the first scoring of the candidate digest sentence;Wherein, it described first scores to represent the candidate digest sentence The sentence quality of unit length;Digest sentence is selected from the candidate digest sentence according to the described first scoring;Utilize the digest of article described in the digest sentence generation.
- 2. according to the method described in claim 1, it is characterized in that, the sentence vector for obtaining each sentence in article, packet It includes:The sentence is identified from the article;Obtain the original sentence vector of the sentence;By limited Boltzmann machine neural network, dimension-reduction treatment is carried out to the original sentence vector, obtains the sentence Vector.
- 3. according to the method described in claim 2, it is characterized in that, the original sentence vector for obtaining the sentence, including:Position of the sentence in the article is identified according to clue word, calculates the location score of the sentence;Calculate the first similarity between the topic of the sentence and the article;Obtain the feature space represented by multidimensional dictionary of the sentence;Using the location score, first similarity and the feature space, the original sentence vector of the sentence is formed.
- 4. according to the method described in claim 3, it is characterized in that, described calculate between the sentence and the topic of the article The first similarity, including:The sentence is converted into the set of words of word composition;Stop words is filtered out from the set of words, and extracts stem;According to the stem extracted, first similarity is calculated.
- 5. according to claim 1-4 any one of them methods, which is characterized in that it is described according to sentence vector, from all Sentence in select candidate digest sentence, including:Since first sentence of the article, based on supporting vector machine model SVM, one by one to the sentence of each sentence Vector scores, and obtains the second scoring of the sentence;By second scoring of the sentence compared with preset first threshold, if second scoring is beyond described first The first threshold then using the sentence as the candidate digest sentence, is updated to second scoring by threshold value;If second scoring of the sentence abandons the sentence, and without departing from the first threshold described in maintenance First threshold is constant.
- 6. according to the method described in claim 5, it is characterized in that, first scoring for obtaining the candidate digest sentence, Including:The second highest sentence of scoring described in the candidate digest sentence is added to digest sentence to concentrate;The digest sentence Concentration includes the digest sentence;A candidate digest sentence is chosen from the remaining candidate digest sentence successively as current candidate digest sentence, meter Calculate the redundancy scoring between the current candidate digest sentence and the digest sentence collection;According to redundancy scoring, second scoring of the current candidate digest sentence and the current candidate digest sentence The sentence of son is long, calculates first scoring.
- 7. according to the method described in claim 6, it is characterized in that, described calculate the current candidate digest sentence and the text The redundancy scoring between sentence collection is plucked, including:Obtain the second phase between the current candidate digest sentence and the existing each digest sentence of digest sentence concentration Like degree;All the second similarity is added, and the result after will add up concentrates existing digest sentence with the digest sentence Number does ratio, obtains the redundancy scoring.
- It is 8. according to the method described in claim 6, it is characterized in that, described according to redundancy scoring, the current candidate Second scoring of digest sentence and the sentence of the current candidate digest sentence are long, calculate first scoring, including:Obtain the difference between second scoring of the current candidate digest sentence and redundancy scoring;Ratio is done by the sentence of the difference and the current candidate digest sentence is long, obtains first scoring.
- It is 9. according to the method described in claim 8, it is characterized in that, described according to redundancy scoring, the current candidate Second scoring of digest sentence and the sentence of the current candidate digest sentence are long, before calculating first scoring, also wrap It includes:Determine the redundancy scoring of the current candidate digest sentence less than redundancy threshold value;When redundancy scoring is greater than the redundancy threshold value, the current candidate digest sentence is abandoned.
- 10. a kind of Text summarization device, which is characterized in that including:First acquisition module, for obtaining the sentence of each sentence in article vector;First chooses module, for according to sentence vector, candidate digest sentence to be selected from all sentences;Second acquisition module, for obtaining the first scoring of the candidate digest sentence;Wherein, it described first scores to represent The sentence quality of candidate's digest sentence unit length;Second chooses module, for selecting digest sentence from the candidate digest sentence according to the described first scoring;Generation module, for utilizing the digest of article described in the digest sentence generation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711463868.5A CN108182247A (en) | 2017-12-28 | 2017-12-28 | Text summarization method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711463868.5A CN108182247A (en) | 2017-12-28 | 2017-12-28 | Text summarization method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108182247A true CN108182247A (en) | 2018-06-19 |
Family
ID=62548639
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711463868.5A Pending CN108182247A (en) | 2017-12-28 | 2017-12-28 | Text summarization method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108182247A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388804A (en) * | 2018-10-22 | 2019-02-26 | 平安科技(深圳)有限公司 | Report core views extracting method and device are ground using the security of deep learning model |
CN110287489A (en) * | 2019-06-24 | 2019-09-27 | 北京大米科技有限公司 | Document creation method, device, storage medium and electronic equipment |
CN110781291A (en) * | 2019-10-25 | 2020-02-11 | 北京市计算中心 | Text abstract extraction method, device, server and readable storage medium |
CN111666402A (en) * | 2020-04-30 | 2020-09-15 | 平安科技(深圳)有限公司 | Text abstract generation method and device, computer equipment and readable storage medium |
CN112732901A (en) * | 2021-01-15 | 2021-04-30 | 联想(北京)有限公司 | Abstract generation method and device, computer readable storage medium and electronic equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
US20110087671A1 (en) * | 2009-10-14 | 2011-04-14 | National Chiao Tung University | Document Processing System and Method Thereof |
CN102841940A (en) * | 2012-08-17 | 2012-12-26 | 浙江大学 | Document summary extracting method based on data reconstruction |
CN104503958A (en) * | 2014-11-19 | 2015-04-08 | 百度在线网络技术(北京)有限公司 | Method and device for generating document summarization |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
US20160078038A1 (en) * | 2014-09-11 | 2016-03-17 | Sameep Navin Solanki | Extraction of snippet descriptions using classification taxonomies |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
CN106599148A (en) * | 2016-12-02 | 2017-04-26 | 东软集团股份有限公司 | Method and device for generating abstract |
CN107273474A (en) * | 2017-06-08 | 2017-10-20 | 成都数联铭品科技有限公司 | Autoabstract abstracting method and system based on latent semantic analysis |
-
2017
- 2017-12-28 CN CN201711463868.5A patent/CN108182247A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398814A (en) * | 2007-09-26 | 2009-04-01 | 北京大学 | Method and system for simultaneously abstracting document summarization and key words |
US20110087671A1 (en) * | 2009-10-14 | 2011-04-14 | National Chiao Tung University | Document Processing System and Method Thereof |
CN102841940A (en) * | 2012-08-17 | 2012-12-26 | 浙江大学 | Document summary extracting method based on data reconstruction |
US20160078038A1 (en) * | 2014-09-11 | 2016-03-17 | Sameep Navin Solanki | Extraction of snippet descriptions using classification taxonomies |
CN104503958A (en) * | 2014-11-19 | 2015-04-08 | 百度在线网络技术(北京)有限公司 | Method and device for generating document summarization |
CN104834735A (en) * | 2015-05-18 | 2015-08-12 | 大连理工大学 | Automatic document summarization extraction method based on term vectors |
CN105243152A (en) * | 2015-10-26 | 2016-01-13 | 同济大学 | Graph model-based automatic abstracting method |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
CN106599148A (en) * | 2016-12-02 | 2017-04-26 | 东软集团股份有限公司 | Method and device for generating abstract |
CN107273474A (en) * | 2017-06-08 | 2017-10-20 | 成都数联铭品科技有限公司 | Autoabstract abstracting method and system based on latent semantic analysis |
Non-Patent Citations (1)
Title |
---|
王佳松: "基于深度学习的多文档自动文摘研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388804A (en) * | 2018-10-22 | 2019-02-26 | 平安科技(深圳)有限公司 | Report core views extracting method and device are ground using the security of deep learning model |
CN110287489A (en) * | 2019-06-24 | 2019-09-27 | 北京大米科技有限公司 | Document creation method, device, storage medium and electronic equipment |
CN110781291A (en) * | 2019-10-25 | 2020-02-11 | 北京市计算中心 | Text abstract extraction method, device, server and readable storage medium |
CN111666402A (en) * | 2020-04-30 | 2020-09-15 | 平安科技(深圳)有限公司 | Text abstract generation method and device, computer equipment and readable storage medium |
CN112732901A (en) * | 2021-01-15 | 2021-04-30 | 联想(北京)有限公司 | Abstract generation method and device, computer readable storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108182247A (en) | Text summarization method and apparatus | |
Downey et al. | Locating complex named entities in web text. | |
CN103914548B (en) | Information search method and device | |
CN108280061A (en) | Text handling method based on ambiguity entity word and device | |
JP5447862B2 (en) | Word classification system, method and program | |
JP4595692B2 (en) | Time-series document aggregation method and apparatus, program, and storage medium storing program | |
KR20170055970A (en) | Computer-implemented identification of related items | |
US11645447B2 (en) | Encoding textual information for text analysis | |
CN111694927B (en) | Automatic document review method based on improved word shift distance algorithm | |
CN108388660A (en) | A kind of improved electric business product pain spot analysis method | |
CN108460098A (en) | Information recommendation method, device and computer equipment | |
JP7281905B2 (en) | Document evaluation device, document evaluation method and program | |
JP2011085986A (en) | Text summarization method, its device, and program | |
CN108038108A (en) | Participle model training method and device and storage medium | |
CN108304377A (en) | A kind of extracting method and relevant apparatus of long-tail word | |
KR20180131146A (en) | Apparatus and Method for Identifying Core Issues of Each Evaluation Criteria from User Reviews | |
CN110020163A (en) | Searching method, device, computer equipment and storage medium based on human-computer interaction | |
JP6830971B2 (en) | Systems and methods for generating data for sentence generation | |
Paripremkul et al. | Segmenting words in Thai language using Minimum text units and conditional random Field | |
CN107704549A (en) | Voice search method, device and computer equipment | |
CN107797986A (en) | A kind of mixing language material segmenting method based on LSTM CNN | |
CN110929518A (en) | Text sequence labeling algorithm using overlapping splitting rule | |
CN111339778B (en) | Text processing method, device, storage medium and processor | |
CN103914447B (en) | Information processing device and information processing method | |
Ou et al. | Unsupervised citation sentence identification based on similarity measurement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180619 |
|
RJ01 | Rejection of invention patent application after publication |