CN108182247A - Text summarization method and apparatus - Google Patents

Text summarization method and apparatus Download PDF

Info

Publication number
CN108182247A
CN108182247A CN201711463868.5A CN201711463868A CN108182247A CN 108182247 A CN108182247 A CN 108182247A CN 201711463868 A CN201711463868 A CN 201711463868A CN 108182247 A CN108182247 A CN 108182247A
Authority
CN
China
Prior art keywords
sentence
digest
scoring
candidate
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711463868.5A
Other languages
Chinese (zh)
Inventor
杜森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201711463868.5A priority Critical patent/CN108182247A/en
Publication of CN108182247A publication Critical patent/CN108182247A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The present invention proposes a kind of Text summarization method and apparatus, wherein, method includes:Obtain the sentence vector of each sentence, according to sentence vector, candidate digest sentence is selected from all sentences, then obtains the first scoring of candidate digest sentence, digest sentence is selected from candidate digest sentence according to the first scoring, further utilizes the digest of digest sentence generation article.Since sentence vector can preferably retain sentence information, first scoring represents the sentence quality of candidate digest sentence unit length, it is scored by sentence vector sum first and digest sentence is obtained to all sentences progress Double Selection in article, improve digest to the summary accuracy of article centre point and coverage.

Description

Text summarization method and apparatus
Technical field
The present invention relates to technical field of information processing more particularly to a kind of Text summarization method and apparatus.
Background technology
Digest is the succinctly coherent short essay of one literature centre's content of accurate reflection comprehensively.At present, related digest extraction skill Art is typically based on word frequency-inverse file frequency (Term Frequency-Inverse Document Frequency, abbreviation TF- IDF the score of word) is calculated, so as to score sentence by the word in sentence.
Due to the digest extractive technique based on TF-IDF by the word in sentence after TF-IDF is calculated, take out topN conducts The score of digest sentence cannot especially consider whole word frequency vectors when sentence is longer, if in sentence low word frequency word compared with Information that are more, and the word of these low word frequency is all given up, sentence will be influenced being included, therefore the text extracted in this way It plucks and tends not to covering sentence information, influence digest to the summary accuracy of literature centre's content and coverage.
Invention content
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, first purpose of the present invention is to propose a kind of Text summarization method, time is selected according to sentence vector Selection plucks sentence, further according to the first scoring of candidate digest sentence, further selects digest sentence from candidate digest sentence, And then according to digest sentence generation digest, since sentence vector can preferably retain sentence information, the first scoring represents candidate The sentence quality of digest sentence unit length, the digest thus generated can preferably reflect the centre point of article, improve The coverage of digest.
Second object of the present invention is to propose a kind of Text summarization device.
Third object of the present invention is to propose a kind of computer equipment.
Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.
The 5th purpose of the present invention is to propose a kind of computer program product.
In order to achieve the above object, first aspect present invention embodiment proposes a kind of Text summarization method, including:
Obtain the sentence vector of each sentence in article;
According to sentence vector, candidate digest sentence is selected from all sentences;
Obtain the first scoring of the candidate digest sentence;Wherein, it described first scores to represent the candidate digest The sentence quality of sentence unit length;
Digest sentence is selected from the candidate digest sentence according to the described first scoring;
Utilize the digest of article described in the digest sentence generation.
The alternatively possible realization method of first aspect present invention embodiment, the sentence for obtaining each sentence in article Subvector, including:
The sentence is identified from the article;
Obtain the original sentence vector of the sentence;
By limited Boltzmann machine neural network, dimension-reduction treatment is carried out to the original sentence vector, is obtained described Sentence vector.
The alternatively possible realization method of first aspect present invention embodiment, the original sentence for obtaining the sentence Vector, including:
Position of the sentence in the article is identified according to clue word, calculates the location score of the sentence;
Calculate the first similarity between the topic of the sentence and the article;
Obtain the feature space represented by multidimensional dictionary of the sentence;
Using the location score, first similarity and the feature space, the original sentence of the sentence is formed Vector.
The alternatively possible realization method of first aspect present invention embodiment, it is described to calculate the sentence and the article Topic between the first similarity, including:
The sentence is converted into the set of words of word composition;
Stop words is filtered out from the set of words, and extracts stem;
According to the stem extracted, first similarity is calculated.
The alternatively possible realization method of first aspect present invention embodiment, it is described according to sentence vector, from institute Candidate digest sentence is selected in some sentences, including:
Since first sentence of the article, based on supporting vector machine model SVM, one by one to described in each sentence Sentence vector scores, and obtains the second scoring of the sentence;
By second scoring of the sentence compared with preset first threshold, if second scoring is beyond described The first threshold then using the sentence as the candidate digest sentence, is updated to second scoring by first threshold;
If second scoring of the sentence abandons the sentence, and maintain without departing from the first threshold The first threshold is constant.
The alternatively possible realization method of first aspect present invention embodiment, the acquisition candidate digest sentence First scoring, including:
The second highest sentence of scoring described in the candidate digest sentence is added to digest sentence to concentrate;The digest Sentence concentration includes the digest sentence;
A candidate digest sentence is chosen from the remaining candidate digest sentence successively as current candidate digest sentence Son calculates the redundancy scoring between the current candidate digest sentence and the digest sentence collection;
According to redundancy scoring, second scoring of the current candidate digest sentence and current candidate text The sentence for plucking sentence is long, calculates first scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described to calculate the current candidate digest sentence The sub redundancy scoring between the digest sentence collection, including:
Obtain the current candidate digest sentence and the digest sentence concentrate between existing each digest sentence the Two similarities;
All the second similarity is added, and the result after will add up concentrates existing digest sentence with the digest sentence The number of son does ratio, obtains the redundancy scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described according to redundancy scoring, institute Second scoring and the sentence of the current candidate digest sentence for stating current candidate digest sentence are long, calculate described first and comment Point, including:
Obtain the difference between second scoring of the current candidate digest sentence and redundancy scoring;
Ratio is done by the sentence of the difference and the current candidate digest sentence is long, obtains first scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described according to redundancy scoring, institute Second scoring and the sentence of the current candidate digest sentence for stating current candidate digest sentence are long, calculate first scoring Before, it further includes:
Determine the redundancy scoring of the current candidate digest sentence less than redundancy threshold value;
When redundancy scoring is greater than the redundancy threshold value, the current candidate digest sentence is abandoned Son.
The alternatively possible realization method of first aspect present invention embodiment, it is described to score according to described first from described Digest sentence is selected in candidate digest sentence, including:
If first scoring is more than or equal to preset second threshold, by the current candidate digest sentence, The digest sentence collection is added to, and the second threshold is updated to first scoring.
The alternatively possible realization method of first aspect present invention embodiment, it is described by the current candidate digest sentence Son is added to the digest sentence collection, including:
The total sentence for obtaining the existing digest sentence of presently described digest sentence collection is long;
Calculate long long with the total sentence and value of sentence of the current candidate digest sentence, it is relatively more described and be worth with it is preset Digest sentence is long;
If described and value is long without departing from the digest sentence, the current candidate digest sentence is added to the digest Sentence is concentrated;
If described and value is long beyond the digest sentence, current candidate digest sentence is no longer added to the digest sentence In subset, stop the update to the digest sentence collection.
The alternatively possible realization method of first aspect present invention embodiment, the selection described second are scored above pre- If first threshold sentence as the candidate digest sentence after, further include:
According to the sequence of the candidate digest sentence, candidate digest sentence collection is formed;
It is then described by the current candidate digest sentence, it is added to after the digest sentence collection, further includes:
By the current candidate digest sentence, concentrate and delete from the candidate digest sentence.
The Text summarization method of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to achieve the above object, second aspect of the present invention embodiment proposes a kind of Text summarization device, including:
First acquisition module, for obtaining the sentence of each sentence in article vector;
First chooses module, for according to sentence vector, candidate digest sentence to be selected from all sentences;
Second acquisition module, for obtaining the first scoring of the candidate digest sentence;Wherein, first scoring is used for Represent the sentence quality of the candidate digest sentence unit length;
Second chooses module, for selecting digest sentence from the candidate digest sentence according to the described first scoring;
Generation module, for utilizing the digest of article described in the digest sentence generation.
The Text summarization device of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to achieve the above object, third aspect present invention embodiment proposes a kind of computer equipment, including processor and deposit Reservoir;
Wherein, the processor by read the executable program code stored in the memory run with it is described can The corresponding program of program code is performed, for realizing the Text summarization method as described in first aspect present invention embodiment.
To achieve these goals, fourth aspect present invention embodiment proposes a kind of computer-readable storage of non-transitory Medium is stored thereon with computer program, is realized as described in first aspect present invention embodiment when which is executed by processor Text summarization method.
To achieve these goals, fifth aspect present invention embodiment proposes a kind of computer program product, when described The Text summarization as described in first aspect present invention embodiment is realized when instruction in computer program product is performed as processor Method.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description It obtains significantly or is recognized by the practice of the present invention.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Significantly and it is readily appreciated that, wherein:
Fig. 1 is a kind of flow diagram of Text summarization method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of another Text summarization method provided in an embodiment of the present invention;
Fig. 3 is the process schematic provided in an embodiment of the present invention for obtaining original sentence vector;
Fig. 4 is the flow diagram of another Text summarization method provided in an embodiment of the present invention;
Fig. 5 is a kind of Text summarization device that the embodiment of the present invention proposes;
Fig. 6 is the block diagram suitable for being used for realizing the exemplary computer device of embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
Below with reference to the accompanying drawings the Text summarization method and apparatus of the embodiment of the present invention are described.
Digest is the succinctly coherent short essay of one literature centre's content of accurate reflection comprehensively.At present, related digest extraction skill Art is typically based on TF-IDF to score sentence.
Due to the digest extractive technique based on TF-IDF by the word in sentence after TF-IDF is calculated, take out topN conducts The score of digest sentence cannot especially consider whole word frequency vectors when sentence is longer, if in sentence low word frequency word compared with Information that are more, and the word of these low word frequency is all given up, sentence will be influenced being included, therefore the text extracted in this way It plucks and tends not to covering sentence information, influence digest to the summary accuracy of literature centre's content and coverage.
For this problem, the embodiment of the present invention proposes a kind of Text summarization method, and candidate is selected according to sentence vector Digest sentence further according to the first scoring of candidate digest sentence, further selects digest sentence from candidate digest sentence, into And according to digest sentence generation digest, since sentence vector can preferably retain sentence information, the first scoring represents candidate text The sentence quality of sentence unit length is plucked, the digest thus generated can preferably reflect the centre point of article, improve text The coverage plucked.
Fig. 1 is a kind of flow diagram of Text summarization method provided in an embodiment of the present invention.
As shown in Figure 1, the digest generation method includes:
Step 101, the sentence vector of each sentence in article is obtained.
In the present embodiment, for the article of digest to be extracted, can each sentence in article be obtained according to punctuation mark. After sentence is obtained, the sentence vector of each sentence is calculated.Specifically, by all words of sentence first with stopping word and punctuation mark Cutting obtains the sentence set of article.
After sentence is obtained, the sentence vector of each sentence is calculated.Word in corpus carries out the word in sentence Coding replaces the equivalent in former sentence using the result after coding, obtains the coding vector that machine can identify, the coding to Amount is sentence vector.
Step 102, according to sentence vector, candidate digest sentence is selected from all sentences.
In the present embodiment, to each sentence, the sentence vector after coded treatment is input to based on supporting vector machine model SVM obtains the scoring of SVM distich subvectors, and based on the scoring of sentence vector, candidate digest is selected from all sentences Sentence.Specifically, the sentence for being scored above predetermined threshold value of SVM distich subvectors is chosen, as candidate digest sentence.
Since sentence vector can retain more sentence information, compared to the word using TF-IDF extractions, sentence vector is utilized The candidate digest sentence of selection can more reflect the centre point of digest.
Step 103, the first scoring of candidate digest sentence is obtained;Wherein, it first scores to represent candidate digest sentence The sentence quality of unit length.
After candidate digest sentence is extracted, the first scoring of each candidate digest sentence is obtained.Due to the number of words of digest It typically limits, therefore the length of sentence can have an impact digest, in the present embodiment, the sentence quality of unit length is made For the first scoring.Wherein, the sentence quality of unit length is obtained by the whole scoring of sentence divided by sentence length, is being obtained After getting the first scoring, can digest sentence be filtered out from candidate digest sentence according to the first scoring.
Step 104, digest sentence is selected from candidate digest sentence according to the first scoring.
In the present embodiment, as a kind of possible realization method, candidate digest sentence can be dropped according to the first scoring Sequence arranges, and using the candidate digest sentence for the preceding predetermined number that sorts as digest sentence.
As alternatively possible realization method, the candidate digest sentence of predetermined threshold value can be scored above as text using first Pluck sentence.
Since the first scoring represents the sentence quality of unit length, the first scoring is higher, illustrates the list of candidate digest sentence The sentence quality of bit length is higher, therefore the digest sentence selected according to the first scoring, can reflect the center of article comprehensively Content.
Step 105, the digest of digest sentence generation article is utilized.
In order to make the logical order of digest corresponding with article, after digest sentence is extracted, can be existed according to digest sentence Sequencing in article is ranked up, and obtains the digest of article.
In order to improve the extraction rate of digest, using support vector machines (Support Vector Machine, abbreviation SVM the second scoring of sentence vector) is quickly calculated, candidate digest sentence is chosen according to the second scoring.Below by another implementation Example, illustrates Text summarization method proposed by the present invention.Fig. 2 is another Text summarization method provided in an embodiment of the present invention Flow diagram.
As shown in Fig. 2, the digest generation method includes:
Step 201, the original sentence vector of sentence is obtained.
In the present embodiment, can according to the first similarity between the topic of the location score of sentence, sentence and article and The feature space represented by various dimensions of sentence forms original sentence vector.
Fig. 3 is the process schematic provided in an embodiment of the present invention for obtaining original sentence vector.
As shown in figure 3, the process for obtaining original sentence vector includes:
Step 301, sentence is identified from article.
In the present embodiment, sentence all in article can be identified according to punctuation marks such as fullstop, question mark, exclamation mark, branches Son.It should be noted that for punctuation mark is less or article without punctuation mark, the semantic of sentence can be passed through and combined The sentence in article is identified in sentence structure, marks off single sentence.Sentence structure can include Subject, Predicate and Object, determine shape benefit Etc. structures.Since the beginning of article, a segment word word for word is formed with the word of front, when a bit of word identified Include specific sentence structure, and this bit of word is semantic complete, then it can be by a bit of Text region into a sentence Son.The number of words that a bit of word can be set is no more than preset number, for example, being up to 30 words.
Step 302, position of the sentence in article is identified according to clue word, calculates the location score of sentence.
Since paragraph beginning and end typically includes summing-up sentence, can sentence be identified by clue word Position, among the then position according to sentence in article, such as paragraph beginning, paragraph, paragraph ending, by calculating each sentence Cosine similarity between son and the title of article, then determines according to the cosine similarity between the sentence title calculated The location score of the sentence.It is alternatively possible to being empirically for different location by professional sets different location scores.
Step 303- steps 305 are to calculate the process of the first similarity between sentence and title of article.
Step 303, sentence is converted into the set of words of word composition.
For each sentence, word segmentation processing, the set of words that sentence is converted into being made of word can be carried out to sentence.
Step 304, stop words is filtered out from set of words, and extracts stem.
Due to that may have stop words such as " " " " " I " in sentence, in order to reduce the calculation amount of the first similarity, this In embodiment, the stop words in set of words can be filtered out according to deactivated word list, the stem extraction algorithm in Lucene also can be used Extract stem.Wherein, Lucene is the full-text search engine kit of an open source code.The full-text search engine kit Cutting can be carried out to the original sentence in article, the stop words such as removal punctuation mark and stopping obtain sentence trunk, utilize sentence Included morphology is into stem in sub- trunk.
Step 305, according to the stem extracted, the first similarity between sentence and topic is calculated.
Since topic is that the refining to entire article is summarized, the first phase between sentence and the topic of article can be calculated Like degree, specifically, the stem extracted can be utilized, carry out Word similarity with word in topic, get each stem Similarity.After the similarity for getting all stems, all similarities can be weighted it is average, obtain sentence with The first similarity between topic, one that the first similarity between the sentence and topic can be vectorial as original sentence is formed A element.
Step 306- steps 307 are the processes for the feature space represented by multidimensional dictionary for obtaining sentence.
Step 306, the weight of stem is calculated according to TF-IDF.
To the stem of extraction, weight of the stem in sentence can be calculated based on TF-IDF.
Step 307, the feature space represented by multidimensional dictionary of sentence is obtained.
In the present embodiment, multidimensional dictionary can be combined by the stem of the set of words of all sentences.Dimension is stem Number.If stem is set as 5000, then the feature space of sentence is represented with the multidimensional dictionary of 5000 dimensions, is passed through by getting After the feature space of the sentence of multidimensional dictionary mark, which can be identified by machine.
Step 308, using location score, the first similarity and feature space, the original sentence vector of sentence is formed.
In the present embodiment, the feature of the first similarity and sentence between the location score of sentence, sentence and topic is empty Between respectively as structure original sentence vector element.Getting location score, the first similarity and feature space, it is possible to Based on location score, the first similarity and feature space, the original sentence vector of sentence is formed.About location score, the first phase The record of related content in above-described embodiment is can be found in like the acquisition process of degree and feature space, details are not described herein again.
In the present embodiment, since original sentence vector is made of location score, the first similarity and feature space etc., compare The information of sentence is represented based on the TF-IDF words chosen, original sentence vector retains more sentence information.
Step 202, by limited Boltzmann machine neural network, dimension-reduction treatment is carried out to original sentence vector, is obtained Sentence vector.
Due to the higher-dimension characteristic of language in itself, related digest extractive technique generally requires to construct by hand to be made for grader Feature, high labor cost.
In the embodiment of the present invention, pass through limited Boltzmann machine neural network (Restricted Boltzmann Machine, abbreviation RBM), dimension-reduction treatment is carried out to original sentence vector, obtains sentence vector.
Wherein, RBM only has input layer and hidden layer, obtains hidden layer by the adjustment of weights after input layer input, implies Layer counter can release the data of input layer again, this process and self-encoding encoder are somewhat similar, self-encoding encoder have input layer, hidden layer and Output layer, using the principle of the error back propagation of neural network (Back Propagation, abbreviation BP), target when training Function is to ask for minimum error sum of squares.RBM networks use CD quick calculation methods, using Gibbs model, meet Bohr Zi Man is distributed, its network is made to rapidly converge to the process of a stable state.Such as by through stopping the original sentences of the pretreatments such as word to Amount is input in RBM networks, and an output is obtained, and calculate the primitive sentence with input by the processing of the hidden layer of Boltzmann The error of subvector readjusts weights and continues to train.After ANALOGY OF BOLTZMANN DISTRIBUTION is met, RBM networks can just reach one Stable state.From stable state when hidden layer part of nodes in the latent space vector that extracts, be the sentence vector after dimensionality reduction.
Feature selecting can be carried out by RBM, cost of labor can be reduced, and be easier to obtain feature and make feature Comprising sentence information it is more complete.
Step 203, since first sentence of article, based on supporting vector machine model SVM, one by one to the sentence of sentence Vector scores, and obtains the second scoring of sentence.
Since supporting vector machine model SVM is for high dimension vector, obtaining for sentence can quickly be calculated by supporting vector Point.In the present embodiment, since first sentence of article, scored the sentence vector of sentence, obtained one by one based on SVM Second scoring of sentence.
Specifically, SVM models are established by carrying expert's sample set of mark, is established by the largest interval for calculating function Supporting vector plane (Support Vesctor), digest is divided into plane both sides by supporting vector.Limited Boltzmann machine is carried The sentence vector of taking-up, puts into SVM models.Sentence vector and the inner product of SVM supporting vectors are calculated, which is classified.SVM moulds Block can calculate the distance of sentence vector and hyperplane, and distance is bigger, and it is bigger to represent the certainty factor that SVM models are classified, Namely illustrate that the sentence is more likely to be digest sentence.
Step 204, by the second scoring of sentence compared with preset first threshold.
Step 205, if the second scoring of sentence abandons sentence, and maintain first threshold without departing from first threshold Constant, return continues to execute a 203 to the last sentence processing and completes, and obtains all candidate digest sentences.
Step 206, if the second scoring is beyond first threshold, using sentence as candidate's digest sentence, and by the first threshold Value is updated to the second scoring.
In the present embodiment, as soon as after the second scoring of sentence is got, the second scoring and the first threshold by sentence Value is compared, if the second of the sentence is scored above first threshold, using the sentence as candidate digest sentence.Further First threshold when second is scored above first threshold, is updated to the second scoring by ground.It should be noted that in article First sentence, first threshold is initial numerical value, then during being screened to sentence below, if current sentence The second scoring higher than first threshold, then the second scoring update first threshold of the current sentence can be utilized, so as to protect The scoring of candidate digest sentence selected below is demonstrate,proved to be stepped up.
Step 207, the second highest sentence of scoring in candidate digest sentence is added to digest sentence to concentrate.
After all candidate digest sentences of article are obtained, according to the second of candidate digest sentence the scoring, to candidate digest Sentence is ranked up, and by the highest candidate digest sentence of the second scoring, the digest sentence for being added to storage digest sentence is concentrated.
Step 208, a candidate digest sentence is chosen from remaining candidate digest sentence successively as current candidate text Sentence is plucked, the redundancy calculated between current candidate digest sentence and digest sentence collection scores.
In related digest extractive technique, by the highest word representative sentences sub-informations of TF-IDF, thus similar word can be selected easily In, therefore redundancy increase between digest sentence can be made, the coverage of digest is made to narrow.
In the present embodiment, using the highest candidate digest sentence of the second scoring as digest sentence, it is added to digest sentence collection After conjunction, in order to choose digest sentence from remaining candidate digest sentence, remaining each candidate digest sentence and text can be calculated The redundancy scoring between sentence collection is plucked, with according to the redundancy between digest sentence and digest sentence collection, as extraction digest sentence The foundation of son.
In specific implementation, can obtain current candidate digest sentence and digest sentence concentrate existing each digest sentence it Between the second similarity, all the second similarity is added, and the result after will add up concentrates existing text with digest sentence The number for plucking sentence does ratio, obtains redundancy scoring.As shown in formula (1).
Wherein, x represents current candidate digest sentence, S*Represent digest sentence collection, r (x, S*) represents candidate digest sentence x With digest sentence collection S*Between redundancy scoring, s ∈ S*Represent the digest sentence that digest sentence is concentrated, similar (x, s) table Show the second similarity between current candidate digest sentence x and digest sentence s, moleculeRepresent current candidate Digest sentence and digest sentence concentrate the second similarity between existing each digest sentence, after being added as a result, size (S*) represent that digest sentence concentrates the number for having digest sentence.
Step 209, it is scored according to redundancy, the second scoring of current candidate digest sentence and current candidate digest sentence Sentence is long, calculates the first scoring.
As a kind of possible realization, can calculate between the second scoring of current candidate digest sentence and redundancy scoring Difference does ratio by the sentence of difference and current candidate digest sentence is long, obtains the first scoring.Specifically, it can be counted according to formula (2) Calculate the first scoring of current candidate digest sentence.
Wherein, x represents current candidate digest sentence, fx(score) the first scoring of current candidate digest sentence x, r are represented (x, S*) represents candidate digest sentence x and digest sentence collection S*Between redundancy scoring, f (x) represent current candidate digest sentence Second scoring of son, len (x) represent that the sentence of current candidate digest sentence x is long.Wherein, len (x) is current candidate digest sentence x In all characters number.
In the present embodiment, using the redundancy scoring between current candidate digest sentence and digest sentence collection as calculating first The foundation of scoring, according to formula (2) it is found that in the case where sentence length and the second scoring are certain, redundancy scoring is bigger, the One scoring is smaller.
It should be noted that in order to avoid current candidate digest sentence and existing digest sentence are there are larger redundancy, After redundancy scoring is got, redundancy can be scored and be compared with preset redundancy threshold value, if current wait Selection plucks the redundancy scoring of sentence less than redundancy threshold value, illustrates the weight of current candidate digest sentence and existing digest sentence New content is less, can continue to calculate the first scoring of current candidate digest sentence, if the redundancy of current candidate digest sentence Degree scoring is less than redundancy threshold value, illustrates that current candidate digest sentence and the content again of existing digest sentence are more, if Current candidate digest sentence is continued into choosing the phenomenon that generating digest, the duplicate contents of digest can be caused more, at this time may not be used The first scoring of current candidate digest sentence is calculated again, directly deletes current candidate digest sentence.Pass through meter in the present embodiment The redundancy of current candidate digest sentence and existing digest sentence is calculated, the content duplicated in digest can be avoided as possible, So as to which the information content of article in itself can be provided in limited word.
Step 210, if the first scoring is more than or equal to preset second threshold, by current candidate digest sentence, Digest sentence collection is added to, and second threshold is updated to the first scoring.
In the present embodiment, the first scoring of current candidate digest sentence with preset second threshold is compared, works as time When the first scoring that selection plucks sentence is equal to or more than second threshold, using current candidate digest sentence as digest sentence, add Enter to digest sentence and concentrate.Further, when the first scoring of current candidate digest sentence is more than second threshold, first is utilized Scoring update second threshold, i.e., be updated to the first scoring by second threshold so that the value increase of second threshold, so as to make The first scoring for obtaining digest sentence currently is consistently higher than previous digest sentence, it is ensured that the list of digest sentence extracted Position quality higher.
Step 211, the digest for the digest sentence generation article concentrated using digest sentence.
By remaining candidate digest sentence, after extracting digest sentence by the first scoring, digest sentence is utilized to concentrate Digest sentence, according to the sequence that digest sentence occurs in the text, generate the digest of article.
The Text summarization method of the embodiment of the present invention is quickly scored by SVM distich subvector, is obtained second and is commented Point, the speed of Text summarization can be improved, it, will be superfluous between candidate digest sentence and digest sentence collection when calculating the first scoring The sentence that remaining scoring is concentrated as calculation basis, maximum difference digest sentence, avoids the useless digest sentence of repetition, by The digest of this generation is wider in coverage.
Since digest is the summary to article centre point, the sentence of digest is long generally can not be too long, therefore in generation digest When, the sentence that can preset digest is long.Since the long sentence for directly affecting digest of the sentence of digest sentence is long, if step 210 first Scoring is more than or equal to second threshold, when current candidate digest sentence is added to digest sentence collection, further, considers pre- If digest sentence it is long.
Specifically, the character for current digest sentence being concentrated existing digest sentence first is added, and it is long to obtain total sentence.So Afterwards, calculate current candidate digest sentence sentence is long long with total sentence and value, compare and value and preset digest sentence it is long.And if value It is long without departing from digest sentence, that is to say, that after current candidate digest sentence is added in digest sentence collection, total sentence is long do not exceed it is default Sentence it is long, then current candidate digest sentence can be added to digest sentence concentration.And if value is long beyond digest sentence, will not Current candidate digest sentence is added to digest sentence concentration.
In the present embodiment, when the first scoring of candidate digest sentence is equal to or more than second threshold, by preset Digest sentence is long, further determines whether that candidate digest sentence is added to digest sentence concentrates, long so as to avoid the sentence of digest It is long, it is not terse enough.
It further, can be according to candidate digest sentence after all candidate digest sentences are extracted according to the second scoring Sequence in article is ranked up to form candidate digest sentence collection, in order to avoid reprocessing, adds by candidate digest sentence Enter to after digest sentence collection, current candidate digest sentence from candidate digest sentence can be concentrated and deleted.
Illustrate above-described embodiment in order to clearer, Text summarization method proposed by the present invention is explained with reference to Fig. 4.Figure 4 be the flow diagram of another Text summarization method provided in an embodiment of the present invention.
Step 401, the sentence set A of article is obtained, each sentence si, i initial value is 0 in size N, A.
In the embodiment of the present invention, first according to punctuates such as fullstop, branches, identify the sentence in article, obtain article N number of sentence is shared in sentence set A, wherein A, each sentence in A is represented with si, and the initial value of i is 0.Wherein, i is just Integer.
Step 402- steps 406 are that candidate digest sentence set C is extracted in subordinate clause subclass A.Specifically,
Step 402, i<N.Judge whether i is less than N.If less than N, step 403 is performed;Otherwise, step 407 is performed.
Step 403, the original sentence vector of sentence is obtained, the second scoring of sentence is calculated according to SVM.
In the present embodiment, the method that obtains the original sentence vector of sentence, reference can be made to above-described embodiment, no longer superfluous herein It states.
Dimension-reduction treatment is carried out after the original sentence vector for defining sentence, the sentence vector of sentence is obtained, then will The sentence vector of sentence is input to SVM and scores sentence, obtains the second scoring f (x) of sentence.About SVM to sentence into The description of row scoring, reference can be made to above-described embodiment, details are not described herein.
Step 404, f (x)>c.Judge whether the second scoring f (x) of sentence is more than c.Wherein, c is under preset scoring Limit.
If f (x) performs step 405, otherwise, performs step 406 more than c.
Step 405, by f (x) compared with max, then sentence candidate digest sentence collection C is added in into, and will if greater than max Max is updated to f (x).
It in the present embodiment, is traversed by the sentence in ordered pair article, candidate digest sentence is chosen from all sentences. By taking current sentence as an example, by the f (x) of current sentence compared with max, sentence is then added in into candidate digest sentence if greater than max Collect C, and max is updated to f (x).After step 405 is performed, step 406 is continued to execute.It should be noted that when initial, in advance First set a max value, by f (x) and the max value of first sentence compared with, if greater than max values, then first sentence is added Enter candidate digest sentence collection C.Further, max values are updated to the f (x) of first sentence.
Step 406, i=i+1, that is, i add 1 certainly.
In the present embodiment, i from plus 1 after, perform step 402, judge whether i is less than N, is i.e. sentence in sentence set A is No traversal finishes.Work as i>During N, step 407 is performed.
Step 407, the highest candidate digest sentences of f (x) in candidate digest sentence collection C are added to digest sentence collection S* In, and the sentence is deleted from candidate digest sentence collection C.
It should be noted that it is to be stored in max variables that highest f (x) value is found from candidate digest sentence collection C Current value, that is to say, that currently the corresponding candidate digest sentences of max are the highest sentences of f (x).
By above-mentioned steps 401- steps 406, by the sentence set A of article, the second of sentence is scored above the first threshold After the sentence of value is added to candidate digest sentence collection C, candidate digest sentence corresponding with current max is then added to digest sentence Subset S*In, i.e., the highest candidate digest sentence of the second scoring is added to digest sentence collection S*, and by the sentence from candidate digest It is deleted in sentence collection C.
Step 408, len (x)<L judges digest sentence collection S*In existing digest sentence the long len (x) of total sentence, be It is no to be less than preset digest overall length L.If len (x)<L then performs step 409;Otherwise, step 414 is performed.
Step 409, next candidate digest sentence Sj, 0≤j < m are taken out from candidate collection of abstracts C, calculate candidate text Pluck the redundancy of sentence and the digest sentence in existing collection of abstracts.
It is to be appreciated that when performing step 409, the value of m is had to be larger than equal to 1, that is to say, that candidate digest sentence set The candidate digest sentence of at least one in C.
Step 410, j<m.Judge whether j is less than m.Work as j<During m, that is to say, that do not have in C in the presence of candidate digest sentence also When being traversed, step 411 is performed.
Step 411, if the redundancy of candidate digest sentence and existing digest sentence in digest set is less than redundancy threshold Value then calculates the first scoring of candidate digest sentence, and compared with, and wherein maxScore preserves highest first scoring.
In the present embodiment, by the first scoring compared with maxScore, maxScore is made to preserve highest first scoring.If certain First current value of the scoring more than maxScore of sentence, then replace the current value of maxScore by the first scoring of sentence.If First current value of the scoring less than maxScore of certain sentence, then maxScore holdings current value is constant.It should be noted that MaxScore is the second threshold in above-described embodiment, carves the value for presetting the maxScore at the beginning, then utilizes First scoring of the digest sentence selected is updated the value of maxScore.
After execution of step 411, perform step 412, j=j+1, that is, j and add 1 certainly.Then, step 409 is performed, Continue computing redundancy degree and judge whether j is less than m.
Step 413, candidate digest sentence corresponding with maxScore is added to S*In, and the sentence is deleted from C.
It, will be with after the highest candidate digest sentence of the first scoring is extracted from candidate digest sentence C in the present embodiment The corresponding candidate digest sentences of maxScore are added to S*In, and candidate's digest sentence is deleted from C.Later, step is performed Rapid 409, judge current digest sentence collection S*In existing digest sentence total sentence it is long whether be less than the preset long L of digest sentence.Such as Fruit len (x) >=L then performs step 414, otherwise continues to execute 409.
By step 409- steps 413, the higher sentence of the second scoring is extracted from all sentences first as time Selection plucks sentence, and then the redundancy based on candidate digest sentence and existing digest sentence from candidate digest sentence collection C is made For foundation, from candidate digest sentence concentration extract remaining existing digest sentence redundancy is low and the long quality of unit sentence (first comments Point) high candidate digest sentence is added to S*, obtain the digest sentence for building digest.
Step 414, according to S*In all digest sentence generation article digest.
After completing to the traversal of candidate digest sentence, according to digest sentence all in S*, according to digest sentence in text Position in chapter is spliced to form the digest of article.
The Text summarization method of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to realize above-described embodiment, the present invention also proposes a kind of Text summarization device.Fig. 5 is proposed for the embodiment of the present invention A kind of Text summarization device.
As shown in figure 5, the device includes:First acquisition module 510, first choose module 520, the second acquisition module 530, Second chooses module 540, generation module 550.
Wherein, the first acquisition module 510, for obtaining the sentence of each sentence in article vector.
First chooses module 520, for according to sentence vector, candidate digest sentence to be selected from all sentences.
Second acquisition module 530, for obtaining the first of candidate digest sentence the scoring;Wherein, it first scores to represent The sentence quality of candidate digest sentence unit length.
Second chooses module 540, for selecting digest sentence from candidate digest sentence according to the first scoring.
Generation module 550, for utilizing the digest of digest sentence generation article.
In one embodiment of the invention, the first acquisition module 510 includes:
Recognition unit, for identifying sentence from article;
Acquiring unit, for obtaining the original sentence of sentence vector;
For passing through limited Boltzmann machine neural network, dimension-reduction treatment is carried out to original sentence vector for dimensionality reduction unit, Obtain sentence vector.
In one embodiment of the invention, acquiring unit is additionally operable to:
Position of the sentence in article is identified according to clue word, calculates the location score of sentence;
Calculate the first similarity between sentence and the topic of article;
Obtain the feature space represented by multidimensional dictionary of sentence;
Using location score, the first similarity and feature space, the original sentence vector of sentence is formed.
In one embodiment of the invention, acquiring unit is additionally operable to:
Sentence is converted into the set of words of word composition;
Stop words is filtered out from set of words, and extracts stem;
According to the stem extracted, the first similarity is calculated.
In one embodiment of the invention, the first selection module 520 may include:
Score unit, for since first sentence of article, based on supporting vector machine model SVM, one by one to each The sentence vector of sentence scores, and obtains the second scoring of sentence;
Selection unit, for by the second of sentence the scoring with preset first threshold compared with, if the second scoring exceeds the One threshold value, then using sentence as candidate digest sentence, and first threshold is updated to second scoring;If the second of sentence comments Divide without departing from first threshold, then abandon sentence, and maintain first threshold constant.
In one embodiment of the invention, the second acquisition module 530 may include:
First computing unit, for the second highest sentence of scoring in candidate digest sentence to be added to digest sentence collection In;Digest sentence concentration includes digest sentence;A candidate digest sentence is chosen from remaining candidate digest sentence successively to make For current candidate digest sentence, the redundancy calculated between current candidate digest sentence and digest sentence collection scores;
Second computing unit, for being scored according to redundancy, current candidate digest sentence second scoring and current candidate The sentence of digest sentence is long, calculates the first scoring.
In one embodiment of the invention, the first computing unit is additionally operable to:
Obtain the second similarity between current candidate digest sentence and the existing each digest sentence of digest sentence concentration;
All the second similarity is added, and the result after will add up concentrates existing digest sentence with digest sentence Number does ratio, obtains redundancy scoring.
In one embodiment of the invention, the second computing unit is additionally operable to:
Obtain the difference between the second scoring of current candidate digest sentence and redundancy scoring;
Ratio is done by the sentence of difference and current candidate digest sentence is long, obtains the first scoring.
Further, the first computing unit is additionally operable to:
Scored according to redundancy, current candidate digest sentence it is described second scoring and current candidate digest sentence sentence It is long, before calculating the first scoring, the redundancy scoring of current candidate digest sentence is determined less than redundancy threshold value, when redundancy is commented When dividing greater than redundancy threshold value, current candidate digest sentence is abandoned.
In one embodiment of the invention, the second selection module 540 is additionally operable to:
If the first scoring is more than or equal to preset second threshold, utilizes current candidate digest sentence, add in The first scoring is updated to digest sentence collection, and by second threshold.
In one embodiment of the invention, the second selection module 540 is additionally operable to:
The total sentence for obtaining the existing digest sentence of current digest sentence collection is long;
Calculate current candidate digest sentence sentence is long long with total sentence and value, compare and value and preset digest sentence it is long;
And if value is long without departing from digest sentence, and current candidate digest sentence is added to digest sentence concentration;
And if value is long beyond digest sentence, and current candidate digest sentence no longer is added to digest sentence concentration, is stopped Update to the digest sentence collection.
In one embodiment of the invention, the first selection module 520 may also include:
Form unit, for choose second be scored above the sentence of preset first threshold as candidate digest sentence it Afterwards, according to the sequence of candidate digest sentence, candidate digest sentence collection is formed.
In one embodiment of the invention, the second selection module 540 is additionally operable to:
It by current candidate digest sentence, is added to after digest sentence collection, by current candidate digest sentence, from candidate text It plucks sentence and concentrates deletion.
It should be noted that the aforementioned explanation to Text summarization embodiment of the method, suitable for the digest of the embodiment Generating means, details are not described herein.
The Text summarization device of the embodiment of the present invention, it is vectorial according to sentence by obtaining the sentence vector of each sentence, from Candidate digest sentence is selected in all sentences, then obtains the first scoring of candidate digest sentence, is scored according to first from time Selection, which is plucked, selects digest sentence in sentence, further utilize the digest of digest sentence generation article.In the present embodiment, due to sentence Subvector can preferably retain sentence information, and the first scoring represents the sentence quality of candidate digest sentence unit length, passes through Sentence vector sum first scores obtains digest sentence to all sentences progress Double Selection in article, improves digest to article The summary accuracy of centre point and coverage solve in related digest extractive technique the highest words of TF-IDF representing sentence Information, have ignored the relationship between word and word, be lost the problem of information of more word causes the coverage of digest narrow.
In order to realize above-described embodiment, the invention also provides a kind of computer equipment, including processor and memory;
Wherein, processor is run and executable program code by reading the executable program code stored in memory Corresponding program, for realizing the Text summarization method as described in aforementioned any embodiment.
Fig. 6 shows the block diagram suitable for being used for the exemplary computer device 30 for realizing the application embodiment.Fig. 6 is shown Computer equipment 30 be only an example, any restrictions should not be brought to the function and use scope of the embodiment of the present application.
As shown in fig. 6, computer equipment 30 is showed in the form of universal computing device.The component of computer equipment 30 can be with Including but not limited to:One or more processor or processing unit 31, system storage 32 connect different system component The bus 33 of (including system storage 32 and processing unit 31).
Bus 33 represents one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts For example, these architectures include but not limited to industry standard architecture (Industry Standard Architecture;Hereinafter referred to as:ISA) bus, microchannel architecture (Micro Channel Architecture;Below Referred to as:MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards Association;Hereinafter referred to as:VESA) local bus and peripheral component interconnection (Peripheral Component Interconnection;Hereinafter referred to as:PCI) bus.
Computer equipment 30 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 30 accesses, including volatile and non-volatile medium, moveable and immovable medium.
System storage 32 can include the computer system readable media of form of volatile memory, such as arbitrary access Memory (Random Access Memory;Hereinafter referred to as:RAM) 40 and/or cache memory 41.Computer equipment 30 It may further include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only conduct Citing, storage system 42 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 6 do not show, commonly referred to as " hard disk Driver ").Although being not shown in Fig. 6, can provide for the magnetic to moving non-volatile magnetic disk (such as " floppy disk ") read-write Disk drive and to removable anonvolatile optical disk (such as:Compact disc read-only memory (Compact Disc Read Only Memory;Hereinafter referred to as:CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only Memory;Hereinafter referred to as:DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 33.Memory 32 can include at least one program and produce Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application The function of embodiment.
Program/utility 50 with one group of (at least one) program module 51 can be stored in such as memory 32 In, such program module 51 includes --- but being not limited to --- operating system, one or more application program, other programs Module and program data may include the realization of network environment in each or certain combination in these examples.Program mould Block 51 usually performs function and/or method in embodiments described herein.
Computer equipment 30 can also be with one or more external equipments 60 (such as keyboard, sensing equipment, display 70 Deng) communication, the equipment interacted with the computer equipment 30 communication can be also enabled a user to one or more and/or with making The computer equipment 30 any equipment (such as network interface card, the modulatedemodulate that can communicate with one or more of the other computing device Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 34.Also, computer equipment 30 may be used also To pass through network adapter 35 and one or more network (such as LAN (Local Area Network;Hereinafter referred to as: LAN), wide area network (Wide Area Network;Hereinafter referred to as:WAN) and/or public network, for example, internet) communication.Such as figure Shown, network adapter 35 is communicated by bus 33 with other modules of computer equipment 30.It should be understood that although do not show in figure Go out, computer equipment 30 can be combined and use other hardware and/or software module, including but not limited to:Microcode, device drives Device, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processing unit 31 is stored in program in system storage 32 by operation, so as to perform various functions application and Data processing, such as realize Fig. 1-Text summarization shown in Fig. 4 method.
In order to realize above-described embodiment, the embodiment of the present invention also proposed a kind of non-transitory computer-readable storage medium Matter is stored thereon with computer program, and the digest life as described in aforementioned any embodiment is realized when which is executed by processor Into method.
In order to realize above-described embodiment, the invention also provides a kind of computer program products, work as computer program product In instruction Text summarization method as described in aforementioned any embodiment is realized when being performed as processor.
In the description of this specification, reference term " one embodiment ", " example ", " is specifically shown " some embodiments " The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment of the present invention or example.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It is combined in an appropriate manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the different embodiments or examples described in this specification and the feature of different embodiments or examples It closes and combines.
In addition, term " first ", " second " are only used for description purpose, and it is not intended that instruction or hint relative importance Or the implicit quantity for indicating indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, " multiple " are meant that at least two, such as two, three It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, represent to include Module, segment or the portion of the code of the executable instruction of one or more the step of being used to implement custom logic function or process Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, to perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The system of row system, device or equipment instruction fetch and execute instruction) it uses or combines these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress used with reference to these instruction execution systems, device or equipment It puts.The more specific example (non-exhaustive list) of computer-readable medium is including following:Electricity with one or more wiring Connecting portion (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can be for example by carrying out optical scanner to paper or other media, then into edlin, interpretation or when necessary with it His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized:With for data-signal realize logic function logic gates from Logic circuit is dissipated, the application-specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..
Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, the program when being executed, one or a combination set of the step of including embodiment of the method.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it can also That each unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and is independent product sale or in use, can also be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although it has been shown and retouches above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention System, those of ordinary skill in the art can be changed above-described embodiment, change, replace and become within the scope of the invention Type.

Claims (10)

  1. A kind of 1. Text summarization method, which is characterized in that including:
    Obtain the sentence vector of each sentence in article;
    According to sentence vector, candidate digest sentence is selected from all sentences;
    Obtain the first scoring of the candidate digest sentence;Wherein, it described first scores to represent the candidate digest sentence The sentence quality of unit length;
    Digest sentence is selected from the candidate digest sentence according to the described first scoring;
    Utilize the digest of article described in the digest sentence generation.
  2. 2. according to the method described in claim 1, it is characterized in that, the sentence vector for obtaining each sentence in article, packet It includes:
    The sentence is identified from the article;
    Obtain the original sentence vector of the sentence;
    By limited Boltzmann machine neural network, dimension-reduction treatment is carried out to the original sentence vector, obtains the sentence Vector.
  3. 3. according to the method described in claim 2, it is characterized in that, the original sentence vector for obtaining the sentence, including:
    Position of the sentence in the article is identified according to clue word, calculates the location score of the sentence;
    Calculate the first similarity between the topic of the sentence and the article;
    Obtain the feature space represented by multidimensional dictionary of the sentence;
    Using the location score, first similarity and the feature space, the original sentence vector of the sentence is formed.
  4. 4. according to the method described in claim 3, it is characterized in that, described calculate between the sentence and the topic of the article The first similarity, including:
    The sentence is converted into the set of words of word composition;
    Stop words is filtered out from the set of words, and extracts stem;
    According to the stem extracted, first similarity is calculated.
  5. 5. according to claim 1-4 any one of them methods, which is characterized in that it is described according to sentence vector, from all Sentence in select candidate digest sentence, including:
    Since first sentence of the article, based on supporting vector machine model SVM, one by one to the sentence of each sentence Vector scores, and obtains the second scoring of the sentence;
    By second scoring of the sentence compared with preset first threshold, if second scoring is beyond described first The first threshold then using the sentence as the candidate digest sentence, is updated to second scoring by threshold value;
    If second scoring of the sentence abandons the sentence, and without departing from the first threshold described in maintenance First threshold is constant.
  6. 6. according to the method described in claim 5, it is characterized in that, first scoring for obtaining the candidate digest sentence, Including:
    The second highest sentence of scoring described in the candidate digest sentence is added to digest sentence to concentrate;The digest sentence Concentration includes the digest sentence;
    A candidate digest sentence is chosen from the remaining candidate digest sentence successively as current candidate digest sentence, meter Calculate the redundancy scoring between the current candidate digest sentence and the digest sentence collection;
    According to redundancy scoring, second scoring of the current candidate digest sentence and the current candidate digest sentence The sentence of son is long, calculates first scoring.
  7. 7. according to the method described in claim 6, it is characterized in that, described calculate the current candidate digest sentence and the text The redundancy scoring between sentence collection is plucked, including:
    Obtain the second phase between the current candidate digest sentence and the existing each digest sentence of digest sentence concentration Like degree;
    All the second similarity is added, and the result after will add up concentrates existing digest sentence with the digest sentence Number does ratio, obtains the redundancy scoring.
  8. It is 8. according to the method described in claim 6, it is characterized in that, described according to redundancy scoring, the current candidate Second scoring of digest sentence and the sentence of the current candidate digest sentence are long, calculate first scoring, including:
    Obtain the difference between second scoring of the current candidate digest sentence and redundancy scoring;
    Ratio is done by the sentence of the difference and the current candidate digest sentence is long, obtains first scoring.
  9. It is 9. according to the method described in claim 8, it is characterized in that, described according to redundancy scoring, the current candidate Second scoring of digest sentence and the sentence of the current candidate digest sentence are long, before calculating first scoring, also wrap It includes:
    Determine the redundancy scoring of the current candidate digest sentence less than redundancy threshold value;
    When redundancy scoring is greater than the redundancy threshold value, the current candidate digest sentence is abandoned.
  10. 10. a kind of Text summarization device, which is characterized in that including:
    First acquisition module, for obtaining the sentence of each sentence in article vector;
    First chooses module, for according to sentence vector, candidate digest sentence to be selected from all sentences;
    Second acquisition module, for obtaining the first scoring of the candidate digest sentence;Wherein, it described first scores to represent The sentence quality of candidate's digest sentence unit length;
    Second chooses module, for selecting digest sentence from the candidate digest sentence according to the described first scoring;
    Generation module, for utilizing the digest of article described in the digest sentence generation.
CN201711463868.5A 2017-12-28 2017-12-28 Text summarization method and apparatus Pending CN108182247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711463868.5A CN108182247A (en) 2017-12-28 2017-12-28 Text summarization method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711463868.5A CN108182247A (en) 2017-12-28 2017-12-28 Text summarization method and apparatus

Publications (1)

Publication Number Publication Date
CN108182247A true CN108182247A (en) 2018-06-19

Family

ID=62548639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711463868.5A Pending CN108182247A (en) 2017-12-28 2017-12-28 Text summarization method and apparatus

Country Status (1)

Country Link
CN (1) CN108182247A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN110287489A (en) * 2019-06-24 2019-09-27 北京大米科技有限公司 Document creation method, device, storage medium and electronic equipment
CN110781291A (en) * 2019-10-25 2020-02-11 北京市计算中心 Text abstract extraction method, device, server and readable storage medium
CN111666402A (en) * 2020-04-30 2020-09-15 平安科技(深圳)有限公司 Text abstract generation method and device, computer equipment and readable storage medium
CN112732901A (en) * 2021-01-15 2021-04-30 联想(北京)有限公司 Abstract generation method and device, computer readable storage medium and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
US20110087671A1 (en) * 2009-10-14 2011-04-14 National Chiao Tung University Document Processing System and Method Thereof
CN102841940A (en) * 2012-08-17 2012-12-26 浙江大学 Document summary extracting method based on data reconstruction
CN104503958A (en) * 2014-11-19 2015-04-08 百度在线网络技术(北京)有限公司 Method and device for generating document summarization
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method
US20160078038A1 (en) * 2014-09-11 2016-03-17 Sameep Navin Solanki Extraction of snippet descriptions using classification taxonomies
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
CN107273474A (en) * 2017-06-08 2017-10-20 成都数联铭品科技有限公司 Autoabstract abstracting method and system based on latent semantic analysis

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
US20110087671A1 (en) * 2009-10-14 2011-04-14 National Chiao Tung University Document Processing System and Method Thereof
CN102841940A (en) * 2012-08-17 2012-12-26 浙江大学 Document summary extracting method based on data reconstruction
US20160078038A1 (en) * 2014-09-11 2016-03-17 Sameep Navin Solanki Extraction of snippet descriptions using classification taxonomies
CN104503958A (en) * 2014-11-19 2015-04-08 百度在线网络技术(北京)有限公司 Method and device for generating document summarization
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN105243152A (en) * 2015-10-26 2016-01-13 同济大学 Graph model-based automatic abstracting method
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106599148A (en) * 2016-12-02 2017-04-26 东软集团股份有限公司 Method and device for generating abstract
CN107273474A (en) * 2017-06-08 2017-10-20 成都数联铭品科技有限公司 Autoabstract abstracting method and system based on latent semantic analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王佳松: "基于深度学习的多文档自动文摘研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN110287489A (en) * 2019-06-24 2019-09-27 北京大米科技有限公司 Document creation method, device, storage medium and electronic equipment
CN110781291A (en) * 2019-10-25 2020-02-11 北京市计算中心 Text abstract extraction method, device, server and readable storage medium
CN111666402A (en) * 2020-04-30 2020-09-15 平安科技(深圳)有限公司 Text abstract generation method and device, computer equipment and readable storage medium
CN112732901A (en) * 2021-01-15 2021-04-30 联想(北京)有限公司 Abstract generation method and device, computer readable storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN108182247A (en) Text summarization method and apparatus
Downey et al. Locating complex named entities in web text.
CN103914548B (en) Information search method and device
CN108280061A (en) Text handling method based on ambiguity entity word and device
JP5447862B2 (en) Word classification system, method and program
JP4595692B2 (en) Time-series document aggregation method and apparatus, program, and storage medium storing program
KR20170055970A (en) Computer-implemented identification of related items
US11645447B2 (en) Encoding textual information for text analysis
CN111694927B (en) Automatic document review method based on improved word shift distance algorithm
CN108388660A (en) A kind of improved electric business product pain spot analysis method
CN108460098A (en) Information recommendation method, device and computer equipment
JP7281905B2 (en) Document evaluation device, document evaluation method and program
JP2011085986A (en) Text summarization method, its device, and program
CN108038108A (en) Participle model training method and device and storage medium
CN108304377A (en) A kind of extracting method and relevant apparatus of long-tail word
KR20180131146A (en) Apparatus and Method for Identifying Core Issues of Each Evaluation Criteria from User Reviews
CN110020163A (en) Searching method, device, computer equipment and storage medium based on human-computer interaction
JP6830971B2 (en) Systems and methods for generating data for sentence generation
Paripremkul et al. Segmenting words in Thai language using Minimum text units and conditional random Field
CN107704549A (en) Voice search method, device and computer equipment
CN107797986A (en) A kind of mixing language material segmenting method based on LSTM CNN
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
CN111339778B (en) Text processing method, device, storage medium and processor
CN103914447B (en) Information processing device and information processing method
Ou et al. Unsupervised citation sentence identification based on similarity measurement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180619

RJ01 Rejection of invention patent application after publication