CN110532554A - A kind of Chinese abstraction generating method, system and storage medium - Google Patents
A kind of Chinese abstraction generating method, system and storage medium Download PDFInfo
- Publication number
- CN110532554A CN110532554A CN201910787889.5A CN201910787889A CN110532554A CN 110532554 A CN110532554 A CN 110532554A CN 201910787889 A CN201910787889 A CN 201910787889A CN 110532554 A CN110532554 A CN 110532554A
- Authority
- CN
- China
- Prior art keywords
- chinese
- word
- vector
- sequence
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a kind of Chinese abstraction generating method, system and storage medium, the method includes the steps: target text is obtained, determines the Chinese word sequence vector of target text;The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector;The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, and the semantic intermediate semanteme of full text of summarizing after recombination is sent to the good decoder of pre-training;The word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text that decoder is predicted according to previous moment, final word sequence generated is the abstract of target text.The present invention is able to ascend the generation quality and readability of Chinese text abstract.
Description
Technical field
The present invention relates to a kind of Chinese abstraction generating method, system and storage mediums, belong to text information processing technology neck
Domain.
Background technique
Automatic abstract is the technology that automatic text analysis, content summary and summarization generation are realized using computer, is to solve
A kind of supplementary means of information overabundance problem at present can help the mankind to further understand natural language text, and more quickly,
Key message is accurately and comprehensively obtained, all there is important Practical significance in terms of industry and business.
The technology of currently used relatively low, the readable difference of the generally existing Chinese text summarization generation quality of abstraction generating method
Problem.
Summary of the invention
It is an object of the invention to overcome deficiency in the prior art, provide a kind of Chinese abstraction generating method, system and
Storage medium is able to ascend the generation quality and readability of Chinese text abstract.
In order to achieve the above objectives, the present invention adopts the following technical solutions realization:
In a first aspect, described method includes following steps the present invention provides a kind of Chinese abstraction generating method:
Target text is obtained, by the Chinese character separating in target text at strokes sequence;
The Chinese word sequence vector of target text is determined according to strokes sequence;
The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector;
The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, it will be in the summary full text semanteme after recombination
Between semantic be sent to the good decoder of pre-training;
The intermediate semantic deduction subsequent time word of word and summary full text semanteme that decoder is predicted according to previous moment
Distribution, final word sequence generated are the abstract of target text.
With reference to first aspect, further, the method for determining the Chinese word sequence vector of target text includes:
N-gram cutting is carried out to the strokes sequence, obtains the n-gram information in Chinese-character stroke;
Skip-Gram model prediction centre word context is used according to n-gram, obtains corresponding Chinese word sequence vector.
With reference to first aspect, further, the method for the n-gram information in acquisition Chinese-character stroke includes:
Word is split into character, finds the corresponding strokes sequence of each character;
By strokes sequence IDization;
N-gram summation is carried out to the strokes sequence of IDization, obtains the n-gram information.
With reference to first aspect, further, the encoder uses two-way length Memory Neural Networks in short-term.
With reference to first aspect, further, the method for generative semantics vector includes:
By Chinese word sequence vector, forward and reverse is input to two-way length in short-term in Memory Neural Networks respectively, obtains two kinds
Corresponding two hidden states of each word under sequence;
Two hidden state head and the tail splicings are generated into the semantic vector.
With reference to first aspect, further, the side for being most suitable for the full text semanteme at current time is recombinated out according to semantic vector
Method includes:
Attention mechanism is added, when generating sentence semantics vector by encoder to calculate different input words to decoder
The weighing factor at end;
It is recombinated out according to hidden state of the input word to the weighing factor combination decoder feedback of decoder end and is most suitable for working as
The full text semantic information at preceding moment.
With reference to first aspect, further, the method also includes optimizing word order generated using beam search algorithm
Column.
With reference to first aspect, further, the method also includes pre-processing to target text, comprising:
Spcial character is removed, the spcial character includes punctuation mark, deactivates modal particle and adversative;
All dates are replaced with into TAG_DATE;
Hyperlink URL is replaced with into label TAG_URL;
Number is replaced with into TAG_NUMBER;
English word is replaced with into TAG_NAME_EN.
Second aspect, the present invention provides a kind of Chinese summarization generation system, including processor and memory, the storages
Be stored with program on device, described program can the method any one of aforementioned by the processor load and execution the step of.
The third aspect, the present invention provides a kind of computer readable storage mediums, are stored thereon with computer program, special
The step of sign is, which realizes any one of aforementioned the method when being executed by processor.
Compared with prior art, advantageous effects of the invention: the Chinese word sequence vector of target text is inputted
In the encoder good to pre-training, generative semantics vector;The full text semanteme for being most suitable for current time is recombinated out according to semantic vector,
The semantic intermediate semanteme of full text of summarizing after recombination is sent to the good decoder of pre-training;Decoder is predicted according to previous moment
Word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text, final word sequence generated is target
The abstract of text can increase the feature needed for phrase understands, capture the part with former character semantic similarity, help to be promoted
The generation quality and readability of Chinese text abstract.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the Chinese abstraction generating method provided according to embodiments of the present invention;
Fig. 2 is the training stage of the language model provided according to embodiments of the present invention and the method schematic diagram of test phase;
Fig. 3 is that is provided according to embodiments of the present invention split into word " university " method schematic diagram of n-gram stroke;
Fig. 4 is the structural schematic diagram of the two-way length that provides according to embodiments of the present invention Memory Neural Networks in short-term;
Fig. 5 is the structural schematic diagram of the Seq2Seq model provided according to embodiments of the present invention;
Fig. 6 is that the method for choosing best text sequence using beam search algorithm provided according to embodiments of the present invention is illustrated
Figure;
Fig. 7 is that the method for carrying out semantic vector calculating using attention mechanism provided according to embodiments of the present invention is illustrated
Figure.
Specific embodiment
The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention
Technical solution, and not intended to limit the protection scope of the present invention.
As shown in Figure 1, be a kind of flow chart of the Chinese abstraction generating method provided according to embodiments of the present invention, it is specific to wrap
Include following steps:
A) Text Pretreatment: after target text is segmented, carrying out the vectorization processing of word, and construct corresponding vocabulary,
Input of the term vector sequence of formation as next stage.
B) semantic understanding: the memory function of Recognition with Recurrent Neural Network, by input coding of term vector sequence of first stage
Device (uses two-way shot and long term Memory Neural Networks (abbreviation Bi-LSTM)), and encoder generates the semantic vector of each section of text simultaneously
It is transmitted to next stage.
C) computing with words: the application joined attention mechanism, attention machine when encoder generates sentence semantics vector
Hidden state according to decoder feedback is made to recombinate out the full text semantic information for being most suitable for current time, and will be in after recombination
Between semantic information send to decoder, for current time step word prediction.
D) summarization generation: (using RNN, (Recurrent Neural Network recycles nerve net to this stage decoder
Network)) according to the word of previous moment prediction and summarizes the semantic intermediate semantic vector of full text and infer the distribution of subsequent time word, most
Throughout one's life at a word sequence i.e. abstract sentence.
In four above-mentioned processes, the language of stroke vector coding and training automatic abstract is introduced for the feature of Chinese
Model.For the structure for the embodiment model of the embodiment of the present invention being more clear, model is divided into training stage and survey in Fig. 2
Two parts of examination stage.
The left side Fig. 2 is the training stage of model of the embodiment of the present invention, the transmission direction of arrows show data and and parameter
Backpropagation, underline part be the embodiment of the present invention use for Chinese stroke coding mode.Rest part packet
Include the optimization of the encoder being made of Bi-LSTM, the decoder that RNN is formed and attention mechanism.The right is the test of model
Stage mainly tests trained decoder, one section of test text is inputted, after the Chinese character code based on stroke
The autoabstract generated by trained decoder.In this stage, it in order to maximize the probability for generating sentence, is added
Beam Search (beam-search) increases the range of choice for generating sentence, and optimization generates the fluency made a summary.
Each word is split into strokes sequence by the embodiment of the present invention, captures class using based on the n-gram information of stroke
Like " knowing " in " intelligence " it is this only can just be captured from smaller stroke granularity and and the close part of former character semanteme, it
Apparent effect promoting is brought to the term vector expression on Chinese.
In the method for term vector coding, the embodiment of the present invention uses the Skip-Gram training method of Word2vec, passes through
Centre word predicts context.In Word2vec, each character is initialized using simple one-hot coding, and uses for reference
Each word is converted to corresponding strokes sequence by the thought of Fasttext, such as: " university " then will become " ノ Dian Dian Dian ノ a Dian
Off Off Shu mono- " does n-gram cutting for such sequence, and the stroke initialization after cutting can thus be captured as input
It is associated with to deep layer existing between Chinese character.The meaning of this method is, Chinese character is splitted into stroke or radical, mainly passes through stroke
Or the semantic information between radical, the feature of text itself is enriched, so that the modular construction similitude between similar character is captured,
Increase the feature needed for phrase understands.For in terms of philological, it can allow the words of similar meaning that there is class on Chinese characters word-formation
As structure, this is the theoretical basis of the method for Chinese character coding of the embodiment of the present invention.
Chinese-character stroke subdivision has more than 30 kinds, and stroke is divided into five major class here: horizontal, vertical, slash, right-falling stroke, folding.As shown in table 1,
This five strokes are used into digital number respectively so that it is convenient to its corresponding vector in dictionary.
1 encoding of Chinese stroke of table
Stroke title | It is horizontal | It is perpendicular | It skims | Right-falling stroke | Folding |
Shape | One | Shu (亅) | Pie | Fu (Dian) | Ya (Yin) |
ID | 1 | 2 | 3 | 4 | 5 |
As shown in figure 3, describing the process that word splits into stroke and takes n-gram value, it is broadly divided into Four processes:
A) word is split into character first;
B) the corresponding strokes sequence of each character is found;
C) the strokes sequence IDization that previous step is obtained;
D) it sums to the strokes sequence n-gram of IDization.The n-gram of each stroke represents a vector, stroke vector
Latitude is consistent with the term vector latitude of context words.It is total that different local same word and stroke are appeared in test in full
Enjoy identical vector.
The context words similitude with higher of one word and it, current word w and its context words c's is similar
Property is indicated using their inner product of vectorsWhereinWithThe vector for respectively representing w and c indicates.
N-gram and each context words to each stroke distribute a vector, and according to all strokes of composition current word w
N-gram vector sum the term vector inner product of cliction and calculates similitude up and down.By the stroke n-gram of words all in corpus
Vector is stored in dictionary S, and wherein S (w) indicates the stroke n-gram set of word w, and similarity function is are as follows:
Wherein q is the element of S (w),For the insertion vector of q.
Language model is intended to predict that the probability in short occurred, Skip-Gram model used by the embodiment of the present invention make
Context c, as calculating Probability p (c | w) are predicted with centre word w, this probability is calculated used here as softmax function:
C ' belongs to a word in corpus vocabulary V, it can be found that the calculation amount of denominator is | V |, for dividing above
Female calculating is accelerated using the negative method of sampling, and the thought of negative sampling is that reduce the number of negative sample needs more to reduce model
New weight number.Such as input " university ", when encoding using one-hot, corresponding " whole nation " and " ranking " is wished in output layer
The neuron node output 1 of word, if | V | size be 10000, remaining 9999 nodes wish that output is 0, this
9999 samples are exactly negative sample herein, and negative sampling is exactly that 5-20 are extracted from 9999 negative samples to update its weight,
Remaining weight does not update, to achieve the effect that reduce calculation amount.Loss function based on negative sampling are as follows:
L=∑w∈D∑c∈T(w)Log σ (sim (w, c))+λ logEC '~p[log σ (- sim (w, c '))] (3)
D is the set of all word training in corpus in above-mentioned formula, and T (w) is the context list of current word in window
Word, σ are sigmoid activation primitive, σ (x)=(1+exp (- x))-1, λ is negative sample number, EC '~pIt is expected, so that negative
The sample of sampling meets p distribution.For more detailed description this algorithm detailed process, table 2 gives the process of stroke coding:
2 Stroke_Embedding algorithm of table (stroke embedded mobile GIS) process
One important parameter of Stroke_Embedding algorithm is n_gram, it is contemplated that the unit stroke of Chinese character in Chinese
Several sizes is here 3,4 and 5 to its value, to capture the most of component information for including in Chinese character.
In long text sequence, the semanteme that can more completely capture text is most important, and the embodiment of the present invention passes through Bi-
LSTM network can capture two-way semantic dependency.Fig. 4 is Bi-LSTM schematic network structure.As Sequence-to-
When the encoding or decoding device of Sequence model (following shorthand are as follows: Seq2 Seq model), although not needing to complete such as emotion
The classification task of analysis, but Bi-LSTM is more accurately positioned to semantic vector or text snippet brings mentioning in effect
It rises.Bi-LSTM corresponding is coded portion, the nerve that it forms the sentence sequence difference forward and reverse input LSTM of input
In network, the corresponding two hidden state h of each word under two kinds of sequences are obtainedtWith h 't, then by hidden state under two kinds of sequences
Splicing:
hnew=concatenate (ht, h 't) (4)
It is i.e. end to end, thus more complete semantic vector can be obtained in coded portion.The structure of entire model
As shown in Figure 5.
Primary structure of the Bi-LSTM as illustraton of model left-half encoder section of the embodiment of the present invention, it is semantic obtaining
After vector C, it is passed to the word that right side part decoder carrys out each time point in gradually formation sequence, training stage decoder
It is made of RNN unit.For intuitively, the text input bidirectional circulating that the segment length which inputs the left side is n is neural
Network, wherein x0, x1... xt, xnRespectively indicate the corresponding term vector of each word in text, after obtaining semantic vector C, decoding
Process is then the input as each moment, then exports a word y at the corresponding moment1, y2... ym, thus may be used
To decode a segment length as the text of m, the wherein not stringent size relation of the size of n and m.Due to the embodiment of the present invention
Task is text snippet, and input text size is greater than output text, therefore n > m.Above it is exactly the training process of model, is surveying
During examination, trained decoder model is mainly used, in such a way that optimization generates, is finally completed on test set
Experiment.
After the completion of Seq2Seq model training, need to generate new abstract sentence using the model, and the generation of sentence be by
The sequence problem that each word forms after increasingly generating.Beam-search is then the preceding K word composition sequence according to maximum probability,
The size of K is to collect beamwidth.In Fig. 6, the schematic diagram of a best text sequence is chosen for Beam Search.
Assuming that dictionary size is 6, detailed steps are as follows:
(1) first word y is generated1Probability distribution [0.1,0.1,0.4,0.2,0.1,0.1] after, choose wherein most probably
Two words of rate, the most probable selection of " I ", " " as first word as shown in Figure 7;
(2) second step calculates the probability that second word is arranged in pairs or groups therewith aiming at the two words, by " I " and " "
As the input of decoder, the then maximum the first two sequence of reselection, as " I " and " seeing ", and so on, finally
It is terminated encountering end mark<s>.
(3) last available two sequences, " I is watching movie " and " watching movie me ", it is evident that the former probability and most
Greatly, so selecting the former as final result.
The algorithm flow of Beam Search is as shown in table 3:
Wrap up in 3 Beam Search algorithm flows
In the algorithm of Beam Search, the size of parameter K is affected to the decoding speed of test phase decoder,
Between 3 to 7 words of general value.
Input of the semantic vector C as the semantic compression and decoder of list entries, because the limitation of its length leads to nothing
Method includes enough useful informations, and especially in automatic abstract task, the sequence of encoder input is often a Duan Changwen
This, a C can not just summarize all information, model accuracy is caused to decline.The embodiment of the present invention is made based on attention mechanism
It obtains semantic vector C and keeps more semantic informations.Attention mechanism is introduced underneath with machine translation:
In machine translation, each word is different by the influence power of input word in the translation of decoder output, because
This, attention mechanism inputs different semantic vector C according to different momentsiTo solve this problem, CiCarry out self-encoding encoder hidden layer
Vector hjWith the weight a of distributionijSum of products.In LSTM sequence, the hidden layer vector h of each moment outputiIt is by current defeated
It gos out and the decision of current time memory unit, semantic vector is to export in the last one unit of encoder, and use attention
The mechanism weighted sum hidden layer vector at each moment.The calculation method for the attention mechanism that Fig. 7 is shown.
When calculating is to the semantic vector C of word " I "1When, influence on it should be most for the semantic information of " I " in list entries
Greatly, thus distribute to its weight just should be maximum.Similarly C2" seeing " is most related, therefore corresponding a22Maximum, then how to count
Calculate aijAn Important Problems are reformed into.
In fact, aijSize be from model middle school come, it is by the hidden state and decoding in -1 stage of jth of encoder
The hidden state in the i-th stage of device is related.Such as the corresponding semantic vector C of calculating word " watch "2When, first by calculating
Three hiding vector h of one word " I " and encoder1, h2, h3Similarity, it is close that this also utilizes adjacent words semanteme
How language rule calculates a to preferably introduceij, the hidden state of encoder is defined as hj, decoder hides layer state
It is defined as hi, CiCalculation formula are as follows:
The total length of list entries, h are indicated for above formula VjIt is also known, aijCalculation formula it is as follows:
aijIt is the probability normalized output of a softmax, eijAn alignment model is indicated, for measuring encoder-side
The word of position j for decoder end the word of position i degree of registration (influence degree).In other words, decoder end generates
When the word of position i, how many degree is influenced by the word of the position j of encoder-side.Alignment model eijCalculation there are many kinds of,
Different calculations represents different Attention models.Most simple and most common alignment model is dot product
Product matrix, i.e., the hidden state h of the output of decoder endtWith the hidden state h of output of encoder-sidesMatrix Multiplication is carried out, when calculating
It is to be completed by the similitude of the last moment and list entries hidden state matrix that calculate the word that will be predicted.It is common
Alignment calculation it is as follows:
score(ht, hs)=eijIndicate the alignment thereof of source and target word, it is common to have the above dot product, weight network
Mapping, concat map several method.
When calculating alignment thereof, each word of decoder end output requires to calculate and the side of alignment of list entries
Formula, all hidden layers of encoder will be used to calculate similitude to get the weight vectors a arrivedijLength be equal to input sequence
The length of column.The higher Soft Attention mechanism of service efficiency of the embodiment of the present invention, it is right between word to be calculated by dot product
Answer mode with determine the relevance between word power.
The beneficial effect of Chinese abstraction generating method is provided further to verify the embodiment of the present invention, below with reference to experiment
Data are further described the embodiment of the present invention.Likewise, following embodiment be only used for clearly illustrating it is of the invention
Technical solution, and not intended to limit the protection scope of the present invention.
0.1 data set and pretreatment
Experiment uses the extensive Chinese short text summary data collection LCSTS that Sina weibo is taken from from Harbin Institute of Technology, number
2,000,000 abstracts that really Chinese short text datas and each text author provide are contained according to concentrating, this data set is also
Maximum one Chinese section text snippet data set at present, provides the standard method of Segmentation of Data Set, table 4 illustrates three portions
The number of divided data collection.
4 LCSTS data of table composition
Data set includes three parts:
A) first part is the major part of notebook data collection, contains 2400591 (short text, abstract) data pair, this
Partial data is used to the model that training generates abstract.
B) second part includes 10666 (short text, abstract) data pair manually marked, and each sample has beaten 1-
5 points, score is the degree of correlation for judging short text and abstract, and 1 represents least correlation, and 5 representatives are most related.This partial data
It is that stochastical sampling comes out from first part's data, for analyzing the distribution situation of first part's data.Wherein, be labeled as 3,
4,5 points of sample original text and abstract correlation are much better, therefrom it is also seen that can be comprising some without going out in many abstracts
Word in present original text, this also illustrates different from sentence compression duty.It is weaker to be labeled as 1,2 point of correlation, is more like mark
Topic is either commented on rather than is made a summary.Statistics shows that 1,2 point of data less than twenty percent, can be filtered with the method for supervised learning
Fall.
C) Part III includes 1106 (short text, abstract) data pair, and three people carry out 2000 pairs in total abstracts
It judges, Dynamic data exchange here is in first part and second part.Select 3 points or more of data as short text abstract task
Test data set.
The pretreatment stage of data is particularly important, because in the format of encoder section data and standardization to entire experiment
Influence it is very big, 1 part PART of above-mentioned LCSTS is training data, training data short text input and summing-up pluck
After extracting, need that some of which information is replaced and is handled:
Spcial character: removal spcial character, mainly includes punctuation mark and common deactivated modal particle and adversative etc.,
Such as: " ", ", $ ... " Ah eh and;
Content in bracket, such as [happy] have many full animation expressions and deposit in this form because of data source microblogging
It to be removed in pretreatment;
Date tag replacement: replacing with TAG_DATE for all dates, e.g.: * * * * month * day, the * * * * * month, etc.
Deng;
Hyperlink URL: label TAG_URL is replaced with;
Replacement number: TAG_NUMBER;
English tag replacement: replacement English word is label TAG_NAME_EN.
0.2 assessment method
The evaluation method that the embodiment of the present invention uses includes tri- kinds of Rouge-1, Rouge-2 and Rouge-L, wherein Rouge-
L in L is the initial of LCS (longest common subsequence, longest common subsequence).
The calculation formula of Rouge-N is as follows:
Wherein, n-gram indicates n-gram word, and { Ref Summaries } indicates that with reference to abstract, i.e., the standard obtained in advance is plucked
It wants, Countmatch(n-gram) indicate occur the number of n-gram, Count (n-gram) simultaneously in rev of Sys and reference abstract
It then indicates with reference to the n-gram number occurred in abstract.
The calculation formula of Rouge-L is as follows:
Wherein LCS (X, Y) is the most normal common subsequence length of X and Y, and m, n are respectively indicated with reference to abstract and autoabstract
The length number of word (generally be exactly), Rlcs, PlcsRespectively indicate recall rate and accuracy rate.Last FlcsIt is exactly Rouge-L points
Number.
0.3 experimental design and interpretation of result
After first using jieba participle packet to first part's participle of LCSTS data set, chooses 50000 high frequency words and make
For encoder vocabulary, the word occurred in vocabulary uses " UNK " to indicate.When decoder is arranged, an important parameter
Be decoder dictionary size, comparative test done to the parameter of decoder dictionary in an experiment, be respectively set 2000,5000,
80000, the 11000, size of 14,000 5 kind of specification passes through the size that best dictionary is chosen in experiment.4 layers are used in the encoder
Two-way LSTM, each node layer are 256, batch_size 64, define Bucket barrels of mechanism: buckets=[(120,
30) long inside the Bucket for ...], the sentence of list entries being assigned to according to the similarity degree of length different regular lengths
The inadequate all additions PAD character of degree, length for heading limit 30 words.
Several method for comparison are as follows:
A) Tf-idf: the Baseline Methods of extraction-type abstract.
B) ABS system: the Baseline Methods of production abstract.
C) Our+att: common term vector input+attention mechanism.
D) Our (S)+att: stroke coding term vector+attention mechanism.
Table 5 and table 6 include four groups of experimental datas be 8000 in decoder dictionary size under experimental result and Rouge comment
It is divided to (the latter two groups experimental results for the embodiment of the present invention).
The abstract comparison of 5 distinct methods of table
Table 6Rouge appraisal result
Model | Rouge-1 (%) | Rouge-2 (%) | Rouge-L (%) |
Tf-idf | 27.30 | 24.30 | 26.76 |
ABS system | 24.26 | 15.22 | 24.11 |
Our+att | 24.83 | 15.61 | 22.19 |
Our(S)+att | 25.08 | 17.05 | 22.77 |
In table 5 by method sample provided by the embodiment of the present invention compared respectively extraction-type Baseline Methods Tf-idf and
Production Baseline Methods ABS system has extracted most representational example in test set in this sample, compares mesh
Sentence is marked it can be found that three models all have good effect to the abstract of this section of text.The model output of the embodiment of the present invention
Sentence effect ratio Tf-idf and ABS system effect are more preferable, semantic comparatively more complete, and generate " North China " this
Do not have in original text occur neologisms, this point make it summarize area " the northeastern Inner Mongolia Shanxi central and north, middle of Hebei province and
High level overview, phase are achieved the effect that when the ground such as by north in the middle part of the Central Jilin Heilungkiang of northeast Beijing-Tianjin area, the west and south, Liaoning "
Than two kinds control methods, model of the embodiment of the present invention seem more complete in description.
In table 6, by comparing the Rouge-1 of four kinds of methods, Rouge-2, the discovery of Rouge-L score, production method
It is weaker than extraction-type method in Rouge scoring.It is word-based similar for being mainly due to the evaluation criterion of Rouge scoring
Degree, production method is in the preferable situation of actual effect, it is more likely that is to have a poor Rouge scoring.Therefore,
Experimental section of the embodiment of the present invention, is shown using sample and two methods of Rouge scoring intuitively indicate modelling effect.In table 7,
Influence by Experimental comparison's decoder dictionary size to model of the embodiment of the present invention.
The Rouge scoring of the different decoder dictionary sizes of table 7
Ours (S) indicates the Seq2Seq model mould that training obtains after using stroke coding of the embodiment of the present invention in table 7
Type.When decoder dictionary size is 2k, trained using the coding based on stroke come training pattern and using common term vector
Promotion for modelling effect is huge, and during dictionary size rises, Rouge scoring only rises average 4-
5 points.This point illustrates under dictionary and incomplete situation, passes through the n- to Chinese character minimum unit stroke based on the coding of stroke
Gram information coding has been combined into " the uncommon word " being not present in more dictionary, and model is after being added stroke coding to dictionary
It is required that relying on reduces, and there is good promotion in effect.
To sum up, the embodiment of the present invention realizes one more concisely, accurately by a series of natural language processing techniques
Chinese abstraction generating method.Firstly, the design feature for Chinese proposes the text vector coding based on stroke, stroke word is constructed
Allusion quotation constitutes text vector by Skip-Gram model, completes to the finer expression task of Hanzi component information.Secondly, right
Optimized in text generation using Seq2Seq model, including use Bi-LSTM in the encoder, is solved to a certain extent
Long sequence text information loses and the supplementary question of information from back to front, uses attention mechanism capture input and output word
Between association power, and in the decoder of test phase use Beam Search optimization generation effect.Finally
Based on LCSTS data set training pattern, the coding of the embodiment of the present invention is confirmed by two aspects of Rouge scoring and artificial judgment
Method and model effect promoting in text snippet readability are obvious.
The embodiment of the invention also provides a kind of Chinese summarization generation system, including processor and memory, the storages
Program is stored on device, the step of described program can be by the processor load and execution preceding method.
The embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer program, the program
The step of preceding method is realized when being executed by processor.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations
Also it should be regarded as protection scope of the present invention.
Claims (10)
1. a kind of Chinese abstraction generating method, which is characterized in that described method includes following steps:
Target text is obtained, by the Chinese character separating in target text at strokes sequence;
The Chinese word sequence vector of target text is determined according to strokes sequence;
The Chinese word sequence vector is input in the good encoder of pre-training, generative semantics vector;
The full text semanteme for being most suitable for current time is recombinated out according to semantic vector, will summarize the semantic intermediate language of full text after recombination
Justice is sent to the good decoder of pre-training;
The word and summarize the semantic intermediate semantic distribution for inferring subsequent time word of full text that decoder is predicted according to previous moment,
Final word sequence generated is the abstract of target text.
2. Chinese abstraction generating method according to claim 1, which is characterized in that determine the Chinese term vector of target text
The method of sequence includes:
N-gram cutting is carried out to the strokes sequence, obtains the n-gram information in Chinese-character stroke;
Skip-Gram model prediction centre word context is used according to n-gram, obtains corresponding Chinese word sequence vector.
3. Chinese abstraction generating method according to claim 2, which is characterized in that obtain the n-gram letter in Chinese-character stroke
The method of breath includes:
Word is split into character, finds the corresponding strokes sequence of each character;
By strokes sequence IDization;
N-gram summation is carried out to the strokes sequence of IDization, obtains the n-gram information.
4. Chinese abstraction generating method according to claim 1, which is characterized in that the encoder uses two-way length in short-term
Memory Neural Networks.
5. Chinese abstraction generating method according to claim 4, which is characterized in that the method for generative semantics vector includes:
By Chinese word sequence vector, forward and reverse is input to two-way length in short-term in Memory Neural Networks respectively, obtains two kinds of sequences
Under corresponding two hidden states of each word;
Two hidden state head and the tail splicings are generated into the semantic vector.
6. Chinese abstraction generating method according to claim 1, which is characterized in that recombinated out and be most suitable for according to semantic vector
The method of the full text semanteme at current time includes:
Attention mechanism is added, when generating sentence semantics vector by encoder to calculate different input words to decoder end
Weighing factor;
It is recombinated out according to hidden state of the input word to the weighing factor combination decoder feedback of decoder end when being most suitable for current
The full text semantic information at quarter.
7. Chinese abstraction generating method according to claim 1, which is characterized in that the method also includes being searched using boundling
Rope algorithm optimization word sequence generated.
8. Chinese abstraction generating method according to claim 1, which is characterized in that the method also includes to target text
It is pre-processed, comprising:
Spcial character is removed, the spcial character includes punctuation mark, deactivates modal particle and adversative;
All dates are replaced with into TAG_DATE;
Hyperlink URL is replaced with into label TAG_URL;
Number is replaced with into TAG_NUMBER;
English word is replaced with into TAG_NAME_EN.
9. a kind of Chinese summarization generation system, which is characterized in that including processor and memory, be stored with journey on the memory
The step of sequence, described program can be by any one of the processor load and execution such as claim 1 to 8 the method.
10. computer readable storage medium is stored thereon with computer program, which is characterized in that the program is executed by processor
The step of any one of Shi Shixian claim 1~8 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910787889.5A CN110532554B (en) | 2019-08-26 | 2019-08-26 | Chinese abstract generation method, system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910787889.5A CN110532554B (en) | 2019-08-26 | 2019-08-26 | Chinese abstract generation method, system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110532554A true CN110532554A (en) | 2019-12-03 |
CN110532554B CN110532554B (en) | 2023-05-05 |
Family
ID=68664157
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910787889.5A Active CN110532554B (en) | 2019-08-26 | 2019-08-26 | Chinese abstract generation method, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110532554B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061861A (en) * | 2019-12-12 | 2020-04-24 | 西安艾尔洛曼数字科技有限公司 | XLNET-based automatic text abstract generation method |
CN111078865A (en) * | 2019-12-24 | 2020-04-28 | 北京百度网讯科技有限公司 | Text title generation method and device |
CN111191451A (en) * | 2019-12-30 | 2020-05-22 | 苏州思必驰信息科技有限公司 | Chinese sentence simplification method and device |
CN111639174A (en) * | 2020-05-15 | 2020-09-08 | 民生科技有限责任公司 | Text abstract generation system, method and device and computer readable storage medium |
CN111666759A (en) * | 2020-04-17 | 2020-09-15 | 北京百度网讯科技有限公司 | Method and device for extracting key information of text, electronic equipment and storage medium |
CN111723196A (en) * | 2020-05-21 | 2020-09-29 | 西北工业大学 | Single document abstract generation model construction method and device based on multi-task learning |
CN111930940A (en) * | 2020-07-30 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Text emotion classification method and device, electronic equipment and storage medium |
CN112115256A (en) * | 2020-09-15 | 2020-12-22 | 大连大学 | Method and device for generating news text abstract integrated with Chinese stroke information |
CN112364225A (en) * | 2020-09-30 | 2021-02-12 | 昆明理工大学 | Judicial public opinion text summarization method combining user comments |
CN112560456A (en) * | 2020-11-03 | 2021-03-26 | 重庆安石泽太科技有限公司 | Generation type abstract generation method and system based on improved neural network |
CN112765976A (en) * | 2020-12-30 | 2021-05-07 | 北京知因智慧科技有限公司 | Text similarity calculation method, device and equipment and storage medium |
CN113254573A (en) * | 2020-02-12 | 2021-08-13 | 北京嘀嘀无限科技发展有限公司 | Text abstract generation method and device, electronic equipment and readable storage medium |
CN113449105A (en) * | 2021-06-25 | 2021-09-28 | 上海明略人工智能(集团)有限公司 | Work summary generation method, system, electronic device and medium |
CN113609863A (en) * | 2021-02-04 | 2021-11-05 | 腾讯科技(深圳)有限公司 | Method, device and computer equipment for training and using data conversion model |
CN114553803A (en) * | 2022-01-21 | 2022-05-27 | 上海鱼尔网络科技有限公司 | Quick reply method, device and system for instant messaging |
WO2022142121A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Abstract sentence extraction method and apparatus, and server and computer-readable storage medium |
CN116049385A (en) * | 2023-04-03 | 2023-05-02 | 北京太极信息系统技术有限公司 | Method, device, equipment and platform for generating information and create industry research report |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804495A (en) * | 2018-04-02 | 2018-11-13 | 华南理工大学 | A kind of Method for Automatic Text Summarization semantic based on enhancing |
CN109885673A (en) * | 2019-02-13 | 2019-06-14 | 北京航空航天大学 | A kind of Method for Automatic Text Summarization based on pre-training language model |
-
2019
- 2019-08-26 CN CN201910787889.5A patent/CN110532554B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804495A (en) * | 2018-04-02 | 2018-11-13 | 华南理工大学 | A kind of Method for Automatic Text Summarization semantic based on enhancing |
CN109885673A (en) * | 2019-02-13 | 2019-06-14 | 北京航空航天大学 | A kind of Method for Automatic Text Summarization based on pre-training language model |
Non-Patent Citations (1)
Title |
---|
沈华东等: "AM-BRNN:一种基于深度学习的文本摘要自动抽取模型", 《小型微型计算机系统》 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061861B (en) * | 2019-12-12 | 2023-09-01 | 西安艾尔洛曼数字科技有限公司 | Text abstract automatic generation method based on XLNet |
CN111061861A (en) * | 2019-12-12 | 2020-04-24 | 西安艾尔洛曼数字科技有限公司 | XLNET-based automatic text abstract generation method |
CN111078865A (en) * | 2019-12-24 | 2020-04-28 | 北京百度网讯科技有限公司 | Text title generation method and device |
CN111078865B (en) * | 2019-12-24 | 2023-02-21 | 北京百度网讯科技有限公司 | Text title generation method and device |
CN111191451B (en) * | 2019-12-30 | 2024-02-02 | 思必驰科技股份有限公司 | Chinese sentence simplification method and device |
CN111191451A (en) * | 2019-12-30 | 2020-05-22 | 苏州思必驰信息科技有限公司 | Chinese sentence simplification method and device |
CN113254573A (en) * | 2020-02-12 | 2021-08-13 | 北京嘀嘀无限科技发展有限公司 | Text abstract generation method and device, electronic equipment and readable storage medium |
CN111666759A (en) * | 2020-04-17 | 2020-09-15 | 北京百度网讯科技有限公司 | Method and device for extracting key information of text, electronic equipment and storage medium |
CN111666759B (en) * | 2020-04-17 | 2024-03-26 | 北京百度网讯科技有限公司 | Extraction method and device of text key information, electronic equipment and storage medium |
CN111639174B (en) * | 2020-05-15 | 2023-12-22 | 民生科技有限责任公司 | Text abstract generation system, method, device and computer readable storage medium |
CN111639174A (en) * | 2020-05-15 | 2020-09-08 | 民生科技有限责任公司 | Text abstract generation system, method and device and computer readable storage medium |
CN111723196A (en) * | 2020-05-21 | 2020-09-29 | 西北工业大学 | Single document abstract generation model construction method and device based on multi-task learning |
CN111930940A (en) * | 2020-07-30 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Text emotion classification method and device, electronic equipment and storage medium |
CN111930940B (en) * | 2020-07-30 | 2024-04-16 | 腾讯科技(深圳)有限公司 | Text emotion classification method and device, electronic equipment and storage medium |
CN112115256A (en) * | 2020-09-15 | 2020-12-22 | 大连大学 | Method and device for generating news text abstract integrated with Chinese stroke information |
CN112364225A (en) * | 2020-09-30 | 2021-02-12 | 昆明理工大学 | Judicial public opinion text summarization method combining user comments |
CN112560456A (en) * | 2020-11-03 | 2021-03-26 | 重庆安石泽太科技有限公司 | Generation type abstract generation method and system based on improved neural network |
CN112560456B (en) * | 2020-11-03 | 2024-04-09 | 重庆安石泽太科技有限公司 | Method and system for generating generated abstract based on improved neural network |
CN112765976A (en) * | 2020-12-30 | 2021-05-07 | 北京知因智慧科技有限公司 | Text similarity calculation method, device and equipment and storage medium |
WO2022142121A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Abstract sentence extraction method and apparatus, and server and computer-readable storage medium |
CN113609863A (en) * | 2021-02-04 | 2021-11-05 | 腾讯科技(深圳)有限公司 | Method, device and computer equipment for training and using data conversion model |
CN113449105A (en) * | 2021-06-25 | 2021-09-28 | 上海明略人工智能(集团)有限公司 | Work summary generation method, system, electronic device and medium |
CN114553803A (en) * | 2022-01-21 | 2022-05-27 | 上海鱼尔网络科技有限公司 | Quick reply method, device and system for instant messaging |
CN116049385B (en) * | 2023-04-03 | 2023-06-13 | 北京太极信息系统技术有限公司 | Method, device, equipment and platform for generating information and create industry research report |
CN116049385A (en) * | 2023-04-03 | 2023-05-02 | 北京太极信息系统技术有限公司 | Method, device, equipment and platform for generating information and create industry research report |
Also Published As
Publication number | Publication date |
---|---|
CN110532554B (en) | 2023-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110532554A (en) | A kind of Chinese abstraction generating method, system and storage medium | |
CN110134757B (en) | Event argument role extraction method based on multi-head attention mechanism | |
CN108108351B (en) | Text emotion classification method based on deep learning combination model | |
CN110245229B (en) | Deep learning theme emotion classification method based on data enhancement | |
CN113254599B (en) | Multi-label microblog text classification method based on semi-supervised learning | |
CN109558487A (en) | Document Classification Method based on the more attention networks of hierarchy | |
CN110134946B (en) | Machine reading understanding method for complex data | |
CN110929030A (en) | Text abstract and emotion classification combined training method | |
CN110309514A (en) | A kind of method for recognizing semantics and device | |
CN109885670A (en) | A kind of interaction attention coding sentiment analysis method towards topic text | |
CN111061861B (en) | Text abstract automatic generation method based on XLNet | |
CN113505200B (en) | Sentence-level Chinese event detection method combined with document key information | |
CN105183833A (en) | User model based microblogging text recommendation method and recommendation apparatus thereof | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN112925904B (en) | Lightweight text classification method based on Tucker decomposition | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN111639186A (en) | Multi-class multi-label text classification model and device dynamically embedded with projection gate | |
CN112784013A (en) | Multi-granularity text recommendation method based on context semantics | |
CN110321434A (en) | A kind of file classification method based on word sense disambiguation convolutional neural networks | |
CN116258137A (en) | Text error correction method, device, equipment and storage medium | |
CN113094502A (en) | Multi-granularity takeaway user comment sentiment analysis method | |
CN113032541A (en) | Answer extraction method based on bert and fusion sentence cluster retrieval | |
CN110059192A (en) | Character level file classification method based on five codes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |