CN110472238A

CN110472238A - Text snippet method based on level interaction attention

Info

Publication number: CN110472238A
Application number: CN201910677195.6A
Authority: CN
Inventors: 余正涛; 周高峰; 黄于欣; 高盛祥; 郭军军; 王振晗
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2019-11-19
Anticipated expiration: 2039-07-25
Also published as: CN110472238B

Abstract

The present invention relates to the text snippet methods based on level interaction attention, belong to natural language processing technique field.The present invention extracts the feature breath of encoder different levels by level interaction attention to instruct the generation of abstract.While in order to avoid utilizing variation information bottleneck compressed data noise because introducing different levels feature due to bring information redundancy problem.The present invention is directed to production text snippet, under the encoding and decoding frame based on attention, encoder multilayer contextual information is extracted to instruct decoding process by attention mechanism, while information is constrained by introducing variation information bottleneck, to improve the quality of production text snippet.The experimental results showed that this method can significantly improve performance of the encoding and decoding frame in production abstract task.

Description

Text snippet method based on level interaction attention

Technical field

The present invention relates to the text snippet methods based on level interaction attention, belong to natural language processing technique field.

Background technique

With the development of depth learning technology, production text snippet method becomes the hot spot studied instantly.Traditional Coding/decoding model based on attention mechanism usually only considers semantic table of the semantic information of encoder high level as context Sign, and have ignored the minutias such as the word level structure of low layer neural network acquisition.The present invention proposes a kind of based on level interaction note The multilayer feature of meaning power mechanism extracts with fusion method the feature for obtaining encoder different levels, while introducing in decoding end Variation information bottleneck is compressed and is denoised to fuse information, to generate higher-quality abstract.

Summary of the invention

The present invention provides the text snippet methods based on level interaction attention, can obtain encoder different levels Feature, while introducing variation information bottleneck in decoding end and fuse information is compressed and denoised, to generate higher-quality Abstract is not concerned only with encoder higher level of abstraction feature when generating abstract, while extracting the detailed information of low layer to improve abstract Generate quality.

The technical scheme is that the text snippet method based on level interaction attention, described based on level interaction Specific step is as follows for the text snippet method of attention:

Step1, use text snippet field English data set Gigaword as training set, using pretreatment script pair Data set is pre-processed, and 3,800,000 and 18.9 ten thousand training set and development set are respectively obtained, and each training sample includes a pair Input text and abstract sentence；

As a preferred solution of the present invention, the specific steps of the step Step1 are as follows: data are standardized, All turn small letter including all words of data set, all numbers are replaced with into #, the word by frequency of occurrence in corpus less than 5 times Replace with UNK mark etc.；As test set after selecting a part of data to be removed and screen in development set.

Step2, training set is encoded using two-way LSTM, the number of plies is set as three layers；Encoder uses two-way length Phase memory network Bi-Directional LSTM, BILSTM, BILSTM include forward and backward LSTM, and forward direction LSTM is from left-hand Right reading list entries obtains forward coding vector, and backward LSTM is read after sequence obtains from right to left to coding vector, most The vector that forward and backward coding vector splices to obtain list entries is characterized afterwards.

Step3, decoder use unidirectional LSTM network, input sentence to be decoded and calculate each layer context vector: decoding Device uses unidirectional LSTM network, reads the state vector initialization of encoder last moment, is then characterized according to Input context Vector generates abstract sequence by word, wherein the length for generating abstract is necessarily less than the length equal to list entries；In decoding, The word that decoder reads last moment target word is embedded in vector, the hidden state vector of last moment and the context at current time Vector generates the hidden state vector at the moment；Introduce attention mechanism, according to the hidden state of last moment decoder, encode to The context vector at current time is calculated in amount；Then it is calculated by current time context vector and hidden state vector To the output vector at current time, and then the output vector at the current time output probability on goal-selling vocabulary is calculated.

Step4, for multilayer coding/decoding model, codec includes multilayer LSTM, falls into a trap and counts in each layer of LSTM Hidden state representation between layer and current layer, so that the context vector on upper layer is fused to current layer；

As a preferred solution of the present invention, the specific steps of the step Step4 are as follows:

Step4.1, the input of the context vector and hidden state vector on upper layer as current layer is merged；

Step4.2, the input feeding LSTM of current layer is obtained into the output of current layer network；

Target output is calculated in word in the output vector of the last layer of Step4.3, calculating multilayer decoder network Probability distribution on table.

Step5, the output of each layer context vector and current layer with characteristic information is spliced, is obtained current The hidden state of decoder of layer；

As a preferred solution of the present invention, the specific steps of the step Step5 are as follows:

Step5.1, in network current layer, splice respectively to each layer context vector is obtained, obtain cross-layer fusion Context vector and the hidden state of decoder, it comprises the characteristic informations of encoder different levels；

Step5.2, output vector is calculated using the hidden state of decoder and context vector, and then can calculates Output probability of the output vector on vocabulary.

Step6, the contextual information for incorporating different levels can bring the redundancy and noise of information, utilize variation information bottle Neck is compressed and is denoised to data.

As a preferred solution of the present invention, the specific steps of the step Step6 are as follows:

Step6.1, given list entries, coding/decoding model generate abstract sequence by calculating probability；

Step6.2, the log-likelihood function of abstract probability is generated by maximization come learning model parameter；

Step6.3, intermediate characterization of the information bottleneck as coding is introduced, constructs the damage from intermediate characterization to output sequence It loses, the intersection entropy loss as classification；

Step6.4, constraint is added, it is desirable that the distribution of probability and the KL divergence Kullback-Leibler of standardized normal distribution Divergence is small as far as possible.

The beneficial effects of the present invention are:

1, the present invention proposes to obtain different layers by attention based on the coding/decoding model of level interaction attention mechanism Secondary semantic information improves the generation quality of abstract.

2, present invention firstly provides variation information bottleneck is applied to summarization generation task, data are compressed and is gone It makes an uproar, the contextual information for advantageously reducing involvement different levels can bring the redundancy and noise of information.

3, the present invention proposes that a kind of level interaction attention mechanism extracts encoder different levels feature, makes a summary generating When be not concerned only with encoder higher level of abstraction feature, while extracting the detailed information of low layer to improve summarization generation quality.

Detailed description of the invention

Fig. 1 is the flow chart in the present invention；

Fig. 2 is the encoding and decoding frame diagram proposed by the present invention based on attention；

Fig. 3 is syncretizing mechanism figure in layer proposed by the present invention；

Fig. 4 is cross-layer syncretizing mechanism figure proposed by the present invention.

Specific embodiment

Embodiment 1: as Figure 1-Figure 4, described to be based on level based on the text snippet method of level interaction attention Specific step is as follows for the text snippet method of interaction attention:

Step1, use English data set Gigaword as training set, data set is carried out using pretreatment script pre- Processing, respectively obtains 3,800,000 and 18.9 ten thousand training set and development set, and each training sample includes a pair of of input text and plucks Want sentence；

Step2, encoder encode training set using two-way LSTM, and the number of plies is set as three layers；

Step3, decoder use unidirectional LSTM network, input sentence to be decoded and calculate each layer context vector；

Step4, for multilayer coding/decoding model, codec includes multilayer LSTM, falls into a trap in each layer of LSTM and counts layer in Hidden state representation between current layer, so that the context vector on upper layer is fused to current layer；

Step6, the contextual information for incorporating different levels can bring the redundancy and noise of information, utilize variation information bottle Neck (Variational Information Bottleneck, VIB) is compressed and is denoised to data.

As a preferred solution of the present invention, the specific steps of the step Step1 are as follows: data are standardized, All turn small letter including all words of data set, all numbers are replaced with into #, the word by frequency of occurrence in corpus less than 5 times Replace with UNK mark etc..8000 are randomly choosed from 18.9 ten thousand development sets and is used as development set, select 2000 datas as survey Examination collection.Sentence of the test set Central Plains text size less than 5 is removed, finally screening obtains 1951 datas as test set.In order to The generalization ability of model is verified, simultaneous selection DUC2004 of the present invention is as test set.DUC2004 data set only includes 500 Text, the corresponding 4 standards abstract sentence of each input text.

The design of this preferred embodiment is important component of the invention, and predominantly the present invention collects corpus process, is this hair Bright identification events sequential relationship provides data supporting.

As a preferred solution of the present invention, the specific steps of the step Step2 are as follows:

Inventive encoder uses two-way shot and long term memory network (Bi-Directional LSTM, BILSTM), with LSTM is compared, and BILSTM includes forward and backward LSTM, forward direction LSTM read from left to right list entries obtain forward coding to AmountAnd backward LSTM is read after sequence obtains from right to left to coding vectorFollowing institute Show.

Wherein,WithTo LSTM and backward LSTM network before respectively indicating, finally forward and backward is compiled Code vector splices the vector characterization for obtaining list entries

The design of this preferred embodiment is important component of the invention, and the process predominantly of the invention encoded utilizes LSTM Modeling is carried out to sentence to have a problem that, can not exactly encode information from back to front.In more fine-grained classification, such as For the commendation of strong degree, the commendation of weak degree, neutrality, the derogatory sense of weak degree, Qiang Chengdu derogatory sense five classification tasks need Pay attention to emotion word, degree word, the interaction between negative word.Two-way semantic dependency can be preferably captured by BiLSTM.

As a preferred solution of the present invention, the specific steps of the step Step3 are as follows:

Step3.1, decoder use unidirectional LSTM network, and wherein s indicates sequence starting.

t₀Moment, decoder read s and the state vector of encoder last moment to predict y₁Output probability；Then root Vector is characterized according to Input context, abstract sequence is generated by word, wherein the length for generating abstract is necessarily less than equal to list entries Length；

Step3.2, moment t is being decoded, the word that decoder reads t-1 moment target word is embedded in vector w_t-1, hidden state to Measure s_t-1With context vector c_tGenerate the hidden state vector s of t moment_t, as shown in formula (3):

s_t=LSTM (w_t-1,s_t-1,c_t) (3)

Step3.3, as shown in Fig. 2, decoder introduce attention mechanism, according to the hidden state s of t-1 moment decoder_t-1、 The context vector c of t moment is calculated in coding vector h_t.Shown in detailed process such as formula (4,5,6):

Step3.4, then pass through t moment context vector c_tWith hidden state vector s_tBe calculated the output of t moment to Measure p_t, and then p is calculated_tThe output probability p on goal-selling vocabulary_vocab,t.It is specific to calculate as shown in formula (7,8):

p_t=tanh (W_m([s_t；c_t])+b_m) (7)

P_vocab,t=softmax (W_pp_t+b_p) (8)

The design of this preferred embodiment is important component of the invention, decoded process predominantly of the invention.LSTM is avoided Long-rang dependence problem.For LSTM, " remembeing " information is a kind of behavior of default for a long time, rather than is difficult to learn Thing.

Syncretizing mechanism in layer:

Syncretizing mechanism (Inner-Layer Merge) is intended to incorporate upper layer context vector the coding of current layer in layer In, to realize the fusion of multilevel encoder information.

Step4.1, the input of k-1 layers of context vector and hidden state vector as kth layer is merged.It is specific to calculate public affairs Shown in formula such as formula (9,10,11).

Wherein,For the context vector of the k-1 layer of acquisition,For -1 layer of kth of hidden state vector.Pass through calculating Obtain the input vector of kth layer

Step4.2, then being sent to kth layer LSTM obtains the output of kth layer network

The output vector p of the last layer of Step4.3, calculating multilayer decoder network_t, it is defeated that target is finally calculated Probability distribution P on vocabulary out_vocab。

The design of this preferred embodiment is important component of the invention, this based on the more of level interaction attention mechanism Layer feature extraction and fusion method obtain the features of encoder different levels, solves traditional based on attention mechanism Coding/decoding model usually only considers characterizing semantics of the semantic information of encoder high level as context, and has ignored low layer mind The minutias such as the word level structure obtained through network.To generate higher-quality abstract.

Cross-layer syncretizing mechanism:

Cross-layer syncretizing mechanism (Cross-Layer Merge) melts in multilayer context vector of the last layer to acquisition It closes, it is specific as shown in Figure 4.

Step5.1, at r layers of network, to obtaining each layer context vectorWithSpliced respectively, obtains cross-layer fusion Context vector c_tWith the hidden state s of decoder_t, it comprises the characteristic informations of encoder different levels.

Step5.2, s is finally utilized_tAnd c_tOutput vector p is calculated_t.Under specific formula such as formula (12,13,14):

p_t=tanh (W_m([s_t；c_t])+b_m) (14)

P is finally calculated_tOutput probability P on vocabulary vocab_t,vocab。

Variation information bottleneck introduces intermediate characterization of the Z as source input X by the classification task of X to Y, construct from X → Z → Y information bottleneck R_IB(θ), shown in calculating process such as formula (15,16):

R_IB(θ)=I (Z, Y；θ)-βI(Z,X；θ) (15)

Wherein I (Z, Y；θ) indicate the mutual information between Y and Z.Our target is using mutual information as information content Measurement, the distribution of study coding Z, so that the information content from X → Y is as few as possible, forced model allows most important information to flow through letter Bottleneck is ceased, and ignores the information unrelated with task, to realize information de-redundancy and denoising.

For abstract task, list entries x is given, coding/decoding model is by calculating probability P_θ(y | x) generate abstract Sequences y, wherein θ is the parameter of model, such as weight matrix W and offset b.Shown in specific formula such as formula (17).

Wherein, y < t=(y₁,y₂,…y_t-1) indicate to have decoded all words before t moment.As shown in formula (18), model By maximizing the log-likelihood function for generating abstract probability come learning model parameter θ.

Loss=-logP_θ(y|x) (18)

Therefore, in traditional coding/decoding model, we introduce centre of the information bottleneck z=f (x, y < t) as coding Characterization constructs the loss from intermediate characterization z to output sequence y, as the intersection entropy loss of classification, calculation formula such as formula (19) It is shown.

Constraint is added simultaneously, it is desirable that P_θThe distribution of (z | x) and the KL divergence (Kullback-of standardized normal distribution Q (z) Leibler divergence) it is small as far as possible, after VIB is added, shown in training loss function such as formula (20):

Wherein λ is hyper parameter, we are set as 1e-3.

The design of this preferred embodiment introduces variation information bottleneck to be compressed and be denoised to data, advantageously reduces involvement The contextual information of different levels can bring the redundancy and noise of information.

Step7, in order to verify effect of the invention, experimental data set introduced below, evaluation index, the detailed ginseng of experiment The benchmark model of number setting and comparison, and experimental result is analyzed and discussed.

Experiment is made using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) value For the evaluation index of model.ROUGE is to be made a summary by a kind of evaluation index of Lin et al. autoabstract proposed based on generating The quality of abstract is evaluated with n-gram word group (n-gram) co-occurrence information in canonical reference abstract.

Wherein, n-gram indicates n-gram word, and { Gold } indicates canonical reference abstract, Count_match(n-gram) model is indicated The n-gram phrase number of co-occurrence in abstract and canonical reference abstract is generated, Count (n-gram) is indicated in canonical reference abstract The n-gram phrase number of appearance.The present invention calculates ROUGE value using pyrouge script, finally selects Rouge-1 (unigram), Rouge-2 (bigram), Rouge-L (longest common subsequence) value is as model performance Evaluation index.

Encoder and decoder select 3 layers of LSTM, and encoder is two-way LSTM, and decoder is unidirectional LSTM.It compiles The hidden state of code device and decoder is disposed as 512.In order to reduce the parameter of model, it is total that encoder and decoder are arranged in we Enjoy word embeding layer.Word insertion dimension is set as 512, and the present invention does not use Word2vec, the pre-training such as Glove, Bert word to Amount, but random initializtion is carried out to word embeding layer.The size that the vocabulary of codec is arranged in the present invention is 50k, unregistered word It is substituted using UNK.Other parameter settings are as follows: dropout 0.3, and optimizer selects Adam, and batch size is set as 64. For the purposes of improving the generation quality of abstract, the present invention uses Beam Search strategy, Beam Size in the mode inference stage It is set as 12.

The present invention chooses following 6 models as benchmark model, and the training data and test data of all benchmark models are equal It is identical as the present invention.

ABS: text snippet is generated using based on the encoder of convolutional neural networks (CNN) and NNLM decoder.

ABS+: it is based on ABS model, is finely adjusted using DUC2003 data the set pair analysis model, performance is in DUC2004 data It further gets a promotion on collection.

RFs2s: encoder and decoder are all made of GRU, and part of speech label, name entity indicia have been merged in encoder input Etc. linguistic features.

CAs2s: codec is all realized by convolutional neural networks, and linear door control unit is added in convolution process (Gated Linear Unit, GLU), the optimisation strategies such as multistep attention.

SEASS: it in the coding/decoding model of traditional attention mechanism, proposes to increase selectivity gate net in coding side Network controls flowing of the information from coding side to decoding end, to realize the purification of encoded information.

CGU: it is similar to SEASS, it proposes to pass through self-attention and Inception convolution network Optimized Coding Based device, the global information characterization of construction source input.

Our_s2s: the coding/decoding model that the present invention realizes.

Table 1 lists the Rouge-1 of model and benchmark model of the present invention on Gigaword test set, Rouge-2 and The F1 value comparison result of Rouge-L.Wherein Our_s2s is the coding/decoding model with attention that the present invention realizes, Inner_ S2s and Cross_s2s, which is illustrated respectively on the basis of Our_s2s, increases syncretizing mechanism and cross-layer syncretizing mechanism in layer, Beam and Greedy is indicated in test phase using Beam Search strategy or greedy search strategy.

1 Gigaword test set experimental result of table compares

Model	RG-1(F1)	RG-2(F1)	RG-L(F1)
				ABS(Beam)	29.55	11.2	26.42
ABS+(Beam)	29.76	11.88	26.96
				RFs2s(Beam)	32.67	15.59	30.64
CAs2s(Beam)	33.78	15.97	31.15
				SEASS(Greedy)	35.48	16.50	32.93
SEASS(Beam)	36.15	17.54	33.63
				CGU(Beam)	36.31	18.00	33.82
Our_s2s(Beam)	33.62	16.35	31.34
				Inner_s2s(Greedy)	36.05	17.18	33.47
Inner_s2s(Beam)	36.52	17.75	33.81
				Cross_s2s(Greedy)	36.23	17.19	33.71
Cross_s2s(Beam)	36.97	18.36	34.35

As it can be seen from table 1 Inner_s2s proposed by the present invention, which compares benchmark model with Cross_s2s, to be had centainly The promotion of degree, especially with Beam Search search strategy Cross_s2s model at RG-1, RG-2 and RG-L tri- Optimum performance is achieved in index.It can also be seen that Cross_s2s is searched for compared to Inner_s2s in Greedy and Beam Under strategy, model performance is all more preferable.

In order to further verify the generalization ability of model, the present invention tests on DUC2004 data set, experiment knot Fruit is as shown in table 2.DUC2004 data set requires to generate the fixed abstract (75bytes) of length, with research work phase before With [1,7,18], the present invention is arranged the length for generating and making a summary and is fixed as 18 words, to meet the needs of shortest length.DUC2004 Data set generally uses recall rate rather than evaluation index of the F1 value as model performance.The each former sentence pair of DUC2004 data set is answered Four artificial abstracts are made a summary as standards, therefore the present invention verifies respectively on four standards abstracts, and are tested with four times The average value of result is demonstrate,proved as evaluation result.

2 DUC2004 test set experimental result of table compares

Model	RG-1(R)	RG-2(R)	RG-L(R)
				ABS	26.55	7.06	22.05
ABS+	28.18	8.49	23.81
				RFs2s	28.35	9.46	24.59
CAs2s	28.97	8.26	24.06
				SEASS	29.21	9.56	25.51
Inner_s2s	30.29	13.24	27.94
				Cross_s2s	30.14	13.05	27.85

From table 2 it can be seen that Inner_s2s and Cross_s2s model performance proposed by the present invention is close, but in RG- The Recall value of 1, RG-2 and RG-L, tri- indexs has been more than benchmark model.Especially compared with ABS+, although its model is sharp Tuning is carried out with DUC2003 data set, but Inner_s2s model proposed by the present invention is still in RG-1, RG-2 and RG-L On be respectively increased 2.11,4.75 and 4.13.Compared with current optimal models SEASS, RG-2 index improves nearly 3 hundred Branch.

The present invention is directed to production text snippet, under the encoding and decoding frame based on attention, proposes based on level interaction Attention mechanism.Encoder multilayer contextual information is extracted to instruct decoding process by attention mechanism, while passing through introducing Variation information bottleneck constrains information, to improve the quality of production text snippet.

Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to upper Embodiment is stated, within the knowledge of a person skilled in the art, present inventive concept can also not departed from Under the premise of various changes can be made.

Claims

1. the text snippet method based on level interaction attention, it is characterised in that: the text based on level interaction attention Specific step is as follows for this method of abstracting:

Step1, use English data set Gigaword as training set, data set pre-process using pretreatment script To training set and development set, each training sample includes a pair of of input text and abstract sentence；

Step4, for multilayer coding/decoding model, codec includes multilayer LSTM, falls into a trap in each layer of LSTM and counts layer in and work as Hidden state representation between front layer, so that the context vector on upper layer is fused to current layer；

Step5, the output of each layer context vector and current layer with characteristic information is spliced, obtains the solution of current layer The code hidden state of device；

Step6, the contextual information for incorporating different levels can bring the redundancy and noise of information, utilize variation information bottleneck logarithm According to being compressed and denoised.

2. the text snippet method according to claim 1 based on level interaction attention, it is characterised in that: the step The specific steps of Step1 are as follows: data are standardized, including all words of data set all turn small letter, by all numbers Word replaces with #, and the word by frequency of occurrence in corpus less than 5 times replaces with UNK mark；A part of data are selected from development set Test set is used as after being removed and screening.

3. the text snippet method according to claim 1 based on level interaction attention, it is characterised in that: the step The specific steps of Step2: encoder uses two-way shot and long term memory network Bi-Directional LSTM, BILSTM, BILSTM Including forward and backward LSTM, forward direction LSTM reads list entries from left to right and obtains forward coding vector, and backward LSTM from The left reading sequence of dextrad obtain after to coding vector, forward and backward coding vector is finally spliced into the vector for obtaining list entries Characterization.

4. the text snippet method according to claim 1 based on level interaction attention, it is characterised in that:

The specific steps of the step Step3 are as follows:

Step3.1, decoder use unidirectional LSTM network, read the state vector initialization of encoder last moment, then root Vector is characterized according to Input context, abstract sequence is generated by word, wherein the length for generating abstract is necessarily less than equal to list entries Length；

Step3.2, in decoding, the word that decoder reads last moment target word is embedded in vector, the hidden state of last moment to Amount and the context vector at current time generate the hidden state vector at the moment；

Step3.3, attention mechanism is introduced, when being calculated current according to the hidden state of last moment decoder, coding vector The context vector at quarter；

Step3.4, the output vector that current time is then calculated by current time context vector and hidden state vector, And then the output vector at the current time output probability on goal-selling vocabulary is calculated.

5. the text snippet method according to claim 1 based on level interaction attention, it is characterised in that: the step The specific steps of Step4 are as follows:

Target output is calculated on vocabulary in the output vector of the last layer of Step4.3, calculating multilayer decoder network Probability distribution.

6. the text snippet method according to claim 1 based on level interaction attention, it is characterised in that: the step The specific steps of Step5 are as follows:

Step5.1, in network current layer, splice respectively to each layer context vector is obtained, obtain cross-layer fusion up and down The literary hidden state of vector sum decoder, it comprises the characteristic informations of encoder different levels；

Step5.2, output vector is calculated using the hidden state of decoder and context vector, so can calculate output to Measure the output probability on vocabulary.

7. the text snippet method according to claim 1 based on level interaction attention, it is characterised in that: the step The specific steps of Step6 are as follows:

Step6.3, intermediate characterization of the information bottleneck as coding is introduced, constructs the loss from intermediate characterization to output sequence, makees For the intersection entropy loss of classification；