CN109508457B

CN109508457B - A transfer learning method based on machine reading-to-sequence model

Info

Publication number: CN109508457B
Application number: CN201811284309.2A
Authority: CN
Inventors: 潘博远; 蔡登�; 李�昊; 陈哲乾; 赵洲; 何晓飞
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2020-05-29
Anticipated expiration: 2038-10-31
Also published as: CN109508457A

Abstract

The invention discloses a transfer learning method based on machine reading to a sequence model, which comprises the following steps: (1) pre-training a machine reading model, wherein the machine reading model comprises a coding layer and a model layer based on a recurrent neural network; (2) establishing a sequence model, wherein the sequence model comprises a coder and a decoder based on a recurrent neural network; (3) extracting parameters of a coding layer and a model layer in a trained machine reading model, transferring the parameters into a sequence model to be trained, and using the parameters as part of initialization parameters when the sequence model is trained; (4) training the sequence model until the model converges; (5) and performing a text sequence prediction task by using the trained sequence model. By using the method and the device, the text inclusion information can be deeply mined, and the quality of the generated text sequence is improved.

Description

Transfer learning method based on machine reading to sequence model

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a transfer learning method based on machine reading to a sequence model.

Background

Machine reading is one of the most popular and troublesome problems in natural language processing, requiring models to understand natural language and be able to exploit existing knowledge. The most popular task at present is to give an article and a question, and we need to find the answer from the article according to the question. With the recent release of several high quality data sets, neural network based models have performed better and better on machine reading, even beyond humans on some data sets. An efficient machine-reading model can be widely applied to a plurality of fields based on semantic understanding, such as a conversation robot, a question-answering system, a search engine and the like.

The sequence model with attention mechanism mainly comprises an encoder and a decoder, wherein the encoder encodes an input sequence and then the decoder sequentially outputs the encoded input sequence and generates a sequence. Such structures have enjoyed tremendous success in natural language generation tasks such as machine translation, text summarization and dialog systems. However, when training such encoder-decoder, we can only optimize the output result against a fixed reference sample, and it is difficult to deeply understand the latent semantic information contained in the text.

And (4) transfer learning, which means that knowledge or characteristics in various fields are combined to establish a new model and probability distribution. In the field of natural language processing, transfer learning is widely applied. For example, in 2011, Natural Language Processing (almost) from Scratch published in the Journal of international top-level Machine Learning theory Research of international top-level Machine Learning theory discloses a uniform neural network structure and can apply unsupervised Learning to a plurality of Natural Language Processing tasks such as part of speech tagging and entity naming recognition; a ' rounded in transitions ' context-transformed Vectors ' published in 2017 on the International Top-level computing Neural theory Conference on Neural Information Processing Systems discloses a method for migrating a machine-translated coder after pre-training to a text classification task and question-answering system as a new Word vector to improve the richness of the original Word vector; a training method based on conjunctions is disclosed in 'Disable Marker augmented network with relationship Learning and Learning for Natural Language Inference' published in International Top-level Natural Language processing conference Proceedings of the 56th Annual Meeting and for computerized Linguities in 2018.

However, the existing natural language processing migration learning method rarely transfers the multilayer neural network to other tasks, and only migrating the coding layer can lose a large amount of information of the original pre-training model.

Disclosure of Invention

The invention provides a transfer learning method based on a sequence model read by a machine, which can more deeply mine text inclusion information and improve the quality of a generated text sequence.

The technical scheme adopted by the invention is as follows:

a migration learning method based on machine reading to sequence model comprises the following steps:

(1) pre-training a machine reading model, wherein the machine reading model comprises a coding layer and a model layer based on a recurrent neural network;

(2) establishing a sequence model, wherein the sequence model comprises an encoder, a decoder and an attention mechanism based on a recurrent neural network;

(3) extracting parameters of a coding layer and a model layer in a trained machine reading model, transferring the parameters into a sequence model to be trained, and using the parameters as part of initialization parameters of the training sequence model;

(4) training the sequence model until the model converges;

(5) and performing a text sequence prediction task by using the trained sequence model.

The method comprises the steps of pre-training a machine reading model comprising a coding layer and a model layer to serve as a migration source, embedding the coding layer and the model layer into a sequence model to be fused with an existing coding result, and finally outputting probability distribution of labels. The method can help the sequence model to understand the meaning of the text more deeply and generate a more natural text.

In the step (1), the recurrent neural network in the coding layer is a bidirectional long-short time memory network, and the recurrent neural network in the model layer is a unidirectional long-short time memory network.

In the step (1), the pre-training machine model comprises the following specific steps:

(1-1) selecting training data, performing word embedding on an input text by using a word vector Glove, and then sending the word embedded word into a bidirectional long-time memory network of a coding layer;

(1-2) connecting each hidden unit side by side to form the expression of the whole sentence in the direction, and combining the sentence expressions in two directions to be used as the final expression of the input sequence;

(1-3) combining the final expression of the article sequence and the final expression of the question sequence into an attention mechanism of a model, and outputting an attention matrix;

(1-4) inputting an attention moment array into a one-way long-short time memory network of a model layer, regularizing by using a hidden unit of the network, and outputting predicted probability distribution;

(1-5) repeating the above steps until the machine reading model converges.

In the step (2), the sequence model mainly comprises an encoder and a decoder, in order to keep the same with the parameters of the migration source, a long-time and short-time memory network is also adopted as the main parameter component of the sequence model, and a recurrent neural network in the encoder is a bidirectional long-time and short-time memory network.

In the step (3), the extracted parameters of the coding layer and the model layer are cyclic neural networks in the coding layer and the model layer. And respectively extracting the network of the coding layer and the network of the model layer, and transferring the networks into a sequence model to be trained to be used as part of initialization parameters of the training sequence model.

The specific steps of the step (4) are as follows:

(4-1) simultaneously sending the input word sequence into an encoder of the sequence model and a coding layer of the migrated machine reading model to obtain a coded merging vector;

(4-2) sending the merged vector into a one-way long-short-time memory for integration to obtain a coded vector integrated with the input text sequence;

(4-3) taking the integrated coding vector as an initialization vector of a decoder, and performing attention interaction on a hiding unit of the decoder and a unit integrating the vector to obtain an attention vector a_tWhere t is the t-th word decoded;

(4-4) attention vector a_tInputting the model layer into the migrated machine reading model layer, and then outputting the output vector r of the model layer_tAnd attention vector a_tIntegrating by using a linear function and sending the integrated result into a softmax function to obtain the probability distribution of the prediction sequence; the formula of the softmax function is as follows:

P(y_t|y_＜t,x)＝softmax(W_pa_t+W_qr_t+b_p)

wherein, W_p、W_qAnd b_pAre all parameters to be trained, y_tIs the t-th word output by the decoder.

(4-5) repeating the above steps until the model converges.

The invention has the following beneficial effects:

1. the invention uses transfer learning to transfer the knowledge learned in other question-answering systems to the text generation task, thus improving the accuracy of the structure of the coder-decoder and ensuring the whole model to be simple and visual.

2. The method fully utilizes the high performance of the existing machine reading model, the transferred parameters comprise multilayer neural networks, the trained machine reading model parameters are randomly initialized instead of the sequence model parameters, and the sequence model can be helped to more deeply mine the information contained in the text, so that the content is deeper, and the quality of the generated text sequence is improved.

Drawings

FIG. 1 is a flow chart of a transfer learning method based on machine reading to sequence model according to the present invention;

FIG. 2 is a schematic diagram of the overall structure of the machine reading model and the sequence model of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantageous effects of the present invention more clearly apparent, the technical contents and specific embodiments of the present invention are described in further detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, a migration learning method based on machine reading to sequence model includes the following steps:

s01, pre-training a machine reading model.

We use the Stanford question-answer dataset SQuAD, a large-scale, high-quality corpus, as a training set, and our task is to predict the answer, which is a continuous field in an article, given an article and a question.

Referring to fig. 2, the existing word vector Glove is used for word embedding of an input text, and then the word vector Glove is sent into a bidirectional long-time memory network (BiLSTM) of an Encoding Layer (Encoding Layer). We join each hidden unit side by side to form the expression of the whole sentence in the direction, and merge the sentence expressions in two directions together as the final expression of the input sequence. Subsequently, we incorporated the expression of the article sequence and the expression of the question sequence into the Attention Mechanism (Attention Mechanism). The Attention mechanism is a function composed of a series of regularized linear operations and logical operations, and can be specifically referred to the contents from page 3 to page 4 in Bi-directional orientation Flow for mechanism compression published in the International conference on Learning and characterization conference International in 2017. The output of our attention mechanism is a matrix of attention vectors for the number of article words. Finally, we input the attention matrix into a model Layer (Modeling Layer) one-way long-short time memory network (LSTM), and use hidden units of the network to regularize and use the softmax function to output the predicted probability distribution.

And S02, extracting the coding layer and model layer parameters of the machine reading model. In step S01, we refer to a long-term and short-term memory network, which is a kind of recurrent neural network and is also the parameter we will extract. The network of the coding layer and the network of the model layer are extracted respectively and prepared to be used as initialization parameters of the next task.

S03, the parameters extracted in step S02 are embedded in the sequence model as initialization of the partial parameters.

Structure of sequence model referring to fig. 2, the sequence model mainly consists of an Encoder (Encoder) and a Decoder (Decoder), and in order to keep the same with the parameters of the migration source, we also use a long-term memory network as the main parameter component of the sequence model. We first input the input word sequence into the sequence model encoder and migrate itObtaining a coded merging vector in a coder of the machine reading model; and then, sending the merged vector into a one-way long-short time memory network for integration to obtain a coding vector integrated by the coders from two different sources on the input text sequence. The integrated coding vector is used as an initialization vector of a decoder, and attention interaction is carried out on a hiding unit of the decoder and a unit of the integrated vector to obtain an attention vector a_tWhere t is the t-th word decoded. For a general sequence model, the attention vector is finally sent to a softmax function for regularization and generation of a predicted probability distribution:

P(y_t|y_＜t,x)＝softmax(W_pa_t+b_p)

wherein W_pAnd b_pAre all parameters to be trained, y_tIs the t-th word output by the decoder. However, in the method of the present invention, we first input the attention vector into the migrated machine-read model layer, and then input the model layer's output vector r_tAnd integrating the predicted sequence with the original attention vector by using a linear function and sending the integrated predicted sequence into a softmax function to obtain the probability distribution of the predicted sequence:

P(y_t|y_＜t,x)＝softmax(W_pa_t+W_qr_t+b_p)

wherein, W_qIs the parameter to be trained.

And S04, starting training the sequence model by taking the trained migration parameters as initialization and other parameters as random initialization until convergence.

And S05, performing text sequence prediction tasks such as machine translation, text summarization and the like by using the trained model.

In order to prove the effectiveness of the method, a comparison experiment is carried out on two tasks of neural machine translation and generation type text summarization. On a machine translation task, a WMT2014 and WMT2015 English-to-German corpus is adopted; on the task of text summarization, two data sets of CNN/Daily Mail and Gigaword are adopted. The CNN/Daily Mail contains 287k training data pairs after being preprocessed, and the Gigaword contains 3.8M training data pairs after being preprocessed.

The results of comparative experiments on machine translation tasks are shown in table 1. In table 1, the first column is the base model, the middle column is the one-by-one addition of the details of the method, and the last column is the method. It can be seen that, on the machine translation task, the method (MacNet) of the invention is obviously improved compared with a basic model (Baseline), and the effectiveness is proved by performing comparison tests on all details.

TABLE 1

The results of the comparative experiments for the text summarization task are shown in table 2. This experiment was compared on the text summary test set with the published method that works best currently. Overall, the method of the invention (Pointer-Generator + MacNet) has a higher accuracy than other methods and achieves the best results at present on most of the indices on both data sets.

TABLE 2

In addition, we show in detail several examples demonstrating the visual impact on generating text summaries before and after the incorporation of the method of the present invention, as shown in table 3.

TABLE 3

In the above table, PG is an abbreviation of a basic model pointer generator, Reference is a Reference answer given in a data set, and PG + Macnet is a model added to the method of the present invention. It can be seen that when an uncommon word appears in the original text, the original basic model is difficult to summarize a better subject-to-predicate object; and when the original text is long and the structure is complex, the original basic model even shows the language sickness. However, after the method of the invention is added, the finally generated text abstract sentences are smooth and natural, and the expressed main body idea is basically in place.

The embodiments described in this specification are only for illustrative purposes and are not intended to limit the invention, the scope of the invention should not be limited to the specific embodiments described in the embodiments, and any modifications, substitutions, changes, etc. within the spirit and principle of the invention are included in the scope of the invention.

Claims

1. A migration learning method based on machine reading to sequence model is characterized by comprising the following steps:

(3) extracting parameters of a coding layer and a model layer in a trained machine reading model, transferring the parameters into a sequence model to be trained, and using the parameters as part of initialization parameters when the sequence model is trained;

(4) training a sequence model, specifically comprising the following steps:

(4-3) taking the integrated coding vector as an initialization vector of a decoder, and performing attention interaction on a hiding unit and a vector integrating unit of the decoder to obtain an attention vector

WhereintIs the first to decodetA word;

(4-4) will be notedVector of the intention force

Inputting the model layer into the migrated machine reading model layer, and then outputting the output vector of the model layer

And attention vector

Integrating by using a linear function and sending the integrated result into a softmax function to obtain the probability distribution of the prediction sequence;

(4-5) repeating the above steps until the model converges;

2. The method according to claim 1, wherein in step (1), the recurrent neural network in the coding layer is a bidirectional long-term and short-term memory network, and the recurrent neural network in the model layer is a unidirectional long-term and short-term memory network.

3. The machine-readable sequence model-based migration learning method according to claim 2, wherein in the step (1), the pre-training comprises the following specific steps:

(1-5) repeating the above steps until the machine reading model converges.

4. The method according to claim 1, wherein in step (2), the recurrent neural network in the encoder is a bidirectional long-term and short-term memory network.

5. The machine-readable sequence model-based migration learning method according to claim 1, wherein in step (4-4), the formula of the softmax function is:

wherein,

、

and

are all the parameters to be trained and,

is the output of the decodertA word.