CN108197701A

CN108197701A - A kind of multi-task learning method based on RNN

Info

Publication number: CN108197701A
Application number: CN201810112482.8A
Authority: CN
Inventors: 王磊; 翟荣安; 王纯配; 顾仓; 王毓; 刘晶晶; 王飞; 于振中; 李文兴
Original assignee: HRG International Institute for Research and Innovation
Current assignee: HRG International Institute for Research and Innovation
Priority date: 2018-02-05
Filing date: 2018-02-05
Publication date: 2018-06-22

Abstract

The present invention provides a kind of multi-task learning methods based on RNN, the described method comprises the following steps：Step S1：Initialize systematic parameter θ=(W, U, B, V)；Step S2：Input sample x_1,i,…,x_R,i, learn publicly-owned information X^co, will be in the training of publicly-owned information compensation to individual task；Step S3：Calculate the prediction label vector output of each neural networkThe loss L of calculating task r_r,i；Step S4：The gradient of θ=(W, U, B, V) is solved according to gradient descent method and BPTT algorithms, determines task r about publicly-owned information X^coGradient；Step S5：It determines learning rate η, updates each weights gradient W=W η δ_W；Step S6：Judge whether neural network reaches stable, if performing step S7；If not, return to step S2, iteration update model parameter；Step S7：Export Optimized model.The present invention can efficiently use the publicly-owned feature between RNN study multitasks, and publicly-owned feature is input in the study of individual task, realize information sharing.And by quoting GRU structures in RNN, gradient disappearance problem can be efficiently solved.

Description

A kind of multi-task learning method based on RNN

Technical field

The present invention relates to neural network multi-task learning field more particularly to a kind of multi-task learning methods based on RNN.

Background technology

In practical application, it can link together in different ways between different tasks.And multitask It practises than individual task study advantageously.For example, we only have the available data of fraction, at this moment multitask when each task Study can together learn the data acquisition system of multiple inter-related tasks.It is between task it could also be possible that because latent there are certain Common represent and link together.For example, in target identification, former steps that human visual system is formed all are to pass through A common characteristic set is practised to represent all targets.Method previously with respect to multi-task learning is between task mostly Relationship by a kind of functional concept connection together.

Modeling for sequence data, the model based on neural network (Neural Network) is in the knowledge that makes a broadcast address Not, excellent achievement is achieved in language model and video recording classification problem.These models largely belong to two neural networks, Feedforward neural network (Feedforward Neural Network) and recurrent neural network (Recurrent Neural Network).It can learn the sequence information of endless in traditional RNN structural theories.But, in practice it has proved that, time interval is got over This learning ability of big RNN will be weaker.And this loop structure be difficult training because it is easy to appear gradient disappearances It explodes with gradient.In order to solve the problems, such as gradient disappearance, many structures have been suggested.Including long mnemon in short-term (LSTM) recurrent neural network.But LSTM network structures are more complicated, this will expend more training times, and easily exist Occurs gradient disappearance in back-propagation process.In order to solve this problem, a kind of GRU structures with more simple structure are carried Go out, this structure is easier to realize compared to LSTM, and training is also simpler.Its structure is as shown in Figure 1.

The existing method previously with respect to multi-task learning be mostly the relationship between task is passed through it is a kind of functional Concept connection is together.For example, Baxter determines the correlation between them by a single model selection criteria, i.e., There are one group of optimal hypothesis classes between multitask.The existing most structure of multi-task learning method is complex, such as LSTM Model, and easily occur gradient disappearance in back-propagation process, based on this, the present invention provides a kind of based on the more of RNN Tasking learning method, the RNN have GRU structures, can effectively prevent gradient disappearance problem, and simpler compared to LSTM structures, make The feature that must be obtained is more accurate.

Invention content

In view of the drawbacks of the prior art, the present invention provides a kind of multi-task learning method based on RNN, is had according to RNN The participation of publicly-owned information can improve the learning ability of individual task when the characteristics of learning contextual information and multi-task learning Characteristic using the publicly-owned feature between RNN model learning tasks, and is input to as input offset the study of individual task In, learn individual task finally by a feedforward compensation layer (Feed Forward Layer, FF) so that there is GRU structures RNN can effectively prevent gradient disappearance problem, and simpler compared to LSTM structures, obtained feature is more accurate.

To achieve the above object, the present invention provides a kind of multi-task learning method based on RNN, the method includes with Lower step：

Step S1：Systematic parameter θ=(W, U, B, V) is initialized, wherein, W represents what neural network connected between layers Weight matrix；U represents weight matrix of the data when inputting neural network；B represents the biasing square of neural network between layers Battle array；V represents neural network hidden layer to softmax layers of weight matrix；

Step S2：Input sample x_1,i,…,x_R,i, learn publicly-owned information X^co, by the instruction of publicly-owned information compensation to individual task In white silk；

Step S3：Calculate the prediction label vector output of each neural networkThe loss L of calculating task r_r,i；

Step S4：The gradient of θ=(W, U, B, V) is solved according to gradient descent method and BPTT algorithms, determines task r about public affairs There is information X^coGradient；

Step S5：It determines learning rate η, updates each weights gradient W=W- η δ_W, wherein, δ_WIt represents reversed in neural network During propagation, pass through the partial derivative for the weight matrix that gradient descent method obtains；

Step S6：Judge whether neural network reaches stable, if performing step S7；If not, return to step S2, repeatedly Generation update model parameter；

Step S7：Export Optimized model.

Wherein, institute step S2 further comprises：It extracts a sample input RNN out from each task, learns on one Hereafter vector obtains publicly-owned information X as shared information^co。

Wherein, institute step S3 further comprises：By publicly-owned information X^coThe study of individual task is input to as input offset In, individual task is learnt by feedforward compensation layer (Feed Forward Layer, FF), generates the label vector of prediction According to the prediction label of generation vectorThe loss L of calculating task r_r,i。

Wherein, institute step S3 further comprises：Hidden layer is exported into h^(r)Output layer is input to, and passes through softmax functions Layer output prediction label vectorWhereinz_r,i=V^(r)·h^(r), h^(r)=g (U^(r)·x_r,i+W^(r)· X^co+b^(r)), wherein g () represents sigmoid activation primitives, weight matrix Bias vectorThe loss of task r is L_r,i：

Wherein, institute step S5 further comprises, determines learning rate by the following method：η=Ae^-λn, wherein n is network instruction Iterations during white silk, 1≤A≤50,0.0001≤λ≤0.001；Alternatively, η (k)=e^-λ(k-1), wherein, 0.0001≤ λ≤0.001, k are iterations.

Wherein, institute step S7 further comprises：It determines object function, and minimizes object function λ is regularization coefficient.

The present invention can efficiently use the publicly-owned feature between RNN study multitasks, and publicly-owned feature is input to single In the study of business, information sharing is realized.And by quoting GRU structures in RNN, gradient disappearance problem can be efficiently solved.

The detailed description of specific embodiment by referring to the following drawings and to the present invention, feature and advantage of the invention It will become apparent.

Description of the drawings

Fig. 1 is the structure diagram of door recursive unit GRU of the prior art；

The schematic diagram of prediction label vector is generated when Fig. 2 is the RNN multi-task learnings the present invention is based on publicly-owned feature compensation；

Fig. 3, which is that the present invention is based on the multi-task learning method parameter iteration of RNN, to update flow diagram.

Specific embodiment

In order to make technical scheme of the present invention clearer, clear, it is described in further detail, should manages below in conjunction with attached drawing Solution, the specific embodiments described herein are merely illustrative of the present invention, is not intended to limit the present invention.

The schematic diagram of prediction label vector is generated when Fig. 2 is the RNN multi-task learnings the present invention is based on publicly-owned feature compensation, Specific method is as follows：

DefinitionFor the sample under each task, wherein N_rRepresent the number of sample in sample, M_rRepresent sample Dimension.We assume that the sample number of each task is identical, N_rIt is represented with N.Therefore, in each task different views sample table Show as follows

Sample is divided into two parts by us, and a part has the N of label_lA sample is for training, N of the another part without label_u A sample is for testing, N_l+N_u=N.Our object function isWherein L_r,iIt represents to appoint I-th sample losses of business r, λ are regularization coefficient, and θ represents weight matrix.

We extract a sample input RNN out from each task and learn a context vector as shared information. Obtaining publicly-owned information X^coAfterwards, we use R feedforward compensation neural network, learn each task respectively, each in study in this way Existing publicly-owned information participates in having private information participation again during task, can preferably utilize the dependence between task.

The input layer of the neural network of task r includes sample x_r,iWith publicly-owned information X^co, they are input to hidden layer；

h^(r)=g (U^(r)·x_r,i+W^(r)·X^co+b^(r))

Wherein g () represents sigmoid activation primitives, weight matrixBias vectorSince how much the publicly-owned information that each task is utilized differs, publicly-owned information X^coThe journey of participation task r training Degree is by weight matrix W^(r)It determines.Next hidden layer is exported into h^(r)Output layer is input to, and by the output of softmax functions layer Prediction label vector

Wherein z_r,i=V^(r)·h^(r), weight matrixPrediction outputWe define task r's It loses as L_r,i：

As shown in figure 3, the present invention provides a kind of multi-task learning method based on RNN, this method includes parameter iteration More new technological process specifically comprises the following steps：

In this step, the parameter that θ=(W, U, B, V) is meant that in weight matrix θ includes W, U, B, V, initializes system Parameter θ=(W, U, B, V) just refers to input the initial value of W, U, B, V, and the initial value of W, U, B, V can be prior according to actual conditions It is set.

In this step, x_1,i,…,x_R,iIt is the sample extracted out from each task respectively, it is then that each sample is defeated Enter in RNN and learnt, learn a context vector as shared information, obtain publicly-owned information X^co。

In this step, by publicly-owned information X^coIt is input in the study of individual task as input offset, passes through feedforward compensation Layer (Feed Forward Layer, FF) learns individual task, generates the label vector of predictionAccording to the prediction of generation Label vectorThe loss L of calculating task r_r,i；

Hidden layer is exported into h^(r)Output layer is input to, and by softmax functions layer output prediction label vectorIts Inz_r,i=V^(r)·h^(r), h^(r)=g (U^(r)·x_r,i+W^(r)·X^co+b^(r)), wherein g () is represented Sigmoid activation primitives, weight matrixBias vector The loss of task r is L_r,i：

In this step, in the model parameter, i.e. unconstrained optimization problem for solving machine learning algorithm, gradient declines (Gradient Descent) is one of commonly used method.It, can when minimizing loss function in machine learning algorithm With by gradient descent method come iterative solution step by step, the loss function minimized.The algorithm of gradient descent method can be with There are two kinds of algebraic approach and matrix method (also referred to as vector method) to represent that algebraic approach is easier to understand, and matrix method is more succinct.Gradient Descent method includes batch gradient descent method (Batch GradientDescent), stochastic gradient descent method (Stochastic Gradient Descent) and small lot gradient descent method (Mini-batch Gradient Descent), the decline of batch gradient Method, is gradient descent method the most common form, and specific practice is carried out more namely in undated parameter using all samples Newly, common linear regression gradient descent algorithm is exactly batch gradient descent method.For training speed, stochastic gradient descent Method only with a sample due to carrying out iteration every time, and quickly, but stochastic gradient descent method is due to only with one for training speed Sample determines gradient direction, and it is not probably optimal to lead to solution.For convergence rate, due to stochastic gradient descent method one Secondary one sample of iteration, cause iteration direction variation very greatly, it is impossible to quickly converge to locally optimal solution.Stochastic gradient descent method More new formula be：Small lot gradient descent method be batch gradient descent method and The compromise of stochastic gradient descent method, the present invention can solve θ=(W, U, B, V's) using above-mentioned described gradient descent method Gradient.

BPTT algorithms are common a kind of algorithms of backpropagation at any time in neural network, and a kind of illustratively algorithm is such as Shown in lower：

The gradient of θ=(W, U, B, V) is solved according to gradient descent method described above and BPTT algorithms, further determines that and appoints R be engaged in about publicly-owned information X^coGradient.

In this step, learning rate η is generally artificially determined, is adjusted according to neural network learning effect, generally according to nerve E-learning error transfer factor, as error is gradually reduced, learning rate reduces therewith, for example the learning rate of next time can be last time 1/10th.

Specifically, learning rate can also be determined by the following formula：η=Ae^-λn, wherein n is changing in network training process Generation number, 1≤A≤50,0.0001≤λ≤0.001；

Alternatively, η (k)=e^-λ(k-1), wherein, 0.0001≤λ≤0.001, k are iterations.

In this step, neural network, which reaches, to be stablized generally according to the totality between the output of neural network and true label Error determines, as training epoch numbers increase (epoch is represented using primary complete training dataset), when error curve most After tend to be steady, and error amount is less than given threshold value and then thinks that neural network learning is effective.

Step S7：Export Optimized model；

In this step, object function is determinedSpecifically,

Wherein, λ is regularization coefficient, minimizes object functionI.e. so that target letter The model that number minimizes is optimal model.

Method provided by the invention can efficiently use the publicly-owned feature between RNN study multitasks, and publicly-owned feature is defeated Enter into the study of individual task, realize information sharing.And by quoting GRU structures in RNN, gradient can be efficiently solved and disappeared Mistake problem.

The foregoing is merely the preferred embodiment of the present invention, are not intended to limit the scope of the invention, every at this Under the design of invention, the equivalent structure transformation made using description of the invention and accompanying drawing content or directly/be used in it indirectly His relevant technical field is included in the scope of patent protection of the present invention.

Claims

A kind of 1. multi-task learning method based on RNN, which is characterized in that the described method comprises the following steps：

Step S1：Systematic parameter θ=(W, U, B, V) is initialized, wherein, W represents the weights that neural network connects between layers Matrix；U represents weight matrix of the data when inputting neural network；B represents the bias matrix of neural network between layers；V Represent neural network hidden layer to softmax layers of weight matrix；

Step S2：Input sample x_1,i,…,x_R,i, learn publicly-owned information X^co, by the training of publicly-owned information compensation to individual task In；

Step S3：Calculate the prediction label vector output of each neural networkThe loss L of calculating task r_r,i；

Step S4：The gradient of θ=(W, U, B, V) is solved according to gradient descent method and BPTT algorithms, determines task r about publicly-owned letter Cease X^coGradient；

Step S5：It determines learning rate η, updates each weights gradient W=W- η δ_W, wherein, δ_WIt represents in neural network backpropagation When, pass through the partial derivative for the weight matrix that gradient descent method obtains；

Step S6：Judge whether neural network reaches stable, if performing step S7；If not, return to step S2, iteration is more New model parameter；

Step S7：Export Optimized model.
2. according to the method described in claim 1, it is characterized in that, institute step S2 further comprises：It is extracted out from each task One sample inputs RNN, learns a context vector as shared information, obtains publicly-owned information X^co。
3. according to the method described in claim 1, it is characterized in that, institute step S3 further comprises：By publicly-owned information X^coAs Input offset is input in the study of individual task, learns list by feedforward compensation layer (Feed Forward Layer, FF) A task generates the label vector of predictionAccording to the prediction label of generation vectorThe loss L of calculating task r_r,i。
4. according to the method described in claim 3, it is characterized in that, institute step S3 further comprises：Hidden layer is exported into h^(r)It is defeated Enter to output layer, and by softmax functions layer output prediction label vectorWhereinz_r,i= V^(r)·h^(r), h^(r)=g (U^(r)·x_r,i+W^(r)·X^co+b^(r)), wherein g () represents sigmoid activation primitives, weight matrixBias vectorThe loss of task r is L_r,i：
5. according to the method described in claim 1, it is characterized in that, institute step S5 further comprises, determine by the following method Learning rate：η=Ae^-λn, wherein n be network training process in iterations, 1≤A≤50,0.0001≤λ≤0.001；Or Person, η (k)=e^-λ(k-1), wherein, 0.0001≤λ≤0.001, k are iterations.
6. according to the method described in claim 1, it is characterized in that, institute step S7 further comprises：Determine object functionAnd minimize object functionλ is regularization coefficient.