CN111553821A

CN111553821A - Automatic problem solving method for application problems based on teacher-student network and multi-head decoder

Info

Publication number: CN111553821A
Application number: CN202010402148.3A
Authority: CN
Inventors: 张骥鹏; 邵杰; 王磊; 徐行
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-08-18
Anticipated expiration: 2040-05-13
Also published as: CN111553821B

Abstract

The invention discloses an automatic problem solving method for application problems based on a teacher student network and a multi-head decoder. Then, a coding and decoding network which is also based on the sequence to the tree structure is constructed, and a plurality of tree structure decoders are added, so that a student network with a multi-head tree structure decoder is obtained. Then, the soft label vector and the 0-1 distributed label vector provided in the original training sample, namely the hard label vector, are used together as a supervision signal, and the student network is trained. And during testing, selecting one of the multiple solutions generated in the multi-head decoder with the highest confidence as the output of the model. The invention can obtain better problem solving effect by utilizing the capability of generating a problem solving equation different from the label of the teacher model and assisting the multi-head decoder structure.

Description

Automatic problem solving method for application problems based on teacher-student network and multi-head decoder

Technical Field

The invention relates to the technical field of computer linguistics, in particular to an automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder.

Background

Solving the problem of mathematics, namely text description, automatically answering the problem of mathematics, has attracted the attention of researchers since 1960, and is an important natural language understanding task. A typical mathematical application problem is to give a description of a problem and to give a short description of an unknown number of problems. Earlier research wisdom designed automatic solvers by statistical machine learning and speech analysis methods, but these methods had poor generalization because they required a great deal of effort to design appropriate functional and expression templates.

In recent years, automatic solvers based on deep learning have appeared, the deep learning methods can automatically acquire feature learning information, can generate new solving expressions which do not exist in a training data set, and simultaneously achieve high performance on a large-scale and complex data set, Deep Neural Solver (DNS) appeared in 2017 is firstly proposed in the methods, a large-scale mathematical Problem (MWP) data set is collected to evaluate the performance of the automatic solver while a model is proposed in the methods, and since then, many research works are focused on improving the automatic solvers based on the deep learning. On the one hand, the more representative improvements are the group attention model (GROUPATT) and the expression normalization method (Math-EN), which focus on improving the inputs of the intermediate process and the model, respectively. On the other hand, improving the acquisition mode and the generation process of the quantity representation is also a potential method for realizing a better solution expression; however, there is a need for a method that takes advantage of the multi-solution nature of mathematical problems to enhance the performance of the model. Therefore, the data of the user only provides a specific solution, if the problem solving device generates a correct solution without annotation, the model can be punished wrongly, so that the accuracy of the generated result of the model is reduced, even if the correct solution is not taken into consideration, the problem solving model can be improved from the aspect because the accuracy of the answer is higher than that of the problem solving equation.

Disclosure of Invention

Aiming at the defects in the prior art, the automatic problem solving method of the application problem based on the teacher-student network and the multi-head decoder solves the problem that the existing deep learning model cannot consider the generation of correct solutions different from labels.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: an automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder comprises the following steps:

s1, constructing a coding and decoding model with a sequence in a tree structure and only one decoder, and using the coding and decoding model as a teacher network;

s2, training the teacher network through the training samples, taking the labels of the training samples as hard label vectors, and taking the class vectors output by the training samples by the teacher network after training as soft label vectors;

s3, constructing a coding and decoding model based on a sequence-to-tree decoder with a multi-head tree structure, and using the coding and decoding model as a student network;

s4, simultaneously constructing a supervision signal based on the hard label vector and the soft label vector, and training the student network by using a training sample based on the constructed supervision signal;

s5, inputting the application questions to be solved into the trained student network, generating a plurality of problem solving equations by using a multi-head tree structure decoder of the student network, and determining corresponding confidence coefficients;

s6, selecting the solving equation corresponding to the highest confidence coefficient, and solving the corresponding answer according to the solving equation to complete the automatic solving.

Further, in step S1, the teacher network is mapped to a function f (x, θ) labeled with y for the training sample x_T) Wherein, theta_TParameters of a teacher network;

the training sample is an application question stem and a corresponding problem solving scheme thereof; the label y is a label vector distributed from 0 to 1.

Further, in step S2, the method for training the teacher network through the training samples specifically includes:

a1, obtaining a word-level hidden state representation H of a character word set X in an application topic stem text through an encoder structure in a teacher network;

a2, inputting the word-level hidden state representation H into a tree-structure-based decoder in a teacher network, and outputting a category vector at each moment;

a3, determining a loss function of the teacher network based on the output class vector and the label y of the training sample;

and A4, based on the loss function of the teacher network, using the label of the training sample as a supervision signal of the training process, and training the teacher network by using the training sample.

Further, the encoder structure in the step a1 is a bidirectional recurrent neural network.

Further, in the step a3, the loss function L of the teacher network_NLL(θ_T) Comprises the following steps:

wherein 1 is an indicator function, V₁Number of numbers and operators in problem solving equations generated for decoders in teacher networks, k₁For a particular number or operator in the loss function of the teacher's network, p₁And (c) the distribution corresponding to the class vector output by the teacher network.

Further, the encoder structure in the student network constructed in step S3 is the same as that in the teacher network, the student network includes a plurality of decoders having independent parameters and based on a tree structure, in the training process of the student network, a diversified regularization term is added to the output end of each decoder, and different noise is added to the input end of each decoder.

Further, in step S4, the method for determining the loss function in the student network training process includes:

b1, determining based on the hard tag vectorLoss L of student network_NLL(θ_S)；

In the formula, theta_SFor parameters of student networks, 1 is an indicator function, V₂Number of numbers and operators in problem solving equations generated for decoders in student networks, k₂For a particular number or operator in the loss function of the student network, p₂() distribution corresponding to the class vector output for the student network;

b2 calculating cross entropy loss L between student network output and teacher network output_KD(θ_S；θ_T)；

Wherein q { y ═ k | x; theta_TP (y is k | x; theta) is an indication function for taking out the kth position value in the soft label vector output by the teacher network_s) Distributing the kth position in the output category vector in the student network, wherein V is the number of numbers and operators in the same problem solving equation generated in the teacher network and the student network, and k is a specific number or operator;

b3 based on cross entropy loss L_NLL(θ_S；θ_T) And loss L_NLL(θ_S) Determining the loss L from the teacher network to the student network corresponding to the ith decoder in the student network_TS,i(θ_S,θ_T)；

L_TS,i(θ_S,θ_T)＝(1-α)L_NLL(θ_S)+αL_KD(θ_S；θ_T)

In the formula, alpha is an interpolation parameter;

b4, obtaining a word-level hidden state representation H of a character word set x in an application question stem text in a training sample through an encoder structure in a student network;

b5, passing the paired wordsMasking the level hidden state representation H to generate a hidden layer vector group set { H) corresponding to the word level hidden state representation H₁,H₂,..H_i...H_NThe hidden layer vector group is sequentially input into each decoder of the student network;

wherein, i is the number of the decoders in the student network, and N is the total number of the decoders in the student network;

b6, introducing a diversified regularization term L_divAnd combined loss L_TS,i(θ_S,θ_T) And obtaining a loss function L of the network training process of the student.

Further, the step B5 is specifically:

b5-1, defining mask rate P_mask；

B5-2, representing the word-level hidden state with the percentage of P in H by using Gaussian distribution_maskSampling the positions of the two phases to generate a zero matrix Mask which is the same as the H_zero；

B5-3, zero matrix Mask generated according to sampling_zeroAssigning 1 to the sampled position, and generating Mask matrix Mask_p；

B5-4, by H_i＝Mask_p⊙ H, determining a hidden vector group set { H) corresponding to the word-level hidden state representation H₁,H₂,..H_i...H_N}；

Wherein, an is a matrix multiplication operator that multiplies by bit;

b5-5, grouping hidden vector groups { H₁,H₂,..H_i...H_NEach hidden vector group H in_iInput to a corresponding decoder.

Further, in the step B6, a diversification regularization term L is introduced_divComprises the following steps:

where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, L_div,tTo calculate the loss function of the similarity of the solution equations, and L_div,t＝1+S_COS(y_i,t,y_i1,t)，S_COS(. is the cosine similarity of any two decoder outputs, y_i,t,y_i1And t is the output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.

Further, the loss function L of the student network training process in step B6 is:

where β is the weight of the regularization term, L_TS,NLoss of the teacher network to the student network corresponding to all decoders in the student network.

The invention has the beneficial effects that:

(1) the method firstly considers the defect of the training target in the automatic solving application problem system, a plurality of application problems can have a plurality of solutions, and even if the same writing method is adopted and different mathematical forms (such as an exchange law, a combination law, a distribution rate and the like) are used, the labels of the solutions are different. Therefore, the existing automatic problem solving system takes a single solution as guidance, penalizes training targets of all other solutions and is harmful to improving the accuracy of the problem solving, and therefore, the measured label is more the accuracy of the solution rather than the accuracy of the answer.

(2) The method corrects the training target by using a teacher-student network: in the case of the existing label lacking multiple solutions, we need to change the training target by other methods. Based on the observation that the existing automatic problem solving systems generate a part of solutions different from labels, the category vectors of numbers or symbols for predicting each position generated by the systems in the generation process actually contain the information of partial multi-solutions. Through the teacher student network structure, the teacher student network structure utilizes the information to help improve the performance of the system.

(3) The method further enhances the diversity of the model prediction result by using a multi-head decoder structure. The multi-head decoder can generate various solutions, and the system further enhances the diversity of the generated results by disturbing the initialization vector and also using a diversified regular term, so that the model can explore more possibilities, and it is worth noting that the final output only has one solution, and therefore the system can select the best solution according to the confidence of the output.

Drawings

FIG. 1 is a flow chart of the method for automatically solving the problem of the application problem based on the teacher student network and the multi-head decoder.

FIG. 2 is a schematic diagram of teacher student network problem solving provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Example 1:

as shown in fig. 1-2, the method for automatically solving the application questions based on the teacher student network and the multi-head decoder comprises the following steps:

In step S1 of the present embodiment, the teacher network is regarded as a function f (x) that maps the training sample x to the label x, and the parameter of the teacher network is denoted as θ_THence, the teacher network is denoted as f (x, θ)_T) (ii) a The training sample of the teacher network is an application question stem and a corresponding problem solving scheme thereof; the supervisory signal of the training process comes from the label of the training sample, and the label y in this embodiment is a label vector distributed from 0 to 1, i.e. a hard label vector.

In step S2 of this embodiment, the method for training the teacher network by using the training samples specifically includes:

wherein, the character word set X ═ { X ═ X₁,...,x_m,...x_MH, word-level hidden state representation H ═ H₁,...,h_m,...,h_MAnd taking the word-level hidden state representation H as a real value vector, x, corresponding to the stem text_mFor the mth word in the subject stem text, h_mThe mth element in the word-level hidden layer state representation is represented;

when a group of word-level hidden-layer state representations H are input into a decoder based on a tree structure in a teacher network, the decoder outputs a category variable (namely output probability distribution) at each moment according to the sequence of a prefix expression, and the numerical value of each position of the vector expresses the probability of generating a specific number and operator at the moment;

a3, determining a loss function of the teacher network based on the output category vector and the label y of the training sample (0-1 distributed label variable, namely the number or operator labeled at the moment is set as 1, and the others are 0);

The encoder structure in step a1 is a bi-directional recurrent neural network or a single recurrent neural network, and in this embodiment, a bi-directional recurrent neural network is used, and the bi-directional recurrent neural network extracts features corresponding to the current word one by one according to the sequence of occurrence of the words in the stem description, so that the influence of the context on the semantics can be considered at the same time.

In the step A3, the loss function L of the teacher network_NLL(θ_T) Comprises the following steps:

The encoder of the student network constructed in step S3 of the present embodiment has the same structure as the encoder of the teacher network, and is the most different from the teacher network in that the student network includes several decoders with independent parameters based on a tree structure, in order to make the outputs of different decoders as different as possible, in the training process of the student network, a diversified regularization term is added to the output end of each decoder, and different noise is added to the input of each decoder, so that the diversity of the output of a multi-head decoder is enhanced by the change of the input.

In step S4 of the embodiment, the input and output are x and y respectively when training the student network, and our purpose is to train to the student network parameter θ_STo makeGet f (x, theta)_S) X → y; based on this, in step S4, the method for determining the loss function in the student network training process includes:

b1, determining loss function L of student network based on hard label vector_NLL(θ_S)；

L_TS,i(θ_S,θ_T)＝(1-α)L_NLL(θ_S)+αL_KD(θ_S；θ_T)

In the formula, alpha is an interpolation parameter;

at the gain of loss L_TS,i(θ_S,θ_T) On the basis of the above, it is expected that the generated results are more diversified by inputting different hidden vector quantities to different decoders in the student network, and specifically, besides outputting the obtained word-level hidden state representation H of the direct encoder to a head decoder, the disturbance is added to the input of another decoder, and the disturbance input to the other decoders is obtained through the following steps B4-B5;

b5, generating a hidden layer vector group set { H) corresponding to the word-level hidden state representation H by masking the word-level hidden state representation H₁,H₂,..H_i...H_NThe hidden layer vector group is sequentially input into each decoder of the student network;

The step B5 is specifically:

b5-1, defining mask rate P_mask；

Wherein, an is a matrix multiplication operator that multiplies by bit;

In this embodiment, step B6, to encourage different decoders to generate different results, we introduce a diversification regularization term L_divIntroduced diversified regularization term L_divComprises the following steps:

where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, L_div,tTo calculate the loss function of the similarity of the solution equations, and L_div,t＝1+S_COS(y_i,t,y_i1,t)，S_COS(. is the cosine similarity of any two decoder outputs, y_i,t,y_i1,tThe output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.

More specifically, we use cosine similarity to measure the difference between the outputs of different decoders, and we aim to promote the diversity of the solution equations generated, and we do not have to perform a bundle search for any two decoder outputs, if the difference between them is too large, so we can get the loss function L of the student network training process in step B6 as:

Example 2:

the embodiment of the invention provides a comparison example for automatically solving the problems of two common data sets by using the method and the existing problem solving method:

two commonly used datasets, where mahps had 2373 problems and Math23K had 23162 problems. For the Math23K dataset, some methods were evaluated using 5-fold cross-validation denoted "Math 23K", while others were evaluated using publicly available training test set partitioning (denoted "Math 23K"). For the mahps dataset, the model was evaluated by 5-fold cross-validation. After the previous work, the accuracy of the solution was used as an evaluation index. As shown in Table 1 (the data represents the accuracy of the model on the test set, the larger the numerical value is, the better the result is), the method has better effect compared with the existing GROUPATT method, Math-EN method and DNS method.

Table 1: the effect of the method is compared with that of the existing method

	MAWPS	Math23K	Math23K*
				DNS	59.5	-	58.1
Math-EN	69.2	66.9	-
				GROUPATT	76.1	69.5	66.9
Method for producing a composite material	84.4	77.4	75.1

The invention has the beneficial effects that:

Claims

1. An automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder is characterized by comprising the following steps of:

2. The method of claim 1, wherein in step S1, the teacher' S network is mapped as a function f (x, θ) labeled y for training samples x_T) Wherein, theta_TParameters of a teacher network;

3. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 2, wherein in said step S2, the method for training teacher network by training sample is specifically:

4. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 3, wherein said encoder structure in step A1 is a bidirectional cyclic neural network.

5. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 3, wherein in said step A3, the loss function L of teacher network_NLL(θ_T) Comprises the following steps:

6. The method of claim 3, wherein the structure of the encoder in the student network constructed in step S3 is the same as that of the encoder in the teacher network, the student network comprises a plurality of decoders with independent parameters and based on a tree structure, a diversification regularization term is added to the output of each decoder, and different noise is added to the input of each decoder during the training process of the student network.

7. The method for automatically solving application problems based on teacher 'S network and multi-head decoder as claimed in claim 6, wherein in said step S4, said student' S network training process loss function is determined by:

b1, determining loss L of student network based on hard label vector_NLL(θ_S)；

b3 based on cross entropy loss L_NLL(θ_S；θ_T) And loss L_NLL(θ_S) Determining the teacher network to student corresponding to the ith decoder in the student networkLoss of raw network L_TS,i(θ_S,θ_T)；

L_TS,i(θ_S,θ_T)＝(1-α)L_NLL(θ_S)+αL_KD(θ_S；θ_T)

In the formula, alpha is an interpolation parameter;

8. The method for automatically solving application problems based on teacher's network and multi-head decoder as claimed in claim 7, wherein said step B5 is specifically:

b5-1, defining mask rate P_mask；

Wherein, an is a matrix multiplication operator that multiplies by bit;

9. The method for automatically solving application problems based on teacher's network and multi-head decoder as claimed in claim 7, wherein in said step B6, a diversified regularization term L is introduced_divComprises the following steps:

10. The method of claim 9, wherein the loss function L of the student network training process in step B6 is: