CN111553821A - Automatic problem solving method for application problems based on teacher-student network and multi-head decoder - Google Patents
Automatic problem solving method for application problems based on teacher-student network and multi-head decoder Download PDFInfo
- Publication number
- CN111553821A CN111553821A CN202010402148.3A CN202010402148A CN111553821A CN 111553821 A CN111553821 A CN 111553821A CN 202010402148 A CN202010402148 A CN 202010402148A CN 111553821 A CN111553821 A CN 111553821A
- Authority
- CN
- China
- Prior art keywords
- network
- teacher
- student
- decoder
- student network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000012549 training Methods 0.000 claims abstract description 77
- 239000013598 vector Substances 0.000 claims abstract description 64
- 230000008569 process Effects 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 2
- 230000000873 masking effect Effects 0.000 claims description 2
- 125000004122 cyclic group Chemical group 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 3
- 238000012360 testing method Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 27
- 230000014509 gene expression Effects 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
- G06Q50/205—Education administration or guidance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Educational Technology (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Educational Administration (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- General Business, Economics & Management (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses an automatic problem solving method for application problems based on a teacher student network and a multi-head decoder. Then, a coding and decoding network which is also based on the sequence to the tree structure is constructed, and a plurality of tree structure decoders are added, so that a student network with a multi-head tree structure decoder is obtained. Then, the soft label vector and the 0-1 distributed label vector provided in the original training sample, namely the hard label vector, are used together as a supervision signal, and the student network is trained. And during testing, selecting one of the multiple solutions generated in the multi-head decoder with the highest confidence as the output of the model. The invention can obtain better problem solving effect by utilizing the capability of generating a problem solving equation different from the label of the teacher model and assisting the multi-head decoder structure.
Description
Technical Field
The invention relates to the technical field of computer linguistics, in particular to an automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder.
Background
Solving the problem of mathematics, namely text description, automatically answering the problem of mathematics, has attracted the attention of researchers since 1960, and is an important natural language understanding task. A typical mathematical application problem is to give a description of a problem and to give a short description of an unknown number of problems. Earlier research wisdom designed automatic solvers by statistical machine learning and speech analysis methods, but these methods had poor generalization because they required a great deal of effort to design appropriate functional and expression templates.
In recent years, automatic solvers based on deep learning have appeared, the deep learning methods can automatically acquire feature learning information, can generate new solving expressions which do not exist in a training data set, and simultaneously achieve high performance on a large-scale and complex data set, Deep Neural Solver (DNS) appeared in 2017 is firstly proposed in the methods, a large-scale mathematical Problem (MWP) data set is collected to evaluate the performance of the automatic solver while a model is proposed in the methods, and since then, many research works are focused on improving the automatic solvers based on the deep learning. On the one hand, the more representative improvements are the group attention model (GROUPATT) and the expression normalization method (Math-EN), which focus on improving the inputs of the intermediate process and the model, respectively. On the other hand, improving the acquisition mode and the generation process of the quantity representation is also a potential method for realizing a better solution expression; however, there is a need for a method that takes advantage of the multi-solution nature of mathematical problems to enhance the performance of the model. Therefore, the data of the user only provides a specific solution, if the problem solving device generates a correct solution without annotation, the model can be punished wrongly, so that the accuracy of the generated result of the model is reduced, even if the correct solution is not taken into consideration, the problem solving model can be improved from the aspect because the accuracy of the answer is higher than that of the problem solving equation.
Disclosure of Invention
Aiming at the defects in the prior art, the automatic problem solving method of the application problem based on the teacher-student network and the multi-head decoder solves the problem that the existing deep learning model cannot consider the generation of correct solutions different from labels.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: an automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder comprises the following steps:
s1, constructing a coding and decoding model with a sequence in a tree structure and only one decoder, and using the coding and decoding model as a teacher network;
s2, training the teacher network through the training samples, taking the labels of the training samples as hard label vectors, and taking the class vectors output by the training samples by the teacher network after training as soft label vectors;
s3, constructing a coding and decoding model based on a sequence-to-tree decoder with a multi-head tree structure, and using the coding and decoding model as a student network;
s4, simultaneously constructing a supervision signal based on the hard label vector and the soft label vector, and training the student network by using a training sample based on the constructed supervision signal;
s5, inputting the application questions to be solved into the trained student network, generating a plurality of problem solving equations by using a multi-head tree structure decoder of the student network, and determining corresponding confidence coefficients;
s6, selecting the solving equation corresponding to the highest confidence coefficient, and solving the corresponding answer according to the solving equation to complete the automatic solving.
Further, in step S1, the teacher network is mapped to a function f (x, θ) labeled with y for the training sample xT) Wherein, thetaTParameters of a teacher network;
the training sample is an application question stem and a corresponding problem solving scheme thereof; the label y is a label vector distributed from 0 to 1.
Further, in step S2, the method for training the teacher network through the training samples specifically includes:
a1, obtaining a word-level hidden state representation H of a character word set X in an application topic stem text through an encoder structure in a teacher network;
a2, inputting the word-level hidden state representation H into a tree-structure-based decoder in a teacher network, and outputting a category vector at each moment;
a3, determining a loss function of the teacher network based on the output class vector and the label y of the training sample;
and A4, based on the loss function of the teacher network, using the label of the training sample as a supervision signal of the training process, and training the teacher network by using the training sample.
Further, the encoder structure in the step a1 is a bidirectional recurrent neural network.
Further, in the step a3, the loss function L of the teacher networkNLL(θT) Comprises the following steps:
wherein 1 is an indicator function, V1Number of numbers and operators in problem solving equations generated for decoders in teacher networks, k1For a particular number or operator in the loss function of the teacher's network, p1And (c) the distribution corresponding to the class vector output by the teacher network.
Further, the encoder structure in the student network constructed in step S3 is the same as that in the teacher network, the student network includes a plurality of decoders having independent parameters and based on a tree structure, in the training process of the student network, a diversified regularization term is added to the output end of each decoder, and different noise is added to the input end of each decoder.
Further, in step S4, the method for determining the loss function in the student network training process includes:
b1, determining based on the hard tag vectorLoss L of student networkNLL(θS);
In the formula, thetaSFor parameters of student networks, 1 is an indicator function, V2Number of numbers and operators in problem solving equations generated for decoders in student networks, k2For a particular number or operator in the loss function of the student network, p2() distribution corresponding to the class vector output for the student network;
b2 calculating cross entropy loss L between student network output and teacher network outputKD(θS;θT);
Wherein q { y ═ k | x; thetaTP (y is k | x; theta) is an indication function for taking out the kth position value in the soft label vector output by the teacher networks) Distributing the kth position in the output category vector in the student network, wherein V is the number of numbers and operators in the same problem solving equation generated in the teacher network and the student network, and k is a specific number or operator;
b3 based on cross entropy loss LNLL(θS;θT) And loss LNLL(θS) Determining the loss L from the teacher network to the student network corresponding to the ith decoder in the student networkTS,i(θS,θT);
LTS,i(θS,θT)=(1-α)LNLL(θS)+αLKD(θS;θT)
In the formula, alpha is an interpolation parameter;
b4, obtaining a word-level hidden state representation H of a character word set x in an application question stem text in a training sample through an encoder structure in a student network;
b5, passing the paired wordsMasking the level hidden state representation H to generate a hidden layer vector group set { H) corresponding to the word level hidden state representation H1,H2,..Hi...HNThe hidden layer vector group is sequentially input into each decoder of the student network;
wherein, i is the number of the decoders in the student network, and N is the total number of the decoders in the student network;
b6, introducing a diversified regularization term LdivAnd combined loss LTS,i(θS,θT) And obtaining a loss function L of the network training process of the student.
Further, the step B5 is specifically:
b5-1, defining mask rate Pmask;
B5-2, representing the word-level hidden state with the percentage of P in H by using Gaussian distributionmaskSampling the positions of the two phases to generate a zero matrix Mask which is the same as the Hzero;
B5-3, zero matrix Mask generated according to samplingzeroAssigning 1 to the sampled position, and generating Mask matrix Maskp;
B5-4, by Hi=Maskp⊙ H, determining a hidden vector group set { H) corresponding to the word-level hidden state representation H1,H2,..Hi...HN};
Wherein, an is a matrix multiplication operator that multiplies by bit;
b5-5, grouping hidden vector groups { H1,H2,..Hi...HNEach hidden vector group H iniInput to a corresponding decoder.
Further, in the step B6, a diversification regularization term L is introduceddivComprises the following steps:
where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, Ldiv,tTo calculate the loss function of the similarity of the solution equations, and Ldiv,t=1+SCOS(yi,t,yi1,t),SCOS(. is the cosine similarity of any two decoder outputs, yi,t,yi1And t is the output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.
Further, the loss function L of the student network training process in step B6 is:
where β is the weight of the regularization term, LTS,NLoss of the teacher network to the student network corresponding to all decoders in the student network.
The invention has the beneficial effects that:
(1) the method firstly considers the defect of the training target in the automatic solving application problem system, a plurality of application problems can have a plurality of solutions, and even if the same writing method is adopted and different mathematical forms (such as an exchange law, a combination law, a distribution rate and the like) are used, the labels of the solutions are different. Therefore, the existing automatic problem solving system takes a single solution as guidance, penalizes training targets of all other solutions and is harmful to improving the accuracy of the problem solving, and therefore, the measured label is more the accuracy of the solution rather than the accuracy of the answer.
(2) The method corrects the training target by using a teacher-student network: in the case of the existing label lacking multiple solutions, we need to change the training target by other methods. Based on the observation that the existing automatic problem solving systems generate a part of solutions different from labels, the category vectors of numbers or symbols for predicting each position generated by the systems in the generation process actually contain the information of partial multi-solutions. Through the teacher student network structure, the teacher student network structure utilizes the information to help improve the performance of the system.
(3) The method further enhances the diversity of the model prediction result by using a multi-head decoder structure. The multi-head decoder can generate various solutions, and the system further enhances the diversity of the generated results by disturbing the initialization vector and also using a diversified regular term, so that the model can explore more possibilities, and it is worth noting that the final output only has one solution, and therefore the system can select the best solution according to the confidence of the output.
Drawings
FIG. 1 is a flow chart of the method for automatically solving the problem of the application problem based on the teacher student network and the multi-head decoder.
FIG. 2 is a schematic diagram of teacher student network problem solving provided by the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Example 1:
as shown in fig. 1-2, the method for automatically solving the application questions based on the teacher student network and the multi-head decoder comprises the following steps:
s1, constructing a coding and decoding model with a sequence in a tree structure and only one decoder, and using the coding and decoding model as a teacher network;
s2, training the teacher network through the training samples, taking the labels of the training samples as hard label vectors, and taking the class vectors output by the training samples by the teacher network after training as soft label vectors;
s3, constructing a coding and decoding model based on a sequence-to-tree decoder with a multi-head tree structure, and using the coding and decoding model as a student network;
s4, simultaneously constructing a supervision signal based on the hard label vector and the soft label vector, and training the student network by using a training sample based on the constructed supervision signal;
s5, inputting the application questions to be solved into the trained student network, generating a plurality of problem solving equations by using a multi-head tree structure decoder of the student network, and determining corresponding confidence coefficients;
s6, selecting the solving equation corresponding to the highest confidence coefficient, and solving the corresponding answer according to the solving equation to complete the automatic solving.
In step S1 of the present embodiment, the teacher network is regarded as a function f (x) that maps the training sample x to the label x, and the parameter of the teacher network is denoted as θTHence, the teacher network is denoted as f (x, θ)T) (ii) a The training sample of the teacher network is an application question stem and a corresponding problem solving scheme thereof; the supervisory signal of the training process comes from the label of the training sample, and the label y in this embodiment is a label vector distributed from 0 to 1, i.e. a hard label vector.
In step S2 of this embodiment, the method for training the teacher network by using the training samples specifically includes:
a1, obtaining a word-level hidden state representation H of a character word set X in an application topic stem text through an encoder structure in a teacher network;
wherein, the character word set X ═ { X ═ X1,...,xm,...xMH, word-level hidden state representation H ═ H1,...,hm,...,hMAnd taking the word-level hidden state representation H as a real value vector, x, corresponding to the stem textmFor the mth word in the subject stem text, hmThe mth element in the word-level hidden layer state representation is represented;
a2, inputting the word-level hidden state representation H into a tree-structure-based decoder in a teacher network, and outputting a category vector at each moment;
when a group of word-level hidden-layer state representations H are input into a decoder based on a tree structure in a teacher network, the decoder outputs a category variable (namely output probability distribution) at each moment according to the sequence of a prefix expression, and the numerical value of each position of the vector expresses the probability of generating a specific number and operator at the moment;
a3, determining a loss function of the teacher network based on the output category vector and the label y of the training sample (0-1 distributed label variable, namely the number or operator labeled at the moment is set as 1, and the others are 0);
and A4, based on the loss function of the teacher network, using the label of the training sample as a supervision signal of the training process, and training the teacher network by using the training sample.
The encoder structure in step a1 is a bi-directional recurrent neural network or a single recurrent neural network, and in this embodiment, a bi-directional recurrent neural network is used, and the bi-directional recurrent neural network extracts features corresponding to the current word one by one according to the sequence of occurrence of the words in the stem description, so that the influence of the context on the semantics can be considered at the same time.
In the step A3, the loss function L of the teacher networkNLL(θT) Comprises the following steps:
wherein 1 is an indicator function, V1Number of numbers and operators in problem solving equations generated for decoders in teacher networks, k1For a particular number or operator in the loss function of the teacher's network, p1And (c) the distribution corresponding to the class vector output by the teacher network.
The encoder of the student network constructed in step S3 of the present embodiment has the same structure as the encoder of the teacher network, and is the most different from the teacher network in that the student network includes several decoders with independent parameters based on a tree structure, in order to make the outputs of different decoders as different as possible, in the training process of the student network, a diversified regularization term is added to the output end of each decoder, and different noise is added to the input of each decoder, so that the diversity of the output of a multi-head decoder is enhanced by the change of the input.
In step S4 of the embodiment, the input and output are x and y respectively when training the student network, and our purpose is to train to the student network parameter θSTo makeGet f (x, theta)S) X → y; based on this, in step S4, the method for determining the loss function in the student network training process includes:
b1, determining loss function L of student network based on hard label vectorNLL(θS);
In the formula, thetaSFor parameters of student networks, 1 is an indicator function, V2Number of numbers and operators in problem solving equations generated for decoders in student networks, k2For a particular number or operator in the loss function of the student network, p2() distribution corresponding to the class vector output for the student network;
b2 calculating cross entropy loss L between student network output and teacher network outputKD(θS;θT);
Wherein q { y ═ k | x; thetaTP (y is k | x; theta) is an indication function for taking out the kth position value in the soft label vector output by the teacher networks) Distributing the kth position in the output category vector in the student network, wherein V is the number of numbers and operators in the same problem solving equation generated in the teacher network and the student network, and k is a specific number or operator;
b3 based on cross entropy loss LNLL(θS;θT) And loss LNLL(θS) Determining the loss L from the teacher network to the student network corresponding to the ith decoder in the student networkTS,i(θS,θT);
LTS,i(θS,θT)=(1-α)LNLL(θS)+αLKD(θS;θT)
In the formula, alpha is an interpolation parameter;
at the gain of loss LTS,i(θS,θT) On the basis of the above, it is expected that the generated results are more diversified by inputting different hidden vector quantities to different decoders in the student network, and specifically, besides outputting the obtained word-level hidden state representation H of the direct encoder to a head decoder, the disturbance is added to the input of another decoder, and the disturbance input to the other decoders is obtained through the following steps B4-B5;
b4, obtaining a word-level hidden state representation H of a character word set x in an application question stem text in a training sample through an encoder structure in a student network;
b5, generating a hidden layer vector group set { H) corresponding to the word-level hidden state representation H by masking the word-level hidden state representation H1,H2,..Hi...HNThe hidden layer vector group is sequentially input into each decoder of the student network;
wherein, i is the number of the decoders in the student network, and N is the total number of the decoders in the student network;
b6, introducing a diversified regularization term LdivAnd combined loss LTS,i(θS,θT) And obtaining a loss function L of the network training process of the student.
The step B5 is specifically:
b5-1, defining mask rate Pmask;
B5-2, representing the word-level hidden state with the percentage of P in H by using Gaussian distributionmaskSampling the positions of the two phases to generate a zero matrix Mask which is the same as the Hzero;
B5-3, zero matrix Mask generated according to samplingzeroAssigning 1 to the sampled position, and generating Mask matrix Maskp;
B5-4, by Hi=Maskp⊙ H, determining a hidden vector group set { H) corresponding to the word-level hidden state representation H1,H2,..Hi...HN};
Wherein, an is a matrix multiplication operator that multiplies by bit;
b5-5, grouping hidden vector groups { H1,H2,..Hi...HNEach hidden vector group H iniInput to a corresponding decoder.
In this embodiment, step B6, to encourage different decoders to generate different results, we introduce a diversification regularization term LdivIntroduced diversified regularization term LdivComprises the following steps:
where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, Ldiv,tTo calculate the loss function of the similarity of the solution equations, and Ldiv,t=1+SCOS(yi,t,yi1,t),SCOS(. is the cosine similarity of any two decoder outputs, yi,t,yi1,tThe output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.
More specifically, we use cosine similarity to measure the difference between the outputs of different decoders, and we aim to promote the diversity of the solution equations generated, and we do not have to perform a bundle search for any two decoder outputs, if the difference between them is too large, so we can get the loss function L of the student network training process in step B6 as:
where β is the weight of the regularization term, LTS,NLoss of the teacher network to the student network corresponding to all decoders in the student network.
Example 2:
the embodiment of the invention provides a comparison example for automatically solving the problems of two common data sets by using the method and the existing problem solving method:
two commonly used datasets, where mahps had 2373 problems and Math23K had 23162 problems. For the Math23K dataset, some methods were evaluated using 5-fold cross-validation denoted "Math 23K", while others were evaluated using publicly available training test set partitioning (denoted "Math 23K"). For the mahps dataset, the model was evaluated by 5-fold cross-validation. After the previous work, the accuracy of the solution was used as an evaluation index. As shown in Table 1 (the data represents the accuracy of the model on the test set, the larger the numerical value is, the better the result is), the method has better effect compared with the existing GROUPATT method, Math-EN method and DNS method.
Table 1: the effect of the method is compared with that of the existing method
MAWPS | Math23K | Math23K* | |
DNS | 59.5 | - | 58.1 |
Math-EN | 69.2 | 66.9 | - |
GROUPATT | 76.1 | 69.5 | 66.9 |
Method for producing a composite material | 84.4 | 77.4 | 75.1 |
The invention has the beneficial effects that:
(1) the method firstly considers the defect of the training target in the automatic solving application problem system, a plurality of application problems can have a plurality of solutions, and even if the same writing method is adopted and different mathematical forms (such as an exchange law, a combination law, a distribution rate and the like) are used, the labels of the solutions are different. Therefore, the existing automatic problem solving system takes a single solution as guidance, penalizes training targets of all other solutions and is harmful to improving the accuracy of the problem solving, and therefore, the measured label is more the accuracy of the solution rather than the accuracy of the answer.
(2) The method corrects the training target by using a teacher-student network: in the case of the existing label lacking multiple solutions, we need to change the training target by other methods. Based on the observation that the existing automatic problem solving systems generate a part of solutions different from labels, the category vectors of numbers or symbols for predicting each position generated by the systems in the generation process actually contain the information of partial multi-solutions. Through the teacher student network structure, the teacher student network structure utilizes the information to help improve the performance of the system.
(3) The method further enhances the diversity of the model prediction result by using a multi-head decoder structure. The multi-head decoder can generate various solutions, and the system further enhances the diversity of the generated results by disturbing the initialization vector and also using a diversified regular term, so that the model can explore more possibilities, and it is worth noting that the final output only has one solution, and therefore the system can select the best solution according to the confidence of the output.
Claims (10)
1. An automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder is characterized by comprising the following steps of:
s1, constructing a coding and decoding model with a sequence in a tree structure and only one decoder, and using the coding and decoding model as a teacher network;
s2, training the teacher network through the training samples, taking the labels of the training samples as hard label vectors, and taking the class vectors output by the training samples by the teacher network after training as soft label vectors;
s3, constructing a coding and decoding model based on a sequence-to-tree decoder with a multi-head tree structure, and using the coding and decoding model as a student network;
s4, simultaneously constructing a supervision signal based on the hard label vector and the soft label vector, and training the student network by using a training sample based on the constructed supervision signal;
s5, inputting the application questions to be solved into the trained student network, generating a plurality of problem solving equations by using a multi-head tree structure decoder of the student network, and determining corresponding confidence coefficients;
s6, selecting the solving equation corresponding to the highest confidence coefficient, and solving the corresponding answer according to the solving equation to complete the automatic solving.
2. The method of claim 1, wherein in step S1, the teacher' S network is mapped as a function f (x, θ) labeled y for training samples xT) Wherein, thetaTParameters of a teacher network;
the training sample is an application question stem and a corresponding problem solving scheme thereof; the label y is a label vector distributed from 0 to 1.
3. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 2, wherein in said step S2, the method for training teacher network by training sample is specifically:
a1, obtaining a word-level hidden state representation H of a character word set X in an application topic stem text through an encoder structure in a teacher network;
a2, inputting the word-level hidden state representation H into a tree-structure-based decoder in a teacher network, and outputting a category vector at each moment;
a3, determining a loss function of the teacher network based on the output class vector and the label y of the training sample;
and A4, based on the loss function of the teacher network, using the label of the training sample as a supervision signal of the training process, and training the teacher network by using the training sample.
4. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 3, wherein said encoder structure in step A1 is a bidirectional cyclic neural network.
5. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 3, wherein in said step A3, the loss function L of teacher networkNLL(θT) Comprises the following steps:
wherein 1 is an indicator function, V1Number of numbers and operators in problem solving equations generated for decoders in teacher networks, k1For a particular number or operator in the loss function of the teacher's network, p1And (c) the distribution corresponding to the class vector output by the teacher network.
6. The method of claim 3, wherein the structure of the encoder in the student network constructed in step S3 is the same as that of the encoder in the teacher network, the student network comprises a plurality of decoders with independent parameters and based on a tree structure, a diversification regularization term is added to the output of each decoder, and different noise is added to the input of each decoder during the training process of the student network.
7. The method for automatically solving application problems based on teacher 'S network and multi-head decoder as claimed in claim 6, wherein in said step S4, said student' S network training process loss function is determined by:
b1, determining loss L of student network based on hard label vectorNLL(θS);
In the formula, thetaSFor parameters of student networks, 1 is an indicator function, V2Number of numbers and operators in problem solving equations generated for decoders in student networks, k2For a particular number or operator in the loss function of the student network, p2() distribution corresponding to the class vector output for the student network;
b2 calculating cross entropy loss L between student network output and teacher network outputKD(θS;θT);
Wherein q { y ═ k | x; thetaTP (y is k | x; theta) is an indication function for taking out the kth position value in the soft label vector output by the teacher networks) Distributing the kth position in the output category vector in the student network, wherein V is the number of numbers and operators in the same problem solving equation generated in the teacher network and the student network, and k is a specific number or operator;
b3 based on cross entropy loss LNLL(θS;θT) And loss LNLL(θS) Determining the teacher network to student corresponding to the ith decoder in the student networkLoss of raw network LTS,i(θS,θT);
LTS,i(θS,θT)=(1-α)LNLL(θS)+αLKD(θS;θT)
In the formula, alpha is an interpolation parameter;
b4, obtaining a word-level hidden state representation H of a character word set x in an application question stem text in a training sample through an encoder structure in a student network;
b5, generating a hidden layer vector group set { H) corresponding to the word-level hidden state representation H by masking the word-level hidden state representation H1,H2,..Hi...HNThe hidden layer vector group is sequentially input into each decoder of the student network;
wherein, i is the number of the decoders in the student network, and N is the total number of the decoders in the student network;
b6, introducing a diversified regularization term LdivAnd combined loss LTS,i(θS,θT) And obtaining a loss function L of the network training process of the student.
8. The method for automatically solving application problems based on teacher's network and multi-head decoder as claimed in claim 7, wherein said step B5 is specifically:
b5-1, defining mask rate Pmask;
B5-2, representing the word-level hidden state with the percentage of P in H by using Gaussian distributionmaskSampling the positions of the two phases to generate a zero matrix Mask which is the same as the Hzero;
B5-3, zero matrix Mask generated according to samplingzeroAssigning 1 to the sampled position, and generating Mask matrix Maskp;
B5-4, by Hi=Maskp⊙ H, determining a hidden vector group set { H) corresponding to the word-level hidden state representation H1,H2,..Hi...HN};
Wherein, an is a matrix multiplication operator that multiplies by bit;
b5-5, grouping hidden vector groups { H1,H2,..Hi...HNEach hidden vector group H iniInput to a corresponding decoder.
9. The method for automatically solving application problems based on teacher's network and multi-head decoder as claimed in claim 7, wherein in said step B6, a diversified regularization term L is introduceddivComprises the following steps:
where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, Ldiv,tTo calculate the loss function of the similarity of the solution equations, and Ldiv,t=1+SCOS(yi,t,yi1,t),SCOS(. is the cosine similarity of any two decoder outputs, yi,t,yi1,tThe output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010402148.3A CN111553821B (en) | 2020-05-13 | 2020-05-13 | Automatic problem solving method for application problems based on teacher-student network and multi-head decoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010402148.3A CN111553821B (en) | 2020-05-13 | 2020-05-13 | Automatic problem solving method for application problems based on teacher-student network and multi-head decoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111553821A true CN111553821A (en) | 2020-08-18 |
CN111553821B CN111553821B (en) | 2021-04-27 |
Family
ID=72004626
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010402148.3A Active CN111553821B (en) | 2020-05-13 | 2020-05-13 | Automatic problem solving method for application problems based on teacher-student network and multi-head decoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111553821B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112836801A (en) * | 2021-02-03 | 2021-05-25 | 上海商汤智能科技有限公司 | Deep learning network determination method and device, electronic equipment and storage medium |
CN117521812A (en) * | 2023-11-20 | 2024-02-06 | 华中师范大学 | Automatic arithmetic text question solving method and system based on variational knowledge distillation |
CN117521812B (en) * | 2023-11-20 | 2024-06-07 | 华中师范大学 | Automatic arithmetic text question solving method and system based on variational knowledge distillation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180336465A1 (en) * | 2017-05-18 | 2018-11-22 | Samsung Electronics Co., Ltd. | Apparatus and method for student-teacher transfer learning network using knowledge bridge |
US20180365564A1 (en) * | 2017-06-15 | 2018-12-20 | TuSimple | Method and device for training neural network |
CN110428010A (en) * | 2019-08-05 | 2019-11-08 | 中国科学技术大学 | Knowledge method for tracing |
CN110739003A (en) * | 2019-10-23 | 2020-01-31 | 北京计算机技术及应用研究所 | Voice enhancement method based on multi-head self-attention mechanism |
-
2020
- 2020-05-13 CN CN202010402148.3A patent/CN111553821B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180336465A1 (en) * | 2017-05-18 | 2018-11-22 | Samsung Electronics Co., Ltd. | Apparatus and method for student-teacher transfer learning network using knowledge bridge |
US20180365564A1 (en) * | 2017-06-15 | 2018-12-20 | TuSimple | Method and device for training neural network |
CN110428010A (en) * | 2019-08-05 | 2019-11-08 | 中国科学技术大学 | Knowledge method for tracing |
CN110739003A (en) * | 2019-10-23 | 2020-01-31 | 北京计算机技术及应用研究所 | Voice enhancement method based on multi-head self-attention mechanism |
Non-Patent Citations (1)
Title |
---|
QIAN GUO ET AL.: ""MS-Pointer Network: Abstractive Text Summary"", 《IEEE ACCESS》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112836801A (en) * | 2021-02-03 | 2021-05-25 | 上海商汤智能科技有限公司 | Deep learning network determination method and device, electronic equipment and storage medium |
CN117521812A (en) * | 2023-11-20 | 2024-02-06 | 华中师范大学 | Automatic arithmetic text question solving method and system based on variational knowledge distillation |
CN117521812B (en) * | 2023-11-20 | 2024-06-07 | 华中师范大学 | Automatic arithmetic text question solving method and system based on variational knowledge distillation |
Also Published As
Publication number | Publication date |
---|---|
CN111553821B (en) | 2021-04-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition | |
CN107967318A (en) | A kind of Chinese short text subjective item automatic scoring method and system using LSTM neutral nets | |
CN110134946B (en) | Machine reading understanding method for complex data | |
Xue et al. | A hierarchical BERT-based transfer learning approach for multi-dimensional essay scoring | |
CN113673254B (en) | Knowledge distillation position detection method based on similarity maintenance | |
CN110781681B (en) | Automatic first-class mathematic application problem solving method and system based on translation model | |
CN113486645A (en) | Text similarity detection method based on deep learning | |
Shakeel et al. | A multi-cascaded deep model for bilingual sms classification | |
Lin et al. | Automated prediction of item difficulty in reading comprehension using long short-term memory | |
CN114153942B (en) | Event time sequence relation extraction method based on dynamic attention mechanism | |
CN111553821B (en) | Automatic problem solving method for application problems based on teacher-student network and multi-head decoder | |
Ye et al. | Machine learning techniques to automate scoring of constructed-response type assessments | |
Zhong | [Retracted] Evaluation of Traditional Culture Teaching Efficiency by Course Ideological and Political Integration Lightweight Deep Learning | |
CN116521872B (en) | Combined recognition method and system for cognition and emotion and electronic equipment | |
Song | [Retracted] An Evaluation Method of English Teaching Ability Based on Deep Learning | |
CN110969010A (en) | Problem generation method based on relationship guidance and dual-channel interaction mechanism | |
CN114579706B (en) | Automatic subjective question review method based on BERT neural network and multi-task learning | |
CN115935969A (en) | Heterogeneous data feature extraction method based on multi-mode information fusion | |
CN114692615A (en) | Small sample semantic graph recognition method for small languages | |
CN114970557A (en) | Knowledge enhancement-based cross-language structured emotion analysis method | |
CN114840679A (en) | Robot intelligent learning guiding method based on music theory knowledge graph reasoning and application | |
CN115617959A (en) | Question answering method and device | |
Wang et al. | Teacher Talk Moves in K12 Mathematics Lessons: Automatic Identification, Prediction Explanation, and Characteristic Exploration | |
Li et al. | A Multimodal Machine Learning Framework for Teacher Vocal Delivery Evaluation | |
Li et al. | Automated essay scoring incorporating multi-level semantic features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |