CN111553821A - Automatic problem solving method for application problems based on teacher-student network and multi-head decoder - Google Patents

Automatic problem solving method for application problems based on teacher-student network and multi-head decoder Download PDF

Info

Publication number
CN111553821A
CN111553821A CN202010402148.3A CN202010402148A CN111553821A CN 111553821 A CN111553821 A CN 111553821A CN 202010402148 A CN202010402148 A CN 202010402148A CN 111553821 A CN111553821 A CN 111553821A
Authority
CN
China
Prior art keywords
network
teacher
student
decoder
student network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010402148.3A
Other languages
Chinese (zh)
Other versions
CN111553821B (en
Inventor
张骥鹏
邵杰
王磊
徐行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010402148.3A priority Critical patent/CN111553821B/en
Publication of CN111553821A publication Critical patent/CN111553821A/en
Application granted granted Critical
Publication of CN111553821B publication Critical patent/CN111553821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Educational Technology (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses an automatic problem solving method for application problems based on a teacher student network and a multi-head decoder. Then, a coding and decoding network which is also based on the sequence to the tree structure is constructed, and a plurality of tree structure decoders are added, so that a student network with a multi-head tree structure decoder is obtained. Then, the soft label vector and the 0-1 distributed label vector provided in the original training sample, namely the hard label vector, are used together as a supervision signal, and the student network is trained. And during testing, selecting one of the multiple solutions generated in the multi-head decoder with the highest confidence as the output of the model. The invention can obtain better problem solving effect by utilizing the capability of generating a problem solving equation different from the label of the teacher model and assisting the multi-head decoder structure.

Description

Automatic problem solving method for application problems based on teacher-student network and multi-head decoder
Technical Field
The invention relates to the technical field of computer linguistics, in particular to an automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder.
Background
Solving the problem of mathematics, namely text description, automatically answering the problem of mathematics, has attracted the attention of researchers since 1960, and is an important natural language understanding task. A typical mathematical application problem is to give a description of a problem and to give a short description of an unknown number of problems. Earlier research wisdom designed automatic solvers by statistical machine learning and speech analysis methods, but these methods had poor generalization because they required a great deal of effort to design appropriate functional and expression templates.
In recent years, automatic solvers based on deep learning have appeared, the deep learning methods can automatically acquire feature learning information, can generate new solving expressions which do not exist in a training data set, and simultaneously achieve high performance on a large-scale and complex data set, Deep Neural Solver (DNS) appeared in 2017 is firstly proposed in the methods, a large-scale mathematical Problem (MWP) data set is collected to evaluate the performance of the automatic solver while a model is proposed in the methods, and since then, many research works are focused on improving the automatic solvers based on the deep learning. On the one hand, the more representative improvements are the group attention model (GROUPATT) and the expression normalization method (Math-EN), which focus on improving the inputs of the intermediate process and the model, respectively. On the other hand, improving the acquisition mode and the generation process of the quantity representation is also a potential method for realizing a better solution expression; however, there is a need for a method that takes advantage of the multi-solution nature of mathematical problems to enhance the performance of the model. Therefore, the data of the user only provides a specific solution, if the problem solving device generates a correct solution without annotation, the model can be punished wrongly, so that the accuracy of the generated result of the model is reduced, even if the correct solution is not taken into consideration, the problem solving model can be improved from the aspect because the accuracy of the answer is higher than that of the problem solving equation.
Disclosure of Invention
Aiming at the defects in the prior art, the automatic problem solving method of the application problem based on the teacher-student network and the multi-head decoder solves the problem that the existing deep learning model cannot consider the generation of correct solutions different from labels.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: an automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder comprises the following steps:
s1, constructing a coding and decoding model with a sequence in a tree structure and only one decoder, and using the coding and decoding model as a teacher network;
s2, training the teacher network through the training samples, taking the labels of the training samples as hard label vectors, and taking the class vectors output by the training samples by the teacher network after training as soft label vectors;
s3, constructing a coding and decoding model based on a sequence-to-tree decoder with a multi-head tree structure, and using the coding and decoding model as a student network;
s4, simultaneously constructing a supervision signal based on the hard label vector and the soft label vector, and training the student network by using a training sample based on the constructed supervision signal;
s5, inputting the application questions to be solved into the trained student network, generating a plurality of problem solving equations by using a multi-head tree structure decoder of the student network, and determining corresponding confidence coefficients;
s6, selecting the solving equation corresponding to the highest confidence coefficient, and solving the corresponding answer according to the solving equation to complete the automatic solving.
Further, in step S1, the teacher network is mapped to a function f (x, θ) labeled with y for the training sample xT) Wherein, thetaTParameters of a teacher network;
the training sample is an application question stem and a corresponding problem solving scheme thereof; the label y is a label vector distributed from 0 to 1.
Further, in step S2, the method for training the teacher network through the training samples specifically includes:
a1, obtaining a word-level hidden state representation H of a character word set X in an application topic stem text through an encoder structure in a teacher network;
a2, inputting the word-level hidden state representation H into a tree-structure-based decoder in a teacher network, and outputting a category vector at each moment;
a3, determining a loss function of the teacher network based on the output class vector and the label y of the training sample;
and A4, based on the loss function of the teacher network, using the label of the training sample as a supervision signal of the training process, and training the teacher network by using the training sample.
Further, the encoder structure in the step a1 is a bidirectional recurrent neural network.
Further, in the step a3, the loss function L of the teacher networkNLLT) Comprises the following steps:
Figure BDA0002489886480000031
wherein 1 is an indicator function, V1Number of numbers and operators in problem solving equations generated for decoders in teacher networks, k1For a particular number or operator in the loss function of the teacher's network, p1And (c) the distribution corresponding to the class vector output by the teacher network.
Further, the encoder structure in the student network constructed in step S3 is the same as that in the teacher network, the student network includes a plurality of decoders having independent parameters and based on a tree structure, in the training process of the student network, a diversified regularization term is added to the output end of each decoder, and different noise is added to the input end of each decoder.
Further, in step S4, the method for determining the loss function in the student network training process includes:
b1, determining based on the hard tag vectorLoss L of student networkNLLS);
Figure BDA0002489886480000041
In the formula, thetaSFor parameters of student networks, 1 is an indicator function, V2Number of numbers and operators in problem solving equations generated for decoders in student networks, k2For a particular number or operator in the loss function of the student network, p2() distribution corresponding to the class vector output for the student network;
b2 calculating cross entropy loss L between student network output and teacher network outputKDS;θT);
Figure BDA0002489886480000042
Wherein q { y ═ k | x; thetaTP (y is k | x; theta) is an indication function for taking out the kth position value in the soft label vector output by the teacher networks) Distributing the kth position in the output category vector in the student network, wherein V is the number of numbers and operators in the same problem solving equation generated in the teacher network and the student network, and k is a specific number or operator;
b3 based on cross entropy loss LNLLS;θT) And loss LNLLS) Determining the loss L from the teacher network to the student network corresponding to the ith decoder in the student networkTS,iST);
LTS,iST)=(1-α)LNLLS)+αLKDS;θT)
In the formula, alpha is an interpolation parameter;
b4, obtaining a word-level hidden state representation H of a character word set x in an application question stem text in a training sample through an encoder structure in a student network;
b5, passing the paired wordsMasking the level hidden state representation H to generate a hidden layer vector group set { H) corresponding to the word level hidden state representation H1,H2,..Hi...HNThe hidden layer vector group is sequentially input into each decoder of the student network;
wherein, i is the number of the decoders in the student network, and N is the total number of the decoders in the student network;
b6, introducing a diversified regularization term LdivAnd combined loss LTS,iST) And obtaining a loss function L of the network training process of the student.
Further, the step B5 is specifically:
b5-1, defining mask rate Pmask
B5-2, representing the word-level hidden state with the percentage of P in H by using Gaussian distributionmaskSampling the positions of the two phases to generate a zero matrix Mask which is the same as the Hzero
B5-3, zero matrix Mask generated according to samplingzeroAssigning 1 to the sampled position, and generating Mask matrix Maskp
B5-4, by Hi=Maskp⊙ H, determining a hidden vector group set { H) corresponding to the word-level hidden state representation H1,H2,..Hi...HN};
Wherein, an is a matrix multiplication operator that multiplies by bit;
b5-5, grouping hidden vector groups { H1,H2,..Hi...HNEach hidden vector group H iniInput to a corresponding decoder.
Further, in the step B6, a diversification regularization term L is introduceddivComprises the following steps:
Figure BDA0002489886480000051
where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, Ldiv,tTo calculate the loss function of the similarity of the solution equations, and Ldiv,t=1+SCOS(yi,t,yi1,t),SCOS(. is the cosine similarity of any two decoder outputs, yi,t,yi1And t is the output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.
Further, the loss function L of the student network training process in step B6 is:
Figure BDA0002489886480000052
where β is the weight of the regularization term, LTS,NLoss of the teacher network to the student network corresponding to all decoders in the student network.
The invention has the beneficial effects that:
(1) the method firstly considers the defect of the training target in the automatic solving application problem system, a plurality of application problems can have a plurality of solutions, and even if the same writing method is adopted and different mathematical forms (such as an exchange law, a combination law, a distribution rate and the like) are used, the labels of the solutions are different. Therefore, the existing automatic problem solving system takes a single solution as guidance, penalizes training targets of all other solutions and is harmful to improving the accuracy of the problem solving, and therefore, the measured label is more the accuracy of the solution rather than the accuracy of the answer.
(2) The method corrects the training target by using a teacher-student network: in the case of the existing label lacking multiple solutions, we need to change the training target by other methods. Based on the observation that the existing automatic problem solving systems generate a part of solutions different from labels, the category vectors of numbers or symbols for predicting each position generated by the systems in the generation process actually contain the information of partial multi-solutions. Through the teacher student network structure, the teacher student network structure utilizes the information to help improve the performance of the system.
(3) The method further enhances the diversity of the model prediction result by using a multi-head decoder structure. The multi-head decoder can generate various solutions, and the system further enhances the diversity of the generated results by disturbing the initialization vector and also using a diversified regular term, so that the model can explore more possibilities, and it is worth noting that the final output only has one solution, and therefore the system can select the best solution according to the confidence of the output.
Drawings
FIG. 1 is a flow chart of the method for automatically solving the problem of the application problem based on the teacher student network and the multi-head decoder.
FIG. 2 is a schematic diagram of teacher student network problem solving provided by the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Example 1:
as shown in fig. 1-2, the method for automatically solving the application questions based on the teacher student network and the multi-head decoder comprises the following steps:
s1, constructing a coding and decoding model with a sequence in a tree structure and only one decoder, and using the coding and decoding model as a teacher network;
s2, training the teacher network through the training samples, taking the labels of the training samples as hard label vectors, and taking the class vectors output by the training samples by the teacher network after training as soft label vectors;
s3, constructing a coding and decoding model based on a sequence-to-tree decoder with a multi-head tree structure, and using the coding and decoding model as a student network;
s4, simultaneously constructing a supervision signal based on the hard label vector and the soft label vector, and training the student network by using a training sample based on the constructed supervision signal;
s5, inputting the application questions to be solved into the trained student network, generating a plurality of problem solving equations by using a multi-head tree structure decoder of the student network, and determining corresponding confidence coefficients;
s6, selecting the solving equation corresponding to the highest confidence coefficient, and solving the corresponding answer according to the solving equation to complete the automatic solving.
In step S1 of the present embodiment, the teacher network is regarded as a function f (x) that maps the training sample x to the label x, and the parameter of the teacher network is denoted as θTHence, the teacher network is denoted as f (x, θ)T) (ii) a The training sample of the teacher network is an application question stem and a corresponding problem solving scheme thereof; the supervisory signal of the training process comes from the label of the training sample, and the label y in this embodiment is a label vector distributed from 0 to 1, i.e. a hard label vector.
In step S2 of this embodiment, the method for training the teacher network by using the training samples specifically includes:
a1, obtaining a word-level hidden state representation H of a character word set X in an application topic stem text through an encoder structure in a teacher network;
wherein, the character word set X ═ { X ═ X1,...,xm,...xMH, word-level hidden state representation H ═ H1,...,hm,...,hMAnd taking the word-level hidden state representation H as a real value vector, x, corresponding to the stem textmFor the mth word in the subject stem text, hmThe mth element in the word-level hidden layer state representation is represented;
a2, inputting the word-level hidden state representation H into a tree-structure-based decoder in a teacher network, and outputting a category vector at each moment;
when a group of word-level hidden-layer state representations H are input into a decoder based on a tree structure in a teacher network, the decoder outputs a category variable (namely output probability distribution) at each moment according to the sequence of a prefix expression, and the numerical value of each position of the vector expresses the probability of generating a specific number and operator at the moment;
a3, determining a loss function of the teacher network based on the output category vector and the label y of the training sample (0-1 distributed label variable, namely the number or operator labeled at the moment is set as 1, and the others are 0);
and A4, based on the loss function of the teacher network, using the label of the training sample as a supervision signal of the training process, and training the teacher network by using the training sample.
The encoder structure in step a1 is a bi-directional recurrent neural network or a single recurrent neural network, and in this embodiment, a bi-directional recurrent neural network is used, and the bi-directional recurrent neural network extracts features corresponding to the current word one by one according to the sequence of occurrence of the words in the stem description, so that the influence of the context on the semantics can be considered at the same time.
In the step A3, the loss function L of the teacher networkNLLT) Comprises the following steps:
Figure BDA0002489886480000081
wherein 1 is an indicator function, V1Number of numbers and operators in problem solving equations generated for decoders in teacher networks, k1For a particular number or operator in the loss function of the teacher's network, p1And (c) the distribution corresponding to the class vector output by the teacher network.
The encoder of the student network constructed in step S3 of the present embodiment has the same structure as the encoder of the teacher network, and is the most different from the teacher network in that the student network includes several decoders with independent parameters based on a tree structure, in order to make the outputs of different decoders as different as possible, in the training process of the student network, a diversified regularization term is added to the output end of each decoder, and different noise is added to the input of each decoder, so that the diversity of the output of a multi-head decoder is enhanced by the change of the input.
In step S4 of the embodiment, the input and output are x and y respectively when training the student network, and our purpose is to train to the student network parameter θSTo makeGet f (x, theta)S) X → y; based on this, in step S4, the method for determining the loss function in the student network training process includes:
b1, determining loss function L of student network based on hard label vectorNLLS);
Figure BDA0002489886480000091
In the formula, thetaSFor parameters of student networks, 1 is an indicator function, V2Number of numbers and operators in problem solving equations generated for decoders in student networks, k2For a particular number or operator in the loss function of the student network, p2() distribution corresponding to the class vector output for the student network;
b2 calculating cross entropy loss L between student network output and teacher network outputKDS;θT);
Figure BDA0002489886480000092
Wherein q { y ═ k | x; thetaTP (y is k | x; theta) is an indication function for taking out the kth position value in the soft label vector output by the teacher networks) Distributing the kth position in the output category vector in the student network, wherein V is the number of numbers and operators in the same problem solving equation generated in the teacher network and the student network, and k is a specific number or operator;
b3 based on cross entropy loss LNLLS;θT) And loss LNLLS) Determining the loss L from the teacher network to the student network corresponding to the ith decoder in the student networkTS,iST);
LTS,iST)=(1-α)LNLLS)+αLKDS;θT)
In the formula, alpha is an interpolation parameter;
at the gain of loss LTS,iST) On the basis of the above, it is expected that the generated results are more diversified by inputting different hidden vector quantities to different decoders in the student network, and specifically, besides outputting the obtained word-level hidden state representation H of the direct encoder to a head decoder, the disturbance is added to the input of another decoder, and the disturbance input to the other decoders is obtained through the following steps B4-B5;
b4, obtaining a word-level hidden state representation H of a character word set x in an application question stem text in a training sample through an encoder structure in a student network;
b5, generating a hidden layer vector group set { H) corresponding to the word-level hidden state representation H by masking the word-level hidden state representation H1,H2,..Hi...HNThe hidden layer vector group is sequentially input into each decoder of the student network;
wherein, i is the number of the decoders in the student network, and N is the total number of the decoders in the student network;
b6, introducing a diversified regularization term LdivAnd combined loss LTS,iST) And obtaining a loss function L of the network training process of the student.
The step B5 is specifically:
b5-1, defining mask rate Pmask
B5-2, representing the word-level hidden state with the percentage of P in H by using Gaussian distributionmaskSampling the positions of the two phases to generate a zero matrix Mask which is the same as the Hzero
B5-3, zero matrix Mask generated according to samplingzeroAssigning 1 to the sampled position, and generating Mask matrix Maskp
B5-4, by Hi=Maskp⊙ H, determining a hidden vector group set { H) corresponding to the word-level hidden state representation H1,H2,..Hi...HN};
Wherein, an is a matrix multiplication operator that multiplies by bit;
b5-5, grouping hidden vector groups { H1,H2,..Hi...HNEach hidden vector group H iniInput to a corresponding decoder.
In this embodiment, step B6, to encourage different decoders to generate different results, we introduce a diversification regularization term LdivIntroduced diversified regularization term LdivComprises the following steps:
Figure BDA0002489886480000111
where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, Ldiv,tTo calculate the loss function of the similarity of the solution equations, and Ldiv,t=1+SCOS(yi,t,yi1,t),SCOS(. is the cosine similarity of any two decoder outputs, yi,t,yi1,tThe output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.
More specifically, we use cosine similarity to measure the difference between the outputs of different decoders, and we aim to promote the diversity of the solution equations generated, and we do not have to perform a bundle search for any two decoder outputs, if the difference between them is too large, so we can get the loss function L of the student network training process in step B6 as:
Figure BDA0002489886480000112
where β is the weight of the regularization term, LTS,NLoss of the teacher network to the student network corresponding to all decoders in the student network.
Example 2:
the embodiment of the invention provides a comparison example for automatically solving the problems of two common data sets by using the method and the existing problem solving method:
two commonly used datasets, where mahps had 2373 problems and Math23K had 23162 problems. For the Math23K dataset, some methods were evaluated using 5-fold cross-validation denoted "Math 23K", while others were evaluated using publicly available training test set partitioning (denoted "Math 23K"). For the mahps dataset, the model was evaluated by 5-fold cross-validation. After the previous work, the accuracy of the solution was used as an evaluation index. As shown in Table 1 (the data represents the accuracy of the model on the test set, the larger the numerical value is, the better the result is), the method has better effect compared with the existing GROUPATT method, Math-EN method and DNS method.
Table 1: the effect of the method is compared with that of the existing method
MAWPS Math23K Math23K*
DNS 59.5 - 58.1
Math-EN 69.2 66.9 -
GROUPATT 76.1 69.5 66.9
Method for producing a composite material 84.4 77.4 75.1
The invention has the beneficial effects that:
(1) the method firstly considers the defect of the training target in the automatic solving application problem system, a plurality of application problems can have a plurality of solutions, and even if the same writing method is adopted and different mathematical forms (such as an exchange law, a combination law, a distribution rate and the like) are used, the labels of the solutions are different. Therefore, the existing automatic problem solving system takes a single solution as guidance, penalizes training targets of all other solutions and is harmful to improving the accuracy of the problem solving, and therefore, the measured label is more the accuracy of the solution rather than the accuracy of the answer.
(2) The method corrects the training target by using a teacher-student network: in the case of the existing label lacking multiple solutions, we need to change the training target by other methods. Based on the observation that the existing automatic problem solving systems generate a part of solutions different from labels, the category vectors of numbers or symbols for predicting each position generated by the systems in the generation process actually contain the information of partial multi-solutions. Through the teacher student network structure, the teacher student network structure utilizes the information to help improve the performance of the system.
(3) The method further enhances the diversity of the model prediction result by using a multi-head decoder structure. The multi-head decoder can generate various solutions, and the system further enhances the diversity of the generated results by disturbing the initialization vector and also using a diversified regular term, so that the model can explore more possibilities, and it is worth noting that the final output only has one solution, and therefore the system can select the best solution according to the confidence of the output.

Claims (10)

1. An automatic problem solving method for application problems based on a teacher-student network and a multi-head decoder is characterized by comprising the following steps of:
s1, constructing a coding and decoding model with a sequence in a tree structure and only one decoder, and using the coding and decoding model as a teacher network;
s2, training the teacher network through the training samples, taking the labels of the training samples as hard label vectors, and taking the class vectors output by the training samples by the teacher network after training as soft label vectors;
s3, constructing a coding and decoding model based on a sequence-to-tree decoder with a multi-head tree structure, and using the coding and decoding model as a student network;
s4, simultaneously constructing a supervision signal based on the hard label vector and the soft label vector, and training the student network by using a training sample based on the constructed supervision signal;
s5, inputting the application questions to be solved into the trained student network, generating a plurality of problem solving equations by using a multi-head tree structure decoder of the student network, and determining corresponding confidence coefficients;
s6, selecting the solving equation corresponding to the highest confidence coefficient, and solving the corresponding answer according to the solving equation to complete the automatic solving.
2. The method of claim 1, wherein in step S1, the teacher' S network is mapped as a function f (x, θ) labeled y for training samples xT) Wherein, thetaTParameters of a teacher network;
the training sample is an application question stem and a corresponding problem solving scheme thereof; the label y is a label vector distributed from 0 to 1.
3. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 2, wherein in said step S2, the method for training teacher network by training sample is specifically:
a1, obtaining a word-level hidden state representation H of a character word set X in an application topic stem text through an encoder structure in a teacher network;
a2, inputting the word-level hidden state representation H into a tree-structure-based decoder in a teacher network, and outputting a category vector at each moment;
a3, determining a loss function of the teacher network based on the output class vector and the label y of the training sample;
and A4, based on the loss function of the teacher network, using the label of the training sample as a supervision signal of the training process, and training the teacher network by using the training sample.
4. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 3, wherein said encoder structure in step A1 is a bidirectional cyclic neural network.
5. The method for automatically solving application problems based on teacher student network and multi-head decoder as claimed in claim 3, wherein in said step A3, the loss function L of teacher networkNLLT) Comprises the following steps:
Figure FDA0002489886470000021
wherein 1 is an indicator function, V1Number of numbers and operators in problem solving equations generated for decoders in teacher networks, k1For a particular number or operator in the loss function of the teacher's network, p1And (c) the distribution corresponding to the class vector output by the teacher network.
6. The method of claim 3, wherein the structure of the encoder in the student network constructed in step S3 is the same as that of the encoder in the teacher network, the student network comprises a plurality of decoders with independent parameters and based on a tree structure, a diversification regularization term is added to the output of each decoder, and different noise is added to the input of each decoder during the training process of the student network.
7. The method for automatically solving application problems based on teacher 'S network and multi-head decoder as claimed in claim 6, wherein in said step S4, said student' S network training process loss function is determined by:
b1, determining loss L of student network based on hard label vectorNLLS);
Figure FDA0002489886470000031
In the formula, thetaSFor parameters of student networks, 1 is an indicator function, V2Number of numbers and operators in problem solving equations generated for decoders in student networks, k2For a particular number or operator in the loss function of the student network, p2() distribution corresponding to the class vector output for the student network;
b2 calculating cross entropy loss L between student network output and teacher network outputKDS;θT);
Figure FDA0002489886470000032
Wherein q { y ═ k | x; thetaTP (y is k | x; theta) is an indication function for taking out the kth position value in the soft label vector output by the teacher networks) Distributing the kth position in the output category vector in the student network, wherein V is the number of numbers and operators in the same problem solving equation generated in the teacher network and the student network, and k is a specific number or operator;
b3 based on cross entropy loss LNLLS;θT) And loss LNLLS) Determining the teacher network to student corresponding to the ith decoder in the student networkLoss of raw network LTS,iST);
LTS,iST)=(1-α)LNLLS)+αLKDS;θT)
In the formula, alpha is an interpolation parameter;
b4, obtaining a word-level hidden state representation H of a character word set x in an application question stem text in a training sample through an encoder structure in a student network;
b5, generating a hidden layer vector group set { H) corresponding to the word-level hidden state representation H by masking the word-level hidden state representation H1,H2,..Hi...HNThe hidden layer vector group is sequentially input into each decoder of the student network;
wherein, i is the number of the decoders in the student network, and N is the total number of the decoders in the student network;
b6, introducing a diversified regularization term LdivAnd combined loss LTS,iST) And obtaining a loss function L of the network training process of the student.
8. The method for automatically solving application problems based on teacher's network and multi-head decoder as claimed in claim 7, wherein said step B5 is specifically:
b5-1, defining mask rate Pmask
B5-2, representing the word-level hidden state with the percentage of P in H by using Gaussian distributionmaskSampling the positions of the two phases to generate a zero matrix Mask which is the same as the Hzero
B5-3, zero matrix Mask generated according to samplingzeroAssigning 1 to the sampled position, and generating Mask matrix Maskp
B5-4, by Hi=Maskp⊙ H, determining a hidden vector group set { H) corresponding to the word-level hidden state representation H1,H2,..Hi...HN};
Wherein, an is a matrix multiplication operator that multiplies by bit;
b5-5, grouping hidden vector groups { H1,H2,..Hi...HNEach hidden vector group H iniInput to a corresponding decoder.
9. The method for automatically solving application problems based on teacher's network and multi-head decoder as claimed in claim 7, wherein in said step B6, a diversified regularization term L is introduceddivComprises the following steps:
Figure FDA0002489886470000041
where i, i1 is the number of two different decoders, T is the sign or value number in the solution equation, T is the length of the generated sequence, Ldiv,tTo calculate the loss function of the similarity of the solution equations, and Ldiv,t=1+SCOS(yi,t,yi1,t),SCOS(. is the cosine similarity of any two decoder outputs, yi,t,yi1,tThe output of the ith decoder and the ith 1 decoder in the student network at the position t respectively.
10. The method of claim 9, wherein the loss function L of the student network training process in step B6 is:
Figure FDA0002489886470000042
where β is the weight of the regularization term, LTS,NLoss of the teacher network to the student network corresponding to all decoders in the student network.
CN202010402148.3A 2020-05-13 2020-05-13 Automatic problem solving method for application problems based on teacher-student network and multi-head decoder Active CN111553821B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010402148.3A CN111553821B (en) 2020-05-13 2020-05-13 Automatic problem solving method for application problems based on teacher-student network and multi-head decoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010402148.3A CN111553821B (en) 2020-05-13 2020-05-13 Automatic problem solving method for application problems based on teacher-student network and multi-head decoder

Publications (2)

Publication Number Publication Date
CN111553821A true CN111553821A (en) 2020-08-18
CN111553821B CN111553821B (en) 2021-04-27

Family

ID=72004626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010402148.3A Active CN111553821B (en) 2020-05-13 2020-05-13 Automatic problem solving method for application problems based on teacher-student network and multi-head decoder

Country Status (1)

Country Link
CN (1) CN111553821B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836801A (en) * 2021-02-03 2021-05-25 上海商汤智能科技有限公司 Deep learning network determination method and device, electronic equipment and storage medium
CN117521812A (en) * 2023-11-20 2024-02-06 华中师范大学 Automatic arithmetic text question solving method and system based on variational knowledge distillation
CN117521812B (en) * 2023-11-20 2024-06-07 华中师范大学 Automatic arithmetic text question solving method and system based on variational knowledge distillation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180336465A1 (en) * 2017-05-18 2018-11-22 Samsung Electronics Co., Ltd. Apparatus and method for student-teacher transfer learning network using knowledge bridge
US20180365564A1 (en) * 2017-06-15 2018-12-20 TuSimple Method and device for training neural network
CN110428010A (en) * 2019-08-05 2019-11-08 中国科学技术大学 Knowledge method for tracing
CN110739003A (en) * 2019-10-23 2020-01-31 北京计算机技术及应用研究所 Voice enhancement method based on multi-head self-attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180336465A1 (en) * 2017-05-18 2018-11-22 Samsung Electronics Co., Ltd. Apparatus and method for student-teacher transfer learning network using knowledge bridge
US20180365564A1 (en) * 2017-06-15 2018-12-20 TuSimple Method and device for training neural network
CN110428010A (en) * 2019-08-05 2019-11-08 中国科学技术大学 Knowledge method for tracing
CN110739003A (en) * 2019-10-23 2020-01-31 北京计算机技术及应用研究所 Voice enhancement method based on multi-head self-attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIAN GUO ET AL.: ""MS-Pointer Network: Abstractive Text Summary"", 《IEEE ACCESS》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836801A (en) * 2021-02-03 2021-05-25 上海商汤智能科技有限公司 Deep learning network determination method and device, electronic equipment and storage medium
CN117521812A (en) * 2023-11-20 2024-02-06 华中师范大学 Automatic arithmetic text question solving method and system based on variational knowledge distillation
CN117521812B (en) * 2023-11-20 2024-06-07 华中师范大学 Automatic arithmetic text question solving method and system based on variational knowledge distillation

Also Published As

Publication number Publication date
CN111553821B (en) 2021-04-27

Similar Documents

Publication Publication Date Title
Zhang et al. Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition
CN107967318A (en) A kind of Chinese short text subjective item automatic scoring method and system using LSTM neutral nets
CN110134946B (en) Machine reading understanding method for complex data
Xue et al. A hierarchical BERT-based transfer learning approach for multi-dimensional essay scoring
CN113673254B (en) Knowledge distillation position detection method based on similarity maintenance
CN110781681B (en) Automatic first-class mathematic application problem solving method and system based on translation model
CN113486645A (en) Text similarity detection method based on deep learning
Shakeel et al. A multi-cascaded deep model for bilingual sms classification
Lin et al. Automated prediction of item difficulty in reading comprehension using long short-term memory
CN114153942B (en) Event time sequence relation extraction method based on dynamic attention mechanism
CN111553821B (en) Automatic problem solving method for application problems based on teacher-student network and multi-head decoder
Ye et al. Machine learning techniques to automate scoring of constructed-response type assessments
Zhong [Retracted] Evaluation of Traditional Culture Teaching Efficiency by Course Ideological and Political Integration Lightweight Deep Learning
CN116521872B (en) Combined recognition method and system for cognition and emotion and electronic equipment
Song [Retracted] An Evaluation Method of English Teaching Ability Based on Deep Learning
CN110969010A (en) Problem generation method based on relationship guidance and dual-channel interaction mechanism
CN114579706B (en) Automatic subjective question review method based on BERT neural network and multi-task learning
CN115935969A (en) Heterogeneous data feature extraction method based on multi-mode information fusion
CN114692615A (en) Small sample semantic graph recognition method for small languages
CN114970557A (en) Knowledge enhancement-based cross-language structured emotion analysis method
CN114840679A (en) Robot intelligent learning guiding method based on music theory knowledge graph reasoning and application
CN115617959A (en) Question answering method and device
Wang et al. Teacher Talk Moves in K12 Mathematics Lessons: Automatic Identification, Prediction Explanation, and Characteristic Exploration
Li et al. A Multimodal Machine Learning Framework for Teacher Vocal Delivery Evaluation
Li et al. Automated essay scoring incorporating multi-level semantic features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant