CN111563392A - Method and device for evaluating importance degree of model parameters and electronic equipment - Google Patents

Method and device for evaluating importance degree of model parameters and electronic equipment Download PDF

Info

Publication number
CN111563392A
CN111563392A CN202010394212.8A CN202010394212A CN111563392A CN 111563392 A CN111563392 A CN 111563392A CN 202010394212 A CN202010394212 A CN 202010394212A CN 111563392 A CN111563392 A CN 111563392A
Authority
CN
China
Prior art keywords
training
model
model parameter
parameter
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010394212.8A
Other languages
Chinese (zh)
Inventor
朱聪慧
刘乐茂
李冠林
史树明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010394212.8A priority Critical patent/CN111563392A/en
Publication of CN111563392A publication Critical patent/CN111563392A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a method and a device for evaluating importance degree of model parameters and electronic equipment, and relates to the technical field of computers. The method comprises the following steps: training a neural machine translation NMT model based on a training data set to obtain parameter value variation of each model parameter before and after each training; sampling the training data set to obtain a sampling data set; determining the gradient of a loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set; and determining the importance degree of each model parameter based on the gradient and parameter value variable quantity corresponding to each training of each model parameter. According to the technical scheme, the gradient of each model parameter is determined based on the sampling data set, so that the calculated amount of data is reduced; the importance degree of each model parameter is determined based on the gradient of each model parameter and the parameter value variation before and after training, and the contribution of each model parameter in the process of the convergence of the loss function can be determined.

Description

Method and device for evaluating importance degree of model parameters and electronic equipment
Technical Field
The application relates to the technical field of computers, in particular to a method and a device for evaluating importance degree of model parameters and electronic equipment.
Background
Since its introduction, the Neural Machine Translation (NMT) model has rapidly become the focus of the field of Translation research. The NMT model not only can produce impressive translation effects, but also has advantages in terms of the structure of the translation model. Compared with the traditional statistical machine translation, the model can model a language model, a translation model and an alignment model in a unified mode instead of a pipeline mode, so that the side effect caused by error accumulation can be reduced.
The NMT model is used as a complex neural network, the scale of model parameters in the whole network can reach 1.08 hundred million, and all model parameters are iteratively trained until convergence. However, the NMT model is trained as a black box at present, and the effect of each model parameter on the loss function reduction in the training process cannot be known, which increases the difficulty in understanding the model training mechanism and is not beneficial to further improving the model performance in a targeted manner.
Disclosure of Invention
The application provides a method and a device for evaluating the importance degree of a model parameter and electronic equipment, which are used for solving at least one problem in the prior art.
The embodiment of the application provides the following specific technical scheme:
in a first aspect, an embodiment of the present application provides a method for evaluating importance of a model parameter, where the method includes:
training a neural machine translation NMT model based on a training data set to obtain parameter value variation of each model parameter before and after each training;
sampling the training data set to obtain a sampling data set;
determining the gradient of a loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set;
and determining the importance degree of each model parameter based on the gradient and parameter value variable quantity corresponding to each training of each model parameter.
In a second aspect, there is provided an importance level evaluation apparatus for model parameters, the apparatus comprising:
the acquisition module is used for training the NMT model based on the training data set and acquiring parameter value variable quantity of each model parameter before and after each training;
the sampling module is used for sampling the training data set to obtain a sampling data set;
a first determining module, configured to determine, based on the sampled data set, a gradient of a loss function of the NMT model corresponding to each training with respect to each of the model parameters;
and the second determining module is used for determining the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: a method of importance assessment of model parameters according to the first aspect or any of the possible implementations of the first aspect is performed.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium is used for storing a computer program, and when the computer program runs on a computer, the computer may execute the importance level evaluation method for model parameters as shown in the first aspect of the present application or any possible implementation manner of the first aspect.
The beneficial effect that technical scheme that this application provided brought is:
the application provides a method and a device for evaluating the importance degree of model parameters and electronic equipment, wherein the gradient of a loss function of an NMT (non-uniform matrix test) model corresponding to each training relative to each model parameter is determined based on a sampling data set obtained by sampling a training data set, so that the calculated amount of data is reduced; the parameter value variation of each model parameter before and after each training of the NMT model is obtained, the importance degree of each model parameter is determined based on the gradient and the parameter value variation corresponding to each training of each model parameter, the contribution of each model parameter in the process of the convergence of the loss function can be determined, and the performance of the model can be improved in a targeted manner.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a flow chart of Transformer-based neural machine translation provided by an embodiment of the present application;
fig. 2 is a flowchart of a method for evaluating importance of model parameters according to an embodiment of the present disclosure;
FIG. 3a is a distribution diagram of LCA contribution weights of modules of an NMT model in a German-English translation task according to an embodiment of the present application;
FIG. 3b is a distribution diagram of LCA contribution weights of modules in an Ender translation task for an NMT model according to an embodiment of the present application;
fig. 4a is a LCA contribution gravity distribution diagram of each layer in an encoder and a decoder of the NMT model in the german translation task according to the embodiment of the present application;
fig. 4b is a LCA contribution gravity distribution diagram of each layer in the encoder and decoder in the english translation task of the NMT model provided in the embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a comparison of LCA calculation based on a sample data set and a training data corpus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an importance degree evaluation apparatus for model parameters according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
The execution subject of the technical scheme of the application is computer equipment, including but not limited to a server, a personal computer, a notebook computer, a tablet computer, a smart phone and the like. The computer equipment comprises user equipment and network equipment. User equipment includes but is not limited to computers, smart phones, PADs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud consisting of a large number of computers or network servers for cloud computing, wherein the cloud computing is a kind of distributed computing, and a super virtual computer is composed of a group of loosely coupled computers. The computer equipment can run independently to realize the application, and can also be accessed to the network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, etc.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
For better understanding and description of the solutions provided by the embodiments of the present application, a brief description of the related art related to the embodiments of the present application will be provided below.
Loss Change Allocation (LCA, Loss Change Allocation): a method for calculating the contribution of a measurement parameter to the loss function reduction in a training process.
Transformer (converter): a neural network framework.
Attention (Attention mechanism): a probabilistic model for selecting a context.
An Encoder: an encoder in the NMT model;
a Decoder: decoder in NMT model.
The NMT is a complex neural network, and takes a currently mainstream Transformer framework as an example, and mainly includes functional components such as an encoder-encoder layer, an encoder/decoder layer, a decoder-encoder layer, and a decoder Softmax layer (a decoder output layer, a Softmax layer shown in the figure). Wherein each layer of the encoder/decoder is composed of some basic coding units/decoding units. All components are organically combined together to form one layer in the network and then stacked layer by layer to form the whole network. Where the encoder layers convert the input sentence (source language) into a semantic vector and the decoder layers convert the semantic vector into the output sentence (target language).
Fig. 1 shows a transform-based neural machine translation flowchart, taking an example that an encoder and a decoder are both 6 layers, a translation process is mainly divided into two parts: the device comprises an encoder neural network (comprising an encoder-word vector layer and encoder layers 1-6) and a decoder neural network (comprising a decoder-word vector layer and decoder layers 1-6), wherein the encoder neural network encodes an input sentence X into a representation vector H, the decoder neural network decodes the representation vector H of X, and finally a translation result Y of X is output.
Specifically, given an input source language sentence X, it is first encoded layer by layer into a representation vector H using the encoder layers, each of which uses a self-attention (self-attention) mechanism; in the decoding phase, each time a new word is generated, the process loops until an end symbol is generated. In particular, not only the self-attention mechanism but also an encoded attention (encoder-attention) mechanism (as indicated by the dotted line in fig. 1) is used in each layer of the decoder. And decoding the result output by each layer of the decoder together with the expression vector H of the X again through the coding attention mechanism, and finally outputting a translation result Y.
The NMT model has a complex network structure, is trained as a black box at present, cannot know which modules play a greater role in the convergence process, and solves the problems that the model parameters of which modules converge stably at first, and the like, so that the difficulty in understanding the model training mechanism is increased, and the improvement of the model performance is not facilitated.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The embodiment of the application provides a method for evaluating the importance degree of a model parameter, as shown in fig. 2, the method includes:
step S101, training an NMT model based on a training data set, and acquiring parameter value variable quantity of each model parameter before and after each training;
when the NMT model is trained based on the training data set, parameter values of model parameters of the NMT model are obtained through each training, and the model parameters are subjected to iterative training until a loss function of the model is converged. And the parameter values of the model parameters are changed before and after each training, and the variable quantity of the parameter values of the model parameters is obtained to be used for calculating the reduction degree of the loss function.
It should be noted that, in the process of training the NMT model, iteration of a plurality of training steps is required until the loss function converges, and "each training" in this application refers to each of the plurality of training steps in the NMT model training process, and does not refer to the entire training process from the beginning of the training to the convergence of the loss function of the model.
Step S102, sampling a training data set to obtain a sampling data set;
step S103, determining the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set;
the NMT model has a complex structure, the scale of model parameters in the whole network model is huge, for example, 1.08 hundred million model parameters can be reached, and all model parameters are trained iteratively until the loss function of the model converges. If the gradients are calculated over the entire training data set, the computational cost of calculating gradients over the entire training data at each model parameter update is unacceptable for the enormous training data scale of NMT. Calculating the gradient based on the entire training data set would increase the computational consumption in proportion to the training data size, with a large computational resource consumption.
The method comprises the steps of sampling a training data set to obtain a sampling data set with the scale smaller than that of the training data set, and solving partial derivatives of model parameters by using a loss function after each training based on the sampling data set to obtain the gradient of the loss function relative to the model parameters. For example, the gradient of the model parameter a is required, each sample data in the sample data set and the sample label data are substituted into the loss function, and the values of the model parameters other than the model parameter a are substituted into the loss function, so that the only unknown number in the current loss function is the model parameter a, and the loss function is subjected to partial derivation on the model parameter a to obtain the gradient of the model parameter a. And the gradient of each model parameter is calculated based on the sampling data set, so that the calculation amount is greatly reduced, and the calculation efficiency is greatly improved.
And step S104, determining the importance degree of each model parameter based on the gradient and parameter value variation corresponding to each training of each model parameter.
After the NMT model is trained each time, a parameter value and a loss function of each model parameter are obtained, and an evaluation index for evaluating the importance degree of the model parameter is determined based on two loss functions obtained by two times of training, for example, the two loss functions obtained by two times of training are subtracted to obtain the evaluation index for evaluating the importance degree of the model parameter.
When the evaluation index of the importance degree of the model parameters is calculated, the LCA value corresponding to each model parameter can be calculated based on the gradient and the variation of each model parameter corresponding to each training, the LCA value is larger as the evaluation index of the importance degree corresponding to each model parameter, the larger the LCA value is, the larger the contribution of the model parameter to the reduction of the loss function is, and the contribution of each model parameter in the convergence process of the loss function can be determined according to the LCA value, so that the role of each model parameter in the training process is determined, more information in the training process is captured and visualized, and further understanding of the training process is increased.
In one example, the LCA values of the NMT model are calculated according to a first order expansion of taylor's formula as follows:
Figure BDA0002486790210000081
wherein the content of the first and second substances,
Figure BDA0002486790210000082
representing the loss function obtained after the tth training of the NMT model, theta representing the model parameter, D representing the sampling data set,
Figure BDA0002486790210000083
representing the loss function obtained before the tth training of the NMT model (i.e. corresponding to the t-1 st training),
Figure BDA0002486790210000084
representing the gradient of the loss function with respect to a model parameter theta, thetat+1Parameter values representing the model parameter theta after the t-th training, thetatRepresenting the parameter value of the model parameter theta before the t-th training, i representing the ith model parameter, and K representing the number of the model parameters; thetat+1tRepresenting the variation of the model parameter theta before and after the t-th training;
Figure BDA0002486790210000085
to representSum of LCA values of model parameters before and after the t-th training.
Figure BDA0002486790210000086
Wherein A islca[t][i]Showing the LCA value corresponding to the ith parameter and the tth training,
Figure BDA0002486790210000087
representing the loss function with respect to the model parameter θiThe gradient of (a) of (b) is,
Figure BDA0002486790210000088
representing the model parameter thetaiThe amount of change before and after the t-th training.
When sampling the training data set, only one sampling may be performed, the gradient of each model parameter is calculated based on the sampling data set obtained by the one-time sampling, the importance degree of the model parameter is determined by the gradient of the model parameter and the parameter value variation, or multiple times of sampling may be performed, the gradient of each model parameter is calculated based on the sampling data set obtained by each sampling, and the importance degree of the model parameter is determined by the gradient of the model parameter and the parameter value variation.
In one possible implementation, sampling a training data set to obtain a sampled data set includes:
respectively sampling a training data set aiming at each training to obtain a sampling data set corresponding to each training;
determining the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set, wherein the gradient comprises the following steps:
based on the sample data set corresponding to each training, the gradient of the loss function corresponding to each training with respect to each model parameter is determined.
In practical application, the objectivity of data can be enhanced through a multi-sampling mode, training data set sampling can be performed on each training, a model obtained by each training is input into a sampling data set obtained by each sampling, model output corresponding to each sample is obtained, a loss function corresponding to the training is obtained based on the model output corresponding to each sample and the sample label corresponding to each sample, and partial derivatives of each model parameter are obtained based on the loss function, so that the gradient of each model parameter is obtained. In the embodiment, the training set is sampled according to each training, so that the data is more objectively selected, gradient calculation is performed on the basis of the sampled data set obtained by sampling according to each training, and the reliability of the calculation result can be enhanced.
For determining the importance of each model parameter, the present application includes determining the importance of the model parameter in each training process and the cumulative importance in multiple training processes.
In a possible implementation manner, the determining the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter includes at least one of the following:
for each model parameter, determining the importance degree of the model parameter in each training process based on the gradient of the model parameter corresponding to each training and the parameter value variation;
and for each model parameter, determining the accumulated importance degree of the model parameter in the multiple training processes based on the gradient of the model parameter corresponding to the multiple training and the parameter value variation.
In practical application, an importance degree evaluation index corresponding to each training of each model parameter can be calculated based on the gradient calculated by the sampling data set obtained by sampling the training data set of each training and the variation of each model parameter before and after each training, and the importance degree of each model parameter in each training process can be evaluated through the importance degree evaluation index, so that the contribution of each model parameter to the reduction degree of the loss function in each training process can be measured.
In addition, the importance degree evaluation index corresponding to each model parameter in multiple training can be calculated based on the gradient of the model parameter corresponding to multiple training and the parameter value variation, the accumulated importance degree of the model parameter in multiple training processes can be evaluated through the importance degree evaluation index, and the accumulated contribution of each model parameter to the loss function reduction degree in multiple training processes can be measured in a balanced manner.
The specific implementation manner for determining the accumulated importance degree of the model parameters in the multiple training processes is as follows:
in a possible implementation manner, determining an accumulated importance degree of a model parameter in a plurality of training processes based on a gradient of the model parameter and a parameter value variation corresponding to the plurality of training processes includes:
and determining the accumulated importance degree of the model parameters in the training process of each first preset number of times and storing the accumulated importance degree based on the gradient of the model parameters corresponding to the continuous training of the first preset number of times and the parameter value variation.
In practical application, the NMT model is trained for multiple times until a loss function is converged to obtain the gradient and parameter value variation of each model parameter corresponding to each training, the LCA value of each model parameter corresponding to each training can be calculated based on the gradient and parameter value variation of each model parameter corresponding to each training, the LCA values of a certain model parameter of continuous first preset times are accumulated, and the accumulated importance degree of the model parameter in the training process of each first preset time can be determined.
When the LCA values are stored, due to the fact that the number of model parameters is large, the storage space occupied by the LCA values corresponding to the model parameters in each training is large, the sum of the LCA values of each parameter in multiple training can be stored, and one LCA value is stored every first preset time, so that the storage space is saved.
In addition, the sum of the LCA values of each parameter in multiple training can be accumulated again to obtain the accumulated sum of the LCA values of each model parameter in multiple training, so that the overall situation of loss function reduction of the NMT model in the multiple training can be known.
In one example, from t1Second to t2The cumulative sum of LCA for each model parameter of the NMT model, sub-training, can be expressed as:
Figure BDA0002486790210000101
wherein the content of the first and second substances,
Figure BDA0002486790210000102
denotes the t-th of NMT model2Before sub-training (i.e. t th)2Corresponding to 1 training) loss function,
Figure BDA0002486790210000103
denotes the t-th of NMT model1Before sub-training (i.e. t th)1Corresponding to 1 training) loss function,
Figure BDA0002486790210000104
represents from t1Second to t2And accumulating the LCA values of the model parameters in the secondary training process.
In addition, for each model parameter, the LCA values corresponding to each training can be fused, specifically, when the LCA values are stored, the LCA values of each parameter in multiple training can be stored to be accumulated and averaged to obtain one LCA, so that the storage space is saved.
In one example, 15 consecutive LCA cumulative values are stored, that is, by defining the cumulative summation window length as 15, the LCA values corresponding to each of 15 training sessions based on a certain model parameter are fused to obtain an average LCA value calculation formula as follows:
Figure BDA0002486790210000111
wherein k has an initial value of 0, Alca[t′][i]And (3) showing the LCA value corresponding to each training in the 15 training times of the ith model parameter.
Since the NMT model is a complex network, it has a huge number of model parameters. Often, the interest is not a specific model parameter, but rather a functional module divided from the NMT network structure. The function of a certain functional block (combination of parameters) is in the whole convergence process. In addition to knowing the contribution of each model parameter to the degradation degree of the loss function through the LCA value of each model parameter, the contribution of each module to the degradation degree of the loss function can be determined, as described in the following embodiments.
In one possible implementation, the method further comprises at least one of:
for each functional module of the NMT model, determining the importance degree of the functional module in each training process based on the gradient and parameter value variation corresponding to each training of model parameters of the functional module;
for each functional module of the NMT model, the accumulated importance degree of the functional module in the multiple training process is determined based on the gradient and parameter value variation corresponding to the multiple training of each model parameter of the functional module.
In practical application, the NMT model includes a plurality of functional modules, and an importance degree evaluation index corresponding to each training of the model parameters of each functional module can be calculated based on the gradient and variation of each model parameter of each functional module, and the importance degree of the model parameters of each functional module in each training process can be evaluated through the importance degree evaluation index, so that the contribution of the model parameters of each functional module to the reduction degree of the loss function in each training process can be measured.
In addition, an importance degree evaluation index corresponding to multiple times of training of the model parameters of each function module can be calculated based on the gradient and parameter value variation of the model parameters of each function module corresponding to multiple times of training, the accumulated importance degree of the model parameters of each function module in the multiple times of training process is evaluated through the importance degree evaluation index, and the accumulated contribution of each function module to the loss function reduction degree in the multiple times of training process can be measured.
In a possible implementation manner, determining an accumulated importance degree of the functional module in a plurality of training processes based on gradients and parameter value variations corresponding to a plurality of times of training of model parameters of the functional module includes:
and determining the accumulated importance degree of the functional module in the training process of every second training time and storing the accumulated importance degree based on the gradient of the model parameter and the parameter value variation of the model parameter corresponding to each model parameter of the functional module in the continuous second preset times of training.
In practical application, the NMT model is trained for multiple times until the loss function converges to obtain the gradient of the model parameter and the parameter value variation of each function module corresponding to each training, the LCA value of the model parameter of each function module corresponding to each training can be calculated based on the gradient of the model parameter and the parameter value variation of each function module corresponding to each training, the LCA values of the model parameters of each function module for a second preset number of consecutive times are accumulated, and the accumulated importance degree of the function module in the training process of each second preset number of times can be determined.
When the LCA value is stored, the LCA value corresponding to each training of each functional module can be stored according to each functional module instead of the LCA value corresponding to each training of each model parameter, so that the storage space is saved.
One example of the structure of the NMT model is shown in fig. 1, but the NMT model can also be other structures. Taking the structure of the NMT model in fig. 1 as an example, the model parameters of the NMT model are divided into model parameters of 15 functional modules, and the functional module 1: en _ emb, encoder endword vector; functional modules 2 to 7: layer 1-6 encoders; functional modules 8 to 13: layer 1-6 decoders; the function module 14: de _ emb, decoder word vector; the function module 15: de _ softmax: a decoder target word generator.
Calculating LCA values corresponding to the functional modules based on the following formula:
Figure BDA0002486790210000121
wherein A islca[t][g]Indicating the LCA value corresponding to function block g, ∑iAlca[t][i]The sum of LCA values of the model parameters of the function block g is represented, | g | represents the number of the model parameters of the function block g.
The specific way of sampling the training data set is as follows:
in one possible implementation, sampling a training data set includes:
the training data set is subjected to monte carlo sampling.
In practical application, as an optional mode, a training data set is sampled in a monte carlo sampling mode to obtain a sampling data set, gradients of model parameters are calculated based on the sampling data set obtained in the sampling mode, and LCAs corresponding to the model parameters are calculated, wherein the LCAs are basically consistent with LCAs calculated on all the training data sets, and consumption of calculation resources is greatly reduced, so that the application in complex networks such as NMT (network driver test) and the like is possible.
According to the method for evaluating the importance degree of the model parameters, the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter is determined based on the sampling data set obtained by sampling the training data set, and the calculated amount of data is reduced; the parameter value variation of each model parameter before and after each training of the NMT model is obtained, the importance degree of each model parameter is determined based on the gradient and the parameter value variation corresponding to each training of each model parameter, the contribution of each model parameter in the process of the convergence of the loss function can be determined, and the performance of the model can be improved in a targeted manner.
The technical scheme of the application can capture more detailed information about the training process than the method in the prior art. In the german (as shown in fig. 3 a) and the english (as shown in fig. 3 b) translation tasks, LCA contribution weights of the encoder word vector layer, the encoder, the decoder word vector layer, and the decoder and decoder output layer of the NMT model at the time of convergence of the loss function (at the end of training) are distributed as shown in the figure, wherein the abscissa represents the model parameters of each module of the NMT model (such as the encoder word vector layer, the encoder, the decoder word vector layer, the decoder and the decoder output layer shown in the figure), and the ordinate represents the accumulated values of the LCA values of the model parameters of each module calculated based on the training set and the test set, respectively. The LCA values of the model parameters of the modules calculated based on the training set are represented without padding, and the LCA values of the model parameters of the modules calculated based on the test set are represented with padding. For the encoder word vector layer in FIG. 3a, -0.32 represents the accumulated value of LCA values for model parameters of the module calculated based on the training set, -0.52 represents the accumulated value of LCA values for model parameters of the module calculated based on the test set. The numerical meanings of the other blocks in the figures are the same as the blocks, and are not repeated here.
The model parameters of the encoder and decoder in the NMT network structure contribute most of the loss function degradation, whether on training data or on test data, wherein the larger the absolute value of the cumulative value of LCA values, the greater the contribution, the corresponding word vector layer, whether encoding or decoding, and the decoder output layer model parameters are not sufficiently learned, and the duty ratio is so small as to be negligible. The visualization result shows that the training process of the current neural machine translation system, the word vectors and the parameters of the middle layer network are not sufficiently learned, and the potentials of the word vectors and the parameters are not mined. This result points out the problems of neural machine translation at present, and provides a strong indication for further improving the model of neural machine translation.
In the german (as shown in fig. 4 a) and the english (as shown in fig. 4 b) translation tasks, LCA contribution weights of respective layers in the encoder and decoder of the NMT model at the time of convergence of the loss function (at the end of training) are distributed as shown in the figure, wherein the abscissa represents model parameters of respective layers in the encoder and decoder, and the ordinate represents accumulated values of LCA values of respective layers in the encoder and decoder calculated based on the training set and the test set, respectively. For the encoder level 1 in fig. 4a, the unfilled bar graphs represent the accumulated values of LCA values for each model parameter calculated based on the training set, and the filled bar graphs represent the accumulated values of LCA values for each model parameter calculated based on the test set. The numerical meanings of the other blocks in the figures are the same as the blocks, and are not repeated here.
Further observing the inside of the encoder and the decoder with a large specific gravity, it can be found that the decoder with the largest LCA value inside the decoder is the decoder, which is very suitable for intuitive understanding. But both the first and the last layer are found inside the decoder to have higher LCA values than the middle layer, as shown in fig. 4a, 4 b. This represents a clear difference between the decoder and the encoder, which means that the first layer also plays a large role in the conversion of the encoding end word vector into the semantic representation vector H.
In addition, in order to prove that the LCA estimated value obtained by the sampling mechanism in the technical scheme of the application is very related to the accurate value which needs to be calculated on the training data corpus, a training data set of 1 ten thousand sentences is constructed, the accurate LCA value is calculated on the training data set, and the LCA estimated value is compared with the LCA estimated value obtained by the sampling mechanism on the training data set through the LCA ratio. The Kendel level correlation coefficient of the two is found to be 0.905, which represents that the two are very correlated, and the rationality of introducing a sampling mechanism is proved. Specifically, as shown in fig. 5, the abscissa represents model parameters of each module of the NMT model (such as the encoder word vector layer, the encoder 1-6 layers, the decoder word vector layer, the decoder 1-6 layers, and the decoder output layer shown in the figure), and the ordinate represents the percentage of the accumulated value of the LCA values of the model parameters of each module, which is calculated based on the training set and the test set, to the total LCA value of the LCA values corresponding to each layer shown in fig. 5. The unfilled bars represent the computed results based on the sampled data set, and the filled bars represent the computed results based on the training data corpus. For the encoder word vector layer in fig. 5, 14 above the unfilled bar graph represents the ordering (kender scale) of the accumulated values of LCA values of model parameters calculated based on the sample data set among the accumulated values of LCA values for each layer shown in fig. 5, and 14 above the filled bar graph represents the ordering (kender scale) of the accumulated values of LCA values of model parameters calculated based on the training data corpus among the accumulated values of LCA values for each layer shown in fig. 5.
Based on the same principle as the method shown in fig. 2, the embodiment of the present disclosure also provides an importance level evaluation device 30 for a model parameter, as shown in fig. 6, where the importance level evaluation device 30 for a model parameter includes:
an obtaining module 31, configured to train an NMT model based on a training data set, and obtain parameter value variation of each model parameter before and after each training;
a sampling module 32, configured to sample the training data set to obtain a sampled data set;
a first determining module 33, configured to determine, based on the sample data set, a gradient of a loss function of the NMT model corresponding to each training with respect to each model parameter;
and the second determining module 34 is configured to determine the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter.
In one possible implementation, the sampling module 32 is configured to:
respectively sampling a training data set aiming at each training to obtain a sampling data set corresponding to each training;
a first determining module 33 configured to:
based on the sample data set corresponding to each training, the gradient of the loss function corresponding to each training with respect to each model parameter is determined.
In one possible implementation, the second determining module 34 is configured to at least one of:
for each model parameter, determining the importance degree of the model parameter in each training process based on the gradient of the model parameter corresponding to each training and the parameter value variation;
and for each model parameter, determining the accumulated importance degree of the model parameter in the multiple training processes based on the gradient of the model parameter corresponding to the multiple training and the parameter value variation.
In one possible implementation, the second determining module 34, when determining the accumulated importance of the model parameter during multiple training processes based on the gradient of the model parameter and the parameter value variation corresponding to the multiple training processes, is configured to:
and determining the accumulated importance degree of the model parameters in the training process of each first preset number of times and storing the accumulated importance degree based on the gradient of the model parameters corresponding to the continuous training of the first preset number of times and the parameter value variation.
In a possible implementation, the importance level evaluation device 30 of the model parameters further comprises at least one of the following:
the third determining module is used for determining the importance degree of each functional module in each training process of the functional module based on the gradient and parameter value variable quantity corresponding to each training of model parameters of the functional module for each functional module of the NMT model;
and the fourth determining module is used for determining the accumulated importance degree of each functional module in the multi-training process of the functional module based on the gradient and parameter value variable quantity corresponding to the multi-training of each model parameter of the functional module for each functional module of the NMT model. In a possible implementation manner, the fourth determining module, when determining the accumulated importance of the functional module in the multiple training processes based on the gradient and the parameter value variation corresponding to the multiple training of each model parameter of the functional module, is configured to:
and determining the accumulated importance degree of the functional module in the training process of every second training time and storing the accumulated importance degree based on the gradient of the model parameter and the parameter value variation of the model parameter corresponding to each model parameter of the functional module in the continuous second preset times of training.
In one possible implementation, the sampling module 32 is configured to perform monte carlo sampling on the training data set.
The importance degree evaluation device of the model parameter of the embodiment of the present disclosure may execute the importance degree evaluation method of the model parameter provided by the embodiment of the present disclosure, and the implementation principle is similar, the actions executed by each module in the importance degree evaluation device of the model parameter of the embodiment of the present disclosure correspond to the steps in the importance degree evaluation method of the model parameter of the embodiment of the present disclosure, and for the detailed functional description of each module of the importance degree evaluation device of the model parameter, reference may be specifically made to the description in the importance degree evaluation method of the corresponding model parameter shown in the foregoing, and details are not repeated here.
The importance degree evaluation device for the model parameters, provided by the embodiment of the application, determines the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set obtained by sampling the training data set, so that the calculated amount of data is reduced; the parameter value variation of each model parameter before and after each training of the NMT model is obtained, the importance degree of each model parameter is determined based on the gradient and the parameter value variation corresponding to each training of each model parameter, the contribution of each model parameter in the process of the convergence of the loss function can be determined, and the performance of the model can be improved in a targeted manner.
The above embodiment introduces the importance degree evaluation apparatus of the model parameter from the perspective of the virtual module, and the following introduces an electronic device from the perspective of the physical module, as follows:
an embodiment of the present application provides an electronic device, as shown in fig. 7, an electronic device 5000 shown in fig. 7 includes: a processor 5001 and a memory 5003. The processor 5001 and the memory 5003 are coupled, such as via a bus 5002. Optionally, the electronic device 5000 may also include a transceiver 5004. It should be noted that the transceiver 5004 is not limited to one in practical application, and the structure of the electronic device 5000 is not limited to the embodiment of the present application.
The processor 5001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 5001 may also be a combination of processors implementing computing functionality, e.g., a combination comprising one or more microprocessors, a combination of DSPs and microprocessors, or the like.
Bus 5002 can include a path that conveys information between the aforementioned components. The bus 5002 may be a PCI bus or EISA bus, etc. The bus 5002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
The memory 5003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 5003 is used for storing application program codes for executing the present solution, and the execution is controlled by the processor 5001. The processor 5001 is configured to execute application program code stored in the memory 5003 to implement the teachings of any of the foregoing method embodiments.
An embodiment of the present application provides an electronic device, where the electronic device includes: a memory and a processor; at least one program, stored in the memory, for training a Neural Machine Translation (NMT) model based on a training data set when executed by the processor, to obtain parameter value variations of model parameters before and after each training; sampling the training data set to obtain a sampling data set; determining the gradient of a loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set; and determining the importance degree of each model parameter based on the gradient and parameter value variable quantity corresponding to each training of each model parameter.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for assessing the importance of a model parameter, the method comprising:
training a neural machine translation NMT model based on a training data set to obtain parameter value variation of each model parameter before and after each training;
sampling the training data set to obtain a sampling data set;
determining a gradient of a loss function of the NMT model corresponding to each training with respect to each of the model parameters based on the sampled data set;
and determining the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter.
2. The method of claim 1, wherein sampling the training data set to obtain a sampled data set comprises:
respectively sampling a training data set aiming at each training to obtain a sampling data set corresponding to each training;
determining a gradient of a loss function of the NMT model corresponding to each training with respect to each of the model parameters based on the sampled dataset, comprising:
determining a gradient of the loss function corresponding to each training with respect to each of the model parameters based on the sample data set corresponding to each training.
3. The method of claim 1, wherein determining the importance of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter comprises at least one of:
for each model parameter, determining the importance degree of the model parameter in each training process based on the gradient of the model parameter corresponding to each training and the parameter value variation;
and for each model parameter, determining the accumulated importance degree of the model parameter in the multiple training processes based on the gradient of the model parameter and the parameter value variation corresponding to the multiple training.
4. The method of claim 3, wherein the determining the cumulative importance of the model parameter during the training of the plurality of times based on the gradient of the model parameter and the parameter value variation of the model parameter corresponding to the training of the plurality of times comprises:
and determining the accumulated importance degree of the model parameters in the training process of each first preset number of times and storing the accumulated importance degree based on the gradient of the model parameters and the parameter value variation corresponding to the continuous training of the first preset number of times.
5. The method of claim 1, further comprising at least one of:
for each functional module of the NMT model, determining the importance degree of the functional module in each training process based on the gradient and parameter value variation corresponding to each training of model parameters of the functional module;
and for each functional module of the NMT model, determining the accumulated importance degree of the functional module in the multiple training process based on the gradient and parameter value variable quantity corresponding to the multiple training of each model parameter of the functional module.
6. The method of claim 5, wherein the determining the accumulated importance of the functional module during multiple training processes based on the gradient and the parameter value variation corresponding to multiple training of each model parameter of the functional module comprises:
and determining the accumulated importance degree of the functional module in the training process of each second training frequency and storing the accumulated importance degree based on the gradient of the model parameter and the parameter value variation of the model parameter corresponding to each model parameter of the functional module in the continuous second preset times of training.
7. The method of claim 1, wherein sampling the training data set comprises:
monte Carlo sampling is performed on the training data set.
8. An apparatus for evaluating the degree of importance of a model parameter, the apparatus comprising:
the acquisition module is used for training the NMT model based on the training data set and acquiring parameter value variable quantity of each model parameter before and after each training;
the sampling module is used for sampling the training data set to obtain a sampling data set;
a first determining module, configured to determine, based on the sampled data set, a gradient of a loss function of the NMT model corresponding to each training with respect to each of the model parameters;
and the second determining module is used for determining the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a memory;
one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: performing a method of assessing the significance of a model parameter according to any one of claims 1 to 7.
10. A computer-readable medium for storing computer instructions which, when run on a computer, cause the computer to perform the method of assessing the importance of a model parameter according to any one of claims 1 to 7.
CN202010394212.8A 2020-05-11 2020-05-11 Method and device for evaluating importance degree of model parameters and electronic equipment Withdrawn CN111563392A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010394212.8A CN111563392A (en) 2020-05-11 2020-05-11 Method and device for evaluating importance degree of model parameters and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010394212.8A CN111563392A (en) 2020-05-11 2020-05-11 Method and device for evaluating importance degree of model parameters and electronic equipment

Publications (1)

Publication Number Publication Date
CN111563392A true CN111563392A (en) 2020-08-21

Family

ID=72072107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010394212.8A Withdrawn CN111563392A (en) 2020-05-11 2020-05-11 Method and device for evaluating importance degree of model parameters and electronic equipment

Country Status (1)

Country Link
CN (1) CN111563392A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342700A (en) * 2021-08-04 2021-09-03 腾讯科技(深圳)有限公司 Model evaluation method, electronic device and computer-readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460029A (en) * 2018-04-12 2018-08-28 苏州大学 Data reduction method towards neural machine translation
CN109978141A (en) * 2019-03-28 2019-07-05 腾讯科技(深圳)有限公司 Neural network model training method and device, natural language processing method and apparatus
CN110472255A (en) * 2019-08-20 2019-11-19 腾讯科技(深圳)有限公司 Neural network machine interpretation method, model, electric terminal and storage medium
CN110598224A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Translation model training method, text processing device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460029A (en) * 2018-04-12 2018-08-28 苏州大学 Data reduction method towards neural machine translation
CN109978141A (en) * 2019-03-28 2019-07-05 腾讯科技(深圳)有限公司 Neural network model training method and device, natural language processing method and apparatus
CN110472255A (en) * 2019-08-20 2019-11-19 腾讯科技(深圳)有限公司 Neural network machine interpretation method, model, electric terminal and storage medium
CN110598224A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Translation model training method, text processing device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CONGHUI ZHU等: "Understanding Learning Dynamics for Neural Machine Translation", 《HTTPS://ARXIV.ORG/ABS/2004.02199》 *
JANICE LAN 等: "LCA: Loss Change Allocation for Neural Network Training", 《ARXIV:1909.01440V2 [CS.LG] 3 MAR 2020》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342700A (en) * 2021-08-04 2021-09-03 腾讯科技(深圳)有限公司 Model evaluation method, electronic device and computer-readable storage medium
CN113342700B (en) * 2021-08-04 2021-11-19 腾讯科技(深圳)有限公司 Model evaluation method, electronic device and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN110366734B (en) Optimizing neural network architecture
CN109960810B (en) Entity alignment method and device
CN110442878B (en) Translation method, training method and device of machine translation model and storage medium
CN106557563B (en) Query statement recommendation method and device based on artificial intelligence
WO2022199504A1 (en) Content identification method and apparatus, computer device and storage medium
CN113535984A (en) Attention mechanism-based knowledge graph relation prediction method and device
CN113254785B (en) Recommendation model training method, recommendation method and related equipment
CN111063398A (en) Molecular discovery method based on graph Bayesian optimization
WO2019006541A1 (en) System and method for automatic building of learning machines using learning machines
CN115238855A (en) Completion method of time sequence knowledge graph based on graph neural network and related equipment
Huai et al. Latency-constrained DNN architecture learning for edge systems using zerorized batch normalization
Schwier et al. Zero knowledge hidden markov model inference
CN111563392A (en) Method and device for evaluating importance degree of model parameters and electronic equipment
CN116383741A (en) Model training method and cross-domain analysis method based on multi-source domain data
CN111723186A (en) Knowledge graph generation method based on artificial intelligence for dialog system and electronic equipment
CN111161238A (en) Image quality evaluation method and device, electronic device, and storage medium
CN111753078A (en) Image paragraph description generation method, device, medium and electronic equipment
Sui et al. Self-supervised representation learning from random data projectors
CN115329146A (en) Link prediction method in time series network, electronic device and storage medium
CN114792097A (en) Method and device for determining prompt vector of pre-training model and electronic equipment
CN115482353A (en) Training method, reconstruction method, device, equipment and medium for reconstructing network
CN114372618A (en) Student score prediction method and system, computer equipment and storage medium
CN110209878B (en) Video processing method and device, computer readable medium and electronic equipment
CN115512693A (en) Audio recognition method, acoustic model training method, device and storage medium
CN111626472A (en) Scene trend judgment index computing system and method based on deep hybrid cloud model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200821

WW01 Invention patent application withdrawn after publication