CN111563392A

CN111563392A - Method and device for evaluating importance degree of model parameters and electronic equipment

Info

Publication number: CN111563392A
Application number: CN202010394212.8A
Authority: CN
Inventors: 朱聪慧; 刘乐茂; 李冠林; 史树明
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2020-08-21

Abstract

The embodiment of the application provides a method and a device for evaluating importance degree of model parameters and electronic equipment, and relates to the technical field of computers. The method comprises the following steps: training a neural machine translation NMT model based on a training data set to obtain parameter value variation of each model parameter before and after each training; sampling the training data set to obtain a sampling data set; determining the gradient of a loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set; and determining the importance degree of each model parameter based on the gradient and parameter value variable quantity corresponding to each training of each model parameter. According to the technical scheme, the gradient of each model parameter is determined based on the sampling data set, so that the calculated amount of data is reduced; the importance degree of each model parameter is determined based on the gradient of each model parameter and the parameter value variation before and after training, and the contribution of each model parameter in the process of the convergence of the loss function can be determined.

Description

Method and device for evaluating importance degree of model parameters and electronic equipment

Technical Field

The application relates to the technical field of computers, in particular to a method and a device for evaluating importance degree of model parameters and electronic equipment.

Background

Since its introduction, the Neural Machine Translation (NMT) model has rapidly become the focus of the field of Translation research. The NMT model not only can produce impressive translation effects, but also has advantages in terms of the structure of the translation model. Compared with the traditional statistical machine translation, the model can model a language model, a translation model and an alignment model in a unified mode instead of a pipeline mode, so that the side effect caused by error accumulation can be reduced.

The NMT model is used as a complex neural network, the scale of model parameters in the whole network can reach 1.08 hundred million, and all model parameters are iteratively trained until convergence. However, the NMT model is trained as a black box at present, and the effect of each model parameter on the loss function reduction in the training process cannot be known, which increases the difficulty in understanding the model training mechanism and is not beneficial to further improving the model performance in a targeted manner.

Disclosure of Invention

The application provides a method and a device for evaluating the importance degree of a model parameter and electronic equipment, which are used for solving at least one problem in the prior art.

The embodiment of the application provides the following specific technical scheme:

in a first aspect, an embodiment of the present application provides a method for evaluating importance of a model parameter, where the method includes:

training a neural machine translation NMT model based on a training data set to obtain parameter value variation of each model parameter before and after each training;

sampling the training data set to obtain a sampling data set;

determining the gradient of a loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set;

and determining the importance degree of each model parameter based on the gradient and parameter value variable quantity corresponding to each training of each model parameter.

In a second aspect, there is provided an importance level evaluation apparatus for model parameters, the apparatus comprising:

the acquisition module is used for training the NMT model based on the training data set and acquiring parameter value variable quantity of each model parameter before and after each training;

the sampling module is used for sampling the training data set to obtain a sampling data set;

a first determining module, configured to determine, based on the sampled data set, a gradient of a loss function of the NMT model corresponding to each training with respect to each of the model parameters;

and the second determining module is used for determining the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: a method of importance assessment of model parameters according to the first aspect or any of the possible implementations of the first aspect is performed.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium is used for storing a computer program, and when the computer program runs on a computer, the computer may execute the importance level evaluation method for model parameters as shown in the first aspect of the present application or any possible implementation manner of the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

the application provides a method and a device for evaluating the importance degree of model parameters and electronic equipment, wherein the gradient of a loss function of an NMT (non-uniform matrix test) model corresponding to each training relative to each model parameter is determined based on a sampling data set obtained by sampling a training data set, so that the calculated amount of data is reduced; the parameter value variation of each model parameter before and after each training of the NMT model is obtained, the importance degree of each model parameter is determined based on the gradient and the parameter value variation corresponding to each training of each model parameter, the contribution of each model parameter in the process of the convergence of the loss function can be determined, and the performance of the model can be improved in a targeted manner.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a flow chart of Transformer-based neural machine translation provided by an embodiment of the present application;

fig. 2 is a flowchart of a method for evaluating importance of model parameters according to an embodiment of the present disclosure;

FIG. 3a is a distribution diagram of LCA contribution weights of modules of an NMT model in a German-English translation task according to an embodiment of the present application;

FIG. 3b is a distribution diagram of LCA contribution weights of modules in an Ender translation task for an NMT model according to an embodiment of the present application;

fig. 4a is a LCA contribution gravity distribution diagram of each layer in an encoder and a decoder of the NMT model in the german translation task according to the embodiment of the present application;

fig. 4b is a LCA contribution gravity distribution diagram of each layer in the encoder and decoder in the english translation task of the NMT model provided in the embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a comparison of LCA calculation based on a sample data set and a training data corpus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an importance degree evaluation apparatus for model parameters according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The execution subject of the technical scheme of the application is computer equipment, including but not limited to a server, a personal computer, a notebook computer, a tablet computer, a smart phone and the like. The computer equipment comprises user equipment and network equipment. User equipment includes but is not limited to computers, smart phones, PADs, etc.; the network device includes, but is not limited to, a single network server, a server group consisting of a plurality of network servers, or a cloud consisting of a large number of computers or network servers for cloud computing, wherein the cloud computing is a kind of distributed computing, and a super virtual computer is composed of a group of loosely coupled computers. The computer equipment can run independently to realize the application, and can also be accessed to the network to realize the application through the interactive operation with other computer equipment in the network. The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, etc.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

For better understanding and description of the solutions provided by the embodiments of the present application, a brief description of the related art related to the embodiments of the present application will be provided below.

Loss Change Allocation (LCA, Loss Change Allocation): a method for calculating the contribution of a measurement parameter to the loss function reduction in a training process.

Transformer (converter): a neural network framework.

Attention (Attention mechanism): a probabilistic model for selecting a context.

An Encoder: an encoder in the NMT model;

a Decoder: decoder in NMT model.

The NMT is a complex neural network, and takes a currently mainstream Transformer framework as an example, and mainly includes functional components such as an encoder-encoder layer, an encoder/decoder layer, a decoder-encoder layer, and a decoder Softmax layer (a decoder output layer, a Softmax layer shown in the figure). Wherein each layer of the encoder/decoder is composed of some basic coding units/decoding units. All components are organically combined together to form one layer in the network and then stacked layer by layer to form the whole network. Where the encoder layers convert the input sentence (source language) into a semantic vector and the decoder layers convert the semantic vector into the output sentence (target language).

Fig. 1 shows a transform-based neural machine translation flowchart, taking an example that an encoder and a decoder are both 6 layers, a translation process is mainly divided into two parts: the device comprises an encoder neural network (comprising an encoder-word vector layer and encoder layers 1-6) and a decoder neural network (comprising a decoder-word vector layer and decoder layers 1-6), wherein the encoder neural network encodes an input sentence X into a representation vector H, the decoder neural network decodes the representation vector H of X, and finally a translation result Y of X is output.

Specifically, given an input source language sentence X, it is first encoded layer by layer into a representation vector H using the encoder layers, each of which uses a self-attention (self-attention) mechanism; in the decoding phase, each time a new word is generated, the process loops until an end symbol is generated. In particular, not only the self-attention mechanism but also an encoded attention (encoder-attention) mechanism (as indicated by the dotted line in fig. 1) is used in each layer of the decoder. And decoding the result output by each layer of the decoder together with the expression vector H of the X again through the coding attention mechanism, and finally outputting a translation result Y.

The NMT model has a complex network structure, is trained as a black box at present, cannot know which modules play a greater role in the convergence process, and solves the problems that the model parameters of which modules converge stably at first, and the like, so that the difficulty in understanding the model training mechanism is increased, and the improvement of the model performance is not facilitated.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The embodiment of the application provides a method for evaluating the importance degree of a model parameter, as shown in fig. 2, the method includes:

step S101, training an NMT model based on a training data set, and acquiring parameter value variable quantity of each model parameter before and after each training;

when the NMT model is trained based on the training data set, parameter values of model parameters of the NMT model are obtained through each training, and the model parameters are subjected to iterative training until a loss function of the model is converged. And the parameter values of the model parameters are changed before and after each training, and the variable quantity of the parameter values of the model parameters is obtained to be used for calculating the reduction degree of the loss function.

It should be noted that, in the process of training the NMT model, iteration of a plurality of training steps is required until the loss function converges, and "each training" in this application refers to each of the plurality of training steps in the NMT model training process, and does not refer to the entire training process from the beginning of the training to the convergence of the loss function of the model.

Step S102, sampling a training data set to obtain a sampling data set;

step S103, determining the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set;

the NMT model has a complex structure, the scale of model parameters in the whole network model is huge, for example, 1.08 hundred million model parameters can be reached, and all model parameters are trained iteratively until the loss function of the model converges. If the gradients are calculated over the entire training data set, the computational cost of calculating gradients over the entire training data at each model parameter update is unacceptable for the enormous training data scale of NMT. Calculating the gradient based on the entire training data set would increase the computational consumption in proportion to the training data size, with a large computational resource consumption.

The method comprises the steps of sampling a training data set to obtain a sampling data set with the scale smaller than that of the training data set, and solving partial derivatives of model parameters by using a loss function after each training based on the sampling data set to obtain the gradient of the loss function relative to the model parameters. For example, the gradient of the model parameter a is required, each sample data in the sample data set and the sample label data are substituted into the loss function, and the values of the model parameters other than the model parameter a are substituted into the loss function, so that the only unknown number in the current loss function is the model parameter a, and the loss function is subjected to partial derivation on the model parameter a to obtain the gradient of the model parameter a. And the gradient of each model parameter is calculated based on the sampling data set, so that the calculation amount is greatly reduced, and the calculation efficiency is greatly improved.

And step S104, determining the importance degree of each model parameter based on the gradient and parameter value variation corresponding to each training of each model parameter.

After the NMT model is trained each time, a parameter value and a loss function of each model parameter are obtained, and an evaluation index for evaluating the importance degree of the model parameter is determined based on two loss functions obtained by two times of training, for example, the two loss functions obtained by two times of training are subtracted to obtain the evaluation index for evaluating the importance degree of the model parameter.

When the evaluation index of the importance degree of the model parameters is calculated, the LCA value corresponding to each model parameter can be calculated based on the gradient and the variation of each model parameter corresponding to each training, the LCA value is larger as the evaluation index of the importance degree corresponding to each model parameter, the larger the LCA value is, the larger the contribution of the model parameter to the reduction of the loss function is, and the contribution of each model parameter in the convergence process of the loss function can be determined according to the LCA value, so that the role of each model parameter in the training process is determined, more information in the training process is captured and visualized, and further understanding of the training process is increased.

In one example, the LCA values of the NMT model are calculated according to a first order expansion of taylor's formula as follows:

wherein the content of the first and second substances,

representing the loss function obtained after the tth training of the NMT model, theta representing the model parameter, D representing the sampling data set,

representing the loss function obtained before the tth training of the NMT model (i.e. corresponding to the t-1 st training),

representing the gradient of the loss function with respect to a model parameter theta, theta_t+1Parameter values representing the model parameter theta after the t-th training, theta_tRepresenting the parameter value of the model parameter theta before the t-th training, i representing the ith model parameter, and K representing the number of the model parameters; theta_t+1-θ_tRepresenting the variation of the model parameter theta before and after the t-th training;

to representSum of LCA values of model parameters before and after the t-th training.

Wherein A is_lca[t][i]Showing the LCA value corresponding to the ith parameter and the tth training,

representing the loss function with respect to the model parameter θⁱThe gradient of (a) of (b) is,

representing the model parameter thetaⁱThe amount of change before and after the t-th training.

When sampling the training data set, only one sampling may be performed, the gradient of each model parameter is calculated based on the sampling data set obtained by the one-time sampling, the importance degree of the model parameter is determined by the gradient of the model parameter and the parameter value variation, or multiple times of sampling may be performed, the gradient of each model parameter is calculated based on the sampling data set obtained by each sampling, and the importance degree of the model parameter is determined by the gradient of the model parameter and the parameter value variation.

In one possible implementation, sampling a training data set to obtain a sampled data set includes:

respectively sampling a training data set aiming at each training to obtain a sampling data set corresponding to each training;

determining the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set, wherein the gradient comprises the following steps:

based on the sample data set corresponding to each training, the gradient of the loss function corresponding to each training with respect to each model parameter is determined.

In practical application, the objectivity of data can be enhanced through a multi-sampling mode, training data set sampling can be performed on each training, a model obtained by each training is input into a sampling data set obtained by each sampling, model output corresponding to each sample is obtained, a loss function corresponding to the training is obtained based on the model output corresponding to each sample and the sample label corresponding to each sample, and partial derivatives of each model parameter are obtained based on the loss function, so that the gradient of each model parameter is obtained. In the embodiment, the training set is sampled according to each training, so that the data is more objectively selected, gradient calculation is performed on the basis of the sampled data set obtained by sampling according to each training, and the reliability of the calculation result can be enhanced.

For determining the importance of each model parameter, the present application includes determining the importance of the model parameter in each training process and the cumulative importance in multiple training processes.

In a possible implementation manner, the determining the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter includes at least one of the following:

for each model parameter, determining the importance degree of the model parameter in each training process based on the gradient of the model parameter corresponding to each training and the parameter value variation;

and for each model parameter, determining the accumulated importance degree of the model parameter in the multiple training processes based on the gradient of the model parameter corresponding to the multiple training and the parameter value variation.

In practical application, an importance degree evaluation index corresponding to each training of each model parameter can be calculated based on the gradient calculated by the sampling data set obtained by sampling the training data set of each training and the variation of each model parameter before and after each training, and the importance degree of each model parameter in each training process can be evaluated through the importance degree evaluation index, so that the contribution of each model parameter to the reduction degree of the loss function in each training process can be measured.

In addition, the importance degree evaluation index corresponding to each model parameter in multiple training can be calculated based on the gradient of the model parameter corresponding to multiple training and the parameter value variation, the accumulated importance degree of the model parameter in multiple training processes can be evaluated through the importance degree evaluation index, and the accumulated contribution of each model parameter to the loss function reduction degree in multiple training processes can be measured in a balanced manner.

The specific implementation manner for determining the accumulated importance degree of the model parameters in the multiple training processes is as follows:

in a possible implementation manner, determining an accumulated importance degree of a model parameter in a plurality of training processes based on a gradient of the model parameter and a parameter value variation corresponding to the plurality of training processes includes:

and determining the accumulated importance degree of the model parameters in the training process of each first preset number of times and storing the accumulated importance degree based on the gradient of the model parameters corresponding to the continuous training of the first preset number of times and the parameter value variation.

In practical application, the NMT model is trained for multiple times until a loss function is converged to obtain the gradient and parameter value variation of each model parameter corresponding to each training, the LCA value of each model parameter corresponding to each training can be calculated based on the gradient and parameter value variation of each model parameter corresponding to each training, the LCA values of a certain model parameter of continuous first preset times are accumulated, and the accumulated importance degree of the model parameter in the training process of each first preset time can be determined.

When the LCA values are stored, due to the fact that the number of model parameters is large, the storage space occupied by the LCA values corresponding to the model parameters in each training is large, the sum of the LCA values of each parameter in multiple training can be stored, and one LCA value is stored every first preset time, so that the storage space is saved.

In addition, the sum of the LCA values of each parameter in multiple training can be accumulated again to obtain the accumulated sum of the LCA values of each model parameter in multiple training, so that the overall situation of loss function reduction of the NMT model in the multiple training can be known.

In one example, from t₁Second to t₂The cumulative sum of LCA for each model parameter of the NMT model, sub-training, can be expressed as:

wherein the content of the first and second substances,

denotes the t-th of NMT model₂Before sub-training (i.e. t th)₂Corresponding to 1 training) loss function,

denotes the t-th of NMT model₁Before sub-training (i.e. t th)₁Corresponding to 1 training) loss function,

represents from t₁Second to t₂And accumulating the LCA values of the model parameters in the secondary training process.

In addition, for each model parameter, the LCA values corresponding to each training can be fused, specifically, when the LCA values are stored, the LCA values of each parameter in multiple training can be stored to be accumulated and averaged to obtain one LCA, so that the storage space is saved.

In one example, 15 consecutive LCA cumulative values are stored, that is, by defining the cumulative summation window length as 15, the LCA values corresponding to each of 15 training sessions based on a certain model parameter are fused to obtain an average LCA value calculation formula as follows:

wherein k has an initial value of 0, A_lca[t′][i]And (3) showing the LCA value corresponding to each training in the 15 training times of the ith model parameter.

Since the NMT model is a complex network, it has a huge number of model parameters. Often, the interest is not a specific model parameter, but rather a functional module divided from the NMT network structure. The function of a certain functional block (combination of parameters) is in the whole convergence process. In addition to knowing the contribution of each model parameter to the degradation degree of the loss function through the LCA value of each model parameter, the contribution of each module to the degradation degree of the loss function can be determined, as described in the following embodiments.

In one possible implementation, the method further comprises at least one of:

for each functional module of the NMT model, determining the importance degree of the functional module in each training process based on the gradient and parameter value variation corresponding to each training of model parameters of the functional module;

for each functional module of the NMT model, the accumulated importance degree of the functional module in the multiple training process is determined based on the gradient and parameter value variation corresponding to the multiple training of each model parameter of the functional module.

In practical application, the NMT model includes a plurality of functional modules, and an importance degree evaluation index corresponding to each training of the model parameters of each functional module can be calculated based on the gradient and variation of each model parameter of each functional module, and the importance degree of the model parameters of each functional module in each training process can be evaluated through the importance degree evaluation index, so that the contribution of the model parameters of each functional module to the reduction degree of the loss function in each training process can be measured.

In addition, an importance degree evaluation index corresponding to multiple times of training of the model parameters of each function module can be calculated based on the gradient and parameter value variation of the model parameters of each function module corresponding to multiple times of training, the accumulated importance degree of the model parameters of each function module in the multiple times of training process is evaluated through the importance degree evaluation index, and the accumulated contribution of each function module to the loss function reduction degree in the multiple times of training process can be measured.

In a possible implementation manner, determining an accumulated importance degree of the functional module in a plurality of training processes based on gradients and parameter value variations corresponding to a plurality of times of training of model parameters of the functional module includes:

and determining the accumulated importance degree of the functional module in the training process of every second training time and storing the accumulated importance degree based on the gradient of the model parameter and the parameter value variation of the model parameter corresponding to each model parameter of the functional module in the continuous second preset times of training.

In practical application, the NMT model is trained for multiple times until the loss function converges to obtain the gradient of the model parameter and the parameter value variation of each function module corresponding to each training, the LCA value of the model parameter of each function module corresponding to each training can be calculated based on the gradient of the model parameter and the parameter value variation of each function module corresponding to each training, the LCA values of the model parameters of each function module for a second preset number of consecutive times are accumulated, and the accumulated importance degree of the function module in the training process of each second preset number of times can be determined.

When the LCA value is stored, the LCA value corresponding to each training of each functional module can be stored according to each functional module instead of the LCA value corresponding to each training of each model parameter, so that the storage space is saved.

One example of the structure of the NMT model is shown in fig. 1, but the NMT model can also be other structures. Taking the structure of the NMT model in fig. 1 as an example, the model parameters of the NMT model are divided into model parameters of 15 functional modules, and the functional module 1: en _ emb, encoder endword vector; functional modules 2 to 7: layer 1-6 encoders; functional modules 8 to 13: layer 1-6 decoders; the function module 14: de _ emb, decoder word vector; the function module 15: de _ softmax: a decoder target word generator.

Calculating LCA values corresponding to the functional modules based on the following formula:

wherein A is_lca[t][g]Indicating the LCA value corresponding to function block g, ∑_iA_lca[t][i]The sum of LCA values of the model parameters of the function block g is represented, | g | represents the number of the model parameters of the function block g.

The specific way of sampling the training data set is as follows:

in one possible implementation, sampling a training data set includes:

the training data set is subjected to monte carlo sampling.

In practical application, as an optional mode, a training data set is sampled in a monte carlo sampling mode to obtain a sampling data set, gradients of model parameters are calculated based on the sampling data set obtained in the sampling mode, and LCAs corresponding to the model parameters are calculated, wherein the LCAs are basically consistent with LCAs calculated on all the training data sets, and consumption of calculation resources is greatly reduced, so that the application in complex networks such as NMT (network driver test) and the like is possible.

According to the method for evaluating the importance degree of the model parameters, the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter is determined based on the sampling data set obtained by sampling the training data set, and the calculated amount of data is reduced; the parameter value variation of each model parameter before and after each training of the NMT model is obtained, the importance degree of each model parameter is determined based on the gradient and the parameter value variation corresponding to each training of each model parameter, the contribution of each model parameter in the process of the convergence of the loss function can be determined, and the performance of the model can be improved in a targeted manner.

The technical scheme of the application can capture more detailed information about the training process than the method in the prior art. In the german (as shown in fig. 3 a) and the english (as shown in fig. 3 b) translation tasks, LCA contribution weights of the encoder word vector layer, the encoder, the decoder word vector layer, and the decoder and decoder output layer of the NMT model at the time of convergence of the loss function (at the end of training) are distributed as shown in the figure, wherein the abscissa represents the model parameters of each module of the NMT model (such as the encoder word vector layer, the encoder, the decoder word vector layer, the decoder and the decoder output layer shown in the figure), and the ordinate represents the accumulated values of the LCA values of the model parameters of each module calculated based on the training set and the test set, respectively. The LCA values of the model parameters of the modules calculated based on the training set are represented without padding, and the LCA values of the model parameters of the modules calculated based on the test set are represented with padding. For the encoder word vector layer in FIG. 3a, -0.32 represents the accumulated value of LCA values for model parameters of the module calculated based on the training set, -0.52 represents the accumulated value of LCA values for model parameters of the module calculated based on the test set. The numerical meanings of the other blocks in the figures are the same as the blocks, and are not repeated here.

The model parameters of the encoder and decoder in the NMT network structure contribute most of the loss function degradation, whether on training data or on test data, wherein the larger the absolute value of the cumulative value of LCA values, the greater the contribution, the corresponding word vector layer, whether encoding or decoding, and the decoder output layer model parameters are not sufficiently learned, and the duty ratio is so small as to be negligible. The visualization result shows that the training process of the current neural machine translation system, the word vectors and the parameters of the middle layer network are not sufficiently learned, and the potentials of the word vectors and the parameters are not mined. This result points out the problems of neural machine translation at present, and provides a strong indication for further improving the model of neural machine translation.

In the german (as shown in fig. 4 a) and the english (as shown in fig. 4 b) translation tasks, LCA contribution weights of respective layers in the encoder and decoder of the NMT model at the time of convergence of the loss function (at the end of training) are distributed as shown in the figure, wherein the abscissa represents model parameters of respective layers in the encoder and decoder, and the ordinate represents accumulated values of LCA values of respective layers in the encoder and decoder calculated based on the training set and the test set, respectively. For the encoder level 1 in fig. 4a, the unfilled bar graphs represent the accumulated values of LCA values for each model parameter calculated based on the training set, and the filled bar graphs represent the accumulated values of LCA values for each model parameter calculated based on the test set. The numerical meanings of the other blocks in the figures are the same as the blocks, and are not repeated here.

Further observing the inside of the encoder and the decoder with a large specific gravity, it can be found that the decoder with the largest LCA value inside the decoder is the decoder, which is very suitable for intuitive understanding. But both the first and the last layer are found inside the decoder to have higher LCA values than the middle layer, as shown in fig. 4a, 4 b. This represents a clear difference between the decoder and the encoder, which means that the first layer also plays a large role in the conversion of the encoding end word vector into the semantic representation vector H.

In addition, in order to prove that the LCA estimated value obtained by the sampling mechanism in the technical scheme of the application is very related to the accurate value which needs to be calculated on the training data corpus, a training data set of 1 ten thousand sentences is constructed, the accurate LCA value is calculated on the training data set, and the LCA estimated value is compared with the LCA estimated value obtained by the sampling mechanism on the training data set through the LCA ratio. The Kendel level correlation coefficient of the two is found to be 0.905, which represents that the two are very correlated, and the rationality of introducing a sampling mechanism is proved. Specifically, as shown in fig. 5, the abscissa represents model parameters of each module of the NMT model (such as the encoder word vector layer, the encoder 1-6 layers, the decoder word vector layer, the decoder 1-6 layers, and the decoder output layer shown in the figure), and the ordinate represents the percentage of the accumulated value of the LCA values of the model parameters of each module, which is calculated based on the training set and the test set, to the total LCA value of the LCA values corresponding to each layer shown in fig. 5. The unfilled bars represent the computed results based on the sampled data set, and the filled bars represent the computed results based on the training data corpus. For the encoder word vector layer in fig. 5, 14 above the unfilled bar graph represents the ordering (kender scale) of the accumulated values of LCA values of model parameters calculated based on the sample data set among the accumulated values of LCA values for each layer shown in fig. 5, and 14 above the filled bar graph represents the ordering (kender scale) of the accumulated values of LCA values of model parameters calculated based on the training data corpus among the accumulated values of LCA values for each layer shown in fig. 5.

Based on the same principle as the method shown in fig. 2, the embodiment of the present disclosure also provides an importance level evaluation device 30 for a model parameter, as shown in fig. 6, where the importance level evaluation device 30 for a model parameter includes:

an obtaining module 31, configured to train an NMT model based on a training data set, and obtain parameter value variation of each model parameter before and after each training;

a sampling module 32, configured to sample the training data set to obtain a sampled data set;

a first determining module 33, configured to determine, based on the sample data set, a gradient of a loss function of the NMT model corresponding to each training with respect to each model parameter;

and the second determining module 34 is configured to determine the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter.

In one possible implementation, the sampling module 32 is configured to:

a first determining module 33 configured to:

In one possible implementation, the second determining module 34 is configured to at least one of:

In one possible implementation, the second determining module 34, when determining the accumulated importance of the model parameter during multiple training processes based on the gradient of the model parameter and the parameter value variation corresponding to the multiple training processes, is configured to:

In a possible implementation, the importance level evaluation device 30 of the model parameters further comprises at least one of the following:

the third determining module is used for determining the importance degree of each functional module in each training process of the functional module based on the gradient and parameter value variable quantity corresponding to each training of model parameters of the functional module for each functional module of the NMT model;

and the fourth determining module is used for determining the accumulated importance degree of each functional module in the multi-training process of the functional module based on the gradient and parameter value variable quantity corresponding to the multi-training of each model parameter of the functional module for each functional module of the NMT model. In a possible implementation manner, the fourth determining module, when determining the accumulated importance of the functional module in the multiple training processes based on the gradient and the parameter value variation corresponding to the multiple training of each model parameter of the functional module, is configured to:

In one possible implementation, the sampling module 32 is configured to perform monte carlo sampling on the training data set.

The importance degree evaluation device of the model parameter of the embodiment of the present disclosure may execute the importance degree evaluation method of the model parameter provided by the embodiment of the present disclosure, and the implementation principle is similar, the actions executed by each module in the importance degree evaluation device of the model parameter of the embodiment of the present disclosure correspond to the steps in the importance degree evaluation method of the model parameter of the embodiment of the present disclosure, and for the detailed functional description of each module of the importance degree evaluation device of the model parameter, reference may be specifically made to the description in the importance degree evaluation method of the corresponding model parameter shown in the foregoing, and details are not repeated here.

The importance degree evaluation device for the model parameters, provided by the embodiment of the application, determines the gradient of the loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set obtained by sampling the training data set, so that the calculated amount of data is reduced; the parameter value variation of each model parameter before and after each training of the NMT model is obtained, the importance degree of each model parameter is determined based on the gradient and the parameter value variation corresponding to each training of each model parameter, the contribution of each model parameter in the process of the convergence of the loss function can be determined, and the performance of the model can be improved in a targeted manner.

The above embodiment introduces the importance degree evaluation apparatus of the model parameter from the perspective of the virtual module, and the following introduces an electronic device from the perspective of the physical module, as follows:

an embodiment of the present application provides an electronic device, as shown in fig. 7, an electronic device 5000 shown in fig. 7 includes: a processor 5001 and a memory 5003. The processor 5001 and the memory 5003 are coupled, such as via a bus 5002. Optionally, the electronic device 5000 may also include a transceiver 5004. It should be noted that the transceiver 5004 is not limited to one in practical application, and the structure of the electronic device 5000 is not limited to the embodiment of the present application.

The processor 5001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 5001 may also be a combination of processors implementing computing functionality, e.g., a combination comprising one or more microprocessors, a combination of DSPs and microprocessors, or the like.

Bus 5002 can include a path that conveys information between the aforementioned components. The bus 5002 may be a PCI bus or EISA bus, etc. The bus 5002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.

The memory 5003 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 5003 is used for storing application program codes for executing the present solution, and the execution is controlled by the processor 5001. The processor 5001 is configured to execute application program code stored in the memory 5003 to implement the teachings of any of the foregoing method embodiments.

An embodiment of the present application provides an electronic device, where the electronic device includes: a memory and a processor; at least one program, stored in the memory, for training a Neural Machine Translation (NMT) model based on a training data set when executed by the processor, to obtain parameter value variations of model parameters before and after each training; sampling the training data set to obtain a sampling data set; determining the gradient of a loss function of the NMT model corresponding to each training relative to each model parameter based on the sampling data set; and determining the importance degree of each model parameter based on the gradient and parameter value variable quantity corresponding to each training of each model parameter.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for assessing the importance of a model parameter, the method comprising:

sampling the training data set to obtain a sampling data set;

determining a gradient of a loss function of the NMT model corresponding to each training with respect to each of the model parameters based on the sampled data set;

and determining the importance degree of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter.

2. The method of claim 1, wherein sampling the training data set to obtain a sampled data set comprises:

determining a gradient of a loss function of the NMT model corresponding to each training with respect to each of the model parameters based on the sampled dataset, comprising:

determining a gradient of the loss function corresponding to each training with respect to each of the model parameters based on the sample data set corresponding to each training.

3. The method of claim 1, wherein determining the importance of each model parameter based on the gradient and the parameter value variation corresponding to each training of each model parameter comprises at least one of:

and for each model parameter, determining the accumulated importance degree of the model parameter in the multiple training processes based on the gradient of the model parameter and the parameter value variation corresponding to the multiple training.

4. The method of claim 3, wherein the determining the cumulative importance of the model parameter during the training of the plurality of times based on the gradient of the model parameter and the parameter value variation of the model parameter corresponding to the training of the plurality of times comprises:

and determining the accumulated importance degree of the model parameters in the training process of each first preset number of times and storing the accumulated importance degree based on the gradient of the model parameters and the parameter value variation corresponding to the continuous training of the first preset number of times.

5. The method of claim 1, further comprising at least one of:

and for each functional module of the NMT model, determining the accumulated importance degree of the functional module in the multiple training process based on the gradient and parameter value variable quantity corresponding to the multiple training of each model parameter of the functional module.

6. The method of claim 5, wherein the determining the accumulated importance of the functional module during multiple training processes based on the gradient and the parameter value variation corresponding to multiple training of each model parameter of the functional module comprises:

and determining the accumulated importance degree of the functional module in the training process of each second training frequency and storing the accumulated importance degree based on the gradient of the model parameter and the parameter value variation of the model parameter corresponding to each model parameter of the functional module in the continuous second preset times of training.

7. The method of claim 1, wherein sampling the training data set comprises:

monte Carlo sampling is performed on the training data set.

8. An apparatus for evaluating the degree of importance of a model parameter, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: performing a method of assessing the significance of a model parameter according to any one of claims 1 to 7.

10. A computer-readable medium for storing computer instructions which, when run on a computer, cause the computer to perform the method of assessing the importance of a model parameter according to any one of claims 1 to 7.