CN115062611A

CN115062611A - Training method, device, equipment and storage medium of grammar error correction model

Info

Publication number: CN115062611A
Application number: CN202210560454.9A
Authority: CN
Inventors: 蒋盛益; 林楠铠; 林晓钿; 武洪艳
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-09-16
Anticipated expiration: 2042-05-23
Also published as: CN115062611B

Abstract

The invention discloses a method, a device, equipment and a storage medium for training a grammar error correction model, wherein an original model is constructed based on a Transformer; in each round of training, inputting a training set acquired in advance into the original model, and adjusting parameters in the original model by combining a moving average strategy; the training set comprises a plurality of training samples, the training samples comprise an original sentence consisting of a plurality of words and phrases and a target sentence corresponding to the original sentence and consisting of a plurality of labels, and the target sentence is obtained by carrying out grammatical error correction on the original sentence in advance; and when the training round reaches a preset time threshold, finishing the algorithm and taking the original model obtained in the last training round as the optimal grammar error correction model. According to the embodiment of the invention, the original model is constructed by using the Transformer, the model training is carried out by using the pre-obtained training set in combination with the moving average strategy, the optimal grammar error correction model is obtained, the overfitting is avoided, and the generalization capability of the model is improved.

Description

Training method, device, equipment and storage medium of grammar error correction model

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method, a device, equipment and a storage medium for training a grammar error correction model.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and researches various theories and methods that can realize effective communication between a person and a computer using Natural Language, wherein syntax error correction is one of the research directions. In the prior art, people adopt deep learning to correct grammar errors, but due to the fact that a constructed deep model has millions or even billions of parameters, overfitting is easy to occur, especially an end-to-end model is low in stability and large in performance fluctuation in the training process, and therefore a model training mode needs to be provided to avoid overfitting and improve the generalization capability of the deep model.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device, equipment and a storage medium for training a grammar error correction model.

In order to achieve the above object, an embodiment of the present invention provides a method for training a syntax error correction model, including:

constructing an original model based on a Transformer;

in each round of training, inputting a training set acquired in advance into the original model, and adjusting parameters in the original model by combining a moving average strategy; the training set comprises a plurality of training samples, the training samples comprise an original sentence consisting of a plurality of words and phrases and a target sentence corresponding to the original sentence and consisting of a plurality of labels, and the target sentence is obtained by carrying out grammatical error correction on the original sentence in advance;

and when the training round reaches a preset time threshold, finishing the algorithm and taking the original model obtained in the last training round as the optimal grammar error correction model.

As an improvement of the above scheme, in each round of training, inputting a training set obtained in advance into the original model, and adjusting parameters in the original model by combining a moving average strategy specifically includes:

in each round of training, randomly discarding a hidden unit when each word in an original sentence of each training sample acquired in advance passes through the original model, and sequentially transmitting forward twice to output a first label probability distribution and a second label probability distribution;

based on a sliding average strategy, calculating an average parameter according to the parameter of the original model obtained after the first forward transmission and the parameter of the original model obtained after the second forward transmission to be used as the parameter of the parameter average model;

calculating a first cross entropy of the original model obtained after the first forward transmission and a second cross entropy of the original model obtained after the second forward transmission;

calculating a model loss value corresponding to the word according to the first label probability distribution, the second label probability distribution, a third label probability distribution obtained by forward transmission when the word passes through the parameter average model, the first cross entropy and the second cross entropy;

calculating a loss function according to model loss values corresponding to all words in the training set;

and adjusting parameters of the original model according to the loss function.

As an improvement of the above solution, the average parameter is calculated by the following formula:

wherein v is _t ' denotes an average parameter, ' denotes a variable taking into account a value of a historical parameter, ' v _t-1 Parameters, v, representing the original model obtained after the first forward pass _t Representing the parameters of the original model obtained after the second forward pass.

As an improvement of the above scheme, the model loss value is calculated by:

calculating KL divergence between the first label probability distribution, the second label probability distribution and the third label probability distribution respectively;

calculating cross entropy loss;

and calculating a model loss value according to the KL divergence and the cross entropy loss based on a preset weight rule.

As an improvement of the above solution, the model loss value is calculated by the following formula:

where Li denotes the model loss value of the ith word, α 'denotes the KL divergence weight, β' denotes the cross entropy loss weight, α '+ 2 β' ═ 1, L ⁱ _KLM KL divergence, L, representing first and second label probability distributions for the ith word ⁱ _KLT1 KL divergence, L, representing the second and third label probability distributions for the ith word ⁱ _KLT2 KL divergence, L, representing the first and third tag probability distributions for the ith word ⁱ _CE1 Cross entropy loss, L, of the first label probability distribution and answer label representing the ith word ⁱ _CE2 A cross entropy loss of the second label probability distribution representing the ith word and the answer label.

As an improvement of the above solution, the loss function is calculated by the following formula:

where L represents a loss function, N represents the number of sentences of the training set, and M represents the number of words in a sentence.

As an improvement of the above, the original model comprises an encoder and a decoder;

the encoder consists of a Z-layer independent coding layer and a (K-Z) -layer redundant coding layer, and the independent coding layer and the redundant coding layer adopt a parameter sharing strategy to perform model compression;

the decoder consists of a Z-layer independent decoding layer and a (K-Z) -layer redundant decoding layer, wherein the independent decoding layer and the redundant decoding layer adopt a parameter sharing strategy to carry out model compression.

In order to achieve the above object, an embodiment of the present invention provides a training apparatus for a syntax error correction model, including:

the model building module is used for building an original model based on a Transformer;

the model training module is used for inputting a training set acquired in advance into the original model in each round of training and adjusting parameters in the original model by combining a moving average strategy; the training set comprises a plurality of training samples, the training samples comprise an original sentence consisting of a plurality of words and phrases and a target sentence corresponding to the original sentence and consisting of a plurality of labels, and the target sentence is obtained by carrying out grammatical error correction on the original sentence in advance;

and the model determining module is used for finishing the algorithm and taking the original model obtained in the last round of training as the optimal grammar error correction model when the training round reaches a preset time threshold value.

To achieve the above object, an embodiment of the present invention provides a training apparatus for a syntax error correction model, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements a training method for the syntax error correction model according to any one of the above embodiments when executing the computer program.

To achieve the above object, an embodiment of the present invention provides a computer-readable storage medium, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the training method of the syntax error correction model according to any one of the above embodiments.

Compared with the prior art, the training method, the training device, the training equipment and the computer-readable storage medium of the grammar error correction model disclosed by the embodiment of the invention are used for constructing an original model based on a Transformer; in each round of training, parameters in the original model are adjusted by inputting a training set acquired in advance into the original model and combining a moving average strategy; the training set comprises a plurality of training samples, the training samples comprise an original sentence consisting of a plurality of words and phrases and a target sentence corresponding to the original sentence and consisting of a plurality of labels, and the target sentence is obtained by carrying out grammatical error correction on the original sentence in advance; and when the training round reaches a preset time threshold, finishing the algorithm and taking the original model obtained in the last training round as the optimal grammar error correction model. According to the embodiment of the invention, the original model is constructed by using the Transformer, the model training is carried out by using the pre-obtained training set in combination with the moving average strategy so as to obtain the optimal grammar error correction model, so that the overfitting is avoided and the generalization capability of the model is improved.

Drawings

FIG. 1 is a flowchart of a method for training a grammar error correction model according to an embodiment of the present invention;

FIG. 2 is a diagram of a model framework provided in accordance with an embodiment of the present invention;

fig. 3 is a schematic diagram of a CYCLE policy according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, which is a flowchart of a training method of a syntax error correction model according to an embodiment of the present invention, the training method of the syntax error correction model includes steps S1 to S3:

s1, constructing an original model based on a Transformer;

s2, inputting a training set acquired in advance into the original model in each training round, and adjusting parameters in the original model by combining a moving average strategy; the training set comprises a plurality of training samples, the training samples comprise an original sentence consisting of a plurality of words and phrases and a target sentence corresponding to the original sentence and consisting of a plurality of labels, and the target sentence is obtained by carrying out grammatical error correction on the original sentence in advance;

and S3, when the training round reaches a preset number threshold, ending the algorithm and taking the original model obtained by the last training round as the optimal grammar error correction model.

Referring to the model framework diagram shown in fig. 2, in the embodiment of the present invention, the sequence learning model in fig. 2 is an end-to-end model composed of a self-attention mechanism and capable of modeling sequence data, the model framework is composed of an encoder and a decoder, the encoder is responsible for encoding and learning input data, and the decoder is responsible for integrating the output of the encoder with the output of the previous time and then encoding the input data. The model framework comprises a grammar generalization module which is used for adjusting the parameters of the basic model besides the basic sequence learning model. Specifically, an original model is constructed based on a Transformer; acquiring a training set, wherein the training set consists of a plurality of training samples, the training samples comprise an original sentence consisting of a plurality of words and a target sentence corresponding to the original sentence and consisting of a plurality of labels, and the target sentence is obtained by grammatical error correction of the original sentence; carrying out a plurality of rounds of training on the original model by utilizing a training set, setting the total number of the rounds of training by a user according to actual requirements, and adjusting the parameters of the original model by adopting a sliding average strategy in each round of training; and when the training round meets a preset number threshold, ending the model training, taking the original model obtained by the last training round as an optimal grammar error correction model for grammar error correction, and providing a grammar generalization strategy of a moving average strategy on the basis of a Transformer model, thereby avoiding overfitting and providing the generalization capability of the model.

In an embodiment, in each round of training, inputting a training set obtained in advance into the original model, and adjusting parameters in the original model in combination with a moving average strategy specifically includes:

based on a sliding average strategy, calculating an average parameter according to the parameter of the original model obtained after the first forward transfer and the parameter of the original model obtained after the second forward transfer to be used as the parameter of the parameter average model;

and adjusting parameters of the original model according to the loss function.

Specifically, in combination with the model framework diagram shown in fig. 2, the embodiment proposes a syntax generalization strategy, improves a discard regularized drop regularization (regularized drop) method, and proposes a multidimensional regularized drop method with training consistency and model consistency, which is applied to a syntax error correction task.

Obtaining a local mean value of model variables in a training process by adopting a moving average, and assuming that the variable of an original model at the time t is v _t Incorporating the variable v of the original model at the last moment _t-1 Performing sliding average to obtain average parameter at t moment, which corresponds to the average parameterThe model is a parameter average model at the time t.

In a multi-dimensional regulated drop formally, an input sentence S is given _i It consists of a series of words w ₁ ,w ₂ ,w ₃ ,…,w _m And a series of labels y ₁ ,y ₂ ,y ₃ ,…,y _m Are composed of, each word w _i Randomly discarding some hidden units when passing through the original model, and forwarding twice to obtain two different label probability distributions P ₁ (y _i |w _i ) (first label probability distribution) and P ₂ (y _i |w _i ) (second label probability distribution), and forward passing the label probability distribution P when passing through the parameter average model ₃ (y _i |w _i ') (third label probability distribution). In order to force two distributions P ₁ (y _i |w _i ) And P ₂ (y _i |w _i ) Consistent with each other to achieve model consistency, minimizing the two-way Kullback-Leibler (KL) divergence between two distributions, where P is the sum of the cross-entropy as a loss function ₁ (y _i |w _i ) And P ₂ (y _i |w _i ) Is the word w _i The probability distribution of the output values is obtained by two times of forward propagation through a Transformer model (original model); according to P ₁ (y _i |w _i )、P ₂ (y _i |w _i ) And P ₃ (y _i |w _i ') and cross entropy to calculate the word w _i And finally, adjusting parameters of the original model obtained by training of the training round according to the loss function. The embodiment adopts a grammar generalization module to improve the generalization and stability of the model; in the grammar generalization module, a multidimensional generalized dropout method with training consistency and model consistency is provided, and based on the multidimensional generalized dropout, a grammar generalization strategy is provided to improve the grammar generalization capability of the model, so that errors based on grammar rules can achieve good error correction effect without depending on a large amount of data, overfitting is avoided, and improvement is achievedThe model generalization capability is realized.

In one embodiment, the average parameter is calculated by the following formula:

Specifically, a local mean value of a model variable in a training process is obtained by adopting a moving average, and the variable of an original model at the time t is assumed to be v _t Then, after using the running average, the variables of the model are updated as:

v _t1 ＝δ·v _t-1 +(1-δ)·v _t ；

where δ is a variable that takes into account historical parameter values, the larger δ the more v is the value obtained by the running average _t Particularly, the average value of the parameters of the first few rounds of the start of the sliding average still has large fluctuation, so that the v is subjected to the sliding average _t1 Making a correction of v _t1 Divided by (1-delta) ^t ) Correcting the estimate of the mean:

in one embodiment, the model loss value is calculated by:

respectively calculating KL divergence degrees between every two of the first label probability distribution, the second label probability distribution and the third label probability distribution;

calculating cross entropy loss;

In particular, to achieve training consistency, the moldType also forces P ₁ (y _i |w _i ) And P ₃ (y _i |w′ _i )、P ₂ (y _i |w _i ) And P ₃ (y _i |w′ _i ) Consistent with each other:

wherein the content of the first and second substances,

is P ₁ (y _i |w _i ) And P ₂ (y _i |w _i ) The KL divergence of (a),

is P ₁ (y _i |w _i ) And P ₃ (y _i |w′ _i ) The KL divergence of (a),

is P ₂ (y _i |w _i ) And P ₃ (y _i |w′ _i ) KL divergence of (1).

After the KL divergence is obtained, we further weight the KL divergence and cross entropy loss

And

to obtain a sentence S _i Chinese word w _i Ultimate loss L of ⁱ 。

In one embodiment, the model loss value is calculated by the following formula:

where Li denotes the model loss value of the ith word, α 'denotes the KL divergence weight, β' denotes the cross entropy loss weight, α '+ 2 β' ═ 1, L ⁱ _KLM KL divergence, L, representing first and second label probability distributions for the ith word ⁱ _KLT1 KL divergence, L, representing the second and third label probability distributions for the ith word ⁱ _KLT2 KL divergence, L, representing the first and third tag probability distributions for the ith word ⁱ _CE1 And L ⁱ _CE2 Represents the cross entropy loss of the ith word, the cross entropy losses of the two cross entropies are respectively the cross entropy losses of the answer label and the different probability distribution generated by two forward propagations, namely L ⁱ _CE1 Cross entropy loss, L, of the first label probability distribution and answer label representing the ith word ⁱ _CE2 A cross entropy loss of the second label probability distribution representing the ith word and the answer label.

Specifically, S _i Chinese word w _i Ultimate loss L of ⁱ (model loss value) was obtained by:

wherein α and β are loss weights, β is default set to 0.5, and after normalization of the loss weights, the loss value of the model is:

α′+2*β′＝1。

in one embodiment, the loss function is calculated by the following equation:

Specifically, the loss function for the entire training set is calculated as follows:

wherein L represents a loss function, N represents the number of sentences of the training set, and M represents the number of words in the sentences. It can be understood that the model loss values of all words in each sentence are averaged to obtain the loss value of each sentence, and then the loss values of all sentences are averaged to obtain the loss function of the whole training set.

In one embodiment, the original model includes an encoder and a decoder;

the decoder consists of a Z-layer independent decoding layer and a (K-Z) -layer redundant decoding layer, wherein the independent decoding layer and the redundant decoding layer adopt a parameter sharing strategy to perform model compression.

Specifically, in conjunction with fig. 2, the model framework further includes a parameter sharing module that uses parameters of the Z layer in constructing a K layer Transformer-based encoder-decoder. Two strategies are provided for parameter allocation in syntax error correction tasks: CYCLE and CYCLE-REV. Illustratively, when Z is set to 3 and K to 6, two parameter allocation strategies are provided for the encoder module, each parameter being allocated to the decoder module in the same way as for the encoder. The parameter sharing strategy enables the model to share the same parameters in different coding layers or decoding layers, so that the number of the parameters of the model is reduced by one third compared with that of a Transformer, and the difficulty in the training process is reduced while the performance of the model is ensured.

In the CYCLE strategy, we stack layers with Z parameters independent of each other (independent coding layers). Then, we repeat stacking Z layers (redundancy-encoded layers) in the same order as the preceding Z layers until the total number of layers reaches K. When setting Z3 and K6, the parameter sharing strategy stacks the initial 3 layers twice. Since higher layers require more degrees of freedom to express them than lower layers, in other words, lower layers may have redundant parameters compared to higher layers, the present embodiment also proposes the CYCLE-REV strategy to reuse the parameters of lower layers in higher layers, see the CYCLE strategy diagram shown in fig. 3. Stacking Z layers is repeated in the same manner, similar to the CYCLE strategy. For the remaining layers, the Z layers are then stacked in reverse order, see the pseudo code for the parameter sharing policy shown in the table below.

Compared with the prior art, the training method of the grammar error correction model disclosed by the embodiment of the invention can be used for constructing the original model by using the Transformer, combining the sliding average strategy and carrying out model training by using the pre-obtained training set so as to obtain the optimal grammar error correction model, thereby avoiding overfitting and improving the generalization capability of the model; in addition, the model is compressed by adding a parameter sharing strategy, so that the training difficulty of the model is reduced.

The embodiment of the present invention further provides a training device for a grammar error correction model, including:

It should be noted that, for a specific working process of the training apparatus for the syntax error correction model, reference may be made to the working process of the training method for the syntax error correction model in the foregoing embodiment, and details are not repeated here.

The training device for the grammar error correction model provided by the embodiment of the invention can be used for constructing the original model by using the Transformer, combining the moving average strategy and carrying out model training by using the pre-obtained training set so as to obtain the optimal grammar error correction model, thereby avoiding overfitting and improving the generalization capability of the model.

Embodiments of the present invention further provide a training apparatus for a syntax error correction model, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor executes the computer program to implement the steps in the above embodiments of the training method for the syntax error correction model, such as steps S1 to S3 shown in fig. 1; alternatively, the processor, when executing the computer program, implements the functions of the modules in the above-described device embodiments, such as a model building module.

Illustratively, the computer program may be partitioned into one or more modules that are stored in the memory and executed by the processor to implement the invention. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the training device of the grammar error correction model. For example, the computer program may be divided into a plurality of modules, each module having the following specific functions:

The specific working process of each module may refer to the working process of the training apparatus for a syntax error correction model described in the above embodiment, and is not described herein again.

The training device of the grammar error correction model can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing devices. The training device of the grammar error correction model may include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that the schematic diagram is merely an example of a training device for the syntax error correction model, and does not constitute a limitation of the training device for the syntax error correction model, and may include more or less components than those shown, or combine some components, or different components, for example, the training device for the syntax error correction model may further include an input-output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the training apparatus of the syntax error correction model, the various parts of the training apparatus of the entire syntax error correction model being connected by various interfaces and lines.

The memory may be used to store the computer program and/or module, and the processor may implement various functions of the training apparatus of the syntax error correction model by executing or executing the computer program and/or module stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the module integrated by the training device of the grammar error correction model can be stored in a computer readable storage medium if the module is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for training a grammar error correction model, comprising:

constructing an original model based on a Transformer;

2. The method for training the grammar error correction model according to claim 1, wherein in each training round, a training set obtained in advance is input into the original model, and parameters in the original model are adjusted by combining a moving average strategy, which specifically includes:

and adjusting parameters of the original model according to the loss function.

3. The method for training the syntax error correction model according to claim 2, wherein the average parameter is calculated by the following formula:

4. The method for training the syntax error correction model according to claim 2, wherein the model loss value is calculated by:

calculating cross entropy loss;

5. The method for training the syntax error correction model according to claim 4, wherein the model loss value is calculated by the following formula:

6. The method for training the syntactic error correction model of claim 5, wherein the loss function is calculated by the following formula:

7. The method for training the syntax error correction model of claim 1, wherein the original model comprises an encoder and a decoder;

the encoder consists of a Z layer independent encoding layer and a (K-Z) layer redundant encoding layer, and the independent encoding layer and the redundant encoding layer adopt a parameter sharing strategy to carry out model compression;

8. An apparatus for training a grammar error correction model, comprising:

9. Training device of a syntactic error correction model, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the training method of a syntactic error correction model according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the training method of the syntax error correction model according to any one of claims 1 to 7 when executing the computer program.