CN116341640B

CN116341640B - Text processing model training method and device

Info

Publication number: CN116341640B
Application number: CN202310614594.4A
Authority: CN
Inventors: 吴亚军; 暴宇健; 汪骞
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-08-11
Anticipated expiration: 2043-05-29
Also published as: CN116341640A

Abstract

The disclosure relates to the technical field of machine learning, and provides a text processing model training method and device. The method comprises the following steps: connecting N long-period memory networks in series, sequentially inserting an adaptation layer and a residual layer between every two long-period memory networks to obtain a text processing model, adding the input and the output of the adaptation layer connected with the residual layer by each residual layer, and inputting the added result into the long-period memory network connected with the residual layer; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; and acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set. By adopting the technical means, the problems that in the prior art, the precision and generalization capability of a text processing model cannot meet the requirements and are to be further improved are solved.

Description

Text processing model training method and device

Technical Field

The disclosure relates to the technical field of machine learning, in particular to a text processing model training method and device.

Background

With the development of machine learning technology, many network models with different architectures and different training methods are presented in the text processing field. However, the accuracy and generalization capability of the finally obtained text processing model are to be improved compared with the increasingly high requirements no matter which architecture of the network model is based and which training method is used for training the model.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a computer readable storage medium for training a text processing model, so as to solve the problem that in the prior art, the accuracy and generalization ability of the text processing model cannot meet the requirements and needs to be further improved.

In a first aspect of an embodiment of the present disclosure, a text processing model training method is provided, including: connecting N long-short-period memory networks in series, sequentially inserting an adaptation layer and a residual layer between every two long-short-period memory networks to obtain a text processing model, wherein each residual layer is used for adding the input and the output of the adaptation layer connected with the residual layer, inputting the added result into the long-short-period memory network connected with the residual layer, and inserting N-1 adaptation layers and N-1 residual layers in total; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; and acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set.

In a second aspect of the embodiments of the present disclosure, there is provided a text processing model training apparatus, including: the construction module is configured to connect N long-short-period memory networks in series, and sequentially insert an adaptation layer and a residual layer between every two long-short-period memory networks to obtain a text processing model, wherein each residual layer is used for adding the input and the output of the adaptation layer connected with the residual layer, inputting the added result into the long-short-period memory network connected with the residual layer, and inserting N-1 adaptation layers and N-1 residual layers in total; the calculation module is configured to set the learning rate of the first adaptation layer in the text processing model, and calculate and set the learning rate of other adaptation layers according to the learning rate of the first adaptation layer through an exponential decay formula or a linear decay formula; and the training module is configured to acquire a training data set and train the text processing model based on the text processing task by utilizing the training data set.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: because the embodiment of the disclosure obtains a text processing model by serially connecting N long-short-period memory networks, sequentially inserting an adaptation layer and a residual layer between every two long-short-period memory networks, wherein each residual layer is used for adding the input and the output of the adaptation layer connected with the residual layer, inputting the added result into the long-short-period memory network connected with the residual layer, and inserting N-1 adaptation layers and N-1 residual layers in total; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; the training data set is obtained, and the training data set is utilized to train the text processing model based on the text processing task, so that the problems that in the prior art, the precision and the generalization capability of the text processing model cannot meet the requirements and are to be further improved can be solved, and the precision and the generalization capability of the text processing model are further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of a text processing model training method provided in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a text processing model provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a text processing network provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a text processing model training apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

Fig. 1 is a schematic flow chart (a) of a text processing model training method according to an embodiment of the disclosure. The text processing model training method of fig. 1 may be performed by a computer or a server, or software on a computer or a server. As shown in fig. 1, the text processing model training method includes:

s101, connecting N long-period memory networks in series, sequentially inserting an adaptation layer and a residual layer between every two long-period memory networks to obtain a text processing model, wherein each residual layer is used for adding the input and the output of the adaptation layer connected with the residual layer, inputting the added result into the long-period memory network connected with the residual layer, and inserting N-1 adaptation layers and N-1 residual layers in total;

s102, setting the learning rate of a first adaptation layer in a text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer through an exponential decay formula or a linear decay formula;

s103, acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set.

Long Short Term Memory network (LSTM). The adaptation layer is an adaptation layer, and the inside of the adaptation layer is a matrix of feed down-project, a nonlinear layer and a matrix of feed down-project in sequence. The first feed forward down-project matrix multiplication is used for dimension reduction and the second feed forward up-project layer is used for dimension increase.

The adapter layer may further comprise a plurality of transform layers, a plurality of mapping layers, and a plurality of residual layers, for example, the adapter layer is internally provided with a transform layer, a mapping layer, a residual, and a transform layer in order; the adapter layer may also be composed of a full connection layer and Transformer Encoder Layer, for example, the inside of the adapter layer is a full connection layer, transformer Encoder Layer and a full connection layer in sequence; the adapter layer may also be composed of a plurality of feedforward layers in series, the feedforward layers being composed of a plurality of linear layers and nonlinear activation functions. The adapter layer may also be any other commonly used structure.

The text processing task may be text translation, text order prediction, text next sentence prediction, question and answer tasks, in-text entity recognition tasks, text classification, and the like.

According to the technical scheme provided by the embodiment of the disclosure, N long-short-period memory networks are connected in series, an adaptation layer and a residual layer are sequentially inserted between every two long-short-period memory networks to obtain a text processing model, wherein each residual layer is used for adding the input and the output of the adaptation layer connected with the residual layer, the added result is input to the long-short-period memory network connected with the residual layer, and N-1 adaptation layers and N-1 residual layers are inserted in total; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; the training data set is obtained, and the training data set is utilized to train the text processing model based on the text processing task, so that the problems that in the prior art, the precision and the generalization capability of the text processing model cannot meet the requirements and are to be further improved can be solved, and the precision and the generalization capability of the text processing model are further improved.

Each residual layer is used for adding the input and the output of the previous adaptation layer of the residual layer, and the added result is input to the next long-short-period memory network of the residual layer.

For example, the N is 3, and the internal structure of the text processing model is a long-period memory network, an adaptation layer, a residual layer, a long-period memory network, an adaptation layer, a residual layer and a long-period memory network in sequence.

The first residual layer is used for adding the input and the output of the first adapting layer, and inputting the added result into the second long-short-period memory network; the second residual layer is used for adding the input and the output of the second adapting layer, and inputting the added result into a third long-period and short-period memory network.

Further, according to the learning rate of the first adaptation layer, the learning rate of the other adaptation layers is calculated by a linear decay formula:

；

wherein i is the sequence number of the adaptation layer in the text processing model, i is more than or equal to 1 and less than or equal to N-1, L1 is the learning rate of the first adaptation layer in the text processing model, li is the learning rate of the ith adaptation layer in the text processing model, K is a preset constant,。

further, according to the learning rate of the first adaptation layer, the learning rate of the other adaptation layers is calculated and set through an exponential decay formula:

；

Wherein i is the sequence number of the adaptation layer in the text processing model, i is more than or equal to 1 and less than or equal to N-1, L1 is the learning rate of the first adaptation layer in the text processing model, li is the learning rate of the i-th adaptation layer in the text processing model, and e is a natural constant.

The learning rate of the first adaptation layer in the text processing model may be set according to usual experience, and after the learning rate of the first adaptation layer is obtained, the learning rates of other adaptation layers are calculated according to the learning rate of the first adaptation layer. Practice shows that the smaller the parameter fluctuation amount of the backward (deeper layer) is, the better the generalization performance of the final model is, and the training is also more stable, so that the embodiment of the application uses different learning rates for a plurality of adaptation layers, and the deeper (higher layer) the adaptation layer is in the text processing model, the smaller the learning rate is. For example, there are three adapter layers in total, the learning rate of the first adapter layer is greater than the learning rate of the second adapter layer and greater than the learning rate of the third adapter layer.

Further, training the text processing model based on the text processing task using the training data set includes: and freezing network parameters of all long-short-term memory networks in the text processing model, and optimizing the network parameters of all adaptation layers by using a training data set based on a text processing task.

The network parameters of the long-term memory network are frozen, but the long-term memory network actually participates in training, the long-term memory network participates in operation by the original model parameters, and only the operation result is only used for updating the network parameters of the adaptation layer, and the network parameters of the long-term memory network are not updated.

According to the practice, only the adaptation layer in the text processing model is optimized, and a certain effect can be achieved, so that when the text processing model is trained, only the network parameters of the adaptation layer are optimized, the training speed is improved, and the requirements of training on the number of samples and equipment are reduced. The method is applied to a small sample scene and a scene with poor computing capability, wherein the small sample scene is the condition of less training samples, and the scene with poor computing capability is the condition of small video memory or poor computing capability of equipment for training.

In an alternative embodiment, the network parameters of all adaptation layers are frozen in the text processing model, and the network parameters of all long-term and short-term memory networks are optimized with the training data set based on the text processing task.

Training the text processing model based on the text processing task using the training dataset, comprising: dividing the training data set into a first training data set, a second training data set and a third training data set according to a preset proportion, and carrying out multi-stage training on the text processing model: first stage training of text processing model: freezing network parameters of all long-short-term memory networks, and optimizing the network parameters of all adaptation layers by using a first training data set based on a text processing task; second stage training of text processing model: freezing network parameters of all adaptation layers, and optimizing the network parameters of all long-term and short-term memory networks by using a second training data set based on a text processing task; third stage training of text processing model: and optimizing the network parameters of all the adaptation layers and the network parameters of all the long-term and short-term memory networks by using the third training data set.

Training in a first stage, freezing network parameters of a long-term and short-term memory network, and only optimizing the network parameters of an adaptation layer by using a first training data set; after the first stage training is completed, thawing network parameters of the long-term and short-term memory network; training in the second stage, freezing network parameters of the adaptation layer, and optimizing only the network parameters of the long-term and short-term memory network by utilizing the second training data set; after the second stage training is completed, thawing the network parameters of the adaptation layer; and training in the third stage, namely optimizing the network parameters of the adaptation layer and the network parameters of the long-term and short-term memory network by utilizing a third training data set, wherein the training in the third stage is the training of the whole text processing model.

Training the text processing model based on the text processing task using the training dataset, comprising: dividing the training data set into a first training data set and a second training data set according to a preset proportion, and carrying out multi-stage training on the text processing model: first stage training of text processing model: freezing network parameters of all long-short-term memory networks, and optimizing the network parameters of all adaptation layers by using a first training data set based on a text processing task; second stage training of text processing model: based on the text processing task, the network parameters of all adaptation layers and the network parameters of all long-term and short-term memory networks are optimized by using the second training data set.

Training the text processing model based on the text processing task using the training dataset, comprising: dividing the training data set into a first training data set and a second training data set according to a preset proportion, and carrying out multi-stage training on the text processing model: first stage training of text processing model: freezing network parameters of all adaptation layers, and optimizing the network parameters of all long-term and short-term memory networks by using a first training data set based on a text processing task; second stage training of text processing model: based on the text processing task, the network parameters of all adaptation layers and the network parameters of all long-term and short-term memory networks are optimized by using the second training data set.

Training the text processing model based on the text processing task using the training dataset, comprising: when training a text processing model in multiple batches: determining unused target words in samples in the current lot based on samples that have been used in all lots preceding the current lot; and freezing network parameters of other embedded layers except the embedded layer corresponding to the target word in the text processing model, and optimizing the network parameters of the embedded layer corresponding to the target word in the text processing model only according to the target word.

For example, the samples in the current batch are "u1 u2 Chinese female re-crowning", wherein "Chinese female re-crowning" is a sample that has been used in all batches before the current batch, and "u1 u2" is a token that has not been trained by the text processing model, that is, an unused target word. Therefore, the network parameters of other embedded layers except the embedded layer corresponding to the target word in the text classification model (namely, the network parameters of the embedded layer corresponding to the Chinese female's abstract of the crown are frozen), and the network parameters of the network layer which has connection relation with other embedded layers in the text classification model are also effectively frozen. The network parameters of the embedded layers corresponding to the target words in the text classification model are optimized only according to the target words, and the network parameters of the network layers, which have connection relations with the embedded layers corresponding to the target words, in the text classification model are actually required to be optimized. The text processing model includes an embedding layer that may be disposed before the first long and short term memory network.

In an alternative embodiment, network parameters of other embedded layers except the embedded layer corresponding to the target word and network parameters of the network layers with connection relation with other embedded layers in the text processing model are frozen, and the network parameters of the embedded layer corresponding to the target word and the network parameters of the network layers with connection relation with the embedded layer corresponding to the target word in the text processing model are optimized only according to the target word; the connection relationship refers to a serial connection relationship between an upper network layer and a lower network layer.

For example, an embedded layer corresponding to a target word is recorded as an a-network, a network having a connection relationship with the a-network in a large network layer above the large network layer where the a-network is located is a b-network (a plurality of networks are arranged in one large network layer, the plurality of networks in one large network layer are connected in parallel, the plurality of large network layers are connected in series to obtain a text processing model), and a network having a connection relationship with the a-network in a large network layer below the large network layer where the a-network is located is a c-network; the method comprises the steps that one other embedded layer is recorded as a d network (a network and d network belong to the same large network layer and are in parallel connection relation), a network which has connection relation with the d network in a large network layer above the large network layer where the d network is located is an e network, and a network which has connection relation with the d network in a large network layer below the large network layer where the d network is located is an f network; the text processing model only comprises an a network, a b network, a c network, a d network, an e network and an f network; the network parameters of d, e and f networks are frozen and the network parameters of a, b, c networks are optimized based on the target words only.

For example, the samples in the current batch are "u1 u2 Chinese female re-crowning", wherein "Chinese female re-crowning" is a sample that has been used in all batches before the current batch, "u1 u2" is a token for which the text processing model is not trained, and it can be understood that the text processing model is trained by using "u1 u2" alone, and the connection relationship between "u1 u2" and "Chinese female re-crowning" is not trained by using "Chinese female re-crowning" alone. That is, the network parameters of other embedded layers corresponding to other words except the target word in the text processing model are frozen, and the network parameters of the embedded layer corresponding to the target word in the text processing model and the network layer having the connection relation with the embedded layer corresponding to the target word are optimized respectively only according to the target word and the connection relation between the target word and the other words.

In an alternative embodiment, the network parameters of other embedded layers except the embedded layer corresponding to the target word and the network parameters of the network layers with connection relation with other embedded layers in the text processing model are frozen, and the network parameters of the embedded layer corresponding to the target word and the network parameters of the network layers with connection relation with the embedded layer corresponding to the target word in the text processing model are optimized respectively only according to the target word and the connection relation between the target word and the other words; the connection relationship refers to a serial connection relationship between an upper network layer and a lower network layer.

For example, an embedded layer corresponding to a target word is recorded as an a-network, a network which has a connection relation with the a-network in a large network layer above a large network layer where the a-network is positioned is a b-network, and a network which has a connection relation with the a-network in a large network layer below the large network layer where the a-network is positioned is a c-network; the other embedded layers are marked as d networks, the network with the connection relation with the d networks in the upper large network layer of the large network layer where the d networks are positioned is an e network, and the network with the connection relation with the d networks in the lower large network layer of the large network layer where the d networks are positioned is an f network; then the network parameters of d, e and f networks are frozen; optimizing network parameters of the a network according to the target word; and optimizing network parameters of the b network and the c network according to the connection relation between the target word and other words. The connection relation between the target word and other words is the context information of the target word, and the calculation context information is the prior art and is not repeated.

In an alternative embodiment, the training data set is divided into a first training data set, a second training data set and a third training data set according to a preset proportion, and the text processing model is subjected to multi-stage training: training the first stage of the text processing model: freezing network parameters of all long-short-term memory networks, and optimizing the network parameters of all adaptation layers by using the first training data set based on the text processing task; a second stage training of the text processing model: freezing network parameters of other embedded layers except the embedded layer corresponding to the target word in the text processing model and network parameters of the network layer with connection relation with other embedded layers, and optimizing the network parameters of the embedded layer corresponding to the target word in the text processing model by utilizing the target word in the second training data set based on the text processing task; training the third stage of the text processing model: and optimizing network parameters of all long-term and short-term memory networks by using the third training data set based on the text processing task.

In this alternative embodiment, the second stage training of the text processing model may also be: freezing network parameters of other embedded layers except the embedded layer corresponding to the target word and network parameters of the network layer with connection relation with other embedded layers in the text processing model, and respectively optimizing the network parameters of the embedded layer corresponding to the target word and the network parameters of the network layer with connection relation with the embedded layer corresponding to the target word in the text processing model by utilizing the target word in the second training data set and the connection relation between the target word and other words based on the text processing task; the connection relationship refers to a serial connection relationship between an upper network layer and a lower network layer.

In an alternative embodiment, the method further comprises: inserting an adaptation layer and a residual layer into each memory network of the long-period memory network and the short-period memory network to obtain the text processing model; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; and acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set.

In this embodiment, a long-short term memory network is used to construct a text processing model. The long-short-period memory network is a special cyclic neural network, the structures of the long-short-period memory network and the standard cyclic neural network are both provided with a chained mode of repeated neural network modules, that is, the long-short-period memory network can be regarded as a memory network which is obtained by repeated serial connection (for example, M words are included in one training text in training data, then the number of the memory networks in the long-short-period memory network is M, one memory network corresponds to one word), a forgetting gate, an input gate, a memory unit and an output gate are sequentially included in the memory network, the forgetting gate determines which information needs to be forgotten from the memory unit, the input gate determines which new information can be stored in the memory unit, and the output gate determines what value is output. In this embodiment, an adaptation layer and a residual layer are sequentially inserted between the memory unit and the output gate of each memory network, and the residual layer after the input and output of each adaptation layer are added is used to calculate the result of adding the input and output of the adaptation layer.

In an alternative embodiment, the method further comprises: connecting N long-short-period memory networks in series, sequentially inserting an adaptation layer and a residual layer between every two long-short-period memory networks, and inserting an adaptation layer and a residual layer into each memory network of each long-short-period memory network to obtain the text processing model; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; and acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set.

In this embodiment, a text processing model is constructed using N long-short-term memory networks. In this embodiment, an adaptation layer and a residual layer are sequentially inserted between the memory unit and the output gate of each memory network, and the residual layer after the input and output of each adaptation layer are added is used to calculate the result of adding the input and output of the adaptation layer. The text processing model of the present embodiment is most complex, so the text processing model effect of the present embodiment is optimal.

Fig. 2 is a schematic structural diagram of a text processing model according to an embodiment of the present disclosure. As shown in fig. 2, the text processing model sequentially includes: an embedded layer, a long-term memory network, an adaptive layer, a residual layer and a long-term memory network.

The residual layer is used for adding the input and the output of the adaptation layer, and inputting the added result into the second long-period memory network.

Fig. 3 is a schematic structural diagram of a text processing network according to an embodiment of the present disclosure. As shown in fig. 3, the text processing network sequentially includes: an embedding layer, a multi-head attention network, a feedforward layer, an adaptation layer, a residual layer, a normalization layer, a full connection layer and a classification layer.

The feedforward layer is obtained by sequentially connecting a plurality of linear layers and nonlinear activation functions. A plurality of text processing networks are connected in series to obtain a text processing model.

The first residual layer is used for adding the input and the output of the first adapting layer, and inputting the added result to the first normalizing layer; the second residual layer is used for adding the input and the output of the second adaptation layer, and the added result is input to the second normalization layer.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 4 is a schematic diagram of a text processing model training apparatus according to an embodiment of the present disclosure. As shown in fig. 4, the text processing model training apparatus includes:

the construction module 401 is configured to connect N long-short-period memory networks in series, and insert an adaptation layer and a residual layer between every two long-short-period memory networks in sequence to obtain a text processing model, where each residual layer is used to add the input and output of the adaptation layer connected with the residual layer, and input the added result to the long-short-period memory network connected with the residual layer, and insert N-1 adaptation layers and N-1 residual layers in total;

a calculation module 402 configured to set a learning rate of a first adaptation layer in the text processing model, and calculate and set learning rates of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula;

the training module 403 is configured to obtain a training dataset with which the text processing model is trained based on text processing tasks.

The second residual layer is used for adding the input and the output of the second adapting layer, and inputting the added result into a third long-period and short-period memory network.

Optionally, the calculation module 402 is further configured to calculate the learning rate of the other adaptation layer by a linear decay formula according to the learning rate of the first adaptation layer:

；

Optionally, the training module 403 is further configured to freeze the network parameters of all long-term memory networks in the text processing model, and to optimize the network parameters of all adaptation layers with the training dataset based on the text processing task.

Optionally, the training module 403 is further configured to freeze the network parameters of all adaptation layers in the text processing model, and to optimize the network parameters of all long and short term memory networks with the training dataset based on the text processing task.

Optionally, the training module 403 is further configured to divide the training data set into a first training data set, a second training data set and a third training data set according to a preset ratio, and perform multi-stage training on the text processing model: first stage training of text processing model: freezing network parameters of all long-short-term memory networks, and optimizing the network parameters of all adaptation layers by using a first training data set based on a text processing task; second stage training of text processing model: freezing network parameters of all adaptation layers, and optimizing the network parameters of all long-term and short-term memory networks by using a second training data set based on a text processing task; third stage training of text processing model: and optimizing the network parameters of all the adaptation layers and the network parameters of all the long-term and short-term memory networks by using the third training data set.

Optionally, the training module 403 is further configured to divide the training data set into the first training data set and the second training data set according to a preset ratio, and perform multi-stage training on the text processing model: first stage training of text processing model: freezing network parameters of all long-short-term memory networks, and optimizing the network parameters of all adaptation layers by using a first training data set based on a text processing task; second stage training of text processing model: based on the text processing task, the network parameters of all adaptation layers and the network parameters of all long-term and short-term memory networks are optimized by using the second training data set.

Optionally, the training module 403 is further configured to divide the training data set into a first training data set and a second training data set according to a preset ratio, and perform multi-stage training on the text processing model: first stage training of text processing model: freezing network parameters of all adaptation layers, and optimizing the network parameters of all long-term and short-term memory networks by using a first training data set based on a text processing task; second stage training of text processing model: based on the text processing task, the network parameters of all adaptation layers and the network parameters of all long-term and short-term memory networks are optimized by using the second training data set.

Optionally, the training module 403 is further configured to, when training the text processing model in multiple batches: determining unused target words in samples in the current lot based on samples that have been used in all lots preceding the current lot; and freezing network parameters of other embedded layers except the embedded layer corresponding to the target word in the text processing model, and optimizing the network parameters of the embedded layer corresponding to the target word in the text processing model only according to the target word.

Optionally, the training module 403 is further configured to freeze the network parameters of other embedded layers in the text processing model except the embedded layer corresponding to the target word and the network parameters of the network layers having connection relations with other embedded layers, and optimize the network parameters of the embedded layer corresponding to the target word and the network parameters of the network layers having connection relations with the embedded layer corresponding to the target word in the text processing model only according to the target word; the connection relationship refers to a serial connection relationship between an upper network layer and a lower network layer.

For example, an embedded layer corresponding to a target word is denoted as an a-network, a network having a connection relationship with the a-network in a large network layer above a large network layer where the a-network is located is a b-network (a plurality of networks are in a large network layer and parallel connection relationships are formed in a large network layer), and a network having a connection relationship with the a-network in a large network layer below the large network layer where the a-network is located is a c-network; the method comprises the steps that one other embedded layer is recorded as a d network (a network and d network belong to the same large network layer and are in parallel connection relation), a network which has connection relation with the d network in a large network layer above the large network layer where the d network is located is an e network, and a network which has connection relation with the d network in a large network layer below the large network layer where the d network is located is an f network; the text processing model only comprises an a network, a b network, a c network, a d network, an e network and an f network; the network parameters of d, e and f networks are frozen and the network parameters of a, b, c networks are optimized based on the target words only.

Optionally, the training module 403 is further configured to freeze network parameters of other embedded layers in the text processing model except for the embedded layer corresponding to the target word and network parameters of the network layer having a connection relationship with other embedded layers, and optimize the network parameters of the embedded layer corresponding to the target word in the text processing model and the network parameters of the network layer having a connection relationship with the embedded layer corresponding to the target word according to the target word and the connection relationship between the target word and other words, respectively; the connection relationship refers to a serial connection relationship between an upper network layer and a lower network layer.

For example, an embedded layer corresponding to a target word is recorded as an a-network, a network which has a connection relation with the a-network in a large network layer above a large network layer where the a-network is positioned is a b-network, and a network which has a connection relation with the a-network in a large network layer below the large network layer where the a-network is positioned is a c-network; the other embedded layers are marked as d networks, the network with the connection relation with the d networks in the upper large network layer of the large network layer where the d networks are positioned is an e network, and the network with the connection relation with the d networks in the lower large network layer of the large network layer where the d networks are positioned is an f network; then the network parameters of d, e and f networks are frozen; optimizing network parameters of the a network according to the target word; and optimizing network parameters of the b network and the c network according to the connection relation between the target word and other words.

Optionally, the training module 403 is further configured to divide the training data set into a first training data set and a second training data set and a third training data set according to a preset ratio, and perform multi-stage training on the text processing model: training the first stage of the text processing model: freezing network parameters of all long-short-term memory networks, and optimizing the network parameters of all adaptation layers by using the first training data set based on the text processing task; a second stage training of the text processing model: freezing network parameters of other embedded layers except the embedded layer corresponding to the target word in the text processing model and network parameters of the network layer with connection relation with other embedded layers, and optimizing the network parameters of the embedded layer corresponding to the target word in the text processing model by utilizing the target word in the second training data set based on the text processing task; training the third stage of the text processing model: and optimizing network parameters of all long-term and short-term memory networks by using the third training data set based on the text processing task.

Optionally, the training module 403 is further configured to insert an adaptation layer and a residual layer in each of the long-short term memory networks to obtain the text processing model; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; and acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set.

Optionally, the training module 403 is further configured to connect N long-short-term memory networks in series, insert an adaptation layer and a residual layer between every two long-short-term memory networks in sequence, and insert an adaptation layer and a residual layer in each memory network of each long-short-term memory network, so as to obtain the text processing model; setting the learning rate of a first adaptation layer in the text processing model, and calculating and setting the learning rate of other adaptation layers according to the learning rate of the first adaptation layer by an exponential decay formula or a linear decay formula; and acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.

Fig. 5 is a schematic diagram of an electronic device 5 provided by an embodiment of the present disclosure. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Alternatively, the processor 501, when executing the computer program 503, performs the functions of the modules/units in the above-described apparatus embodiments.

The electronic device 5 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 5 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the electronic device 5 and is not limiting of the electronic device 5 and may include more or fewer components than shown, or different components.

The processor 501 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 502 may be an internal storage unit of the electronic device 5, for example, a hard disk or a memory of the electronic device 5. The memory 502 may also be an external storage device of the electronic device 5, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 5. Memory 502 may also include both internal storage units and external storage devices of electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A method for training a text processing model, comprising:

connecting N long-period memory networks in series, sequentially inserting an adapter layer and a residual layer between every two long-period memory networks to obtain a text processing model, wherein each residual layer is used for adding the input and the output of the adapter layer connected with the residual layer, inputting the added result into the long-period memory network connected with the residual layer, and inserting N-1 adapter layers and N-1 residual layers in total;

setting the learning rate of a first adapter layer in the text processing model, and calculating and setting the learning rate of other adapter layers according to the learning rate of the first adapter layer through an exponential decay formula or a linear decay formula;

Acquiring a training data set, and training the text processing model based on a text processing task by utilizing the training data set;

according to the learning rate of the first adapter layer, the learning rate of other adapter layers is calculated through a linear attenuation formula:the method comprises the steps of carrying out a first treatment on the surface of the Wherein i is the serial number of the adapter layer in the text processing model, i is more than or equal to 1 and less than or equal to N-1, L1 is the learning rate of the first adapter layer in the text processing model, li is the learning rate of the ith adapter layer in the text processing model, K is a preset constant, and K is a preset constant>；

According to the learning rate of the first adapter layer, calculating and setting the learning rate of other adapter layers through an exponential decay formula:the method comprises the steps of carrying out a first treatment on the surface of the Wherein i is the serial number of the adapter layer in the text processing model, i is more than or equal to 1 and less than or equal to N-1, L1 is the learning rate of the first adapter layer in the text processing model, li is the learning rate of the i-th adapter layer in the text processing model, and e is a natural constant.

2. The method of claim 1, wherein training the text processing model based on text processing tasks using the training dataset comprises:

and freezing network parameters of all long-short-term memory networks in the text processing model, and optimizing the network parameters of all adapter layers by using the training data set based on the text processing task.

3. The method of claim 1, wherein training the text processing model based on text processing tasks using the training dataset comprises:

dividing the training data set into a first training data set, a second training data set and a third training data set according to a preset proportion, and carrying out multi-stage training on the text processing model:

training the first stage of the text processing model: freezing network parameters of all long-short-term memory networks, and optimizing the network parameters of all adapter layers by using the first training data set based on the text processing task;

a second stage training of the text processing model: freezing network parameters of all adaptation layers, and optimizing network parameters of all long-term and short-term memory networks by using the second training data set based on the text processing task;

training the third stage of the text processing model: and optimizing the network parameters of all adapter layers and the network parameters of all long-term and short-term memory networks by using the third training data set.

4. The method of claim 1, wherein training the text processing model based on text processing tasks using the training dataset comprises:

When training the text processing model in multiple batches:

determining unused target words in samples in the current lot based on samples that have been used in all lots preceding the current lot;

and freezing network parameters of other embedded layers except the embedded layer corresponding to the target word in the text processing model, and optimizing the network parameters of the embedded layer corresponding to the target word in the text processing model only according to the target word.

5. The method according to claim 1, wherein the method further comprises:

inserting an adapter layer and a residual layer into each memory network of the long-period memory network and obtaining the text processing model;

and acquiring a training data set, and training the text processing model based on the text processing task by utilizing the training data set.

6. The method according to claim 1, wherein the method further comprises:

Connecting N long-period memory networks in series, sequentially inserting an adapter layer and a residual layer between every two long-period memory networks, and inserting an adapter layer and a residual layer in each memory network of each long-period memory network to obtain the text processing model;

7. A text processing model training apparatus, comprising:

the construction module is configured to connect N long-short-period memory networks in series, and sequentially insert an adapter layer and a residual layer between every two long-short-period memory networks to obtain a text processing model, wherein each residual layer is used for adding the input and the output of the adapter layer connected with the residual layer, inputting the added result into the long-short-period memory network connected with the residual layer, and inserting N-1 adapter layers and N-1 residual layers in total;

The calculation module is configured to set the learning rate of a first adapter layer in the text processing model, and calculate and set the learning rate of other adapter layers through an exponential decay formula or a linear decay formula according to the learning rate of the first adapter layer;

the training module is configured to acquire a training data set, and training the text processing model based on text processing tasks by utilizing the training data set;

the calculation module is further configured to calculate the learning rate of the other adapter layers by a linear decay formula according to the learning rate of the first adapter layer:the method comprises the steps of carrying out a first treatment on the surface of the Wherein i is the serial number of the adapter layer in the text processing model, i is more than or equal to 1 and less than or equal to N-1, L1 is the learning rate of the first adapter layer in the text processing model, li is the learning rate of the ith adapter layer in the text processing model, K is a preset constant, and K is a preset constant>；

The calculation module is further configured to calculate and set the learning rate of the other adapter layers according to the learning rate of the first adapter layer through an exponential decay formula:the method comprises the steps of carrying out a first treatment on the surface of the Wherein i is the serial number of the adapter layer in the text processing model, i is more than or equal to 1 and less than or equal to N-1, L1 is the learning rate of the first adapter layer in the text processing model, and Li is the first adapter layer in the text processing model Learning rate of i adaptive layers, e is a natural constant.

8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.