CN115146657A - Model training method, device, storage medium, client, server and system - Google Patents

Model training method, device, storage medium, client, server and system Download PDF

Info

Publication number
CN115146657A
CN115146657A CN202210917581.XA CN202210917581A CN115146657A CN 115146657 A CN115146657 A CN 115146657A CN 202210917581 A CN202210917581 A CN 202210917581A CN 115146657 A CN115146657 A CN 115146657A
Authority
CN
China
Prior art keywords
training
local
language model
pseudo
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210917581.XA
Other languages
Chinese (zh)
Inventor
吴双志
董威龙
边超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202210917581.XA priority Critical patent/CN115146657A/en
Publication of CN115146657A publication Critical patent/CN115146657A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a model training method, a device, a storage medium, clients, a server and a system. Firstly, a global pre-training language model sent by a server is utilized at a client to generate a pseudo training sample set corresponding to a local training sample set of the client. And updating the local pre-training language model based on the local training sample subset and the pseudo training sample set, and finally sending the current local pre-training language model of the client to the server so as to facilitate the server to perform weighted aggregation on the global pre-training language model. Finally, the marked data of different users can be utilized, the marked data utilization efficiency is improved, and the generalization capability of the multi-task pre-training language model is enhanced.

Description

Model training method, device, storage medium, client, server and system
Technical Field
The embodiment of the disclosure relates to the technical field of machine learning, in particular to a model training method, a device, a storage medium, a client, a server and a system.
Background
In the field of natural language processing, various different tasks can be processed into a unified generation-type task, and a text-to-text pre-training language model is used for processing all the tasks.
Disclosure of Invention
The embodiment of the disclosure provides a model training method, a model training device, a storage medium, a client, a server and a system.
In a first aspect, an embodiment of the present disclosure provides a global model training method, applied to a server, where the method includes: the following global model update operations are performed: sending the global pre-training language model to each client; receiving local pre-training language models returned by the clients after responding to the local training of the global pre-training language models; determining a weight coefficient of each local pre-training language model; and carrying out weighted aggregation on each local pre-training language model according to the corresponding weight coefficient to obtain an updated global pre-training language model.
In some optional embodiments, the global model update operation further includes: determining whether to end the global model update operation; in response to determining not to end, continuing to perform the global model update operation.
In some optional embodiments, the receiving the local pre-trained language model returned by each client after responding to the local training of the global pre-trained language model includes: receiving a local pre-training language model returned by each client after responding to the local training of the global pre-training language model and a loss disturbance gradient of each local training sample in a corresponding local training sample set; and the determining the weight coefficient of each local pre-training language model comprises: for each local pre-training language model, determining the sum of the loss disturbance gradients of each local training sample corresponding to the local pre-training language model as the disturbance gradient sum of the local pre-training language model; for each local pre-training language model, determining the ratio of the sum of the disturbance gradients of the local pre-training language model to the sum of the disturbance gradients of each local pre-training language model as the ratio of the sum of the disturbance gradients of the local pre-training language model; and for each local pre-training language model, determining a weight coefficient of the local pre-training language model according to the disturbance gradient and the proportion of the local pre-training language model.
In some optional embodiments, the determining, according to the perturbation gradient and the proportion of the local pre-training language model, the weight coefficient of the local pre-training language model includes: and determining the larger of the disturbance gradient and the ratio of the local pre-training language model and a preset non-negative smaller weight coefficient as the weight coefficient of the local pre-training language model.
In some optional embodiments, the determining whether to end the global model update operation includes: determining whether the number of times of executing the global model updating operation is less than a first preset number of times; in response to determining yes, determining not to end the global model update operation; in response to a determination of no, determining to end the global model update operation.
In a second aspect, an embodiment of the present disclosure provides a local model training method, applied to a client, where the method includes: in response to receiving a global pre-training language model sent by a server, respectively inputting local sample data in a local training sample set into the global pre-training language model to obtain a pseudo label corresponding to the corresponding local sample data, wherein the local training sample comprises the local sample data and the corresponding local label; generating a pseudo training sample in a pseudo training sample set by using local sample data in the local training sample set and the corresponding pseudo label; determining the global pre-training language model as a current local pre-training language model, and dividing the local training sample set and the corresponding pseudo training sample set into at least two local training sample subsets and corresponding pseudo training sample subsets; for each batch of local training sample subsets and corresponding pseudo training sample subsets, updating the current local pre-training language model based on the local training sample subsets and the corresponding pseudo training sample subsets; and sending the current local pre-training language model to the server.
In some optional embodiments, the updating the current local pre-training language model based on the local training sample subset and the corresponding pseudo training sample subset includes: calculating a first loss of the current local pre-training language model to the pseudo training sample subset according to a first loss function, and optimizing the current local pre-training language model by using the first loss to obtain an optimized local pre-training language model; calculating a second loss of the optimized local pre-training language model to the local training sample subset according to a second loss function, wherein the second loss function adds corresponding loss disturbance to the loss of the local training sample, and the second loss is reversely propagated in the optimized local pre-training language model to obtain a loss disturbance gradient of each local training sample and the corresponding pseudo training sample in the local training sample subset; for each pseudo training sample in the pseudo training sample subset, determining the sample weight of the pseudo training sample to the current local pre-training language model according to the loss disturbance gradient of the pseudo training sample; calculating a third loss of the current local pre-training language model for the local training sample subset and the corresponding pseudo training sample subset according to a third loss function, wherein the third loss function is the sum of a fourth loss function and a first loss function weighted by a sample weight, the first loss function weighted by the sample weight is obtained by weighting the first loss function according to the sample weight of the corresponding pseudo training sample, and the fourth loss function is the loss of the current local pre-training language model for the local training samples in the local training sample subset; and updating the model parameters of the current local pre-training language model according to the third loss.
In some optional embodiments, the sending the current local pre-trained language model to the server includes: and sending the loss disturbance gradient of the current local pre-training language model and each pseudo training sample in the pseudo training sample set to the server.
In some optional embodiments, the determining, for each pseudo training sample in the pseudo training sample subset, a sample weight of the pseudo training sample to the current local pre-training language model according to a loss perturbation gradient of the pseudo training sample includes: and for each pseudo training sample in the pseudo training sample subset, determining the greater of the preset non-negative smaller sample weight and the opposite number of the loss disturbance gradient of the pseudo training sample as the sample weight of the pseudo training sample to the current pre-training language model.
In some optional embodiments, for each pseudo training sample in the pseudo training sample subset, determining a sample weight of the pseudo training sample to the current local pre-training language model according to a loss perturbation gradient of the pseudo training sample, further includes: and carrying out normalization processing on the sample weight of the current pre-training language model for each pseudo training sample in the pseudo training sample subset.
In a third aspect, an embodiment of the present disclosure provides a local model training method, which is applied to a client, and the method includes: and executing the target natural language processing task by using a local pre-training language model, wherein the local pre-training language model is obtained by training through the local model training method according to any one of the second aspect.
In a fourth aspect, an embodiment of the present disclosure provides a global model training apparatus, applied to a server, including: a global model update unit configured to perform the following global model update operations: sending the global pre-training language model to each client; receiving a local pre-training language model returned by each client after responding to the local training of the global pre-training language model; determining a weight coefficient of each local pre-training language model; and performing weighted aggregation on each local pre-training language model according to the corresponding weight coefficient to obtain an updated global pre-training language model.
In some optional embodiments, the global model update operation further includes: determining whether to end the global model update operation; in response to determining not to end, continuing to perform the global model update operation.
In some optional embodiments, the receiving a local pre-trained language model returned by each client in response to the local training of the global pre-trained language model includes: receiving a local pre-training language model returned by each client after responding to the local training of the global pre-training language model and a loss disturbance gradient of each local training sample in a corresponding local training sample set; and the determining the weight coefficient of each local pre-training language model comprises: for each local pre-training language model, determining the sum of the loss disturbance gradients of each local training sample corresponding to the local pre-training language model as the sum of the disturbance gradients of the local pre-training language model; for each local pre-training language model, determining the ratio of the sum of the disturbance gradients of the local pre-training language model to the sum of the disturbance gradients of each local pre-training language model as the ratio of the sum of the disturbance gradients of the local pre-training language model; and for each local pre-training language model, determining a weight coefficient of the local pre-training language model according to the disturbance gradient and the ratio of the local pre-training language model.
In some optional embodiments, the determining the weight coefficient of the local pre-training language model according to the disturbance gradient and the proportion of the local pre-training language model includes: and determining the larger of the disturbance gradient and the ratio of the local pre-training language model and a preset non-negative smaller weight coefficient as the weight coefficient of the local pre-training language model.
In some optional embodiments, the determining whether to end the global model update operation includes: determining whether the number of times of executing the global model updating operation is less than a first preset number of times; in response to determining yes, determining not to end the global model update operation; in response to a determination of no, determining to end the global model update operation.
In a fifth aspect, an embodiment of the present disclosure provides a local model training apparatus, applied to a client, where the apparatus includes: the pseudo label generating unit is configured to respond to a received global pre-training language model sent by the server, input local sample data in a local training sample set into the global pre-training language model respectively, and obtain a pseudo label corresponding to corresponding local sample data, wherein the local training sample comprises the local sample data and the corresponding local label; a pseudo sample generating unit configured to generate a pseudo training sample in a pseudo training sample set by using local sample data in the local training sample set and a corresponding pseudo label; a sample batching unit configured to determine the global pre-training language model as a current local pre-training language model and to divide the local training sample set and the corresponding pseudo-training sample set into at least two batches of local training sample subsets and corresponding pseudo-training sample subsets; a local model updating unit configured to update, for each batch of a local training sample subset and a corresponding pseudo training sample subset, the current local pre-training language model based on the local training sample subset and the corresponding pseudo training sample subset; a local model sending unit configured to send the current local pre-trained language model to the server.
In some optional embodiments, the local model updating unit is further configured to: calculating a first loss of the current local pre-training language model to the pseudo training sample subset according to a first loss function, and optimizing the current local pre-training language model by using the first loss to obtain an optimized local pre-training language model; calculating a second loss of the optimized local pre-training language model to the local training sample subset according to a second loss function, wherein the second loss function adds corresponding loss disturbance to the loss of the local training sample, and the second loss is reversely propagated in the optimized local pre-training language model to obtain a loss disturbance gradient of each local training sample and the corresponding pseudo training sample in the local training sample subset; for each pseudo training sample in the pseudo training sample subset, determining the sample weight of the pseudo training sample to the current local pre-training language model according to the loss disturbance gradient of the pseudo training sample; calculating a third loss of the current local pre-training language model for the local training sample subset and the corresponding pseudo training sample subset according to a third loss function, wherein the third loss function is the sum of a fourth loss function and a first loss function weighted by a sample weight, the first loss function weighted by the sample weight is obtained by weighting the first loss function according to the sample weight of the corresponding pseudo training sample, and the fourth loss function is the loss of the current local pre-training language model for the local training samples in the local training sample subset; and updating the model parameters of the current local pre-training language model according to the third loss.
In some optional embodiments, the local model sending unit is further configured to: and sending the loss disturbance gradient of each pseudo training sample in the current local pre-training language model and the pseudo training sample set to the server.
In some optional embodiments, the determining, for each pseudo training sample in the pseudo training sample subset, a sample weight of the pseudo training sample to the current local pre-training language model according to a loss perturbation gradient of the pseudo training sample includes: and for each pseudo training sample in the pseudo training sample subset, determining the greater one of the preset non-negative smaller sample weight and the opposite number of the loss disturbance gradient of the pseudo training sample as the sample weight of the pseudo training sample to the current pre-training language model.
In some optional embodiments, for each pseudo training sample in the pseudo training sample subset, determining a sample weight of the pseudo training sample to the current local pre-training language model according to a loss perturbation gradient of the pseudo training sample, further includes: and carrying out normalization processing on the sample weight of the current pre-training language model for each pseudo training sample in the pseudo training sample subset.
In a sixth aspect, an embodiment of the present disclosure provides a task processing apparatus, applied to a client, where the apparatus includes: a task processing unit configured to execute a target natural language processing task by using a local pre-training language model, wherein the local pre-training language model is obtained by training through a local model training method as described in any implementation manner of the second aspect.
In a seventh aspect, an embodiment of the present disclosure provides a server, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.
In an eighth aspect, an embodiment of the present disclosure provides a client, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the second aspect.
In a ninth aspect, embodiments of the present disclosure provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any of the implementations of the first aspect and/or the method as described in any of the implementations of the second aspect.
In a tenth aspect, an embodiment of the present disclosure provides a model training system, including a server as described in any implementation manner of the seventh aspect and a client as described in any implementation manner of the eighth aspect.
At present, a large-scale pre-training model obtains a good effect in a multi-task scene, and generally, the more tasks, the more data, the more promotion, and the more obvious promotion. However, in practical scenarios, different tasks are often distributed among computing devices of different users/collective users, and due to the limitation of data compliance requirements, data among different users/tenants cannot be directly shared, which results in a "data islanding" phenomenon. Different users have their own data and task types and cannot directly communicate with each other. In addition, since data labeling is expensive, each task usually has only a small amount of labeled training data, resulting in weak model capability and weak generalization.
According to the model training method, the model training device, the storage medium, the clients, the server and the system, the local pre-training language models from the clients are subjected to weighted aggregation at the server side by adopting a federal learning method, so that the global pre-training language model is obtained. At a client, a global pre-training language model sent by a server is used to generate a pseudo training sample set corresponding to a local training sample set of the client. And updating the local pre-training language model based on the local training sample subset and the pseudo training sample set, and finally sending the current local pre-training language model of the client to the server so that the server performs weighted aggregation on the global pre-training language model. Finally, the marked data of different users can be utilized, the marked data utilization efficiency is improved, and the generalization capability of the multi-task pre-training language model is enhanced.
Drawings
Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are only for purposes of illustrating the particular embodiments and are not to be construed as limiting the invention. In the drawings:
FIG. 1 is a system architecture diagram of one embodiment of a model training system according to the present disclosure;
FIG. 2A is a timing diagram for one embodiment of a model training system according to the present disclosure;
FIG. 2B is an exploded flow diagram for one embodiment of step 205, according to the present disclosure;
FIG. 2C is a schematic diagram of an application scenario of one embodiment of a model training system according to the present disclosure;
FIG. 3 is a flow diagram for one embodiment of a global model training method applied to a server, according to the present disclosure;
FIG. 4 is a flow diagram of one embodiment of a local model training method applied to a client, according to the present disclosure;
FIG. 5 is a schematic structural diagram illustrating one embodiment of a global model training apparatus applied to a server according to the present disclosure;
FIG. 6 is a schematic block diagram illustrating one embodiment of a local model training apparatus applied to a client according to the present disclosure;
FIG. 7 is a schematic block diagram of a computer system suitable for use with a server or client that implements embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
FIG. 1 illustrates an exemplary system architecture 100 to which one embodiment of the model training system of the present disclosure may be applied.
As shown in FIG. 1, model training system 100 may include clients 101, 102, 103, network 104, and server 105. Network 104 is the medium used to provide communication links between clients 101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use clients 101, 102, 103 to interact with server 105 over network 104 to receive or send messages, etc. Various communication client applications, such as a model training application, a natural language processing application, an audio and video conference application, a voice recognition application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the clients 101, 102, and 103.
The clients 101, 102, 103 may be hardware or software. When the clients 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting sound capture and/or video capture, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the clients 101, 102, 103 are hardware, they may be a single server or a distributed server cluster composed of multiple servers providing various natural language processing services. When the clients 101, 102, 103 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server providing various services, such as a background server providing support for natural language processing type applications displayed on the clients 101, 102, 103. The background server may analyze and process the received data such as the local pre-training language model, and feed back a processing result (e.g., the global pre-training language model) to the client.
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the local model training method applied to the client provided by the present disclosure is generally executed by the client 101, 102, 103, and accordingly, the local model training apparatus applied to the client is generally disposed in the client 101, 102, 103.
It should be noted that the global model training method applied to the server provided by the present disclosure is generally executed by the server 105, and accordingly, the global model training device applied to the server is generally disposed in the server 105.
It should be understood that the number of clients, networks, and servers in fig. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for an implementation.
With continued reference to FIG. 2A, a timing sequence 200 of one embodiment of a model training system according to the present disclosure is illustrated. The model training system in the embodiment of the disclosure may include a client and a server. The sequence 200 includes the following steps:
step 201, the server sends the global pre-training language model to each client.
In this embodiment, the server may send the global pre-trained language model to each client. Here, each of the clients may be a client having a model training service binding relationship with the server. And each client is used for training based on the respective local training sample set and the global pre-training language model received from the server so as to obtain the local pre-training language model of each client. The global pre-training language model may have the same model structure and model parameter settings as the local pre-training language model of each client, but the values of the model parameters may be the same or different.
Step 202, in response to receiving the global pre-training language model sent by the server, the client inputs the local sample data in the local training sample set into the global pre-training language model respectively, and obtains a pseudo label corresponding to the corresponding local sample data.
Here, the client may store a local training sample set, where the local training sample may include local sample data and a corresponding local tag. Here, the local tag corresponding to the local sample data may be obtained by manual tagging. And respectively inputting the local sample data in the local training sample set into the global pre-training language model to obtain the pseudo label corresponding to the corresponding local sample data. Here, the pseudo label is a label obtained by integrating the training results of the clients, that is, the global pre-training language model obtained by the server after integrating the local pre-training language models of the clients, and the label is not necessarily obtained with a relatively good effect in the local training sample set of a specific client.
Step 203, the client generates a pseudo training sample in the pseudo training sample set by using the local sample data in the local training sample set and the corresponding pseudo label.
And 204, the client determines the global pre-training language model as a current local pre-training language model, and divides the local training sample set and the corresponding pseudo training sample set into at least two local training sample subsets and corresponding pseudo training sample subsets.
Here, the local training sample set and the pseudo training sample set are divided into at least two batches (Batch) of local training sample subsets and corresponding pseudo training sample subsets.
In step 205, for each batch of local training sample subsets and corresponding pseudo training sample subsets, the client updates the current local pre-training language model based on the local training sample subsets and corresponding pseudo training sample subsets.
Here, the client may update the current local pre-training language model based on each batch of the local training sample subset and the corresponding pseudo training sample subset by using various implementations, and after at least two batches of the local training sample subset and the corresponding pseudo training sample subset are updated, the model parameters of the local pre-training language model of the client are optimized.
In step 206, the client sends the current local pre-trained language model to the server.
After step 205, the model parameters of the local pre-training language model of the client are optimized, and therefore, the client may send the current local pre-training language model to the server, so that the server may aggregate the current local pre-training language model to form the global pre-training language model.
And step 207, the server receives the local pre-training language models returned by the clients after responding to the local training of the global pre-training language models.
In step 208, the server determines the weighting coefficients for each local pre-trained language model.
Here, the server may determine the weight coefficients of the local pre-training language models returned by each client in various implementations, so as to perform weighted aggregation on different local pre-training language models in subsequent steps.
As an example, for the local pre-training language model returned by each client, the server may determine, as the weight coefficient of the local pre-training language model returned by the client, a ratio of the training sample number corresponding to the client to the sum of the training sample numbers corresponding to all clients.
Assume a total of K clients, u i For the ith client, i is a positive integer between 1 and K. Specifically, the following can be formulated:
Figure BDA0003776287460000111
wherein n is i Is the number of training samples for the ith client, and δ i And calculating the weight coefficient of the local pre-training language model returned by the ith client.
And 209, performing weighted aggregation on each local pre-training language model by the server according to the corresponding weight coefficient to obtain an updated global pre-training language model.
Continue to use the above assumptions, and assume F i For client u i Returning the local pre-trained language model, step 2014 may be formulated as follows:
Figure BDA0003776287460000121
wherein, F global The computed global pre-training language model.
Through the steps 201 to 209, both the global pre-training language model stored in the server side and the local pre-training language model stored in the client side are updated and optimized.
In some optional embodiments, the above flow 200 may further include the following step 210:
at step 210, the server determines whether to end the global model update operation.
Here the global model update operation may include steps 201 through 209. The server may determine whether to end the global model update operation using various implementations according to actual needs. If it is determined that the global model updating operation is not finished, the process may proceed to step 201 to continue.
As an example, the server may determine to end the global model updating operation when the overall difference between the global pre-training language model and the local pre-training language model sent by each client is smaller than a preset difference threshold, and otherwise determine not to end the global model updating operation when the difference is not smaller than the preset difference threshold, and go to step 201 to continue execution. Here, the difference between the models may be calculated in various ways. For example, assume that model M1 and model M2 each have N identical parameters, but the model parameters may take different values. By using N parameter values of the model M1 to form the N-dimensional vector V1 and N parameter values of the model M2 to form the N-dimensional vector V2, the difference between the model M1 and the model M2 can be converted into a distance problem between the vector V1 and the vector V2, for example, the difference between the model M1 and the model M2 can be calculated using euclidean distance.
Optionally, step 210 may also be performed as follows:
it is determined whether the number of times the global model update operation is performed is less than a first preset number of times. If the determination result is less than the predetermined threshold, it is determined not to end the global model updating operation, and the process goes to step 201 to continue the execution. Otherwise, if the global model updating operation is determined to be not smaller than the preset threshold, the global model updating operation is determined to be ended. In this way, it is ensured that steps 201 to 209 are performed a first preset number of times, and that a sufficient number of times of training on the global pre-training language model and each local pre-training language model is achieved.
In some alternative embodiments, step 205 may include steps 2051 through 2054 as shown in fig. 2B:
step 2051, according to the first loss function, calculating a first loss of the current local pre-training language model to the pseudo-training sample subset, and optimizing the current local pre-training language model by using the first loss to obtain an optimized local pre-training language model.
Here, it can be assumed that there are K clients, u i For the ith client, i is a positive integer between 1 and K. (X) i ,Y i ) For client u i In which X i For client u i Local sample data set of Y i For client u i The local tag set.
Figure BDA0003776287460000131
For client u i The set of pseudo training samples of (a), wherein,
Figure BDA0003776287460000132
for client u i The pseudo tag set of (2).
Figure BDA0003776287460000133
For client u i The jth local training sample subset of (c),
Figure BDA0003776287460000134
for client u i And local training sample subset
Figure BDA0003776287460000135
A corresponding jth subset of pseudo training samples,
Figure BDA0003776287460000136
Figure BDA0003776287460000137
F i for client u i The current local pre-trained language model of (c),
Figure BDA0003776287460000138
is composed of
Figure BDA0003776287460000139
And
Figure BDA00037762874600001310
the local sample data in (1).
Here, step 2051 may be formulated as follows:
Figure BDA00037762874600001311
Figure BDA00037762874600001312
Figure BDA00037762874600001313
wherein:
Figure BDA00037762874600001314
refers to the collection of pseudo samples
Figure BDA00037762874600001315
The local sample data input client u in (1) i Current local pre-training language model F i Here, the actual output of the model is
Figure BDA00037762874600001316
Figure BDA00037762874600001317
Refer to the fact that for a subset of pseudo samples
Figure BDA00037762874600001318
Each sample data in (1)
Figure BDA00037762874600001319
Calculating sample data
Figure BDA00037762874600001320
Input client u i Current local pre-training language model F i Actual output result of (2)
Figure BDA00037762874600001321
And a subset of pseudo samples
Figure BDA00037762874600001322
First loss function between corresponding pseudo-tags in the set
Figure BDA00037762874600001323
Recalculating a subset of pseudo samples
Figure BDA00037762874600001324
Each sample data in (1)
Figure BDA00037762874600001325
Sum of corresponding first loss functions
Figure BDA00037762874600001326
b i Is composed of
Figure BDA00037762874600001327
And
Figure BDA00037762874600001328
the number of samples in the table. Here, the number of the first and second electrodes,
Figure BDA00037762874600001329
i.e. client u i Current local pre-training language model F i For pseudo training sample subset
Figure BDA00037762874600001330
The first loss of (2).
Figure BDA0003776287460000141
Means that the first loss obtained by the above-mentioned calculation is used
Figure BDA0003776287460000142
To client u i Current local pre-training language model F i Optimizing to obtain an optimized local pre-training language model F' i . It should be noted that, here, the client u i Current local pre-training language model F i Itself is not updated.
And step 2052, calculating and optimizing a second loss of the local pre-training language model for the local training sample subset according to a second loss function.
Here, the second loss function adds a corresponding loss perturbation to the loss of the local training samples.
Continuing with the above assumptions, step 2052 may be formulated as follows:
Figure BDA0003776287460000143
Figure BDA0003776287460000144
wherein the content of the first and second substances,
Figure BDA0003776287460000145
refers to the local sample subset
Figure BDA0003776287460000146
Local sample data in (2)
Figure BDA0003776287460000147
Input client u i Current local pre-training language model F i Here, the actual output of the model is
Figure BDA0003776287460000148
Figure BDA0003776287460000149
Refers to the local sample subset
Figure BDA00037762874600001410
Each sample data in (1)
Figure BDA00037762874600001411
Calculating sample data
Figure BDA00037762874600001412
Input client u i Optimized local pre-training language model F' i Actual output result of (2)
Figure BDA00037762874600001413
And local sample subset
Figure BDA00037762874600001414
Second loss function between corresponding local tags in the set
Figure BDA00037762874600001415
Then, for the local sample subset
Figure BDA00037762874600001416
Each sample data in (1)
Figure BDA00037762874600001417
Corresponding second loss function
Figure BDA00037762874600001418
Make per-loss perturbation e j Weighted summation is carried out to obtain
Figure BDA00037762874600001419
Here, the first and second liquid crystal display panels are,
Figure BDA00037762874600001420
i.e. client u i Optimized local pre-training language model F' i For local training sample subset
Figure BDA00037762874600001421
The second loss of (2). In practice, a subset of local samples
Figure BDA00037762874600001422
Each sample data in (1)
Figure BDA00037762874600001423
Corresponding loss perturbation e j May be the same constant, e.g., may be 1; local sample subset
Figure BDA00037762874600001424
Each sample data in (1)
Figure BDA00037762874600001425
Corresponding loss perturbation e j It may also be a different value, for example it may take a value randomly between 0 and 1.
It should be noted that the first loss function and the second loss function may be various now known or later developed loss functions, and the present disclosure is not limited thereto.
And step 2053, reversely propagating the second loss in the optimized local pre-training language model to obtain the loss disturbance gradient of each local training sample and the corresponding pseudo training sample in the local training sample subset.
Step 2052 may be formulated as follows:
Figure BDA0003776287460000151
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003776287460000152
for local training sample subset
Figure BDA0003776287460000153
The loss perturbation gradient of each local training sample and the corresponding dummy training sample. That is to say that the temperature of the molten steel,
Figure BDA0003776287460000154
in is b i The losses perturb the gradient.
And step 2054, for each pseudo training sample in the pseudo training sample subset, determining the sample weight of the pseudo training sample on the current local pre-training language model according to the loss disturbance gradient of the pseudo training sample.
In step 2051, client u i Current local pre-training language model F i Learning a subset of pseudo samples
Figure BDA0003776287460000155
Middle false label
Figure BDA0003776287460000156
And get the client u i Optimized local pre-training language model F' i
In step 2052 and step 5053, client u i Optimized local pre-training language model F' i Locally training a subset of samples
Figure BDA0003776287460000157
Local tag in (1)
Figure BDA0003776287460000158
Training, the calculated second loss is calculated F' i And a local tag
Figure BDA0003776287460000159
The difference between them, and calculate the corresponding loss perturbation gradient
Figure BDA00037762874600001510
Here, the gradient of the disturbance is lost
Figure BDA00037762874600001511
Is calculated to measure a subset of local training samples
Figure BDA00037762874600001512
Local tag in (1)
Figure BDA00037762874600001513
And a subset of pseudo samples
Figure BDA00037762874600001514
Middle false label
Figure BDA00037762874600001515
The similarity between them. While the gradient is disturbed by the loss
Figure BDA00037762874600001516
The larger the pseudo-label and the local label, i.e. the pseudo-training sample is to the client u i The smaller the gain, the smaller the corresponding pseudo training samples should have. Thus, here, the sample weight of the pseudo training sample to the current local pre-training language model is inversely related to the loss perturbation gradient of the pseudo training sample. In practice, various negative correlation methods may be used to determine, for each pseudo training sample in the pseudo training sample subset, the sample weight of the pseudo training sample to the current local pre-training language model according to the loss perturbation gradient of the pseudo training sample.
Alternatively, step 2054 may be performed as follows: for the pseudo training sample subset
Figure BDA0003776287460000161
In each pseudo training sample
Figure BDA0003776287460000162
Presetting a non-negative small sample weight c and the pseudo training sample
Figure BDA0003776287460000163
Loss perturbation gradient of
Figure BDA0003776287460000164
Opposite number of
Figure BDA0003776287460000165
The larger of the two
Figure BDA0003776287460000166
Is determined as the pseudo training sample
Figure BDA0003776287460000167
For the current pre-training language model F i Sample weights of
Figure BDA0003776287460000168
Here, the preset non-negative smaller sample weight may be a preset non-negative value, for example, 0.
Specifically, this alternative may be formulated as follows:
Figure BDA0003776287460000169
wherein c is a preset non-negative smaller sample weight. From the above formula, it can be seen that
Figure BDA00037762874600001610
In the case of a non-negative number, the pseudo label is not similar to the local label, i.e. the pseudo training pattern
Figure BDA00037762874600001611
To client u i With no gain or less gain, corresponding pseudo-training samples
Figure BDA00037762874600001612
Should have a small non-negative weight c, when c is 0, the pseudo training samples
Figure BDA00037762874600001613
To client u i Is 0, i.e., there is no gain.
In that
Figure BDA00037762874600001614
In the case of a negative number, the pseudo label is similar to the local label, i.e. the pseudo training pattern
Figure BDA00037762874600001615
To client u i The gain is larger, and
Figure BDA00037762874600001616
the smaller, pseudo-training samples
Figure BDA00037762874600001617
To client u i The greater the gain.
By adopting the above alternative mode, the pseudo training sample subset can be obtained
Figure BDA00037762874600001618
In each pseudo training sample
Figure BDA00037762874600001619
Non-negative sample weights of
Figure BDA00037762874600001620
Optionally, the pseudo training sample subset is obtained
Figure BDA00037762874600001621
In each pseudo training sample
Figure BDA00037762874600001622
Non-negative sample weights of
Figure BDA00037762874600001623
Thereafter, a subset of the pseudo training samples may also be subjected to
Figure BDA00037762874600001624
In each pseudo training sample
Figure BDA0003776287460000171
For the current pre-training language model F i Sample weights of (2)
Figure BDA0003776287460000172
And carrying out normalization processing. In particular, a formulaic table may be usedShown below:
Figure BDA0003776287460000173
wherein the content of the first and second substances,
Figure BDA0003776287460000174
for the calculated pseudo-training sample subset
Figure BDA0003776287460000175
In each pseudo training sample
Figure BDA0003776287460000176
For the current pre-training language model F i The sample weight of (2).
And step 2055, calculating a third loss of the current local pre-training language model for the batch of local training sample subsets and corresponding pseudo training sample subsets according to a third loss function.
The third loss function is the sum of a fourth loss function and a first loss function weighted by the sample weight, the first loss function weighted by the sample weight is obtained by weighting the first loss function according to the sample weight of the corresponding pseudo training sample, and the fourth loss function is the loss of the current local pre-training language model on the local training samples in the local training sample subset.
Specifically, the following can be formulated:
Figure BDA0003776287460000177
wherein the content of the first and second substances,
Figure BDA0003776287460000178
is a third loss function.
Figure BDA0003776287460000179
Is a fourth loss function.
Figure BDA00037762874600001710
Language model F pre-trained for current local i For the local training sample subset
Figure BDA00037762874600001711
And corresponding pseudo training sample subset
Figure BDA00037762874600001712
The third loss of (2).
And step 2056, updating the model parameters of the current local pre-training language model according to the third loss.
Specifically, the following can be formulated:
Figure BDA00037762874600001713
i.e. according to the third loss
Figure BDA00037762874600001714
For the current local pre-training language model F i The model parameters of (2) are updated.
By the above alternative embodiment, in addition to being able to do so to client u i Current local pre-training language model F i Besides updating the model parameters, the client u can be obtained i Pseudo training sample set of
Figure BDA0003776287460000181
The loss disturbance gradient corresponding to each pseudo training sample
Figure BDA0003776287460000182
Based on the above optional embodiment, in step 206, the client sends the current local pre-trained language model to the server, which may be performed as follows:
and the client side sends the loss disturbance gradient of each pseudo training sample in the current local pre-training language model and the pseudo training sample set to the server.
So that the server can receive the client u i Sent current local pre-training language model F of the client i Besides, the pseudo training sample set of the client can be received
Figure BDA0003776287460000183
Loss perturbation gradient of each pseudo training sample
Figure BDA0003776287460000184
Accordingly, optionally, the server determines the weighting factor of each local pre-trained language model in step 208, and may also perform as follows:
firstly, for each local pre-training language model, determining the sum of the loss disturbance gradients of each local training sample corresponding to the local pre-training language model as the disturbance gradient sum of the local pre-training language model.
Then, for each local pre-training language model, determining the ratio of the sum of the disturbance gradients of the local pre-training language model divided by the sum of the disturbance gradients of each local pre-training language model as the disturbance gradient sum ratio of the local pre-training language model.
Continuing with the above example, specifically, client u i Returned local pre-trained language model F i The perturbation gradient and the ratio of (c) can be formulated as follows:
Figure BDA0003776287460000185
wherein p is i Is the calculated client u i Returned local pre-trained language model F i The perturbation gradient and the fractional ratio of (c).
And finally, for each local pre-training language model, determining a weight coefficient of the local pre-training language model according to the disturbance gradient and the ratio of the local pre-training language model.
Here, the weight coefficient of the local pre-trained language model may be determined in various ways according to the disturbance gradient and the ratio of the local pre-trained language model. Here, the weight coefficient of the local pre-trained language model is positively correlated with the weight coefficient of the local pre-trained language model.
Optionally, to ensure that the weight coefficient of the local pre-training language model is a non-negative value, so as to facilitate subsequent weighted aggregation of the global pre-training language model, the greater one of the disturbance gradient and the proportion of the local pre-training language model and the preset non-negative smaller weight coefficient may be determined as the weight coefficient of the local pre-training language model.
Specifically, the following can be formulated:
δ i =max(p i ,w c )
wherein, w c A non-negative smaller weight factor is preset. For example, the predetermined non-negative smaller weight coefficient may be zero.
Therefore, the weighting coefficient of the local pre-training language model is ensured to be a non-negative value, and the minimum value is a preset non-negative smaller weighting coefficient.
By adopting the optional implementation manner of the step 205, the pseudo training sample is generated by using the global pre-training language model sent by the server, so that data enhancement is performed, the requirement for labeling the training sample data volume is reduced, and then the labeling cost of the training sample is reduced. In addition, the generated pseudo training samples are subjected to re-weighting by adopting a meta-learning method, and according to the weights of different client models when the loss disturbance gradient server of the client performs weighted aggregation on the global model, the local pre-training language model of each client can be correspondingly weighted according to the contribution degree of each client training sample to the global pre-training language model.
In some optional embodiments, the process 200 may further include the following step 211:
in step 211, the client executes the target natural language processing task by using the local pre-training language model.
After the client is trained to obtain the local pre-training language model of the client, the trained local pre-training language model can be used for executing the target natural language processing task. Here, the target natural language processing task is a natural language processing task provided by the client corresponding to the client's own pre-trained language model.
It should be noted that the natural language processing tasks corresponding to different clients may be the same or different.
In the model training system provided by the above embodiment of the present disclosure, by using the federal learning method, the local pre-training language models from the clients are weighted and aggregated at the server side, so as to obtain the global pre-training language model. At a client, a global pre-training language model sent by a server is used to generate a pseudo training sample set corresponding to a local training sample set of the client. And updating the local pre-training language model based on the local training sample subset and the pseudo training sample set, and finally sending the current local pre-training language model of the client to the server so as to facilitate the server to perform weighted aggregation on the global pre-training language model. Finally, the marked data of different users can be utilized, the marked data utilization efficiency is improved, and the generalization capability of the multi-task pre-training language model is enhanced.
Referring now to FIG. 2C, FIG. 2C illustrates an application scenario diagram of one embodiment of a model training system according to the present disclosure. As shown in fig. 2C, the system as a whole includes a server 220 and K clients 231, \ 8230;, 23K. Wherein, each client can correspond to different users (individual users or collective users), and data between different clients cannot be shared.
Server 220 receives returned local pre-trained language models F for K clients 1 ,F 2 ,…,F K Then, determining the weight coefficient of each local pre-training language model, and obtaining a global pre-training language model F after weighted aggregation global
Each client sends the global pre-training language model F by using the server 220 global And generating a pseudo training sample set by the local training sample set. Based on local training sample set and global pre-training languageAnd performing model training on the language model and the pseudo training sample set to obtain a local pre-training language model of the client so as to make up for the problem of insufficient labeled training samples. After the local pre-training language model is updated, the client may send the updated local pre-training language model to the server 220 for weighted aggregation to obtain the global pre-training language model.
Optionally, when determining the weight coefficient of the local pre-training language model, the server 220 may determine the weight coefficient by using a loss disturbance sent by the client, in this way, the global pre-training language model F may be made global More attention is paid to local pre-trained language models that are weak in performance.
With continued reference to FIG. 3, a flow 300 of one embodiment of a global model training method according to the present disclosure is shown. The model training method is applied to a server and comprises the following steps:
step 301, sending the global pre-training language model to each client.
Step 302, receiving the local pre-training language model returned by each client after responding to the local training of the global pre-training language model.
Step 303, determining the weight coefficient of each local pre-training language model.
And 304, carrying out weighted aggregation on each local pre-training language model according to the corresponding weight coefficient to obtain an updated global pre-training language model.
In this embodiment, the specific operations of step 301, step 302, step 303, and step 304 and the technical effects thereof are substantially the same as the operations and effects of step 201, step 207, step 208, and step 209 in the embodiment shown in fig. 2, and are not described herein again.
In some optional embodiments, the above process 300 may further include the following step 305:
step 305, it is determined whether to end the global model update operation.
Here, the detailed operations of the alternative implementation of step 305 and the technical effects thereof are substantially the same as those described in step 210 in the embodiment shown in fig. 2, and are not repeated herein.
Here the global model update operation may comprise step 301, step 302, step 303 and step 304. The server may determine whether to end the global model update operation using various implementations according to actual needs. If it is determined that the global model update operation is not finished, the process may proceed to step 301 to continue.
In the method provided by the embodiment of the disclosure, the global pre-training language model is sent to the client by the server, and the local pre-training language models from the clients are subjected to weighted aggregation at the server, so that the global pre-training language model is obtained. The global pre-training language model can be trained by utilizing the label data of different clients, and the local pre-training language model of each client can also utilize the label data of other clients through the global pre-training language model, so that the model generalization capability of the global pre-training language model and the local pre-training language model of the client is improved.
With continued reference to FIG. 4, a flow 400 of one embodiment of a local model training method according to the present disclosure is shown. The model training method is applied to a client and comprises the following steps:
step 401, in response to receiving the global pre-training language model sent by the server, respectively inputting the local sample data in the local training sample set into the global pre-training language model, and obtaining a pseudo label corresponding to the corresponding local sample data.
Step 402, generating a pseudo training sample in the pseudo training sample set by using the local sample data in the local training sample set and the corresponding pseudo label.
And step 403, determining the global pre-training language model as a current local pre-training language model.
Step 404, the local training sample set and the corresponding pseudo training sample set are divided into at least two local training sample subsets and corresponding pseudo training sample subsets.
Step 405, for each batch of local training sample subsets and corresponding pseudo training sample subsets, updating the current local pre-training language model based on the local training sample subsets and corresponding pseudo training sample subsets.
Step 406, the current local pre-trained language model is sent to the server.
In this embodiment, the specific operations of step 401, step 402, step 403, step 404, step 405, and step 406 and the technical effects thereof are substantially the same as the operations and effects of step 202, step 203, step 204, step 205, step 206, and step 207 in the embodiment shown in fig. 2A, and are not repeated herein.
In the local model training method provided by the above embodiment of the present disclosure, a global pre-training language model sent by a server is first used at a client to generate a pseudo training sample set corresponding to a local training sample set of the client. And updating the local pre-training language model based on the local training sample subset and the pseudo training sample set, and finally sending the current local pre-training language model of the client to the server so that the server performs weighted aggregation on the global pre-training language model and trains to obtain the local pre-training language model of the client, so that the labeled data of different clients can be utilized, the utilization efficiency of the labeled data is improved, and the generalization capability of the multi-task pre-training language model is enhanced.
The disclosure also provides an embodiment of a task processing method. The task processing method is applied to a client, and comprises the following steps:
and executing the target natural language processing task by using the local pre-training language model.
Here, the local pre-trained language model of the client is pre-trained by the local model training method as shown in the embodiment shown in fig. 4 and its optional implementation.
Here, the target natural language processing task is a natural language processing task provided by the client corresponding to the local pre-trained language model of the client. For example, the target natural language processing task may be a machine translation task.
It should be noted that the natural language processing tasks corresponding to different clients may be the same or different.
With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a global model training apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 3, and the apparatus may be specifically applied to various servers.
As shown in fig. 5, the global model training apparatus 500 of the present embodiment includes: a global model update unit 501 configured to perform the following global model update operations: sending the global pre-training language model to each client; receiving a local pre-training language model returned by each client after responding to the local training of the global pre-training language model; determining a weight coefficient of each local pre-training language model; and carrying out weighted aggregation on each local pre-training language model according to the corresponding weight coefficient to obtain an updated global pre-training language model.
In this embodiment, the detailed processing of the global model updating unit 501 of the global model training apparatus 500 and the technical effects thereof can refer to the related descriptions of step 301, step 302, step 303 and step 304 in the corresponding embodiment of fig. 3, which are not described herein again.
In some optional embodiments, the global model updating operation may further include: determining whether to end the global model update operation; in response to determining not to end, continuing to perform the global model update operation.
In some optional embodiments, the receiving, by each client, a local pre-trained language model returned after the local training of the global pre-trained language model may include: receiving a local pre-training language model returned by each client after responding to the local training of the global pre-training language model and a loss disturbance gradient of each local training sample in a corresponding local training sample set; and the determining the weight coefficient of each local pre-training language model may include: for each local pre-training language model, determining the sum of the loss disturbance gradients of each local training sample corresponding to the local pre-training language model as the sum of the disturbance gradients of the local pre-training language model; for each local pre-training language model, determining the ratio of the sum of the disturbance gradients of the local pre-training language model to the sum of the disturbance gradients of each local pre-training language model as the ratio of the sum of the disturbance gradients of the local pre-training language model; and for each local pre-training language model, determining a weight coefficient of the local pre-training language model according to the disturbance gradient and the ratio of the local pre-training language model.
In some optional embodiments, the determining, according to the perturbation gradient and the proportion of the local pre-training language model, the weight coefficient of the local pre-training language model may include: for each of the local pre-trained language models, performing the following weight coefficient determination operations: determining whether the disturbance gradient and the proportion of the local pre-training language model are positive numbers; in response to determining yes, determining a perturbation gradient and a proportion of the local pre-trained language model as weight coefficients of the local pre-trained language model; in response to a determination of no, determining a preset non-negative smaller weight coefficient as the weight coefficient of the local pre-training language model.
In some optional embodiments, the determining whether to end the global model update operation may include: determining whether the number of times of executing the global model updating operation is less than a first preset number of times; in response to determining yes, determining not to end the global model update operation; in response to a determination of no, determining to end the global model update operation.
It should be noted that, for details of implementation and technical effects of each unit in the global model training device provided in the embodiments of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not described herein again.
With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a local model training apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 4, and the apparatus may be specifically applied to various clients.
As shown in fig. 6, the local model training apparatus 600 of the present embodiment includes: a pseudo label generating unit 601, a pseudo sample generating unit 602, a sample batching unit 603, a local model updating unit 604, and a local model transmitting unit 605. The pseudo label generating unit 601 is configured to respond to a received global pre-training language model sent by a server, and input local sample data in a local training sample set into the global pre-training language model respectively to obtain pseudo labels corresponding to corresponding local sample data, where the local training sample includes the local sample data and corresponding local labels; a pseudo sample generating unit 602 configured to generate a pseudo training sample in a pseudo training sample set by using local sample data in the local training sample set and a corresponding pseudo label; a sample batching unit 603 configured to determine the global pre-training language model as a current local pre-training language model and to divide the local training sample set and the corresponding pseudo training sample set into at least two local training sample subsets and corresponding pseudo training sample subsets; a local model updating unit 604 configured to, for each batch of the local training sample subset and the corresponding pseudo training sample subset, update the current local pre-training language model based on the local training sample subset and the corresponding pseudo training sample subset; a local model sending unit 605 configured to send the current local pre-training language model to the server.
In this embodiment, the detailed processing and the technical effects thereof of the pseudo label generating unit 601, the pseudo sample generating unit 602, the sample batching unit 603, the local model updating unit 604 and the local model sending unit 605 of the model training apparatus 600 can respectively refer to the related descriptions of step 401, step 402, step 403, step 404 and step 405 in the corresponding embodiment of fig. 4, and are not repeated herein.
In some optional embodiments, the local model updating unit 604 may be further configured to: calculating a first loss of the current local pre-training language model to the pseudo training sample subset according to a first loss function, and optimizing the current local pre-training language model by using the first loss to obtain an optimized local pre-training language model; calculating a second loss of the optimized local pre-training language model to the local training sample subset according to a second loss function, wherein the second loss function adds corresponding loss disturbance to the loss of the local training sample, and the second loss is reversely propagated in the optimized local pre-training language model to obtain a loss disturbance gradient of each local training sample and the corresponding pseudo training sample in the local training sample subset; for each pseudo training sample in the pseudo training sample subset, determining the sample weight of the pseudo training sample to the current local pre-training language model according to the loss disturbance gradient of the pseudo training sample; calculating a third loss of the current local pre-training language model for the batch of local training sample subsets and the corresponding pseudo training sample subsets according to a third loss function, wherein the third loss function is the sum of a fourth loss function and a first loss function weighted by sample weights, the first loss function weighted by the sample weights is a loss function obtained by weighting the first loss function according to the sample weights of the corresponding pseudo training samples, and the fourth loss function is the loss of the current local pre-training language model for the local training samples in the batch of local training sample subsets; and updating the model parameters of the current local pre-training language model according to the third loss.
In some optional embodiments, the local model transmitting unit 605 may be further configured to: and sending the loss disturbance gradient of each pseudo training sample in the current local pre-training language model and the pseudo training sample set to the server.
In some optional embodiments, the determining, for each pseudo training sample in the pseudo training sample subset, a sample weight of the pseudo training sample on the current local pre-training language model according to a loss disturbance gradient of the pseudo training sample may include: and for each pseudo training sample in the pseudo training sample subset, determining the greater one of the preset non-negative smaller sample weight and the opposite number of the loss disturbance gradient of the pseudo training sample as the sample weight of the pseudo training sample to the current pre-training language model.
In some optional embodiments, for each pseudo training sample in the pseudo training sample subset, determining a sample weight of the pseudo training sample to the current local pre-training language model according to a loss perturbation gradient of the pseudo training sample, may further include: and carrying out normalization processing on the sample weight of the current pre-training language model for each pseudo training sample in the pseudo training sample subset.
It should be noted that, for details of implementation and technical effects of each unit in the model training device provided in the embodiments of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not described herein again.
The present disclosure also provides an embodiment of a task processing device. The task processing device is applied to a client and comprises: and the task processing unit is configured to execute the target natural language processing task by using a local pre-training language model, wherein the local pre-training language model is obtained by training through a local model training method described in the embodiment shown in fig. 4 and the optional implementation manner thereof.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use as a client or server for implementing embodiments of the present disclosure. The computer system 700 shown in fig. 7 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.
As shown in fig. 7, computer system 700 may include a processing device (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage device 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communications device 709 may allow the computer system 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates a computer system 700 having various means, it is to be understood that it is not required that all of the illustrated means be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, the processes described above with reference to the flow diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement a global model training method as shown in the embodiment shown in fig. 3 and its optional embodiments, and/or a local model training method as shown in the embodiment shown in fig. 4 and its optional embodiments, and/or a task processing method as shown in the embodiments of the present disclosure and its optional embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the local model sending unit may also be described as a "unit sending the current local pre-trained language model to the server".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (18)

1. A global model training method, comprising:
the following global model update operations are performed: sending the global pre-training language model to each client; receiving a local pre-training language model returned by each client after responding to the local training of the global pre-training language model; determining a weight coefficient of each local pre-training language model; and carrying out weighted aggregation on each local pre-training language model according to the corresponding weight coefficient to obtain an updated global pre-training language model.
2. The method of claim 1, wherein the global model update operation further comprises:
determining whether to end the global model update operation;
in response to determining not to end, continuing to perform the global model update operation.
3. The method of claim 2, wherein the receiving the local pre-trained language model returned by each client in response to the local training of the global pre-trained language model comprises:
receiving a local pre-training language model returned by each client after responding to the local training of the global pre-training language model and a loss disturbance gradient of each local training sample in a corresponding local training sample set; and
the determining the weight coefficient of each local pre-training language model includes:
for each local pre-training language model, determining the sum of the loss disturbance gradients of each local training sample corresponding to the local pre-training language model as the disturbance gradient sum of the local pre-training language model;
for each local pre-training language model, determining the ratio of the sum of the disturbance gradients of the local pre-training language model to the sum of the disturbance gradients of each local pre-training language model as the ratio of the sum of the disturbance gradients of the local pre-training language model;
and for each local pre-training language model, determining a weight coefficient of the local pre-training language model according to the disturbance gradient and the ratio of the local pre-training language model.
4. The method of claim 3, wherein determining the weighting factor of the local pre-trained language model according to the perturbation gradient and the ratio of the local pre-trained language model comprises:
and determining the larger of the disturbance gradient and the ratio of the local pre-training language model and a preset non-negative smaller weight coefficient as the weight coefficient of the local pre-training language model.
5. The method of claim 2, wherein the determining whether to end the global model update operation comprises:
determining whether the number of times of executing the global model updating operation is less than a first preset number of times;
in response to determining yes, determining not to end the global model update operation;
in response to a determination of no, determining to end the global model update operation.
6. A local model training method, comprising:
in response to receiving a global pre-training language model sent by a server, respectively inputting local sample data in a local training sample set into the global pre-training language model to obtain pseudo labels corresponding to corresponding local sample data, wherein the local training sample comprises the local sample data and corresponding local labels;
generating a pseudo training sample in a pseudo training sample set by using local sample data in the local training sample set and the corresponding pseudo label;
determining the global pre-training language model as a current local pre-training language model, and dividing the local training sample set and the corresponding pseudo training sample set into at least two local training sample subsets and corresponding pseudo training sample subsets;
for each batch of local training sample subsets and corresponding pseudo training sample subsets, updating the current local pre-training language model based on the local training sample subsets and the corresponding pseudo training sample subsets;
and sending the current local pre-training language model to the server.
7. The method of claim 6, wherein said updating the current local pre-training language model based on the local subset of training samples and the corresponding subset of pseudo-training samples comprises:
calculating a first loss of the current local pre-training language model to the pseudo training sample subset according to a first loss function, and optimizing the current local pre-training language model by using the first loss to obtain an optimized local pre-training language model;
calculating a second loss of the optimized local pre-training language model to the local training sample subset according to a second loss function, wherein the second loss function adds corresponding loss disturbance to the loss of the local training sample, and the second loss is reversely propagated in the optimized local pre-training language model to obtain a loss disturbance gradient of each local training sample and the corresponding pseudo training sample in the local training sample subset;
for each pseudo training sample in the pseudo training sample subset, determining the sample weight of the pseudo training sample to the current local pre-training language model according to the loss disturbance gradient of the pseudo training sample;
calculating a third loss of the current local pre-training language model for the batch of local training sample subsets and the corresponding pseudo training sample subsets according to a third loss function, wherein the third loss function is the sum of a fourth loss function and a first loss function weighted by sample weights, the first loss function weighted by the sample weights is a loss function obtained by weighting the first loss function according to the sample weights of the corresponding pseudo training samples, and the fourth loss function is the loss of the current local pre-training language model for the local training samples in the batch of local training sample subsets;
and updating the model parameters of the current local pre-training language model according to the third loss.
8. The method of claim 7, wherein the sending the current local pre-trained language model to the server comprises:
and sending the loss disturbance gradient of each pseudo training sample in the current local pre-training language model and the pseudo training sample set to the server.
9. The method of claim 7, wherein the determining, for each pseudo training sample in the subset of pseudo training samples, the sample weight of the pseudo training sample to the current local pre-training language model according to the loss perturbation gradient of the pseudo training sample comprises:
and for each pseudo training sample in the pseudo training sample subset, determining the greater of the preset non-negative smaller sample weight and the opposite number of the loss disturbance gradient of the pseudo training sample as the sample weight of the pseudo training sample to the current pre-training language model.
10. The method of claim 7, wherein the determining, for each pseudo training sample in the subset of pseudo training samples, a sample weight of the pseudo training sample to the current local pre-training language model according to a loss perturbation gradient of the pseudo training sample, further comprises:
and carrying out normalization processing on the sample weight of the current pre-training language model for each pseudo training sample in the pseudo training sample subset.
11. A method of task processing, comprising:
performing a target natural language processing task using a local pre-trained language model, wherein the local pre-trained language model is trained by the local model training method according to any one of claims 6 to 10.
12. A global model training device applied to a server comprises:
a global model update unit configured to perform the following global model update operations: sending the global pre-training language model to each client; receiving local pre-training language models returned by the clients after responding to the local training of the global pre-training language models; determining a weight coefficient of each local pre-training language model; and carrying out weighted aggregation on each local pre-training language model according to the corresponding weight coefficient to obtain an updated global pre-training language model.
13. A local model training device applied to a client comprises:
the pseudo label generating unit is configured to respond to a received global pre-training language model sent by the server, input local sample data in a local training sample set into the global pre-training language model respectively, and obtain a pseudo label corresponding to the corresponding local sample data, wherein the local training sample comprises the local sample data and the corresponding local label;
a pseudo sample generating unit configured to generate a pseudo training sample in a pseudo training sample set by using local sample data in the local training sample set and a corresponding pseudo label;
a sample batching unit configured to determine the global pre-training language model as a current local pre-training language model and to divide the local training sample set and the corresponding pseudo-training sample set into at least two batches of local training sample subsets and corresponding pseudo-training sample subsets;
a local model updating unit configured to update, for each batch of a local training sample subset and a corresponding pseudo training sample subset, the current local pre-training language model based on the local training sample subset and the corresponding pseudo training sample subset;
a local model sending unit configured to send the current local pre-training language model to the server.
14. A task processing device comprising:
a task processing unit configured to perform a target natural language processing task using a local pre-trained language model, wherein the local pre-trained language model is trained by the local model training method according to any one of claims 6 to 10.
15. A server, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-5.
16. A client, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 6-10.
17. A computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method of any of claims 1-5 and/or the method of any of claims 6-10.
18. A model training system comprising the server of claim 15 and the client of claim 16.
CN202210917581.XA 2022-08-01 2022-08-01 Model training method, device, storage medium, client, server and system Pending CN115146657A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210917581.XA CN115146657A (en) 2022-08-01 2022-08-01 Model training method, device, storage medium, client, server and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210917581.XA CN115146657A (en) 2022-08-01 2022-08-01 Model training method, device, storage medium, client, server and system

Publications (1)

Publication Number Publication Date
CN115146657A true CN115146657A (en) 2022-10-04

Family

ID=83413258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210917581.XA Pending CN115146657A (en) 2022-08-01 2022-08-01 Model training method, device, storage medium, client, server and system

Country Status (1)

Country Link
CN (1) CN115146657A (en)

Similar Documents

Publication Publication Date Title
CN109902186B (en) Method and apparatus for generating neural network
CN109460513B (en) Method and apparatus for generating click rate prediction model
CN108416310B (en) Method and apparatus for generating information
CN110619078B (en) Method and device for pushing information
WO2020207174A1 (en) Method and apparatus for generating quantized neural network
CN111831855B (en) Method, apparatus, electronic device, and medium for matching videos
CN112650841A (en) Information processing method and device and electronic equipment
CN112364860A (en) Training method and device of character recognition model and electronic equipment
CN111340220A (en) Method and apparatus for training a predictive model
CN110866040A (en) User portrait generation method, device and system
US20220391425A1 (en) Method and apparatus for processing information
CN110009101B (en) Method and apparatus for generating a quantized neural network
CN111680799A (en) Method and apparatus for processing model parameters
CN111008213A (en) Method and apparatus for generating language conversion model
CN111783731B (en) Method and device for extracting video features
CN112449217B (en) Method and device for pushing video, electronic equipment and computer readable medium
CN112241761A (en) Model training method and device and electronic equipment
CN110046670B (en) Feature vector dimension reduction method and device
CN109598344B (en) Model generation method and device
CN111782933A (en) Method and device for recommending book list
CN112308477A (en) Inventory positioning method and device
CN111709784B (en) Method, apparatus, device and medium for generating user retention time
CN115146657A (en) Model training method, device, storage medium, client, server and system
CN113220922A (en) Image searching method and device and electronic equipment
CN111353585A (en) Structure searching method and device of neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination