CN117455009A

CN117455009A - Federal learning method, federal prediction method, apparatus, device, and storage medium

Info

Publication number: CN117455009A
Application number: CN202311415967.1A
Authority: CN
Inventors: 杜逸超; 张志锐; 黄旭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-01-26

Abstract

The application discloses a federal learning method, a federal prediction device, equipment and a storage medium, and belongs to the technical field of machine learning. The method comprises the following steps: acquiring a cascade local backbone model and an adapter model, wherein the number of model parameters of the adapter model is smaller than that of the local backbone model; training a local backbone model and an adapter model based on private data in the client; uploading the trained adapter model to a central server, wherein the adapter model is used for updating a global model at the central server side. According to the method, only the adapter model is updated during federal learning, and the model parameter of the adapter model is smaller than that of the local backbone model, so that federal learning cost can be reduced; meanwhile, a local knowledge base is built on the client side based on private data of the client side, and when the global model is used for generating output results, the local knowledge base is combined for generating, so that the results generated by the global model are more personalized.

Description

Federal learning method, federal prediction method, apparatus, device, and storage medium

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a federal learning method, a federal prediction device, a federal prediction apparatus, and a storage medium.

Background

Along with the development of artificial intelligence, people put forward the concept of 'federal learning' for solving the problem of data island, so that both federal parties can also perform model training to obtain model parameters under the condition of not giving own data, and the problem of data privacy leakage can be avoided. Federal learning has two participants, a server and a client, respectively.

In federal learning, a server is responsible for issuing models to clients and integrating models uploaded by clients, while clients are responsible for training models according to private data local to the clients and uploading the models to the server. Because the model training process is carried out locally by the client, and only model parameters are uploaded instead of private data of the client, the private data of the client can be used for model training, and the private data of the client can be prevented from being leaked during model training, so that the problem of data island is solved.

However, in the related art, the model training method based on federal learning is mostly based on a FedAvg (Federated Averaging, federal average) algorithm, and the method includes that model parameters of a local model of each client are uploaded to a server, the server calculates an average value of model parameters of the local model of all clients, and then broadcasts the average value back to all local devices. However, when the model to be trained is relatively large, the use of FedAvg results in a large calculation amount for each client and a large communication overhead between the client and the server.

Therefore, how to design a low-cost high-quality federal learning method is a problem to be solved at present.

Disclosure of Invention

The application provides a federal learning method, a federal prediction device, equipment and a storage medium, wherein the technical scheme is as follows:

according to an aspect of the present application, there is provided a federal learning method, the method being performed by a client participating in federal learning, the method comprising:

acquiring a cascade local backbone model and an adapter model, wherein the number of model parameters of the adapter model is smaller than that of the local backbone model;

private data in the client is obtained;

training the local backbone model and the adapter model based on the private data;

uploading the trained adapter model to a central server, wherein the trained adapter model is used for updating the global model at the central server side by adopting the federal learning mode.

According to an aspect of the present application, there is provided a federal learning method, the method being performed by a client participating in federal learning, the method further comprising:

the method further comprises the steps of:

acquiring an i+1st round of adapter model issued by the central server;

Inputting the private data into the local backbone model and the i+1th round adapter model to obtain an i+1th round loss function value;

fixing the model parameters of the local backbone model to be unchanged;

updating model parameters of the i+1th round adapter model based on the i+1th round loss function value;

uploading the i+1st round of loss function values and the trained i+1st round of adapter models to the central server, wherein the i+1st round of loss function values are used for the central server to determine whether to execute the i+2nd round of training.

under the condition that a first output result is simultaneously present in the prediction result set and the search result set, interpolation is carried out on the prediction result probability and the search result probability corresponding to the first output result, so that interpolation probability of the first output result is obtained;

under the condition that a second output result only appears in the predicted result set and does not appear in the search result set, interpolation is carried out on the predicted result probability corresponding to the second output result and the search result probability set to be zero, so that interpolation probability of the second output result is obtained;

Under the condition that a third output result only appears in the search result set and does not appear in the prediction result set, interpolation is carried out on the search result probability corresponding to the third output result and the prediction result probability set to be zero, so that interpolation probability of the third output result is obtained;

and sorting based on interpolation probabilities of all the output results, and determining the output result with the highest interpolation probability as the final output result.

According to an aspect of the present application, there is provided a federation learning method, the method being performed by a client participating in federation learning, the private data including source data and target data, the target data including n sub-target data, n being a positive integer, the method further comprising:

extracting features of the source data and the first m sub-target data based on the global model as context representation of the (m+1) th sub-target data, wherein m is a positive integer and is less than or equal to n;

and storing the corresponding relation between the context representation of the (m+1) th sub-target data and the (m+1) th sub-target data into the local knowledge base.

According to an aspect of the present application, there is provided a federal learning method, the method being performed by a client participating in federal learning, the source data being a speech sequence; the target data is a text sequence; the sub-target data is a text unit, and the text unit is a word or a word; the method further comprises the steps of:

Extracting features of the speech sequence and the first m text units as a context representation of an (m+1) th text unit based on the global model;

storing the corresponding relation between the context representation of the (m+1) -th sub-target data and the (m+1) -th sub-target data in the local knowledge base, wherein the method comprises the following steps:

and storing the corresponding relation between the context representation of the (m+1) th text unit and the (m+1) th text unit into the local knowledge base.

According to an aspect of the present application, there is provided a federal learning method, the method being performed by a client participating in federal learning, the input information including source data and first m sub-target data; the method further comprises the steps of:

extracting features of the source data and the first m sub-target data as a context representation of the (m+1) th sub-target data based on the global model;

and searching according to the context representation of the (m+1) th sub-target data based on the local knowledge base, and obtaining the search result set through a k neighbor model.

According to an aspect of the present application, there is provided a federal learning method, the method being performed by a central server, the method comprising:

Transmitting a cascaded local backbone model and an adapter model to at least two clients participating in federal learning, wherein the number of model parameters of the adapter model is smaller than that of the local backbone model;

receiving the trained adapter model uploaded by the at least two clients;

and aggregating based on the trained adapter models uploaded by at least two clients to obtain an aggregated adapter model.

According to an aspect of the present application, there is provided a federal prediction method, the method being performed by a client, the method comprising:

acquiring a global model, wherein the global model is obtained by training according to a federal learning method;

predicting the input information based on the global model to obtain a prediction result set; searching the input information based on a local knowledge base to obtain a search result set;

determining a final output result in the prediction result set and the retrieval result set;

the local knowledge base is constructed based on private data in the client, the prediction result set comprises at least one prediction result, and the retrieval result set comprises at least one retrieval result.

According to an aspect of the present application, there is provided a federal prediction method, the method being performed by a client, the determining a final output result in the prediction result set and the search result set, including:

According to an aspect of the present application, there is provided a federal prediction method, the method being performed by a client, the source data being a speech sequence; the target data is a text sequence; the sub-target data is a text unit, and the text unit is a word or a word; the method comprises the following steps:

According to an aspect of the present application, there is provided a federal prediction method, the method being performed by a client, the input information comprising source data and first m sub-target data; the method comprises the following steps:

According to an aspect of the present application, there is provided a federal learning apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a cascade local backbone model and an adapter model, and the number of model parameters of the adapter model is smaller than that of the local backbone model;

the second acquisition module is used for acquiring private data in the client;

the training module is used for training the local backbone model and the adapter model based on the private data;

and the uploading module is used for uploading the trained adapter model to a central server, and the trained adapter model is used for updating the global model at the central server side by adopting the federal learning mode.

the first sending module is used for sending the cascaded local backbone model and the adapter model to at least two clients participating in federal learning, and the number of model parameters of the adapter model is smaller than that of the local backbone model;

the first receiving module is used for receiving the trained adapter models uploaded by the at least two clients;

And the aggregation module is used for aggregating the trained adapter models uploaded by the at least two clients to obtain an aggregated adapter model.

According to an aspect of the present application, there is provided a federal prediction apparatus, the apparatus comprising:

the third acquisition module is used for acquiring a global model, wherein the global model is obtained by training according to a federal learning method;

the second prediction module is used for predicting the input information based on the global model to obtain a prediction result set;

the second retrieval module is used for retrieving the input information based on a local knowledge base to obtain a retrieval result set;

the second determining module is used for determining a final output result in the prediction result set and the retrieval result set;

According to an aspect of the present application, there is provided a computer device comprising: a processor and a memory, wherein at least one section of program is stored in the memory; the processor is configured to execute the at least one program in the memory to implement the federal learning method and/or the federal prediction method.

According to an aspect of the present application, there is provided a computer-readable storage medium having stored therein executable instructions that are loaded and executed by a processor to implement the federal learning method and/or federal prediction method described above.

According to an aspect of the present application, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium, from which a processor reads and executes the computer instructions to implement the federal learning method and/or federal prediction method described above.

The beneficial effects that this application provided technical scheme brought include at least:

according to the method and the device, model training is put down to the client, the client uses private data stored by the client to conduct model training, and when the model is trained, instead of directly training a local backbone model, an adapter is added, training content of the model is saved by means of updating the adapter, the number of model parameters of the adapter is smaller than that of model parameters of the local backbone model, namely, training of the model can be completed only by updating fewer parameters when the model is trained, the client with lower configuration can also participate in federal learning, in the model training process, communication expenditure between the server and the client can be greatly reduced only by interacting with the adapter model, and the federal learning model with lower cost can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a schematic diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 2 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 3 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 4 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 5 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 6 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 7 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 8 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 9 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 10 illustrates a flowchart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 11 illustrates a flowchart of a federal learning method provided in an exemplary embodiment of the present application;

FIG. 12 illustrates a flowchart of a federal learning method provided in an exemplary embodiment of the present application;

FIG. 13 illustrates a flowchart of a federal learning method provided in an exemplary embodiment of the present application;

FIG. 14 illustrates a flowchart of a federal prediction method provided in one exemplary embodiment of the present application;

FIG. 15 illustrates a flowchart of a federal prediction method provided in one exemplary embodiment of the present application;

FIG. 16 illustrates a flowchart of a federal prediction method provided by an exemplary embodiment of the present application;

FIG. 17 illustrates a flowchart of a federal prediction method provided in an exemplary embodiment of the present application;

FIG. 18 illustrates a flowchart of a federal prediction method provided by an exemplary embodiment of the present application;

FIG. 19 illustrates a schematic diagram of a federal prediction method provided in an exemplary embodiment of the present application;

FIG. 20 illustrates an overall flow chart of a federal learning method provided in one exemplary embodiment of the present application;

FIG. 21 illustrates a schematic diagram of a federal learning method provided in an exemplary embodiment of the present application;

FIG. 22 illustrates a block diagram of a federal learning device according to an exemplary embodiment of the present application;

FIG. 23 illustrates a block diagram of a federal learning device according to an exemplary embodiment of the present application;

FIG. 24 illustrates a block diagram of a federal prediction appliance according to an exemplary embodiment of the present application;

fig. 25 shows a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions. For example, information such as setting operations referred to in the present application is acquired with sufficient authorization.

It should be understood that, although the terms first, second, etc. may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, description is made of related terms related to the present application:

federal learning (Federated Learning, FL): and a machine learning framework for learning and training by using the data on the premise of meeting the requirements of privacy safety and data safety. Federal learning is also known as joint learning. The federal learning is used as a distributed machine learning paradigm, so that the problem of data island can be effectively solved, participants can jointly model on the basis of not sharing data, the data island can be broken technically, and collaborative training is realized. The data island refers to the phenomenon of closed and semi-closed type due to asymmetry, redundancy and the like which are formed by incomplete states of subject mobility, object technology, policy environment, system construction and the like in the process of forming, analyzing and using data and data sets. For example, in the field of voice-to-text, because voice data has natural privacy characteristics, and is under privacy protection regulations and data security requirements of users/enterprises, private data of clients cannot be collected to train a voice-to-text model, and thus a data island is formed. And the federal learning is used for lowering model training to the client, namely, model training is carried out on the client, and then the client uploads the trained model parameters to the server, and the server integrates the model parameters to obtain a final global model. In the process, although the private data of the user is also used for model training, the model training process is carried out at the client, and finally only model parameters are sent to the server, the private data of the client is not leaked in the middle process, and the problem of data island can be well solved.

FedAvg: a federal learning algorithm aggregates model parameters by weighted averaging. The basic idea of FedAvg is to upload the parameters of the local model to a server, which calculates the average of all model parameters and then broadcast this average back to all local devices. This process may be iterated multiple times until convergence. In order to ensure the accuracy of model aggregation, the FedAVg algorithm adopts a weighted average mode to conduct model aggregation. Specifically, the model parameters uploaded by each device will be given a weight and then weighted average. The weight of the model parameter uploaded by the device is assigned according to the local data size on the device, and the device weight is larger when the data size is larger. However, in the FedAvg algorithm, the weight of the model parameter uploaded by each device is assigned according to the local data size on the device. This approach may lead to the problem of data imbalance, i.e. devices with smaller amounts of data contribute less to the global model, thereby affecting the generalization performance of the model. And if the amount of model data trained is large, it may be difficult for the client to support training the model.

Artificial intelligence (Artificial Intelligence, AI): theory, methods, techniques and application systems that utilize digital computers or digital computer-controlled machines to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Speech technology (Speech Technology, ST): the Speech technology includes at least one of an automatic Speech recognition technology (Automatic Speech Recognition, ASR) and a Speech synthesis technology (TTS) and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. The large model technology brings revolution for the development of the voice technology, and WavLM, uniSpeech and other pre-training models which use a transducer architecture have strong generalization and universality and can excellently finish voice processing tasks in all directions.

Natural language processing (Nature Language Processing, NLP): an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also to computer science and mathematics. An important technique for model training in the artificial intelligence domain, a pre-training model, is developed from a large language model (Large Language Model, LLM) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML): a multi-domain interdisciplinary relates to multi-disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

Pre-training Model (PTM): the model is also called a matrix model and a large model, which refer to a deep neural network (Deep neural network, DNN) with large parameters, the deep neural network is trained on massive unlabeled data, the PTM extracts common characteristics on the data by utilizing the function approximation capability of the large-Parameter DNN, and the model is suitable for downstream tasks through technologies such as Fine Tuning (Fine Tuning), parameter-Efficient Fine Tuning (PEFT), and prompt-Tuning. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be divided into language model, visual model, voice model (VALL-E), multi-modal model, etc. according to the processed data mode, wherein the language model comprises ELMo (Embeddings from Language Models), BERT (Bidirectional Encoder Representation from Transformers), GPT (generating Pre-trained Transformer, generating Pre-training model), etc.; visual models include swin-transducer, viT (Vision Transformer), V-MOE (Vision MoE), etc.; the multimodal model includes VIBERT

(Vision-and-Language BERT), CLIP (Contrastive Language-Image Pre-tracking), flamingo, gato, etc.; multimodal models refer to models that build representations of two or more data modality features. The pre-training model is an important tool for outputting artificial intelligence generation content (Artificial Intelligence Generated Content, AIGC), and can also be used as a general interface for connecting a plurality of specific task models.

FIG. 1 illustrates a block diagram of a computer system 100 provided in an exemplary embodiment of the present application. The computer system 100 includes: a server 110 and a client 120.

The server 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud computing services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and cloud servers of basic cloud computing services such as big data and an artificial intelligent platform. Server 110 is used to provide a backbone model and adapters for clients 120; the server 110 is also configured to integrate adapters trained by the client 120 into a global adapter. Server 110 may also be referred to as a central server.

The client 120 may be an electronic device such as a cell phone, tablet computer, in-vehicle terminal (car), wearable device, PC (Personal Computer ), unmanned reservation terminal, smart speaker, etc. The client 120 may install a client running a target application, where the target application may be an application supporting voice to text, for example, a voice assistant, an input method supporting voice to text, a social application supporting voice to text, a game program embedded with a voice to text module, an application supporting voice translation, etc.; the target application may also be an application supporting translation, such as an online translation program, a video class application supporting online translation of subtitles, etc.; the target application may also be a conversation-enabled application such as a cell phone assistant, chat robot, or the like. The form of the target application program is not limited in this embodiment, and includes, but is not limited to, a client installed in the client 120, an application program, an applet, etc., and may also be in the form of a web page. The client 120 may also be a background server of the application program that gathers private data for the corresponding application program, e.g., the client 120 is a background server of an application program that supports voice-to-text, a background server of an application program that supports translation. Background servers for applications supporting conversations, etc.

Communication between the server 110 and the client 120 may be via a network, such as a wired or wireless network.

Those skilled in the art will appreciate that the number of clients 120 described above may be greater or lesser. Such as: the number of clients 120 may be only one, or tens or hundreds, or more. The number of clients 120 and the device type are not limited in the embodiments of the present application.

Fig. 2 shows a flowchart of a federal learning method provided in an exemplary embodiment of the present application, which is performed by a client participating in federal learning, which may be the client as shown in fig. 2. The method comprises the following steps:

step 210: acquiring a cascade local backbone model and an adapter model, wherein the number of model parameters of the adapter model is smaller than that of the local backbone model;

in some embodiments, a client obtains a cascaded local backbone model and an adapter model from a central server; or the client acquires the adapter model from the central server, and cascades the adapter model with the local backbone model to obtain a cascade local backbone model and an adapter model.

Alternatively, the adapter model is a Fine-Tuning model, i.e., a model employing a Fine-Tuning method, such as SFT (Supervised Fine-Tuning), lorea (Low-Rank Adaptation), freeze, P-Tuning (Prompt-Tuning), and the like.

In some embodiments, the number of model parameters of the adapter model is less than the number of model parameters of the local backbone model; or, the number of model parameters of the adapter model is much smaller than the number of model parameters of the local backbone model.

Exemplary, the number of model parameters for the local backbone model is 180B (Billion), the number of model parameters for the adapter model is 4.7M (Million ), where 1b=10 ³ M。

Step 220: private data in a client is obtained;

in some embodiments, the client stores private data, such as photo data, voice data, location data, and the like.

Step 230: training a local backbone model and an adapter model based on private data;

in some embodiments, the client trains the local backbone model and the adapter model a times based on the obtained private data; or, the client trains the local backbone model and the adapter model based on the obtained private data until the loss function converges.

In some embodiments, the client trains the local backbone model and the adapter model based on private data, and updates model parameters of the adapter model during training to obtain a trained adapter model.

In some embodiments, the private data is represented in the form ofx represents the sequence of source data, y represents the sequence of target data,/for>The length of private data, D is the capitalization of data, and represents the data; the subscript s indicates that the data is a piece of private data; />The ith element in the sequence representing the source data, and similarly,/>Representing the ith element in the sequence of target data.

The sequence of the source data is a voice sequence, the voice sequence is a text sequence taking a frame as a basic unit, and the sequence of the target data is a text sequence taking a phoneme or a word as a basic unit; or the sequence of the source data is a text sequence, and the sequence of the target data is a voice sequence; or, the sequence of the source data is the text sequence of the language a, the sequence of the target data is the text sequence of the language b, and the language a and the language b are different languages, such as English, chinese, german, french and the like; the sequence of source data is the text sequence of task prompt information, and the sequence of target data is the text sequence of output information of task execution, such as questions and answers in a generative language model.

Step 240: uploading the trained adapter model to a central server, wherein the trained adapter model is used for updating a global model at the central server side by adopting a federal learning mode.

In some embodiments, the client uploads the trained adapter model to the central server, which uses the trained adapter model to update the global model on the central server side.

In summary, according to the method provided by the embodiment of the application, model training is put down to the client, the client uses private data stored by the client to perform model training, and instead of directly training a local backbone model during model training, an adapter model is added, the training content of the model is saved by updating the adapter model, the number of model parameters of the adapter model is smaller than that of model parameters of the local backbone model, namely, the model training can be completed by updating fewer parameters during model training, so that the client with lower configuration can participate in federal learning, and in the model training process, only the adapter model is needed to interact between the client and the server, so that communication overhead between the server and the client can be greatly reduced, and a high-quality federal learning model can be obtained by using lower cost for federal learning.

In an alternative embodiment based on fig. 2, as shown in fig. 3, step 230 may alternatively be implemented as steps 231 to 233:

Step 231: inputting private data into a local backbone model and an adapter model to obtain a loss function value;

in some embodiments, a loss function is designed based on maximum likelihood estimation (Maximum Likelihood Estimation, MLE) and is used to represent the difference between the model predicted and actual values. Loss function employed in embodiments of the present applicationAs shown in the following formula.

In delta _c For model parameters of the adapter model, θ ⁰ Log () is a logarithm based on a natural constant e, which is a model parameter of the local backbone model.

In some embodiments, privateThe data is expressed in the form ofx represents the sequence of source data, y represents the sequence of target data,/for>The length of private data, D is the capitalization of data, and represents the data; />The ith element in the sequence representing the source data, and similarly,/>An i-th element in the sequence representing the target data; subscript c denotes client c, i.e. +.>Delta as private data for client c _c For the adapter model of client c, +.>A corresponding loss function for the adapter model of client c.

In the aboveIt can be understood that the model parameter of the local backbone model is θ ⁰ And the model parameter of the adapter model is delta _c At the time of source data->On the premise of occurrence, candidate at least one target data +.>Probability of occurrence.

Step 232: fixing model parameters of the local backbone model to be unchanged;

in some embodiments, the client will fix the model parameters of the local backbone model while the model is being trained, leaving it unchanged. For example, in the above-described loss function calculation, the parameter values of the functions are only parameters of the adapter model and do not include parameters of the local backbone model.

Step 233: model parameters of the adapter model are updated based on the loss function values.

In some embodiments, model parameters of the adapter model are updated based on differences in the loss function values before model parameter adjustment and the loss function values after model parameter adjustment.

Exemplary, the model parameters of the adapter model prior to parameter adjustment are δ _c The corresponding loss function value is 0.2, and after parameter adjustment, the model parameter of the adapter model is delta' _c If the corresponding loss function value is 0.1, updating the model parameter of the adapter model to delta' _c The method comprises the steps of carrying out a first treatment on the surface of the Or, before parameter adjustment, the model parameter of the adapter model is delta _c The corresponding loss function value is 0.05, and after parameter adjustment, the model parameter of the adapter model is delta' _c If the corresponding loss function value is 0.1, updating the model parameter of the adapter model to delta _c Or, maintaining the model parameters of the adapter model as delta _c 。

In some embodiments, a gradient function corresponding to the loss function is derived based on the loss function, the gradient function being a derivative of the loss function, i.e. derived by biasing the loss function. Model parameters of the adapter model are updated on the basis of gradient descent according to the gradient function. It will be appreciated that the loss function is used to represent the difference between the predicted value and the actual value, and the gradient function derived from the loss function is used to represent the change value of the difference between the predicted value and the actual value, and the gradient is reduced, i.e. the difference between the predicted value and the actual value is reduced, i.e. the effect equivalent to the prediction is improved. The purpose of gradient descent is to achieve a model training process by adjusting the parameters of the model in a manner equivalent to the above-mentioned updating of the model parameters according to the numerical changes of the loss function values before and after the parameter adjustment.

In summary, the method provided by the embodiment of the application only updates the parameters of the adapter model during model training, and the number of the model parameters of the adapter model is smaller than that of the model parameters of the local backbone model, so that the threshold of the client participating in federal learning can be effectively reduced, and the efficiency during federal learning can be effectively improved.

In an alternative embodiment based on fig. 3, where the loss function value includes an i-th round loss function value and the adapter model includes an i-th round adapter model, i is a positive integer, as shown in fig. 4, step 233 may alternatively be implemented as step 2331 and step 240 may alternatively be implemented as step 241.

Step 2331: updating model parameters of the ith round of adapter model based on the ith round of loss function value to obtain a trained ith round of adapter model;

in some embodiments, model parameters of the ith round of adapter model are updated based on differences in the ith round of loss function values before model parameter adjustment and the ith round of loss function values after model parameter adjustment.

Exemplary, model parameters of the ith round of adapter model before parameter adjustment are delta _c The corresponding i-th round loss function value is 0.2, and after parameter adjustment, the model parameter of the i-th round adapter model is delta _c ' if the corresponding i-th round loss function value is 0.1, updating the model parameter of the i-th round adapter model to delta _c 'A'; or, before parameter adjustment, the model parameter of the ith round of adapter model is delta _c The corresponding i-th round loss function value is 0.05, and after parameter adjustment, the model parameter of the i-th round adapter model is delta _c ' if the corresponding i-th round loss function value is 0.1, updating the model parameter of the i-th round adapter model to delta _c Or, keeping the model parameters of the ith wheel adapter model as delta _c 。

Step 241: uploading the ith round of loss function values and the trained ith round of adapter models to a central server, wherein the ith round of loss function values are used for the central server to determine whether to execute the ith+1st round of training.

In some embodiments, the client uploads the trained ith round of adapter model and the ith round of loss function value corresponding to the adapter model to the central server, and the central server judges whether to execute the (i+1) th round of training according to the received at least one ith round of loss function value.

Optionally, the central server judges whether to execute the (i+1) -th training according to the average value of the received at least one (i) -th loss function value; or, the central server performs weighted average on at least one ith round of loss function value according to the received data volume of at least one client for model training, so as to obtain a corresponding weighted average value, and judges whether to perform the (i+1) th round of training according to the weighted average value.

For example, there are 3 clients participating in federal learning, the loss function values corresponding to the 3 clients participating in federal learning are 0.1,0.3,0.2, respectively, and the corresponding data amounts for training are 50, 30, 120, respectively. The center server calculates the received 3 ith round of loss function values to obtain an average value of 0.2; or the central server calculates the weight of 3 clients respectively, wherein the weight corresponding to the current client is the sum of the data volume corresponding to the current client/the data volumes of all clients participating in federal learning, and the weight of each client can be obtained by calculation and is 0.25,0.15,0.6 respectively. And carrying out weighted average on the received 3 ith round loss function values according to the obtained weight value to obtain a weighted average value of 0.06.

In summary, in the embodiment of the present application, during training of the ith model, the ith loss function value and the ith adapter model corresponding to the client are uploaded to the central server, and the central server determines whether to perform the next model training according to the ith loss function value, and only the model parameters of the ith adapter model need to be changed during training, so that a general federal learning model can be quickly trained.

In an alternative embodiment based on fig. 4, as shown in fig. 5, the method further comprises:

step 310: acquiring an i+1st round adapter model issued by a central server;

in some embodiments, the client obtains an i+1st round of adapter model issued by the central server; or, the client acquires a cascade local backbone model and an i+1st round adapter model which are issued by the central server.

Step 320: inputting private data into a local backbone model and an i+1th round adapter model to obtain an i+1th round loss function value;

in some embodiments, a loss function is designed based on maximum likelihood estimation (Maximum Likelihood Estimation, MLE) and is used to represent the difference between the model predicted and actual values.

Loss function employed in embodiments of the present applicationAs shown in the following formula.

In some embodiments, the private data is represented in the form ofx represents the sequence of source data and y represents the destinationSequence of the tag data,/->The length of private data, D is the capitalization of data, and represents the data; />The ith element in the sequence representing the source data, and similarly,/ >An i-th element in the sequence representing the target data; subscript c denotes client c, i.e. +.>Delta as private data for client c _c For the adapter model of client c, +.>A corresponding loss function for the adapter model of client c.

Step 330: fixing model parameters of the local backbone model to be unchanged;

Step 340: updating model parameters of the i+1th round adapter model based on the i+1th round loss function value;

in some embodiments, model parameters of the i+1th round of adapter model are updated based on differences between the i+1th round of loss function values before model parameter adjustment and the i+1th round of loss function values after model parameter adjustment.

Exemplary, the model parameters of the i+1st round adapter model prior to parameter adjustment are δ _c The corresponding i+1th round loss function value is 0.2, and after parameter adjustment, the model parameter of the i+1th round adapter model is delta' _c If the corresponding i+1th round loss function value is 0.1, updating the model parameter of the i+1th round adapter model to delta' _c The method comprises the steps of carrying out a first treatment on the surface of the Or, before parameter adjustment, the model parameter of the i+1st round adapter model is delta _c The corresponding i+1th round loss function value is 0.05, and after parameter adjustment, the model parameter of the i+1th round adapter model is delta' _c If the corresponding i+1th round loss function value is 0.1, updating the model parameter of the i+1th round adapter model to delta _c Or, keeping the model parameters of the i+1st round adapter model as delta _c 。

In some embodiments, a gradient function corresponding to the loss function is derived based on the loss function, the gradient function being a derivative of the loss function, i.e. derived by biasing the loss function. Model parameters of the adapter model are updated on the basis of gradient descent according to the gradient function. It will be appreciated that the loss function is used to represent the difference between the predicted value and the actual value, and the gradient function derived from the loss function is used to represent the change value of the difference between the predicted value and the actual value, and the gradient is reduced, i.e. the difference between the predicted value and the actual value is reduced, i.e. the effect equivalent to the prediction is improved. The purpose of gradient descent is achieved by adjusting the parameters of the model, namely the training process of the adapter model, in a manner equivalent to the above-mentioned updating of the model parameters according to the numerical changes of the loss function values before and after parameter adjustment.

Step 350: uploading the i+1st round of loss function values and the trained i+1st round of adapter models to a central server, wherein the i+1st round of loss function values are used for the central server to determine whether to execute the i+2nd round of training.

In some embodiments, the client uploads the trained i+1st round of adapter model and the i+1st round of loss function value corresponding to the adapter model to the central server, and the central server determines whether to execute the i+2nd round of training according to the received at least one i+1st round of loss function value.

In some embodiments, the central server determines whether to perform the i+2 th round of training based on the i+1 th round of average loss function values. The i+1 th round average loss function value is the average of at least one i+1 th round loss function value, or the i+1 th round average loss function value is the weighted average of at least one i+1 th round loss function value. When the average loss function value of the ith round (i+1) is larger than the target threshold, training the ith round (i+2), and when the average loss function value is larger than the target threshold, the difference between the predicted value and the actual value generated by the model is larger than the target threshold, namely, training needs to be continued; and under the condition that the i+1-th round average loss function value is smaller than or equal to the target threshold value, the i+2-th round training is not needed, and when the average loss function value is smaller than or equal to the target threshold value, the difference between the predicted value and the actual value generated by the model at the moment is smaller than or equal to the target threshold value, namely the set model training effect is achieved, and the training can be ended.

In some embodiments, the target threshold is a preset value; or, the target threshold is a value set by the central server.

Optionally, the central server judges whether to execute the (i+2) th round of training according to the average value of the received at least one (i+1) th round of loss function values; or, the central server performs weighted average on at least one i+1st round of loss function value according to the received data quantity of the at least one client for model training, obtains a corresponding weighted average value, and judges whether to perform the i+2nd round of training according to the weighted average value.

For example, there are 3 clients participating in federal learning, the loss function values corresponding to the 3 clients participating in federal learning are 0.1,0.3,0.2, respectively, and the corresponding data amounts for training are 50, 30, 120, respectively. The center server calculates the loss function value of the 3 (i+1) th round received, and the average value is 0.2; or the central server calculates the weights of 3 clients respectively, wherein the weight corresponding to the current client is the quotient of the sum of the data quantity corresponding to the current client and the data quantity of all clients participating in federal learning, and the weight of each client can be obtained by calculation and is 0.25,0.15,0.6 respectively. And carrying out weighted average on the received 3 (i+1) th round of loss function values according to the obtained weight value, so as to obtain a weighted average value of 0.06.

In summary, the method provided by the embodiment of the application shows the training process of the ith round+1 of the client, when the model is trained, the central server judges whether to perform the next round of training according to the training effect of the ith round, namely, the loss function value of the ith round, and the central server integrally coordinates the whole model training, so that the model training is more efficient.

Fig. 6 shows a flowchart of a federal learning method provided by an exemplary embodiment of the present application, which is performed by a client participating in federal learning, which may be the client as shown in fig. 2. The method further comprises the steps of:

step 410: the method comprises the steps of obtaining a global model, wherein the global model is obtained by integrating a local backbone model and a final adapter model by a central server, and the final adapter model is obtained by aggregating at least two adapter models of clients by the central server in a federal learning mode.

In some embodiments, the client requests a download of the global model from the central server; or, the client acquires the global model issued by the central server.

In some embodiments, the central server acquires the adapter models of at least two clients trained by the last round of model training, and performs aggregation on the acquired adapter models of at least two clients by adopting a federal learning mode to obtain a final adapter model, wherein a model aggregation formula is shown in the following formula.

Wherein C represents clients C, C is the total number of clients participating in federal learning, n _c For the data volume of client c, n is the sum of the data volumes of all clients participating in federal learning, δ _c The delta is the model parameter of the adapter model of the client c after aggregation.

For example, there are 3 clients participating in federal learning, and the data amount for training corresponding to the 3 clients participating in federal learning is 50, 30, 120, i.e. n is 200, respectively, so that the model obtained by aggregation can be expressed as 1/4 delta ₁ +3/20δ ₂ +3/5δ ₃ 。

In some embodiments, the local backbone is modeledAnd a final adapter model f _δ Obtaining global model->Wherein f is a machine learning model, +.>Representing model parameters θ ⁰ Is the same as theory f _δ For model parameter f _δ Is a machine learning model of->For model parameter θ _δ Machine learning model of θ _g Is a model parameter of the global model, θ _g ＝θ ⁰ +δ，θ ⁰ And delta is the model parameter of the final adapter model.

In summary, when model training is completed, the method provided by the embodiment of the application obtains the global model, the global model is obtained by integrating the central server according to the local backbone model and the final adapter model, and the final adapter model is obtained by aggregating adapter models uploaded by at least two clients participating in federal learning by the central server, so that the data training model of the clients can be ensured to be used in the federal learning process, but private data of the clients cannot be leaked, and data security during model training is ensured.

In an alternative embodiment based on fig. 6, as shown in fig. 7, the method further comprises:

in some embodiments, step 420-1 and step 420-2 may be performed in exchange for each other or simultaneously.

Step 420-1: predicting the input information based on the global model to obtain a prediction result set;

in some embodiments, the client predicts the input information based on a global model to obtain a set of predicted results, wherein the set of predicted results includes at least one predicted result.

Step 420-2: searching the input information based on the local knowledge base to obtain a search result set;

in some embodiments, the client retrieves the input information based on a local knowledge base, resulting in a set of retrieval results, wherein the local knowledge base is constructed based on private data within the client, the set of retrieval results comprising at least one retrieval result.

In some embodiments, private data within the client is input into a global model, and a local knowledge base is built at the client by the global model; or the client builds a local knowledge base according to the private data in the client.

Step 430: and determining a final output result in the prediction result set and the retrieval result set.

In some embodiments, the prediction result set and the search result set have an intersection, or the prediction result set and the search result set do not have an intersection.

In some embodiments, the cardinality of the set of predicted results is equal to or different from the cardinality of the set of retrieved results, the cardinality being used to represent the size of the set, the cardinality may also be a potential.

Illustratively, the set of prediction results is { a1, a2, a3, a5, a6}, and the set of search results is { a1, a4, a7}; or the prediction result set is { a1, a2, a3, a5, a6}, and the search result set is { a4, a7}; or the prediction result set is { a1, a2, a3}, and the search result set is { a4, a5, a6}; or the prediction result set is { a1, a2, a3}, and the search result set is { a1, a2, a3}.

In some embodiments, as shown in FIG. 8, step 430 may alternatively be implemented as steps 431-1 through 432.

In some embodiments, the set of predictors includes at least one predictor and a predictor probability corresponding to the predictor; the search result set comprises at least one search result and a search result probability corresponding to the search result.

Step 431-1: under the condition that the first output result is simultaneously present in the prediction result set and the search result set, interpolation is carried out on the prediction result probability and the search result probability corresponding to the first output result, and interpolation probability of the first output result is obtained;

In some embodiments, under the condition that the first output result is simultaneously present in the prediction result set and the search result set, interpolation is performed on the prediction result probability and the search result probability corresponding to the first output result, so as to obtain the interpolation probability of the first output result, wherein the first output result is an intersection element in the prediction result set and the search result set. For example, the prediction result set is { a1, a2, a3, a5, a6}, the search result set is { a1, a4, a7}, and the first output result may be one element in the set { a1 }.

In some embodiments, the prediction result probability and the search result probability corresponding to the first output result are interpolated, where the interpolation formula is shown as follows.

p＝λP _Retrieval +(1-λ)P _Prediction

Wherein P is _Retrieval To retrieve the result probability, P _Prediction For the prediction result probability, p is the interpolation probability of the output result, lambda is the interpolation coefficient, lambda E [0,1 ]]。

Optionally, λ is a preset value; or lambda is a value set by the client; or λ is a value set by the central server.

Step 431-2: under the condition that the second output result only appears in the predicted result set and does not appear in the search result set, interpolation is carried out on the predicted result probability corresponding to the second output result and the search result probability set to be zero, so that interpolation probability of the second output result is obtained;

In some embodiments, the second output result is an element that is only present in the set of predicted results but not in the set of retrieved results, e.g., the set of predicted results is { a1, a2, a3, a5, a6}, the set of retrieved results is { a1, a4, a7}, then the second output result may be one element of the set { a2, a3, a5, a6 }.

In some embodiments, the prediction result probability corresponding to the second output result and the search result probability set to zero are interpolated, where the interpolation formula is consistent with the interpolation formula of the first output result, but since the search result probability is set to zero, the interpolation formula of the second output result may be abbreviated as:

p＝(1-λ)P _prediction

Wherein P is _Prediction For the prediction result probability, p is the interpolation probability of the output result, lambda is the interpolation coefficient, lambda E [0,1 ]]。

Step 431-3: under the condition that the third output result only appears in the search result set and does not appear in the prediction result set, interpolation is carried out on the search result probability corresponding to the third output result and the prediction result probability set to be zero, so that interpolation probability of the third output result is obtained;

In some embodiments, the third output result is an element that is only present in the set of search results but not in the set of prediction results, e.g., the set of prediction results is { a1, a2, a3, a5, a6}, the set of search results is { a1, a4, a7}, then the third output result may be one element in the set { a4, a7}.

In some embodiments, the search result probability corresponding to the third output result and the prediction result probability set to zero are interpolated, where the interpolation formula is consistent with the interpolation formula of the first output result, but since the prediction result probability is set to zero, the interpolation formula of the third output result may be abbreviated as:

p＝λP _retrieval

Wherein P is _Retrieval For the probability of the search result, p is the interpolation probability of the output result, lambda is the interpolation coefficient, lambda E [0,1 ]]。

Step 432: and sorting the output results based on interpolation probabilities of all the output results, and determining the output result with the highest interpolation probability as the final output result.

In some embodiments, the total output result is a union of the set of predicted results and the set of retrieved results. For example, the prediction result set is { a1, a2, a3, a5, a6}, the search result set is { a1, a4, a7}, and the output result set is { a1, a2, a3, a4, a5, a6, a7}.

In some embodiments, the output results are ranked based on interpolation probabilities for all output results, and the w output results with the highest interpolation probability are determined to be the final output result. Optionally, the final output result is at least one output result.

For example, when performing a text translation task, the source data is "I have an apple," the generated text sequence is "I me," the second word is being generated, the set of predicted results is { have, owned, already }, the set of retrieved results is { have }, and the sequence of probability-ordered output results is "have, owned, already }. Outputting the 'Yes' as a final output result; or, outputting 'Yes, yes' as a final output result; or "have, have", will be output as the final output result.

In summary, the method provided in the embodiment of the present application shows a manner of using a global model obtained by training with a federal learning method, and besides using the global model to directly predict to obtain a prediction result set, the embodiment of the present application further provides a local knowledge base, where the local knowledge base is searched to obtain a search result set, and the local knowledge base is constructed according to private data of a client, and a final output result is obtained according to the prediction result set and the search result set. That is, the final output result is generated according to the global model and the local knowledge base, and compared with the method of generating the output result by using only the global model, the method provided by the embodiment of the invention can combine the private data of the client, so that the output result is more personalized, i.e. more accords with the use scene of the client.

In an alternative embodiment based on fig. 7, as shown in fig. 9, the method further comprises:

in some embodiments, the private data includes source data and target data, the target data includes n sub-target data, n being a positive integer. The private data is expressed in the form ofx represents the sequence of source data, y represents the sequence of target data,/for>The length of private data, D is the capitalization of data, and represents the data; the subscript s indicates that the data is a piece of private data,/->The ith element in the sequence representing the source data, and similarly,/>Representing the ith element in the sequence of target data.

Step 440: extracting features of the source data and the first m sub-target data as context representations of the (m+1) th sub-target data based on the global model, wherein m is a positive integer and is less than or equal to n;

in some embodiments, the source data and the first m sub-target data are input into a global modelExtracting features to obtain features of source data and the first m sub-target data, and using the features as the context of the (m+1) th sub-target dataAnd (3) representing. Wherein the contextual representation may be represented as +.>x is the source data and y is the first m sub-target data.

In some embodiments, the sub-target data is partitioned according to time steps of the model. The time step is a step unit when training or generating in the sequence model. For example, the target data is a text sequence, the time step is each word, then the sub-target data is each word that constitutes the target data, e.g., I have an apple set of sub-target data is { I, have, one, apple }; or, the target data is a text sequence, and the time step is each sentence, then the sub-target data is each sentence forming the target data, for example, "i have an apple. But i consumed it yesterday. The set of child object data that i now have no apples is { i once have an apple, but i consumed it yesterday, so i now have no apples }. It should be noted that, the division of the time steps depends on the actual effect of the model, and only a part of the division of the time steps is listed in the embodiment of the present application, and the division manner of the remaining time steps is not listed in the embodiment of the present application, but the protection scope of the present application is not limited thereto.

Step 450: and storing the corresponding relation between the context representation of the (m+1) th sub-target data and the (m+1) th sub-target data into a local knowledge base.

In some embodiments, the context of the (m+1) -th sub-target data is representedAnd the m+1th sub-target data y _m+1 Is stored in a local knowledge base.

In some embodiments, the local knowledge base stores data in the form of key-value pairs (K, V). Illustratively, the context representation of the (m+1) th sub-target data is used as a key (K), and the (m+1) th sub-target data is stored as a value (V) in the local knowledge base.

In some embodiments, steps 440 and 450 are methods of building a local knowledge base, and steps 440 and 450 may be performed in exchange for step 410, sequentially or simultaneously.

In order to facilitate understanding of the foregoing solution, in this embodiment of the present application, the source data is a voice sequence, the target data is a text sequence, the sub-target data is a text unit, and the text unit is a word or a word, to further describe the method for constructing the local knowledge base, as shown in fig. 10, step 440 may be implemented as step 441, and step 450 may be implemented as step 451.

Step 441: extracting the characteristics of the voice sequence and the first m text units as context representations of the (m+1) th text unit based on the global model;

There is private data 41 whose corresponding source data is a speech sequence x and the target data is a text sequence y. Dividing the text sequence y to obtain a set of sub-target data { I, one, apple }, wherein m is 3. The voice sequence x and the first m text units y _≤m 42 input global modelGet the context representation of the 4 th text unit +.>

Step 451: and storing the corresponding relation between the context representation of the (m+1) th text unit and the (m+1) th text unit into a local knowledge base.

The client stores the extracted context representation 44 of the 4 th text unit and the 4 th text unit "apple" in the local knowledge base 45.

In summary, the method provided by the embodiment of the present application provides a method for constructing a local knowledge base, according to private data and a global model owned by a client, where the local knowledge base is used to instruct generation of a final output result, so that individuation of the model can be better implemented.

In an alternative embodiment based on fig. 7, step 420-2 may alternatively be implemented as step 510 and step 520, as shown in fig. 11.

In some embodiments, the input information includes source data and first m sub-target data.

Step 510: extracting features of the source data and the first m sub-target data as context representations of the (m+1) th sub-target data based on the global model;

in some embodiments, the client extracts features of the input information as a contextual representation of the information to be generated based on a global model; or, the client side extracts the characteristics of the input information as the context representation of the (m+1) th sub-target data based on the global model.

Step 520: based on the local knowledge base, searching is carried out according to the context representation of the (m+1) th sub-target data, and a search result set is obtained through a k neighbor model.

In some embodiments, the client retrieves in the local knowledge base according to the context representation of the (m+1) -th sub-target data, obtains k neighbors according to the context representation of the (m+1) -th sub-target data and the context representation stored in the local knowledge base, and composes the k neighbors into a retrieval result set, where k is a positive integer.

In some embodiments, the number k of neighbors obtained from the k-neighbor model is a preset value; or, the number k of the neighbors obtained according to the k neighbor model is set by the client; or, the number k of neighbors obtained according to the k neighbor model is set by the central server.

In some embodiments, when k neighbors are found, the distance is measured by euclidean distance, manhattan distance, cosine value, correlation, and the like, which is not limited in this embodiment.

In some embodiments, the probability of each search result in the set of search results is calculated by a k-nearest neighbor model, and the probability calculation formula is shown as follows.

Wherein exp () is an exponential function based on e;for the first m sub-target data that have been generated;for k neighbors retrieved, h _i The context for the ith neighbor represents, v _i Searching results corresponding to the ith neighbor; d is distance, and->It can be understood that the context of the ith neighbor represents h _i Distance +.>The examples of the present application are illustrated in terms of squared Euclidean distances; t is a smoothing temperature, and is used for controlling the smoothness degree of the probability of the search result in the search result set, for example, when T is more than 1, the probability of the corresponding search result is reduced, when a final output result is generated according to the search result and the prediction result, the probability of the search result is reduced, the influence degree is reduced, and the generated content is more approximate to the content of the global model, namely, the content which is more universal; when T is smaller than 1, the probability of the corresponding search result increases, and when the final output result is generated, the probability of the search result increases, the influence increases, and the generated content is more similar to the content in the local knowledge base, namely more personalized content.

For the context characteristics of the k searched neighbors, the search result probabilities with the same search result are aggregated to obtain the final corresponding search result probability, which can be understood as a de-duplication process. />It can be understood that before the source data x and the generated first m

Sub-target dataOn the premise of occurrence, candidate at least one target data y _m Probability of occurrence.

In some embodiments, the prediction is performed using a teacher forced mode, i.eIs y _≤m I.e. the actual sub-target data. For example, the source data corresponding to the input information is voice data of target data, the target data is "I have an apple", and the corresponding sub-target data set is { I have, one, apple }. However, when the 2 nd sub-target data is generated, the final output result is the prediction result "again", and the search result is "yes", and when the 3 rd sub-target data is generated, the search is performed using the context representation corresponding to the source data and "my have".

In summary, the method provided by the embodiment of the application shows that the search is performed according to the local knowledge base, the k neighbor model is adopted during the search, k neighbors are obtained according to the context representation search of the information to be generated, the search result set is generated, and the search is performed according to the above mode, so that personalized information in the local knowledge base can be fully utilized, and the content generated during the model prediction is more personalized.

Fig. 12 shows a flowchart of a federal learning method provided by an exemplary embodiment of the present application, which is performed by a central server, which may be a server as shown in fig. 2. The method comprises the following steps:

step 610: transmitting the cascaded local backbone model and the adapter model to at least two clients participating in federal learning, wherein the number of model parameters of the adapter model is smaller than that of the local backbone model;

in some embodiments, a central server issues a cascaded local backbone model and an adapter model to at least two clients participating in federal learning; or, the central server receives the request of downloading the cascaded local backbone model and the adapter model from at least two clients participating in federation learning, and issues the cascaded local backbone model and the adapter model to at least two clients participating in federation learning; or, the central server transmits an adapter model to at least two clients participating in federal learning; or the central server transmits a local backbone model to at least two clients participating in federal learning; or, the central server transmits model parameters of the adapter model to at least two clients participating in federal learning; or, the central service transmits model parameters of the local backbone model to at least two clients participating in federal learning. It should be noted that, in the embodiments of the present application, only the manner of data interaction between a part of central servers and clients participating in federal learning is listed, and the embodiments of the present application are not listed one by one for the manner of data interaction between the rest of central servers and clients participating in federal learning, but the scope of protection of the embodiments of the present application is not limited thereto.

Alternatively, the adapter model is a trim model, i.e., a model that employs a trim approach, such as LoRA, freeze, P-Tuning, etc.

Step 620: receiving the trained adapter model uploaded by at least two clients;

in some embodiments, the central server receives a trained adapter model uploaded by at least two clientsOr, the central server receives model parameters delta of the trained adapter model uploaded by at least two clients _c 。

Step 630: and aggregating based on the trained adapter models uploaded by at least two clients to obtain an aggregated adapter model.

In some embodiments, when uploading the trained adapter model, the client also uploads the amount of data used to train the adapter model.

In some embodiments, the aggregation formula for aggregating the trained adapter models uploaded by at least two clients by the central server is shown below.

Wherein C represents clients C, C is the total number of clients participating in federal learning, n _c For the data volume of client c, n is the sum of the data volumes of all clients participating in federal learning, δ _c The model parameters of the adapter model of the client c are shown, and delta is the model parameters of the aggregated model.

In summary, according to the method provided by the embodiment of the application, the central server is responsible for issuing the adapter model and aggregating the adapter model uploaded by the client participating in federal learning, the adapter model is only required to be aggregated in the whole federal learning process, and the number of model parameters of the adapter model is smaller than that of the local backbone model, so that the configuration requirement of the central server is reduced, and the federal learning cost is reduced.

In an alternative embodiment based on fig. 12, step 620 may alternatively be implemented as steps 621 to 623-2, as shown in fig. 13.

In some embodiments, the adapter model includes a jth round adapter model, j being a positive integer.

Step 621: receiving a j-th round of adapter model and a j-th round of loss function value which are uploaded by at least two clients and are trained;

in some embodiments, the client also uploads the trained loss function value for the round of training when uploading the trained adapter model. The loss function value is used to represent the difference between the model predicted value and the actual value.

Step 622: obtaining a jth round average loss function value based on the jth round loss function values of at least two clients;

in some embodiments, the jth round average loss function value is an average of jth round loss function values of at least two clients; or, the jth round average loss function value is a weighted average of the jth round loss function values of the at least two clients.

For example, there are 3 clients participating in federal learning, the loss function values corresponding to the 3 clients participating in federal learning are 0.1,0.3,0.2, respectively, and the corresponding data amounts for training are 50, 30, 120, respectively. The center server calculates the received 3 jth round of loss function values to obtain an average value of 0.2, and the jth round of average loss function value is 0.2; or the central server calculates the weights of 3 clients respectively, wherein the weight corresponding to the current client is the quotient of the sum of the data quantity corresponding to the current client and the data quantity of all clients participating in federal learning, and the weight of each client can be obtained by calculation and is 0.25,0.15,0.6 respectively. And carrying out weighted average on the received 3 jth round of loss function values according to the obtained weight value to obtain that the weighted average value is 0.06, and the jth round of average loss function value is 0.06.

Step 623-1: under the condition that the average loss function value of the jth round is larger than a target threshold, the aggregated adapter model is used as a j+1th round adapter to be issued to at least two clients participating in federal learning;

When the average loss function value is greater than the target threshold, the difference between the predicted value and the actual value generated by the model is greater than the target threshold, that is, training needs to be continued, so that the central server takes the aggregated adapter model as the initial adapter model of the next training round, and the client continues the (j+1) th training round on the basis of the (j) th training round.

Step 623-2: and under the condition that the j-th round average loss function value is smaller than or equal to the target threshold value, integrating the aggregated adapter model as a final adapter model with the local backbone model to obtain a global model.

When the average loss function value is smaller than or equal to the target threshold, the difference between the predicted value and the actual value generated by the model at the moment is smaller than or equal to the target threshold, that is, the set model training effect is achieved, training can be finished, and the aggregated adapter model is used as a final adapter model to be integrated with the local backbone model to obtain the global model.

In summary, the method provided by the embodiment of the application shows the processing procedure of the center server after the jth round of training is performed on the client, and the center server judges whether to perform the next round of training according to the training effect of the jth round, namely, the loss function value of the ith round, so that the center server integrally builds the overall model training, and the model training can be more efficient.

FIG. 14 illustrates a flowchart of a federal prediction method provided in an exemplary embodiment of the present application. The method is performed by a client, which may be a client as shown in fig. 2. The method comprises the following steps:

step 710: acquiring a global model which is trained according to a federal learning method;

In some embodiments, the client is a client that participates in federal learning; or, the client is a client which does not participate in federal learning.

Step 720-1: predicting the input information based on the global model to obtain a prediction result set;

Step 720-2: searching the input information based on the local knowledge base to obtain a search result set;

Step 730: and determining a final output result in the prediction result set and the retrieval result set.

In summary, the method provided in the embodiment of the present application shows a method in which the client performs prediction by obtaining the global model, when a final output result is generated, a search result set is obtained by searching according to the local knowledge base, and a final output result is obtained according to the search result set and a prediction result set obtained by predicting the model.

In an alternative embodiment based on fig. 14, step 730 may alternatively be implemented as steps 731-1 through 732 as shown in fig. 15.

Step 731-1: under the condition that the first output result is simultaneously present in the prediction result set and the search result set, interpolation is carried out on the prediction result probability and the search result probability corresponding to the first output result, and interpolation probability of the first output result is obtained;

p＝λP _Retrieval +(1-λ)P _Prediction

Step 731-2: under the condition that the second output result only appears in the predicted result set and does not appear in the search result set, interpolation is carried out on the predicted result probability corresponding to the second output result and the search result probability set to be zero, so that interpolation probability of the second output result is obtained;

p＝(1-λ)P _prediction

Step 731-3: under the condition that the third output result only appears in the search result set and does not appear in the prediction result set, interpolation is carried out on the search result probability corresponding to the third output result and the prediction result probability set to be zero, so that interpolation probability of the third output result is obtained;

p＝λP _retrieval

In the middle of，P _Retrieval For the probability of the search result, p is the interpolation probability of the output result, lambda is the interpolation coefficient, lambda E [0,1 ]]。

Step 732: and sorting the output results based on interpolation probabilities of all the output results, and determining the output result with the highest interpolation probability as the final output result.

In summary, in addition to the method for directly predicting by using the global model to obtain the prediction result set, the embodiment of the application further provides a local knowledge base, the local knowledge base is searched to obtain the search result set, the local knowledge base is constructed according to private data of the client, and the final output result is obtained according to the prediction result set and the search result set. That is, the final output result is generated according to the global model and the local knowledge base, and compared with the method of generating the output result by using only the global model, the method provided by the embodiment of the invention can combine the private data of the client, so that the output result is more personalized, i.e. more accords with the use scene of the client.

In an alternative embodiment based on fig. 14, as shown in fig. 16, the method further comprises:

Step 740: extracting features of the source data and the first m sub-target data as context representations of the (m+1) th sub-target data based on the global model, wherein m is a positive integer and is less than or equal to n;

in some embodiments, the source data and the first m sub-target data are input into a global modelAnd extracting the characteristics to obtain the characteristics of the source data and the first m sub-target data, and using the characteristics as the context representation of the (m+1) th sub-target data. Wherein the contextual representation may be represented as +.>x is the source data and y is the first m sub-Target data.

Step 750: and storing the corresponding relation between the context representation of the (m+1) th sub-target data and the (m+1) th sub-target data into a local knowledge base.

In some embodiments, steps 740 and 750 are methods of building a local knowledge base, and steps 740 and 750 may be performed in exchange for step 710, sequentially or simultaneously.

In order to facilitate understanding of the foregoing solution, in this embodiment of the present application, the source data is a speech sequence, the target data is a text sequence, the sub-target data is a text unit, and the text unit is a word or a word, to further describe the method for constructing the local knowledge base, as shown in fig. 17, step 740 may be implemented as step 741, and step 750 may be implemented as step 751.

Step 741: extracting the characteristics of the voice sequence and the first m text units as context representations of the (m+1) th text unit based on the global model;

There is private data 71 whose corresponding source data is a speech sequence x and the target data is a text sequence y. Dividing the text sequence y to obtain a set of sub-target data { I, one, apple }, wherein m is 3. The voice sequence x and the first m text units y _≤m 72 input global modelGet the context representation of the 4 th text unit +.>

Step 751: and storing the corresponding relation between the context representation of the (m+1) th text unit and the (m+1) th text unit into a local knowledge base.

The client stores the extracted context representation 74 of the 4 th text unit and the 4 th text unit "apple" in a local knowledge base 75.

In an alternative embodiment based on fig. 14, step 720-2 may alternatively be implemented as step 810 and step 820, as shown in fig. 18.

Step 810: extracting features of the source data and the first m sub-target data as context representations of the (m+1) th sub-target data based on the global model;

Step 820: based on the local knowledge base, searching is carried out according to the context representation of the (m+1) th sub-target data, and a search result set is obtained through a k neighbor model.

For the context characteristics of the k searched neighbors, the search result probabilities with the same search result are aggregated to obtain the final corresponding search result probability, which can be understood as a de-duplication process. />It can be understood that the first m sub-target data are +.>On the premise of occurrence, candidate at least one target data y _m Probability of occurrence.

In some embodiments, the sampling is performedPrediction by teacher forced mode, i.e.Is y _≤m I.e. the actual sub-target data. For example, the source data corresponding to the input information is voice data of target data, the target data is "I have an apple", and the corresponding sub-target data set is { I have, one, apple }. However, when the 2 nd sub-target data is generated, the final output result is the prediction result "again", and the search result is "yes", and when the 3 rd sub-target data is generated, the search is performed using the context representation corresponding to the source data and "my have".

FIG. 19 illustrates a schematic diagram of a federal prediction method provided in an exemplary embodiment of the present application.

In the scene of converting voice into text, the corresponding source data is a voice sequence x, and the target data is a text sequence y. For the real text corresponding to the voice sequence x, i have an apple, the first m text units are predicted, the text sequence consisting of the first m text units is I have an apple, and m is 3. The speech sequence and the first m text units that have been currently predicted to be obtained are taken as input information 81, and the input information 81 is input into a global model 82. On the one hand, the global model 82 predicts a prediction result set and a prediction result probability distribution 83 corresponding to the prediction result set based on the input information 81; on the other hand, the global model 82 extracts features of the input information 81 as a contextual representation 84 of the 4 th text unit. The client performs a search in the local repository 85 based on the context representation 84 of the 4 th text unit, resulting in a set of search results 86, the set of search results 86 including the search results and the distance between each search result and the context representation 84 of the 4 th text unit. Based on the k-nearest neighbor model, a probability distribution 87 of the search results of the k nearest neighbors is calculated. Interpolation is carried out on the prediction result probability in the prediction result set and the retrieval result probability in the retrieval result set, interpolation probability of each candidate output result is calculated, and the final output result with the highest interpolation probability is output, so that output information 88 is obtained.

FIG. 20 illustrates an overall flow chart of a federal learning method provided in one exemplary embodiment of the present application. The method comprises the following steps:

step 910: the central server transmits the cascaded local backbone model and the adapter model to the client side participating in federal learning;

Step 911: the central server transmits a cascaded local backbone model and an adapter model to a client 1 participating in federal learning;

step 912: the central server transmits a cascaded local backbone model and an adapter model to a client side 2 participating in federal learning;

step 91c: the central server transmits a cascaded local backbone model and an adapter model to a client c participating in federal learning;

step 920: the client participating in federation learning trains a local backbone model and an adapter model based on private data in the client;

In some embodiments, the private data is represented in the form ofx represents the sequence of source data, y represents the sequence of target data,/for>The length of private data, D is the capitalization of data, and represents the data; the subscript s indicates that the data is a piece of private data,/->The ith element in the sequence representing the source data, and similarly,/>Representing the ith element in the sequence of target data.

Step 930: uploading the trained adapter model by the client;

Step 931: the client 1 uploads the trained adapter model;

step 932: the client 2 uploads the trained adapter model;

step 93c: uploading the trained adapter model by the client c;

step 940: the central server judges whether the next training is needed;

in some embodiments, the central server determines whether to perform the next round of training based on the adapter model uploaded by the client; or the central server judges whether to perform the next training based on the loss function value uploaded by the client.

Step 950: continuing training;

in some embodiments, when the central server determines that the next round of training is still needed, steps 910 to 940 are continued to complete a round of training.

The r-th training is shown in fig. 21. Client 1 will only update its adapter model LoRA when training for round r, or freezing the local backbone model 91 ₁ ^r After the r-th round training is completed at each of the clients 1 to c and the r-th round adapter model is uploaded, the central server 92 determines whether the r+1-th round training is required according to the received adapter models of the clients 1 to c, and uplar is used for the training ₁ r to LoRA _c ^r Polymerization to LoRA ^r+1 If training is still needed, loRA is performed ^r+1 And continuing to issue to the client, if training is not needed, loRA is performed ^r+1 As a final adapter, the global model is integrated. It should be noted that, in the embodiment of the present application, the adaptor model is described by taking the LoRA component as an example, but the specific type of the adaptor model is not limited.

Step 960: the central server obtains the final adapter and integrates the global model.

For example, there are 3 clients participating in federal learning, and the data volume score for training corresponding to the 3 clients participating in federal learning 50, 30, 120, i.e., n is 200, so that the model obtained by polymerization can be expressed as 1/4. Delta ₁ +3/20δ ₂ +3/5δ ₃ 。

In some embodiments, the local backbone is modeledAnd a final adapter model f _δ Obtaining global model->Wherein f is a machine learning model, +.>Representing model parameters θ ⁰ Is the same as theory f _δ For model parameter f _δ Is a machine learning model of->For model parameter θ _g Machine learning model of θ _g ＝θ ⁰ +δ，θ ⁰ And delta is the model parameter of the final adapter model.

In summary, according to the method provided by the embodiment of the application, model training is put down to the client, the client uses private data stored by the client to perform model training, and an adapter is added instead of directly training a local backbone model during model training, the training content of the model is saved by adopting a mode of updating the adapter, and the number of model parameters of the adapter is smaller than that of model parameters of the local backbone model, namely, the model training can be completed by only updating fewer parameters during model training, so that the client with lower configuration can participate in federal learning, and the federal learning can acquire a high-quality federal learning model with lower cost.

In order to more clearly understand the federal learning method and federal prediction method according to the embodiments of the present application, the present application will be specifically described in a Speech-to-Text (S2T) scenario.

Wherein the speech to text comprises automatic speech recognition (Automatic Speech Recognition, ASR) and end-to-end speech translation (Speech Translation, ST), the training data to be applied is proprietary speech data.

Step 1, a central server sends a cascade pre-training model and an adapter model to at least two clients participating in federal learning, wherein the number of model parameters of the adapter model is smaller than that of the pre-training model.

In some embodiments, the pre-training model is obtained by training the central server based on stored private voice data or open source voice data; or, the pre-training model is a pre-training model supporting a speech-to-text task.

And 2, the client acquires a cascade pre-training model and an adapter model, wherein the number of model parameters of the adapter model is smaller than that of the pre-training model.

And step 3, the client acquires private voice data in the client.

And 4, training the pre-training model and the adapter model by the client based on the private voice data.

In some embodiments, the client inputs private voice data into the pre-training model and the adapter model resulting in loss function values.

In some embodiments, the model parameters of the client-side fixed pre-trained model remain unchanged.

In some embodiments, the client updates model parameters of the adapter model based on the loss function value.

In some embodiments, the client updates model parameters of the ith round of adapter models based on the ith round of loss function values, resulting in a trained ith round of adapter models.

And 5, uploading the trained adapter model to a central server by the client, wherein the trained adapter model is used for updating the voice-to-text model at the central server side by adopting a federal learning mode.

In some embodiments, clients participating in federal learning upload the ith round of loss function values and the trained ith round of adapter models to the central server, the ith round of loss function values being used by the central server to determine whether to perform the (i+1) th round of training.

And 6, the central server receives the trained adapter model uploaded by at least two clients.

And 7, the central server obtains an aggregated adapter model through aggregation based on the trained adapter model uploaded by at least two clients.

In some embodiments, the central server receives the trained jth round of adapter models and the jth round of loss function values uploaded by the at least two clients.

In some embodiments, the central server obtains a jth round average loss function value based on the jth round loss function values of the at least two clients.

In some embodiments, the central server issues the aggregated adapter model as a j+1st round of adapters to at least two clients participating in federal learning if the j-th round of average loss function values are greater than the target threshold.

In some embodiments, the central server integrates the aggregated adapter model with the pre-training model as a final adapter to obtain the speech-to-text model under the condition that the average loss function value of the jth round is less than or equal to the target threshold.

In some embodiments, the client obtains the i+1st round of adapter model issued by the central server.

In some embodiments, the client inputs the private speech data into the pre-training model and the i+1st round adapter model, resulting in the i+1st round loss function value.

In some embodiments, the client updates model parameters of the i+1th round adapter model based on the i+1th round loss function value.

In some embodiments, the client uploads the i+1st round of loss function values and the trained i+1st round of adapter models to the central server, the i+1st round of loss function values being used by the central server to determine whether to perform the i+2nd round of training.

And 8, the client acquires a voice-to-text model which is trained according to the federal learning method.

In some embodiments, the client obtains a voice-to-text model, where the voice-to-text model is obtained by integrating a pre-training model and a final adapter model by a central server, and the final adapter model is obtained by aggregating adapter models of at least two clients by the central server in a federal learning manner.

Step 9, the client predicts the input information based on the voice-to-text model to obtain a prediction result set; and retrieving the input information based on the local knowledge base to obtain a retrieval result set.

The local knowledge base is constructed based on private voice data in the client, the predicted result set comprises at least one predicted result, and the search result set comprises at least one search result.

And step 10, determining a final output result in the prediction result set and the retrieval result set by the client.

In some embodiments, under the condition that the first output result is simultaneously present in the prediction result set and the search result set, interpolation is performed on the prediction result probability and the search result probability corresponding to the first output result, so as to obtain the interpolation probability of the first output result.

In some embodiments, in a case where the second output result is only present in the prediction result set and not present in the search result set, interpolation is performed on the prediction result probability corresponding to the second output result and the search result probability set to zero, so as to obtain an interpolation probability of the second output result.

In some embodiments, in a case that the third output result only appears in the search result set and does not appear in the prediction result set, interpolation is performed on the search result probability corresponding to the third output result and the prediction result probability set to zero, so as to obtain an interpolation probability of the third output result.

In some embodiments, the output results with the highest interpolation probability are determined as the final output result, ordered based on interpolation probabilities of all output results.

In some embodiments, the private voice data includes source data and target data, the target data includes n sub-target data, n being a positive integer.

In some embodiments, based on the speech-to-text model, features of the source data and the first m sub-target data are extracted as contextual representations of the (m+1) -th sub-target data, m being a positive integer and m being less than or equal to n.

In some embodiments, the context representation of the (m+1) -th sub-target data and the corresponding relationship of the (m+1) -th sub-target data are stored in a local knowledge base.

In some embodiments, the source data is a voice sequence; the target data is a text sequence; the sub-target data is a text unit, which is a word or a word.

In some embodiments, based on the speech-to-text model, the speech sequence and features of the first m text units are extracted as contextual representations of the (m+1) th text unit.

In some embodiments, the context representation of the (m+1) th text unit and the correspondence of the (m+1) th text unit are stored in a local knowledge base.

In some embodiments, features of the source data and the first m sub-target data are extracted as contextual representations of the (m+1) th sub-target data based on a speech-to-text model.

In some embodiments, based on the local knowledge base, a search is performed according to the context representation of the (m+1) -th sub-target data, and a search result set is obtained through a k-nearest neighbor model.

Referring to fig. 22, a block diagram of a federal learning device according to an embodiment of the present application is shown. The device has the function of realizing the federal learning method example, and the function can be realized by hardware or by executing corresponding software by hardware. The device may be the client described above or may be provided in the client. As shown in fig. 22, the apparatus 1000 may include: the first acquisition module 1010, the second acquisition module 1020, the training module 1030, and the uploading module 1040.

A first obtaining module 1010, configured to obtain a cascaded local backbone model and an adapter model, where the number of model parameters of the adapter model is smaller than the number of model parameters of the local backbone model;

a second obtaining module 1020, configured to obtain private data in the client;

a training module 1030 for training the local backbone model and the adapter model based on the private data;

and the uploading module 1040 is configured to upload the trained adapter model to a central server, where the trained adapter model is used to update the global model on the central server side by adopting the federal learning mode.

In some embodiments, training module 1030 includes an loss calculation sub-module, a fixing sub-module, and a parameter update sub-module.

The loss calculation sub-module is used for inputting the private data into the local backbone model and the adapter model to obtain a loss function value;

the fixing sub-module is used for fixing the model parameters of the local backbone model to be unchanged;

a parameter updating sub-module for updating model parameters of the adapter model based on the loss function values.

In some embodiments, the parameter update sub-module includes an adapter update unit; the upload module 1040 includes an upload unit.

The adapter updating unit is used for updating the model parameters of the ith round of adapter model based on the ith round of loss function value to obtain a trained ith round of adapter model;

and the uploading unit is used for uploading the ith round of loss function value and the trained ith round of adapter model to the central server, wherein the ith round of loss function value is used for the central server to determine whether to execute the (i+1) th round of training.

In some embodiments, the apparatus 1000 further includes a first acquisition unit, a second acquisition unit, a fixing unit, a parameter updating unit, and a model uploading unit.

The first acquisition unit is used for acquiring an i+1st round of adapter model issued by the central server;

The second acquisition unit is used for inputting the private data into the local backbone model and the i+1th round adapter model to obtain an i+1th round loss function value;

the fixing unit is used for fixing the model parameters of the local backbone model to be unchanged;

a parameter updating unit configured to update model parameters of the i+1th round adapter model based on the i+1th round loss function value;

and the model uploading unit is used for uploading the i+1st round of loss function values and the trained i+1st round of adapter models to the central server, wherein the i+1st round of loss function values are used for the central server to determine whether to execute the i+2nd round of training.

In some embodiments, the apparatus 1000 further comprises a global model acquisition module.

The global model acquisition module is used for acquiring the global model, wherein the global model is obtained by integrating the local backbone model and a final adapter model by the central server, and the final adapter model is obtained by aggregating adapter models of at least two clients by the central server in a federal learning mode.

In some embodiments, the apparatus 1000 further comprises a first prediction module, a first retrieval module, and a first determination module.

The first prediction module is used for predicting the input information based on the global model to obtain a prediction result set;

the first retrieval module is used for retrieving the input information based on a local knowledge base to obtain a retrieval result set;

and the first determining module is used for determining a final output result in the prediction result set and the retrieval result set.

In some embodiments, the first determination module further comprises a first interpolation sub-module, a second interpolation sub-module, a third interpolation sub-module, and a first determination sub-module.

The first interpolation sub-module is used for interpolating the predicted result probability and the search result probability corresponding to the first output result to obtain the interpolation probability of the first output result under the condition that the first output result is simultaneously present in the predicted result set and the search result set;

the second interpolation sub-module is used for interpolating the predicted result probability corresponding to the second output result and the retrieval result probability set to be zero under the condition that the second output result only appears in the predicted result set and does not appear in the retrieval result set, so as to obtain the interpolation probability of the second output result;

The third interpolation sub-module is used for interpolating the retrieval result probability corresponding to the third output result and the prediction result probability set to be zero under the condition that the third output result only appears in the retrieval result set and does not appear in the prediction result set, so as to obtain the interpolation probability of the third output result;

and the first determining submodule is used for sorting based on interpolation probabilities of all output results and determining the output result with the highest interpolation probability as the final output result.

In some embodiments, the apparatus 1000 further comprises a first extraction module and a first storage module.

The first extraction module is used for extracting the characteristics of the source data and the first m sub-target data based on the global model to serve as a context representation of the (m+1) th sub-target data, wherein m is a positive integer and is less than or equal to n;

the first storage module is used for storing the corresponding relation between the context representation of the (m+1) th sub-target data and the (m+1) th sub-target data into the local knowledge base.

In some embodiments, the first extraction module is further configured to extract features of the speech sequence and the first m text units as a context representation of the (m+1) th text unit based on the global model; the first storage module is further configured to store a corresponding relationship between the context representation of the (m+1) th text unit and the (m+1) th text unit in the local knowledge base.

In some embodiments, the first retrieval module further comprises a first extraction sub-module and a first retrieval sub-module.

A first extraction sub-module for extracting features of the source data and the first m sub-target data as a contextual representation of the (m+1) -th sub-target data based on the global model;

and the first retrieval sub-module is used for retrieving according to the context representation of the (m+1) th sub-target data based on the local knowledge base, and obtaining the retrieval result set through a k neighbor model.

Referring to fig. 23, a block diagram of a federal learning device according to an embodiment of the present application is shown. The device has the function of realizing the federal learning method example, and the function can be realized by hardware or by executing corresponding software by hardware. The device may be the client described above or may be provided in the client. As shown in fig. 23, the apparatus 1100 may include: a first transmitting module 1110, a first receiving module 1120, and an aggregating module 1130.

A first sending module 1110, configured to send, to at least two clients participating in federation learning, a cascaded local backbone model and an adapter model, where the number of model parameters of the adapter model is smaller than the number of model parameters of the local backbone model;

A first receiving module 1120, configured to receive the trained adapter model uploaded by the at least two clients;

the aggregation module 1130 is configured to aggregate the trained adapter models uploaded by the at least two clients to obtain an aggregated adapter model.

In some embodiments, the first receiving module 1120 includes a receiving sub-module, a computing sub-module, a first determining sub-module, and a second determining sub-module.

A receiving sub-module, configured to receive the trained jth round of adapter model and the jth round of loss function values uploaded by the at least two clients;

a calculation sub-module, configured to obtain a jth round average loss function value based on the jth round loss function values of the at least two clients;

a first judging sub-module, configured to issue, as a j+1th round adapter, the aggregated adapter model to the at least two clients participating in federal learning, where the j-th round average loss function value is greater than a target threshold;

and the second judging sub-module is used for integrating the aggregated adapter model with the local backbone model to obtain a global model as a final adapter under the condition that the average loss function value of the jth round is smaller than or equal to the target threshold.

Referring to FIG. 24, a block diagram of a federal prediction apparatus according to one embodiment of the present application is shown. The device has the function of realizing the federal prediction method example, and the function can be realized by hardware or by executing corresponding software by hardware. The device may be the client described above or may be provided in the client. As shown in fig. 24, the apparatus 1200 may include: a third acquisition module 1210, a second prediction module 1220, a second retrieval module 1230, and a second determination module 1240.

A third obtaining module 1210, configured to obtain a global model, where the global model is obtained by training according to a federal learning method;

a second prediction module 1220, configured to predict the input information based on the global model, to obtain a prediction result set;

a second retrieving module 1230, configured to retrieve the input information based on a local knowledge base, to obtain a retrieval result set;

a second determining module 1240, configured to determine a final output result from the prediction result set and the search result set.

In some embodiments, the second determination module 1240 further comprises a fourth interpolation sub-module, a fifth interpolation sub-module, a sixth interpolation sub-module, and a second determination sub-module.

The fourth interpolation sub-module is used for interpolating the predicted result probability and the search result probability corresponding to the first output result to obtain the interpolation probability of the first output result under the condition that the first output result is simultaneously in the predicted result set and the search result set;

a fifth interpolation sub-module, configured to interpolate, when a second output result only appears in the prediction result set and does not appear in the search result set, a prediction result probability corresponding to the second output result and a search result probability set to zero, so as to obtain an interpolation probability of the second output result;

a sixth interpolation sub-module, configured to interpolate, when a third output result only appears in the search result set and does not appear in the prediction result set, a search result probability corresponding to the third output result and a prediction result probability set to zero, so as to obtain an interpolation probability of the third output result;

and the second determining submodule is used for sorting based on interpolation probabilities of all output results and determining the output result with the highest interpolation probability as the final output result.

In some embodiments, the apparatus 1200 further comprises a second extraction module, a second storage module.

The second extraction module is used for extracting the characteristics of the source data and the first m sub-target data based on the global model to serve as a context representation of the (m+1) th sub-target data, wherein m is a positive integer and is less than or equal to n;

and the second storage module is used for storing the corresponding relation between the context representation of the (m+1) th sub-target data and the (m+1) th sub-target data into the local knowledge base.

In some embodiments, the second extraction module is further for extracting features of the speech sequence and the first m text units as a context representation of the (m+1) th text unit based on the global model; the second storage module is further configured to store a corresponding relationship between the context representation of the (m+1) th text unit and the (m+1) th text unit in the local knowledge base.

In some embodiments, the second retrieval module further comprises a second extraction sub-module and a second retrieval sub-module.

A second extraction sub-module for extracting features of the source data and the first m sub-target data as a context representation of the (m+1) -th sub-target data based on the global model;

and the second retrieval sub-module is used for retrieving according to the context representation of the (m+1) th sub-target data based on the local knowledge base, and obtaining the retrieval result set through a k neighbor model.

It should be noted that: in the device provided in the above embodiment, when implementing the functions thereof, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

The computer apparatus 1300 includes a central processing unit (Central Processing Unit, CPU) 1301, a system Memory 1304 including a random access Memory (Random Access Memory, RAM) 1302 and a Read-Only Memory (ROM) 1303, and a system bus 1305 connecting the system Memory 1304 and the central processing unit 1301. The computer device 1300 also includes a basic Input/Output system (I/O) 1306 to facilitate the transfer of information between various devices within the computer device, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.

The basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable storage media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable storage medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

The computer-readable storage medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable storage instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only register (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the above-described method embodiments, the central processing unit 1301 executing the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the present application, the computer device 1300 may also operate by a remote computer device connected to the network through a network, such as the Internet. I.e., the computer device 1300 may be connected to the network 1312 via a network interface unit 1311 coupled to the system bus 1305, or alternatively, the network interface unit 1311 may be used to connect to other types of networks or remote computer device systems (not shown).

The memory further includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, performed by the terminal device.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, having stored thereon a computer program which, when executed by a processor, implements the federal learning method and/or federal prediction method described above.

In an exemplary embodiment, a computer program product is also provided, which, when executed by a processor, is adapted to implement the federal learning method and/or federal prediction method described above.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. A federal learning method, the method performed by a client participating in federal learning, the method comprising:

private data in the client is obtained;

2. The method of claim 1, wherein the training the local backbone model and the adapter model based on the private data comprises:

inputting the private data into the local backbone model and the adapter model to obtain a loss function value;

fixing the model parameters of the local backbone model to be unchanged;

model parameters of the adapter model are updated based on the loss function values.

3. The method of claim 2, wherein the loss function value comprises an i-th round of loss function values, the adapter model comprises an i-th round of adapter models, i being a positive integer;

The updating model parameters of the adapter model based on the loss function values includes:

updating model parameters of the ith round of adapter model based on the ith round of loss function value to obtain a trained ith round of adapter model;

the uploading the trained adapter model to the central server comprises the following steps:

uploading the ith round of loss function values and the trained ith round of adapter models to the central server, wherein the ith round of loss function values are used for the central server to determine whether to execute the ith+1st round of training.

4. A method according to any one of claims 1 to 3, wherein the method further comprises:

the global model is obtained by integrating the local backbone model and a final adapter model by the central server, and the final adapter model is obtained by aggregating at least two adapter models of clients by the central server in a federal learning mode.

5. The method according to claim 4, wherein the method further comprises:

6. A federal learning method, the method performed by a central server, the method comprising:

receiving the trained adapter model uploaded by the at least two clients;

7. The method of claim 6, wherein the adapter model comprises a j-th round adapter model, j being a positive integer;

the receiving the trained adapter model uploaded by the at least two clients comprises:

receiving the trained jth round of adapter model and the jth round of loss function values uploaded by the at least two clients;

Obtaining a jth round average loss function value based on the jth round loss function values of the at least two clients;

under the condition that the j-th round average loss function value is larger than a target threshold, the aggregated adapter model is used as a j+1-th round adapter to be issued to the at least two clients participating in federal learning;

and under the condition that the j-th round average loss function value is smaller than or equal to the target threshold value, integrating the aggregated adapter model as a final adapter with a local backbone model to obtain a global model.

8. A federal prediction method, the method performed by a client, the method comprising:

9. The method of claim 8, wherein the private data comprises source data and target data, the target data comprising n sub-target data, n being a positive integer, the method further comprising:

10. A federal learning apparatus, the apparatus comprising:

the second acquisition module is used for acquiring private data in the client;

11. A federal learning apparatus, the apparatus comprising:

12. A federal prediction apparatus, the apparatus comprising:

13. A computer device, the computer device comprising: a processor and a memory, wherein at least one section of program is stored in the memory; the processor is configured to execute the at least one program in the memory to implement the federal learning method according to any one of claims 1 to 5, or the federal learning method according to claim 6 or 7, or the federal prediction method according to claim 8 or 9.

14. A computer readable storage medium having stored therein executable instructions that are loaded and executed by a processor to implement the federal learning method of any one of claims 1 to 5, or the federal learning method of claim 6 or 7, or the federal prediction method of claim 8 or 9.

15. A computer program product comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium and executed by a processor to implement the federal learning method according to any one of claims 1 to 5, or the federal learning method according to claim 6 or 7, or the federal prediction method according to claim 8 or 9.