CN114925119A

CN114925119A - Federal learning method, apparatus, system, device, medium, and program product

Info

Publication number: CN114925119A
Application number: CN202210665673.3A
Authority: CN
Inventors: 高晓龙
Original assignee: Shenzhen Zhixing Technology Co Ltd
Current assignee: Shenzhen Zhixing Technology Co Ltd
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-08-19

Abstract

The present disclosure provides a federated learning method, apparatus, system, electronic device, non-transitory computer-readable storage medium, and computer program product, which relate to the field of computer technologies, and in particular, to the field of federated learning and the field of privacy computation, and may be used to manage privacy data. The implementation scheme is as follows: obtaining a first set of IDs for a first set of data for the participant from a data management system; obtaining an ID intersection of the plurality of participants, wherein the ID intersection is obtained based on the first ID set and a second ID set of a second data set of other participants in the plurality of participants, and the ID type of the second ID set is the same as that of the first ID set; and obtaining feature data associated with the federated learning task from the first data set based on the ID intersection to perform subsequent subtasks of the federated learning task with the other participants based on the respective feature data.

Description

Federal learning method, apparatus, system, device, medium, and program product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of federal learning and privacy computing, which can be used for managing privacy data, and in particular, to a method, an apparatus, a system, an electronic device, a non-transitory computer-readable storage medium, and a computer program product for federal learning.

Background

Federal Machine Learning (Federal Machine Learning), also known as Federal Learning (Federal Learning), is a Machine Learning framework, and can effectively help a plurality of participants to perform data use and Machine Learning modeling under the condition of meeting user privacy protection and data security. The federated learning is used as a distributed machine learning paradigm, the problem of data island can be effectively solved, participators can complete a joint learning task on the basis of not sharing data, the data island can be technically broken, and AI (intellectual insight) cooperation is realized.

Federal learning defines a machine learning framework under which the problem of different data owners collaborating without exchanging data (especially private data) can be solved by designing virtual models. The virtual model is an optimal model for all parties to aggregate data together, and the objective of federal learning is that the virtual model is infinitely close to a model obtained according to a traditional modeling mode, namely, the model is obtained by aggregating data of a plurality of data owners to one place for modeling. Under a federal mechanism, the identity and the status of each participant (namely, a data owner) are the same, and a shared data strategy can be established. Because the data is not transferred, the privacy of the user is not revealed or the privacy data specification is not influenced. It should be noted that the federal learning task is not limited to federal modeling, and may be, for example, a federal query task, a federal statistical task, or the like.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a method, apparatus, system, electronic device, non-transitory computer readable storage medium, and computer program product for federated learning.

According to an aspect of the present disclosure, there is provided a federated learning method applied to any one of a plurality of participants who execute the same federated learning task, wherein the method includes: obtaining a first set of IDs for a first set of data for the participant from a data management system; obtaining an ID intersection of the plurality of participants, wherein the ID intersection is obtained based on the first ID set and a second ID set of a second data set of other participants in the plurality of participants, and the ID type of the second ID set is the same as that of the first ID set; and obtaining feature data associated with the federated learning task from the first data set based on the ID intersection to perform subsequent subtasks of the federated learning task with the other participants based on the respective feature data.

According to another aspect of the present disclosure, a federated learning method is provided, which is applied to a collaborating party that is respectively in communication connection with a plurality of participating parties executing the same federated learning task, wherein the method includes: obtaining a plurality of ID sets from a plurality of participants, wherein the ID types of the plurality of ID sets are the same; calculating the intersection of the multiple ID sets to obtain the ID intersection; and sending an ID intersection to each participant in the multiple participants so that the multiple participants respectively acquire feature data associated with the federal learning task based on the ID intersection, and further enabling the multiple participants to execute subsequent subtasks of the federal learning task based on the respective feature data.

According to another aspect of the present disclosure, there is provided a data management method applied to a data management system communicatively connected to any one of a plurality of participants performing the same federal learning task, wherein the method includes: obtaining a first data set of the participant; in response to receiving the ID acquisition request of the participant, sending a first ID set of a first data set to the participant; acquiring ID intersection of a plurality of participants; in response to receiving the feature acquisition request of the participant, feature data associated with the federated learning task in the first data set is sent to the participant based on the ID intersection, such that a plurality of participants perform subsequent subtasks of the federated learning task based on the respective feature data.

According to another aspect of the present disclosure, there is provided a federated learning apparatus applied to any one of a plurality of participants who perform the same federated learning task, wherein the apparatus includes: a first ID acquisition module configured to acquire a first ID set of a first data set of the participant from a data management system; an intersection acquisition module configured to acquire an ID intersection of the plurality of participants, the ID intersection being obtained based on the first ID set and a second ID set of a second data set of other participants of the plurality of participants, wherein an ID type of the second ID set is the same as an ID type of the first ID set; and a feature acquisition module configured to acquire feature data associated with the federal learning task from the first data set based on the ID intersection to perform subsequent subtasks of the federal learning task with other participants based on the respective feature data.

According to the federal learning device mentioned above, the intersection acquisition module is configured to calculate the intersection of the first ID set and the second ID set in cooperation with other participants to obtain the ID intersection.

According to above-mentioned federal learning device, wherein, a plurality of participators respectively with cooperator communication connection, intersect and acquire the module and include: a sending submodule configured to send the first set of IDs to the collaborator; and the acquisition sub-module is configured to acquire an ID intersection from the collaborators, wherein the ID intersection is the intersection of the first ID set and a second ID set calculated by the collaborators, and the second ID set is acquired by the collaborators from other participants.

According to the above federal learning device, wherein the data management system has a relational database built therein, the first data set is stored in the relational database in the form of relational data.

According to the federal learning device, the first ID set corresponds to a column of data sets corresponding to preset ID types in the relational database.

According to the above federal learning device, wherein the feature acquisition module includes: the characteristic selection submodule is configured for acquiring full-quantity characteristic data corresponding to the ID intersection from the relational database; and the feature connection sub-module is configured for connecting the full amount of feature data to generate feature data associated with the federal learning task.

According to the above federal learning device, wherein the feature acquisition module includes: the characteristic selection submodule is configured for acquiring characteristic data of a plurality of preset attribute columns corresponding to the ID intersection from the relational database; and the characteristic connecting sub-module is configured for connecting the characteristic data of the preset attribute columns to generate the characteristic data associated with the federal learning task.

According to the federal learning device, the subsequent subtasks of the federal learning task include at least one of a federal modeling task, a federal forecasting task, a federal query task and a federal statistical task.

According to another aspect of the present disclosure, there is provided a federated learning apparatus applied to collaborators in respective communication connection with a plurality of participants performing the same federated learning task, wherein the apparatus includes: an ID acquisition module configured to acquire a plurality of ID sets from a plurality of participants, wherein the ID types of the plurality of ID sets are the same; the intersection module is configured to calculate the intersection of the ID sets to obtain the ID intersection; and a sending module configured to send the ID intersection to each of the plurality of participants, so that the plurality of participants respectively obtain feature data associated with the federal learning task based on the ID intersection, and further, the plurality of participants execute subsequent subtasks of the federal learning task based on the respective feature data.

According to another aspect of the present disclosure, there is provided a data management system communicatively connected to any one of a plurality of participants performing the same federal learning task, wherein the data management system includes: a data acquisition module configured to acquire a first data set of the participant; a sending module configured to send a first ID set of a first data set to the participant in response to receiving the ID acquisition request of the participant; an ID obtaining module configured to obtain an ID intersection of a plurality of participants, wherein the sending module is further configured to send, in response to receiving a feature obtaining request of the participant, feature data associated with the federal learning task in the first data set to the participant based on the ID intersection, so that the plurality of participants perform subsequent subtasks of the federal learning task based on the respective feature data.

According to another aspect of the present disclosure, there is provided a bang learning system, including: the federal learning device is applied to any one of a plurality of participants executing the same federal learning task; and the data management system described above.

According to another aspect of the present disclosure, there is provided an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the preceding.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method according to any one of the above.

According to another aspect of the disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements a method according to any of the above.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of example only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a flow chart of a federated learning method in accordance with an exemplary embodiment of the present disclosure;

FIGS. 2A-2B illustrate a schematic diagram of a federated learning method in accordance with an exemplary embodiment of the present disclosure;

FIGS. 3A-3B illustrate schematic connections between participants in a federated learning system that does not contain collaborators, according to an exemplary embodiment of the present disclosure;

FIG. 3C illustrates a schematic diagram of a connection of a participant to a collaborator in a federated learning system that contains collaborators in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a federated learning method in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a data management method according to an example embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of a federated learning device in accordance with an exemplary embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of a federated learning device in accordance with an exemplary embodiment of the present disclosure;

FIG. 8 shows a block diagram of a data management system according to an example embodiment of the present disclosure;

fig. 9 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, it will be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

The federated learning is a distributed machine learning framework with privacy protection and a safe encryption technology, and aims to enable distributed participants to collaborate to perform a federated learning task on the premise that private data are not disclosed to other participants. The federal learning task may be, for example, training of a machine learning model, and may also be, for example, a federal query task, a federal statistical task, or the like.

The execution process of the federal learning task relates to a preposed data processing process such as data import, data request and the like. And in the data importing process, importing the full amount of feature data including all feature data in the data set into the Federal learning system. In the data intersection process, intersection is carried out by using the imported full quantity feature data so as to obtain intersection feature data. The intersection feature data may be used for federal learning tasks such as training of machine learning models, federal query tasks, federal statistical tasks, and the like. The inventor finds that when a federal learning task is executed in the face of a large-data-volume application scene, data loading in the data importing process is slow by using full-volume feature data for prepositive data processing, and meanwhile, the calculated amount in the data intersecting process is large. Therefore, the pre-data processing process of the federal learning task consumes a lot of time and computing resources, and further affects the overall efficiency of the federal learning task.

The present disclosure provides a federated learning method, apparatus, system, electronic device, non-transitory computer-readable storage medium, and computer program product that further divide a pre-data processing procedure that uses full-size feature data into a negotiation step that uses only an ID set, and a feature acquisition step. In the intersection step, only the ID sets in the data set are used to obtain the ID intersection. And then acquiring feature data associated with the federal learning task based on the ID intersection obtained in the intersection step so as to execute subsequent subtasks of the federal learning task. Therefore, data dimensionality processed in the processes of whole data import, intersection calculation and the like of the federal learning task can be reduced, and the whole performance and efficiency of the federal learning task are improved.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows a flow diagram of a federal learning method 100 in accordance with an exemplary embodiment of the present disclosure. The federated learning method 100 applies to any one of a plurality of participants who perform the same federated learning task. As shown in fig. 1, the federal learning method 100 includes: step S102, acquiring a first ID set of a first data set of the participant from a data management system; step S104, obtaining an ID intersection of a plurality of participants, wherein the ID intersection is obtained based on the first ID set and a second ID set of a second data set of other participants in the plurality of participants, and the ID type of the second ID set is the same as that of the first ID set; and step S106, acquiring feature data associated with the federal learning task from the first data set based on the ID intersection, so as to execute subsequent subtasks of the federal learning task with other participants based on the respective feature data.

Therefore, the pre-data processing process of the federal learning task can be divided into two independent steps, namely an intersection step and a feature acquisition step. In the process of acquiring the ID set from the data management system, the participant only needs to acquire the ID set of the data set corresponding to the participant from the data management system, and does not need to acquire the full amount of feature data of the data set. In the process of acquiring the ID intersection of a plurality of participants, only ID data is needed to be used for intersection, and the full amount of characteristic data of the data set is not needed. The data dimension of the ID set is lower than that of the full-scale feature data. Therefore, the ID set is used in the process of processing the preposed data, so that the time and the computing resources required by the processes of data transmission and ID intersection calculation can be reduced, and the overall efficiency of executing the federal learning task is improved.

The ID in the embodiment of the present disclosure refers to an identifier capable of uniquely marking an object, and may be, for example, a mobile phone number, an identity card number, or a user account number, which is not limited herein and may be set according to an actual application scenario. The ID set refers to the set of all IDs included in the data set of the participant.

In some embodiments, the data management system has built in one or more of a documentation database, a graphics database, a key-value pair database, etc. to store the data sets of the participants.

In some embodiments, the data management system has a relational database built-in, wherein the first data set is stored in the relational database in the form of relational data. Further, the relational database may include a plurality of relational data tables, each of which may store data in a plurality of rows and columns.

By storing the data set in the form of relational data, operations such as retrieval, extraction, linking, etc. of particular rows and columns of the data set may be implemented. In this embodiment, the built-in relational database in the data management system may enable only the ID data or the desired characteristic data to be derived from the data set of the participating parties.

In some embodiments, the first ID set corresponds to a list of data sets in the relational database corresponding to the preset ID type. The list of data sets may be a single list of ID data representing the corresponding user identity of the relational data. The single-column ID data content may have multiple choices, for example, it may be user ID data such as a mobile phone number and an identification number in personal information. A single column of data of the appropriate ID type can be derived as ID data based on actual needs. In a data management system incorporating a relational database, the single-column ID data can be retrieved by the query function of the relational database SQL and further derived.

By using only a single column of ID data, the data transfer speed and the intersection calculation speed in the preamble data processing can be maximized. Compared with a data processing process using full amount of feature data, the single-column ID data can reduce the time and the computing resources required by the pre-data processing process under the condition of not influencing the quality of the pre-data processing result, so that the efficiency of the whole federal learning task can be improved.

According to the scheme of the embodiment of the disclosure, when the federal learning task is executed, firstly, the ID sets of the data sets of the multiple participants for executing the federal learning task are obtained, further, the ID intersection of the multiple participants is obtained based on the ID sets, the feature data associated with the federal learning task in the data sets of the participants is obtained from the data management system based on the ID intersection, and the subsequent subtasks of the federal learning task are executed based on the feature data. In the above, the acquisition of ID sets of each participant according to various embodiments is described, and how to acquire ID intersections of a plurality of participants performing the same federal learning task based on ID sets of each participant will be specifically described below.

According to some embodiments, in step S104, obtaining the ID intersection of the plurality of parties may be achieved by encrypted communication between the plurality of parties. In this case, step S104 may include: the intersection of the first set of IDs with the second set of IDs is computed in collaboration with the other participants to obtain an ID intersection. Data sharing is achieved through encrypted communication among multiple participants to execute a federal learning task, and data security is guaranteed.

Fig. 2A shows a schematic diagram of a federal learning method in accordance with an exemplary embodiment of the present disclosure. Wherein a participant 201 is communicatively coupled to a data management system 202 and other participants 203. As shown in fig. 2A, the method includes:

a participant 201 obtains a first set of IDs 211 for a first set of data for the participant 201 from the data management system 202;

the participant 201 acquires a second ID set 212 from the other participants 203, where the second ID set 212 may be an ID set of a data set of the other participants 203 themselves, or an ID intersection of a plurality of participants that is calculated by the other participants 203 and does not include the participant 201;

participant 201 calculates the intersection of first ID set 211 and second ID set 212 to obtain an ID intersection 213 of a plurality of participants including participant 201;

in the case where the ID intersection 213 is the ID intersection of all participants performing the same federated learning task, it is determined that the step of obtaining the ID intersection has been completed. Participant 201 obtains feature data 214 associated with the federated learning task from the first data set from data management system 202 based on ID intersection 213 to perform subsequent subtasks of the federated learning task with other participants based on the respective feature data.

In the above process, it can be appreciated that a plurality of other parties may be communicatively coupled to party 201. Participant 201 may obtain multiple ID sets from multiple other participants. When there are multiple other parties, computing the ID intersection includes computing a common ID intersection between the first ID set 211 and all ID sets obtained from the multiple other parties.

Fig. 2A corresponds to a federal learning system that includes participants and no collaborators. Thus, in the step of obtaining the ID intersection, the participant only needs to obtain the ID set of the data set corresponding to the participant from the data management system, and does not need to obtain any other feature data. In the process of calculating the ID intersection, only the ID set is needed to be used, and other feature data is not needed. Therefore, time and computing resources required by data transmission and ID intersection calculation can be reduced, and the overall efficiency of executing the federal learning task is improved.

According to further embodiments, multiple participants may each be communicatively coupled to a collaborator. In step S104, the process of acquiring the ID intersection of the plurality of participants may be completed through communication between the plurality of participants and the collaborator. In this case, step S104 may include: sending the first ID set to the collaborator; and acquiring an ID intersection from the collaborators, wherein the ID intersection is the intersection of a first ID set and a second ID set which are calculated by the collaborators, and the second ID set is acquired by the collaborators from other participants. Therefore, the data are sent to the commonly trusted cooperative party by the multiple participants to complete ID intersection, the data sharing efficiency can be improved, and the overall efficiency of executing the federal learning task is improved. In addition, for data security, the data sets transmitted and received (acquired) above are all in the form of ciphertext, and are not described in detail below.

Fig. 2B shows a schematic diagram of a federal learning method in accordance with an exemplary embodiment of the present disclosure. Wherein the participant 201 is communicatively coupled to the data management system 202 and the collaborator 204. The collaborator 204 is also communicatively connected to other participants 203. As shown in fig. 2B, the method includes:

participant 201 sends a first set of IDs 211 to collaborator 204;

the other participant 203 sends a second set of IDs 212 to the collaborator 204;

the collaborator 204 calculates the intersection of the first set of IDs 211 with the second set of IDs 212 to obtain an ID intersection 213 of a plurality of parties including party 201 and party 203;

the participant 201 obtains the calculated ID intersection 213 from the collaborator 204;

participant 201 obtains feature data 214 associated with the federated learning task from the first data set from data management system 202 based on ID intersection 213 to perform subsequent subtasks of the federated learning task with other participants based on the respective feature data.

In the above process, it can be appreciated that a plurality of other parties may be communicatively connected to the collaborator 204 and send the set of IDs to the collaborator 204. When there are multiple other parties, computing the ID intersection includes computing a common ID intersection between all ID sets obtained from the party and the multiple other parties.

Fig. 2B corresponds to a federated learning system that includes participants and collaborators. Similar to the method shown in fig. 2A, the method shown in fig. 2B can reduce the time and the computing resources required by the processes of data transmission and ID intersection calculation, and improve the overall efficiency of performing the federal learning task.

As introduced in the above, the exemplary federal learning method illustrated in fig. 2A-2B can include a plurality of other participants. The interconnection scheme of the federal learning system including a plurality of participants will be described in detail below with reference to fig. 3A to 3C.

Fig. 3A-3B illustrate schematic connections between participants in a federated learning system that does not include collaborators, according to an exemplary embodiment of the present disclosure. The federated learning system includes 4 participants (participant 301, participant 302, participant 303, participant 304).

As shown in fig. 3A, in some embodiments, the participants may be interconnected in a ring network scheme. In a ring network, each participant is communicatively connected to only one "neighbor" participant, forming a ring network. Each participant gets the ID set from only one participant and further computes the ID intersection.

In a federated learning approach such as that illustrated in fig. 2A described above, through the connection scheme of fig. 3A, a first participant (e.g., participant 301) may receive a second set of IDs from a second participant (e.g., participant 302) and intersect the first set of IDs acquired by the first participant from the data management system with the second set of IDs to arrive at an ID intersection. Where the ID intersection is the ID intersection of all parties performing the same federated learning task (e.g.,

parties

301 and 304 in FIG. 3A), it is determined that the ID intersection step has been completed. Based on the ID intersection, the first participant obtains feature data associated with the federated learning task from the data management system. Otherwise, participant 301 may continue to send ID intersections to participant 304 to continue ID intersection.

As shown in fig. 3B, in some embodiments, the participants may be interconnected in a point-to-point network scheme. Each participant is communicatively coupled to other participants in a point-to-point network. Each participant may obtain the ID set from any participant and further compute the ID intersection.

It is to be understood that the number of parties included in the federal learning system is not limited to the numbers illustrated in fig. 3A-3B, and that more or fewer parties may be included. Furthermore, in addition to the interconnection schemes shown in fig. 3A-3B, a variety of different interconnection schemes may be employed between the parties in a federated system. In practical applications, for example, the number of participants that can be connected to one participant can be determined based on the encryption method of the communication between the participants, and then the interconnection scheme between the participants is selected.

Fig. 3C shows a schematic diagram of the connection of participants to collaborators in a federated learning system that contains collaborators, according to an exemplary embodiment of the present disclosure. As shown in FIG. 3C, the federated learning system includes 4 participants (participant 301-304) and one collaborator 310. Each participant is communicatively connected to a collaborator 310.

In a federated learning approach such as that illustrated in fig. 2B above, a first party (e.g., party 301) may send a first set of IDs to a collaborator 310 and a second party (e.g., party 302) may send a second set of IDs to the collaborator 310 through the connectivity scheme of fig. 3C. The collaborator 310 calculates the intersection of the first ID set and the second ID set to obtain the ID intersection. The collaborator 310 then sends the ID intersection to the first and second parties. The first participant and the second participant acquire feature data associated with the federated learning task from the data management system in respective corresponding data sets based on the ID intersection.

It is to be understood that the number of parties included in the federated learning system is not limited to the number illustrated in FIG. 3C, and may include more or fewer parties. Multiple participants are communicatively coupled to the collaborator 310 in the same manner as in fig. 3C, and each of the multiple participants may send a respective ID set of the corresponding data set to the collaborator 310 and obtain an ID intersection from the collaborator. Based on the ID intersection, the plurality of participants obtain feature data associated with the federal learning task from the data management system.

The step of obtaining an ID intersection of multiple participants is described above, and how to obtain feature data associated with the federated learning task based on the ID intersection to perform subsequent subtasks of the federated learning task will be described in further detail below.

Returning to fig. 1, according to some embodiments, where a relational database is built into the data management system, step S106 in federated learning method 100 may include: acquiring full-quantity feature data corresponding to the ID intersection from a relational database; and concatenating the full-scale feature data to generate feature data associated with the federated learning task.

Based on the ID intersection of the plurality of parties acquired from step S104, for example, each party may derive the full amount of feature data corresponding to each ID in the ID intersection from a relational database built in the data management system using a join (join) function of the relational database SQL.

In some embodiments, the full-scale feature data may be distributed across multiple relational data tables of a relational database. A plurality of feature data located in different relational data tables can be concatenated by a concatenation function to derive a full amount of feature data.

Thus, the federated learning method 100 only obtains the full amount of feature data corresponding to the ID intersection, further reducing the amount of data read from the data management system. For the participants with huge data volume, the federal learning method 100 only uses the full feature data corresponding to the ID intersection, thereby avoiding the transmission and intersection calculation of the whole full feature data from the large-scale participants and further improving the efficiency of the federal learning task. In addition, by storing the data sets in the form of relational data, even if feature data are stored in different relational data tables, it is possible to efficiently connect and extract feature data corresponding to each ID in an ID intersection.

According to other embodiments, step S106 of the federal learning method 100 includes: acquiring feature data of a plurality of preset attribute columns corresponding to the ID intersection from the relational database; and connecting the feature data of the plurality of preset attribute columns to generate feature data associated with the federal learning task.

Due to differences between the various participants, in some application scenarios, the federal learning task uses only a portion of the dimensions of feature data (e.g., some participants are sensitive to only a portion of the features). According to the scheme, only the feature data of part of corresponding dimensions of the ID set can be acquired based on a plurality of preset feature attributes related to the federal learning task so as to be used by the federal learning task, and therefore the flexibility and convenience of data use can be improved.

According to some embodiments, the data management system supports data set importation of at least one of the following data source types: csv, txt, HTTP, FTP, MySQL, Oracle, Hive.

A data management system with a built-in relational database supports multiple types of data import. The imported data may have a local file format (e.g., csv, txt, etc.), a remote file format (e.g., HTTP, FTP, etc.), a database/table format (e.g., MySQL, Oracle, Hive, etc.). The imported data is stored in the relational database in the form of relational data (e.g., in the form of relational data tables).

In federal learning, the file formats used by multiple participants are not necessarily the same. Therefore, the data formats can be effectively unified by supporting the import of multi-type data and storing the data sets corresponding to all the participants in the form of relational data, and convenience is provided for the participation of the multiple participants in the federal learning task.

According to some embodiments, subsequent subtasks of the federated learning task may include, for example, a federated modeling task (e.g., training of a federated machine learning model). The federal modeling task relates to multiple times of training of submodels of all participants, and the time and the computing resources required by the pre-data processing processes of data transmission, ID intersection calculation and the like in each training can be reduced by applying the federal learning method, so that the acquisition efficiency of the training data for the federal modeling task is improved, and the integral efficiency of the federal modeling task is improved.

In some embodiments, the subsequent subtasks of the federal learning task may further include at least one of a federal prediction task (e.g., a road congestion degree prediction task based on federal learning), a federal query task (e.g., a query based on federal learning refers to a task of data results satisfying specified conditions), a federal statistical task (e.g., a statistic based on federal learning refers to a task of data results satisfying specified conditions), and the like, which involve a plurality of participants. Similar to the execution of the federal modeling task, the application of the federal learning method can reduce the time and the calculation resources required by the preposed data processing processes such as the calculation of ID intersection and the like in the federal prediction task, the federal query task and the federal statistical task, or reduce the query/statistical data volume based on the ID intersection, thereby improving the overall efficiency of the federal learning task.

Fig. 4 illustrates a flow chart of a federal learning method 400 in accordance with an exemplary embodiment of the present disclosure. The method 400 is applied to collaborators that are each communicatively coupled to a plurality of participants performing the same federated learning task, wherein the federated learning method 400 includes: step S402, acquiring a plurality of ID sets from a plurality of participants, wherein the ID types of the ID sets are the same; step S404, calculating the intersection of the ID sets to obtain the ID intersection; step S406, an ID intersection is sent to each of the multiple participants, so that the multiple participants respectively obtain feature data associated with the federal learning task based on the ID intersection, and then the multiple participants execute subsequent subtasks of the federal learning task based on the respective feature data.

Thus, by using the ID set in the intersection step, the method 400 can reduce the time and computational resources required for the data transmission process between the participants and the collaborators, and the collaborators to calculate the ID intersection, thereby facilitating an increase in the overall efficiency of performing federal learning tasks.

Fig. 5 shows a flow diagram of a data management method 500 according to an example embodiment of the present disclosure. The method 500 applies to a data management system communicatively coupled to any one of a plurality of parties performing the same federal learning task. Wherein the method 500 comprises: step S502, a first data set of the participant is obtained; step S504, in response to receiving the ID acquisition request of the participant, sending a first ID set of a first data set to the participant; step S506, obtaining the ID intersection of a plurality of participants; step S508, in response to receiving the feature acquisition request of the participant, sending feature data associated with the federal learning task in the first data set to the participant based on the ID intersection, so that the plurality of participants execute subsequent subtasks of the federal learning task based on the respective feature data.

Thus, the method 500 can provide an ID set, reduce the dimensionality of the transmitted and computed data, and increase the processing speed of the intersection step. And can further provide feature data associated with the federated learning task to perform subsequent subtasks of the federated learning task.

In some embodiments, a relational database may be built in the data management system, and the data sets corresponding to the participants are stored in the form of relational data in the relational database built in the data management system.

In some embodiments, the first ID set corresponds to a column of data sets in the relational database corresponding to the preset ID type. The list of data sets may be a single list of ID data representing the corresponding user identity of the relational data. This single column ID data can be retrieved and further derived by the query function of the relational database SQL.

In some embodiments, the feature data may be full-scale feature data obtained from a relational database corresponding to an ID intersection, or feature data of a plurality of preset attribute columns corresponding to the ID intersection. The feature data may be stored in a plurality of relational data tables of a relational database and connected and exported by a join (join) function, such as relational database SQL.

In some embodiments, a data management system with a relational database built in supports multiple types of data importation. The imported data may have a local file format (e.g., csv, txt, etc.), a remote file format (e.g., HTTP, FTP, etc.), and a database/table format (e.g., MySQL, Oracle, Hive, etc.).

According to some embodiments, the method 500 further comprises: a third data set of the participant is obtained, and wherein the participant performs sending the third data set to the data management system and receiving the first ID set of the first data set from the data management system in an asynchronous manner.

In some embodiments, the third data set may be a subsequent supplemental data set to the federal learning task, such as a user profile data set, a user history information data set, or the like. Illustratively, when a participant (e.g., a vendor) obtains a first ID set of the participant's user consumption data set (corresponding to the first data set described above) from the data management system while performing the federated modeling task of the user recommendation model, there is new user consumption data generation at the participant, and the user consumption data set with the newly generated data corresponds to a third data set. The participant continues to send this third data set to the data management system for the next federal modeling task for the user to recommend a model. Thus, data generated at the participant may be continuously stored in the data management system, while ID sets of data sets (e.g., first data set, third data set) of the stored data management system may be retrieved, for example, at certain preset time nodes, for the federal modeling task. The user recommendation model modeled via the later third data set may be used to further optimize the previous user recommendation model modeled via the first data set.

At each participant, the sending of data (e.g., the sending of the third set of data) and the obtaining of data (e.g., the obtaining of the first set of IDs) occur in an asynchronous manner.

In particular, each participant may perform the acquisition and transmission of data in an asynchronous manner with the data management system. For example, the participant may obtain an ID set or feature data associated with the federal learning task from the data management system based on a predetermined acquisition time window, and send the data set or ID intersection to the data management system based on a predetermined sending time window different from the acquisition time window, so as to avoid the problems of interface timeout, insufficient bandwidth, data collision, etc. caused by the participant performing data acquisition and sending synchronously.

In some embodiments, any data acquisition and transmission by each participant may be performed asynchronously, e.g., with the collaborators or with other participants, to avoid interface timeouts, bandwidth starvation, data collisions, etc.

Fig. 6 illustrates a block diagram of a federal learning device 600 in accordance with an exemplary embodiment of the present disclosure. The federal learning device 600 is applied to any one of a plurality of participants who perform the same federal learning task, wherein the federal learning device 600 includes: a first ID obtaining module 601 configured to obtain a first ID set of a first data set of the participant from the data management system; an intersection obtaining module 602 configured to obtain an intersection of IDs of the multiple parties, the intersection of IDs being obtained based on the first ID set and a second ID set of second data sets of other parties of the multiple parties, wherein an ID type of the second ID set is the same as an ID type of the first ID set; and a feature obtaining module 603 configured to obtain feature data associated with the federal learning task from the first data set based on the ID intersection to perform subsequent subtasks of the federal learning task with other participants based on the respective feature data.

The federal learning device 600 can be adapted to perform similar operations as the federal learning method 100 described above and will not be described further herein.

Fig. 7 illustrates a block diagram of a federal learning device 700 in accordance with an exemplary embodiment of the present disclosure. The federal learning apparatus 700 is applied to a collaborator which is in communication connection with a plurality of participants performing the same federal learning task, respectively, wherein the federal learning apparatus 700 includes: an ID obtaining module 701 configured to obtain a plurality of ID sets from a plurality of participants, wherein ID types of the plurality of ID sets are the same; an intersection module 702 configured to calculate an intersection of the plurality of ID sets to obtain an ID intersection; and a sending module 703 configured to send an ID intersection to each of the multiple participants, so that the multiple participants respectively obtain feature data associated with the federal learning task based on the ID intersection, and further perform subsequent subtasks of the federal learning task based on the respective feature data.

Federal learning apparatus 600 may be adapted to perform similar operations to federal learning method 400 described above and will not be described in detail herein.

Fig. 8 shows a block diagram of a data management system 800 according to an example embodiment of the present disclosure. The data management system 800 is communicatively coupled to any one of a plurality of parties that perform the same federal learning task, wherein the data management system 800 comprises: a data acquisition module 801 configured to acquire a first data set of the participant; a sending module 802 configured to send a first ID set of a first data set to the participant in response to receiving an ID acquisition request of the participant; an ID obtaining module 803, configured to obtain an ID intersection of multiple participants.

According to some embodiments, the sending module 802 is further configured to send, in response to receiving the feature acquisition request of the participant, feature data associated with the federal learning task in the first data set to the participant based on the ID intersection, such that a plurality of participants perform subsequent subtasks of the federal learning task based on the respective feature data.

The data management system 800 may be adapted to perform operations similar to those of the data management method 500 described above, and will not be described herein.

According to another aspect of the disclosure, a federated learning system is provided, which includes a federated learning apparatus 600 shown in fig. 6 and a data management system 800 shown in fig. 8, where the federated learning apparatus 600 is communicatively connected to the data management system 800. Among them, a plurality of federal learning devices 600 may be communicatively connected with each other in the interconnection scheme shown in fig. 3A or fig. 3B.

According to some embodiments, the federal learning system includes a federal learning device 600 as shown in fig. 6, a data management system 800 as shown in fig. 8, and a federal learning device 700 as shown in fig. 7, the federal learning device 600 being communicatively coupled to the federal learning device 700 and the data management system 800. Federal learning device 600 and federal learning device 700 may be communicatively coupled via an interconnection scheme as shown in fig. 3C.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the above-described method.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program realizes the above-mentioned method when executed by a processor.

Referring to fig. 9, an electronic device 900, which is an example of a hardware device (electronic device) that can be applied to aspects of the present disclosure, will now be described. The electronic device 900 may be any machine configured to perform processing and/or computing, and may be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a robot, a smartphone, an in-vehicle computer, or any combination thereof. The above-described data transmission methods may be implemented in whole or at least in part by electronic device 900 or a similar device or system.

Electronic device 900 may include components that connect to bus 902 (possibly via one or more interfaces) or that communicate with bus 902. For example, electronic device 900 may include a bus 902, one or more processors 904, one or more input devices 906, and one or more output devices 908. The one or more processors 904 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (e.g., special processing chips). Input device 906 may be any type of device capable of inputting information to electronic device 900 and may include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control. Output device(s) 908 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The electronic device 900 may also include a non-transitory storage device 910, which may be any storage device that is non-transitory and that can enable data storage, including but not limited to a magnetic disk drive, an optical storage device, solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, an optical disk or any other optical medium, a ROM (read only memory), a RAM (random access memory), a cache memory, and/or any other memory chip or cartridge, and/or any other medium from which a computer can read data, instructions, and/or code. The non-transitory storage device 910 may be removable from the interface. The non-transitory storage device 910 may have data/programs (including instructions)/code for implementing the above-described methods and steps. Electronic device 900 may also include a communications device 912. The communication device 912 may be any type of device or system that enables communication with external devices and/or with a network, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as a bluetooth (TM) device, an 802.11 device, a Wi-Fi device, a WiMAX device, a cellular communication device, and/or the like.

Electronic device 900 may also include a working memory 914, which may be any type of working memory that can store programs (including instructions) and/or data useful for the operation of processor 904, and which may include, but is not limited to, random access memory and/or read only memory devices.

Software elements (programs) may be located in working memory 914, including but not limited to an operating system 916, one or more application programs 918, drivers, and/or other data and code. Instructions for performing the above-described methods and steps may be included in one or more application programs 918, and the above-described methods may be implemented by the processor 904 reading and executing the instructions of the one or more application programs 918. More specifically, federated learning method 100, federated learning method 400, and data management method 500 described above may be implemented, for example, by processor 904 executing application 918 having the instructions of steps S102-S106, steps S402-S406, and steps S502-S508, respectively. Further, other steps in the federated learning method described above may be implemented, for example, by processor 904 executing application 918 with instructions for performing the corresponding steps. Executable code or source code for the instructions of the software elements (programs) may be stored in a non-transitory computer-readable storage medium, such as the storage device 910 described above, and when executed may be stored in the working memory 914 (possibly compiled and/or installed). Executable code or source code for the instructions of the software elements (programs) may also be downloaded from a remote location.

It will also be appreciated that various modifications may be made in accordance with specific requirements. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. For example, some or all of the disclosed methods and apparatus may be implemented by programming hardware (e.g., programmable logic circuitry including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs)) in an assembly language or hardware programming language such as VERILOG, VHDL, C + +, using logic and algorithms according to the present disclosure.

It should also be understood that the foregoing method may be implemented in a server-client mode. For example, a client may receive data input by a user and send the data to a server. The client may also receive data input by the user, perform part of the processing in the foregoing method, and transmit the data obtained by the processing to the server. The server may receive data from the client and perform the aforementioned method or another part of the aforementioned method and return the results of the execution to the client. The client may receive the results of the execution of the method from the server and may present them to the user, for example, through an output device.

It should also be understood that the components of the electronic device 900 may be distributed across a network. For example, some processes may be performed using one processor while other processes may be performed by another processor that is remote from the one processor. Other components of electronic device 900 may also be similarly distributed. As such, electronic device 900 may be interpreted as a distributed computing system that performs processing at multiple locations.

While embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely illustrative embodiments or examples and that the scope of the invention is not to be limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A federated learning method that is applied to any one of a plurality of participants performing the same federated learning task, the method comprising:

obtaining a first set of IDs for a first set of data for the participant from a data management system;

obtaining an ID intersection of the plurality of participants, the ID intersection being obtained based on the first ID set and a second ID set of a second data set of other participants in the plurality of participants, wherein an ID type of the second ID set is the same as an ID type of the first ID set; and

based on the ID intersection, feature data associated with the federated learning task in the first data set is obtained from the data management system to perform subsequent subtasks of the federated learning task with the other participants based on the respective feature data.

2. The method of claim 1, wherein obtaining the intersection of the IDs of the plurality of participants comprises:

computing, in cooperation with the other participants, an intersection of the first ID set and the second ID set to obtain the ID intersection.

3. The method of claim 1, wherein the plurality of participants are respectively in communication connection with a collaborator, and wherein obtaining the ID intersection of the plurality of participants comprises:

sending the first set of IDs to the collaborator;

obtaining the ID intersection from the collaborator, wherein the ID intersection is the intersection of the first ID set and the second ID set calculated by the collaborator, wherein the second ID set is obtained by the collaborator from the other participants.

4. The method according to any of claims 1-3, wherein the data management system has a relational database built in, the first data set being stored in the relational database in the form of relational data.

5. The method of claim 4, wherein the first ID set corresponds to a list of data sets in the relational database corresponding to a predetermined ID type.

6. The method of claim 4, wherein obtaining feature data associated with the federated learning task from the first data set based on the ID intersection comprises:

acquiring full quantity feature data corresponding to the ID intersection from the relational database;

concatenating the full-scale feature data to generate the feature data associated with the federal learning task.

7. The method of claim 4, wherein obtaining feature data associated with the federated learning task from the first data set based on the ID intersection comprises:

acquiring feature data of a plurality of preset attribute columns corresponding to the ID intersection from the relational database;

connecting feature data of the plurality of preset attribute columns to generate the feature data associated with the federal learning task.

8. The method of claim 1, wherein the data management system supports data set importation of at least one of the following data source types: csv, txt, HTTP, FTP, MySQL, Oracle, Hive.

9. The method according to any one of claims 1-8, wherein subsequent subtasks of the federated learning task include at least one of a federated modeling task, a federated prediction task, a federated query task, and a federated statistics task.

10. A federated learning method is applied to collaborators which are respectively in communication connection with a plurality of participants executing the same federated learning task, and is characterized in that the method comprises the following steps:

obtaining a plurality of ID sets from the plurality of participants, wherein the ID types of the plurality of ID sets are the same;

calculating the intersection of the plurality of ID sets to obtain the ID intersection; and

sending the ID intersection to each of the plurality of participants, such that the plurality of participants respectively obtain feature data associated with the federated learning task based on the ID intersection, thereby causing the plurality of participants to execute subsequent subtasks of the federated learning task based on the respective feature data.

11. A data management method for use in a data management system communicatively coupled to any one of a plurality of parties performing a same federal learning task, the method comprising:

obtaining a first data set of the participant;

in response to receiving the ID acquisition request of the participant, sending a first ID set of the first data set to the participant;

acquiring ID intersection of the multiple participants;

in response to receiving a feature acquisition request of the participant, sending feature data associated with the federal learning task in the first data set to the participant based on the ID intersection, so that the plurality of participants execute subsequent subtasks of the federal learning task based on the respective feature data.

12. The method of claim 11, further comprising:

a third data set of the participant is obtained,

and wherein the participant performs the sending of the third data set to the data management system and the receiving of the first ID set of the first data set from the data management system in an asynchronous manner.

13. A federated learning apparatus, for application to any one of a plurality of participants who perform the same federated learning task, the apparatus comprising:

a first ID acquisition module configured to acquire a first ID set of the first data set of the party from a data management system;

an intersection acquisition module configured to acquire an ID intersection of the plurality of participants, the ID intersection being obtained based on the first ID set and a second ID set of a second data set of other participants of the plurality of participants, wherein an ID type of the second ID set is the same as an ID type of the first ID set; and

a feature obtaining module configured to obtain feature data associated with the federal learning task from the first data set based on the ID intersection to perform subsequent subtasks of the federal learning task with the other participants based on the respective feature data.

14. A federal learning device for use in a cooperator communicatively coupled to a plurality of participants respectively performing a same federal learning task, the device comprising:

an ID acquisition module configured to acquire a plurality of ID sets from the plurality of participants, wherein the ID types of the plurality of ID sets are the same;

an intersection module configured to calculate an intersection of the plurality of ID sets to obtain an ID intersection; and

a sending module configured to send the ID intersection to each of the plurality of participants, so that the plurality of participants respectively obtain feature data associated with the federated learning task based on the ID intersection, and further so that the plurality of participants execute subsequent subtasks of the federated learning task based on the respective feature data.

15. A data management system communicatively coupled to any one of a plurality of parties performing a same federal learning task, the data management system comprising:

a data acquisition module configured to acquire a first data set of the participant;

a sending module configured to send a first ID set of the first data set to the participant in response to receiving an ID acquisition request of the participant;

an ID acquisition module configured to acquire an ID intersection of the plurality of parties,

wherein the sending module is further configured to send feature data associated with the federated learning task in the first data set to the participant based on the ID intersection in response to receiving the feature acquisition request of the participant, such that the plurality of participants perform subsequent subtasks of the federated learning task based on the respective feature data.

16. A bang learning system, comprising:

the federal learning device as claimed in claim 13; and

the data management system of claim 15.

17. The system of claim 16, further comprising:

the federal learning device as claimed in claim 14.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

19. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to any one of claims 1-12.

20. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the method according to any of claims 1-12.