CN114897067A

CN114897067A - Federal learning-based decision model training method and device and federal learning system

Info

Publication number: CN114897067A
Application number: CN202210493876.9A
Authority: CN
Inventors: 彭胜波; 周吉文
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-08-12

Abstract

The disclosure provides a decision model training method and device based on federal learning, a federal learning system, electronic equipment, a computer readable storage medium and a computer program product, and relates to the field of artificial intelligence, in particular to the field of federal learning. The scheme is as follows: sorting, by the first participant, the portion of the sample data based on one of the at least one attribute value of the portion of the sample data of the plurality of sample data; partitioning the sorted sample data into a plurality of data packets; for each data packet: generating a sample identification ID code based on a sample identification ID set formed by the sample identification ID of at least one sample data; sending the sample identification, ID, code to a second party; receiving a count of each of at least one tag value corresponding to the data packet; determining a splitting gain corresponding to the data packet based on the count; and determining local splitting nodes included by the decision model based on the splitting gain corresponding to each data packet.

Description

Federal learning-based decision model training method and device and federal learning system

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for training a decision model based on federal learning, a federal learning system, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the development of computer technology and the progress of artificial intelligence technology, federal Learning (fed Learning) gradually becomes a hot topic in the field of artificial intelligence, and refers to a calculation process in which a plurality of data holders can train models and obtain final models without leaving the original data locally, so that the training task of machine Learning models can be completed through multi-party cooperation.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a decision model training method, apparatus, federated learning system, electronic device, computer-readable storage medium, and computer program product based on federated learning.

According to one aspect of the disclosure, a decision model training method based on federated learning is provided, and the method is applied to a first participant holding a plurality of sample data in a federated learning system, and the federated learning system further comprises a second participant in communication connection with the first participant. The method comprises the following steps: sorting partial sample data in the plurality of sample data based on one attribute value of at least one attribute value of the partial sample data; partitioning the sorted sample data into a plurality of data packets; for each data packet: generating a sample identification ID code based on a sample identification ID set formed by the sample identification ID of at least one sample data in the data grouping; sending the sample identification, ID, code to a second party; receiving, from the second participant, a count of each of the at least one tag value to which the data packet corresponds; and determining a splitting gain corresponding to the data packet based on the count, the splitting gain indicating heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet. The method further comprises determining local splitting nodes included in the decision model based on the splitting gain corresponding to each data packet, and accordingly training the decision model.

According to another aspect of the disclosure, a method for decision model training based on federated learning is provided. The method is applied to a second participant holding a plurality of label data in the federal learning system, and the federal learning system also comprises a first participant in communication connection with the second participant. The method comprises the following steps: receiving a plurality of sample Identification (ID) codes from a first participant, each of the plurality of sample Identification (ID) codes being generated by the first participant based on a set of sample Identification (ID) of at least one of held plurality of sample data divided into a corresponding one of the data groups; ID code for each sample identification: querying a plurality of tag data for a tag value of at least one tag data associated with the sample identification, ID, code; determining, based on the tag values of the associated at least one tag data, a count for each of the at least one tag value corresponding to the respective data packet; and sending the count to the first participant to cause the first participant to determine a local split node included in the decision model based on the count, thereby training the decision model.

According to another aspect of the present disclosure, a federal learning based decision model training device is provided. The device is applied to a first participant who holds a plurality of sample data in the federal learning system, and the federal learning system further comprises a second participant which is in communication connection with the first participant. The device includes: the device comprises a sorting unit, a grouping unit, a coding sending unit, a label value counting and receiving unit, a splitting gain determining unit and a training unit. The sorting unit is configured to sort the partial sample data based on one attribute value of at least one attribute value of partial sample data in the plurality of sample data; a grouping unit configured to divide the sorted sample data into a plurality of data groups; for each data packet: the encoding unit is configured to generate a sample identification ID code based on a sample identification ID set consisting of sample identification IDs of at least one sample data in the data packet; the code transmitting unit is configured to transmit the sample identification ID code to the second party; the tag value count receiving unit is configured to receive a count of each of at least one tag value corresponding to the data packet from the second participant; and the split gain determination unit is configured to determine a split gain corresponding to the data packet based on the count, the split gain indicating heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet. Furthermore, the training unit is configured to determine local splitting nodes comprised by the decision model based on the splitting gain corresponding to each data packet, thereby training the decision model.

According to another aspect of the present disclosure, a decision model training apparatus based on federated learning is provided. The device is applied to a second participant who holds a plurality of label data in the federal learning system, and the federal learning system further comprises a first participant which is in communication connection with the second participant. The device includes: the device comprises a code receiving unit, a label value inquiring unit, a label value counting unit and a label value counting and sending unit. The code receiving unit is configured to receive a plurality of sample identification ID codes from the first participant, each of the plurality of sample identification ID codes being generated by the first participant based on a set of sample identification IDs of held sample data divided into at least one sample data of a corresponding one of the data groups; ID code for each sample identification: a tag value inquiring unit configured to inquire a tag value of at least one tag data associated with the sample identification ID code from a plurality of tag data; the tag value counting unit is configured to determine a count of each of at least one tag value corresponding to the respective data packet based on the tag value of the associated at least one tag data; and the tag value count sending unit is configured to send the count to the first participant to cause the first participant to determine the local split node comprised by the decision model based on the count, thereby training the decision model.

According to another aspect of the present disclosure, a federated learning system is provided, which includes a federated learning-based decision model training apparatus according to the above-mentioned application to a first participant in the federated learning system that holds a plurality of sample data, and a federated learning-based decision model training apparatus according to the above-mentioned application to a second participant in the federated learning system that holds a plurality of label data.

According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of decision model training based on federated learning as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method of decision model training based on federated learning as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements a method of decision model training based on federated learning as described above.

According to one or more embodiments of the present disclosure, the security of data in the federal learning process can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram of a federated learning-based decision model training method in accordance with an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a portion of the process of a federated learning-based decision model training method, in accordance with an embodiment of the present disclosure;

FIG. 4 illustrates a flow diagram of a portion of the process of a federated learning-based decision model training method in accordance with an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a portion of the process of a federated learning-based decision model training method, in accordance with an embodiment of the present disclosure;

FIG. 6 illustrates a flow diagram of a federated learning-based decision model training method in accordance with an embodiment of the present disclosure;

FIG. 7 illustrates a flow diagram of a portion of the process of a federated learning-based decision model training method in accordance with an embodiment of the present disclosure;

FIG. 8 illustrates a flow diagram of a portion of the process of a federated learning-based decision model training method in accordance with an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of a portion of the process of a federated learning-based decision model training method, in accordance with an embodiment of the present disclosure;

FIG. 10 shows a block diagram of a federated learning-based decision model training apparatus applied to a first participant in a federated learning system, in accordance with an embodiment of the present disclosure;

FIG. 11 shows a block diagram of a federated learning-based decision model training apparatus applied to a second participant in a federated learning system, in accordance with an embodiment of the present disclosure;

FIG. 12 illustrates a process diagram for model training using a Federal learning-based decision model training method according to an embodiment of the present disclosure; and

FIG. 13 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", and the like to describe various elements is not intended to limit the positional relationship, the temporal relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing the particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

Federal learning (also called federal machine learning or joint learning) is a machine learning framework, and can effectively help a plurality of data holders to perform data use and machine learning modeling under the condition of meeting the requirements of user privacy protection, data safety and government regulations. According to the data distribution situation, the federal learning can be realized by horizontal federal learning and vertical federal learning. Horizontal federal Learning is also referred to as Feature-Aligned federal Learning (Feature-Aligned fed Learning), i.e., the data features of participants in horizontal federal Learning are Aligned, which is suitable for cases where the data features of the participants overlap more, and the sample Identifications (IDs) overlap less. Vertical federal Learning is also called Sample-Aligned federal Learning (Sample-Aligned fed Learning), i.e., training samples of participants in vertical federal Learning are Aligned, which is suitable for cases where there is more overlap of participant training Sample IDs and less overlap of data features.

In the related art, the implementation method of federal learning may need to use an intermediate coordinating node (or coordinating party, intermediate coordinating party) which participates in the computing tasks of data aggregation and data distribution, which increases the risk of privacy disclosure of data of the participating parties. In addition, in practical applications, it is difficult to find such an intermediate coordinator to be trusted by other participants.

In addition, in the related art, the implementation method of the federal learning, especially the decision model based on the federal learning, is easy to reveal the tag data, and the participant who owns the tag data can encrypt the tag data and send the tag data to other participants. In the classification task, because the value types of the tag data are limited, the encrypted tag data have a certain statistical rule, and a participant with sample data can relatively easily reversely deduce the original plaintext tag data according to the ciphertext data. This poses a challenge to both the parties that possess the tag data and the privacy of the user.

Furthermore, in the related art, the overall computational efficiency of federal learning is low. In order to ensure privacy of tag data and sample data of a participant to the maximum extent, a large amount of homomorphic encryption technology is adopted in related technologies, calculation logic is complex, communication overhead is high, and the method cannot be well applied to large-scale data sets.

In view of this, the present disclosure provides a method for training a decision model based on federal learning. Leakage of tag data is reduced or avoided. Thereby improving the security of data in federal learning, particularly tag data. In addition, according to the decision model training method disclosed by the embodiment of the disclosure, the communication traffic between two parties (or multiple parties) can be greatly reduced, so that the efficiency of federal learning is improved, and the operation pressure of operation resources is reduced.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications capable of performing a federal learning based decision model training method (e.g., a decision model training method applied to a first party in a federal learning system that holds multiple sample data or a decision model training method applied to a second party in a federal learning system that holds multiple label data in accordance with embodiments of the present disclosure).

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein, and is not intended to be limiting.

The

client devices

101, 102, 103, 104, 105, and/or 106 may also run one or more services or software applications capable of performing federal learning based decision model training methods (e.g., a decision model training method applied to a first participant in a federal learning system that holds multiple sample data or a decision model training method applied to a second participant in a federal learning system that holds multiple tag data in accordance with an embodiment of the present disclosure). The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 can also run any of a variety of additional server applications and/or mid-tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and/or 106. In some implementations, the server 120 may include one or more applications, such as applications for services such as object detection and recognition, signal conversion, etc., based on data such as image, video, voice, text, digital signals, etc., to process task requests such as voice interactions, text classification, image recognition, or keypoint detection, etc., received from the

client devices

101, 102, 103, 104, 105, and/or 106. The server can train the neural network model by using the training samples according to a specific deep learning task, can test each sub-network in the super-network module of the neural network model, and determines the structure and parameters of the neural network model for executing the deep learning task according to the test result of each sub-network. Various data can be used as training sample data of the deep learning task, such as image data, audio data, video data or text data. After the training of the neural network model is completed, the server 120 may also automatically search out an optimal model structure through a model search technique to perform a corresponding task.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

FIG. 2 shows a flow diagram of a federated learning-based decision model training method 200, in accordance with an embodiment of the present disclosure.

The method 200 is applied to a first party holding a plurality of sample data in a federated learning system that also includes a second party communicatively coupled to the first party.

According to some embodiments, each of a plurality of sample data held by the first party comprises a sample identification ID and at least one attribute value, and the second party holds a plurality of tag data, the plurality of tag data being respectively associated with respective ones of the plurality of sample data.

As shown in fig. 2, the method 200 includes:

step S210, sorting partial sample data based on one attribute value of at least one attribute value of partial sample data in a plurality of sample data;

step S220, dividing the sequenced sample data into a plurality of data groups;

for each data packet, the method 200 further comprises:

step S230, generating a sample identification ID code based on a sample identification ID set formed by the sample identification ID of at least one sample data in the data grouping;

step S240, the sample identification ID code is sent to a second participant;

step S250, receiving the count of each label value in at least one label value corresponding to the data packet from the second participant; and

step S260, determining a splitting gain corresponding to the data packet based on the count, wherein the splitting gain indicates heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet.

In addition, the method 200 further includes step S270, determining a local splitting node included in the decision model based on the splitting gain corresponding to each data packet, so as to train the decision model.

According to the method 200, a first participant holding sample data groups the sorted sample data, and for each data group, respectively sends a sample identification ID code corresponding to the group to a second participant, and receives a count of each of at least one tag value corresponding to the data group from the second participant, and trains a decision model based on the technique of the tag values. It can be seen that the first participant does not send the full amount of sample data held by the first participant to the second participant holding the tag data, and does not send the attribute value of the tag data to the second participant, thereby being able to avoid leakage of the attribute information of the sample data held by the first participant.

In addition, the first participant receives a count of each tag value corresponding to each packet from the second participant without receiving the tag data itself, thereby being able to reduce or avoid leakage of the tag data of the second participant. Especially for training decision models or classification models, it is difficult for the first participant to extrapolate back from the count of tag values to the original plaintext tag data because the first participant does not receive the tag data itself or the encrypted value of the tag data. The count of the tag values can be used for training the decision model, so that the tag data of the second party is reduced or avoided from being revealed on the basis of ensuring the model training effect. This improves the security of data in federal learning, particularly tag data.

The federated learning system may be a longitudinal federated learning system. The sample identification ID of the plurality of sample data held by the first party of the federal learning system may be a unique number of the sample data, and the sample identification ID of the tag data held by the second party may be a number corresponding to the sample identification ID of the plurality of sample data held by the first party. It should be understood that the second party may also hold sample data corresponding to the tag data, and for the sake of simplicity of the description of the scheme, the sample data of the second party is not referred to in the following, but only the tag data it holds is described.

The attribute of the sample data may be a characteristic of the sample data. For example, the sample data may include the user's age, education level, amount of consumption, etc. Wherein age, education level, spending amount may be an attribute (or characteristic) of sample data herein; and the specific amount of age, education level, amount of consumption may be an attribute value (or characteristic value) of the sample data herein. It should be understood that in the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user are all in accordance with the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

In some examples, at step S210, sorting the partial sample data based on one of the at least one attribute value of the partial sample data of the plurality of sample data may include: and according to an attribute value of the sample data, performing ascending sorting or descending sorting on part of the sample data. The sorted data has a certain rule, so that the grouping of the sample data is facilitated. For example, the sample data may be sorted according to their age in ascending order of age. As another example, the sample data may be sorted in descending order of education level according to the education level of the sample data.

In some examples, the partitioning of the sorted sample data into a plurality of data packets at step S220 may include average or chi-square grouping of the sorted sample data. For example, 10000 pieces of sample data sorted in ascending order of age are distributed into 10 data packets divided on average, where each data packet has a value interval of 10 years, i.e., 10 packets may be 1-10 years old, 11-20 years old … 91-100 years old, respectively.

By grouping the data, on the one hand, subsequent communication with the second party based on the full amount of samples of the first party can be avoided, thereby improving data security. On the other hand, the data communication quantity can be reduced, so that the model training speed is improved, and the calculation pressure of calculation resources is reduced. In addition, through data grouping, some continuous attribute values (feature values) can be preprocessed, so that the continuous attribute values are discretized, and thus model training can be performed based on the discretized attribute values. Moreover, the discretized attributes can be subjected to feature crossing, so that the occurrence of overfitting of the machine learning model is reduced.

It should be understood that different data grouping rules may be selected according to different properties of the sample data, and are not limited herein.

According to some embodiments, the sample identification ID code may comprise a sample identification ID code array having a plurality of storage bits, and the step S230 may comprise: and for each sample data in the data packet, respectively mapping the sample identification ID of the sample data to partial storage bits in the plurality of storage bits according to a first mapping rule.

Referring to fig. 3, fig. 3 shows a schematic diagram of a part of a process of a federal learning based decision model training method 200 in accordance with an embodiment of the present disclosure. As shown in fig. 3, the sample identification ID code may include a sample identification ID code array 310 having 12 storage bits. Here, the storage bits may be digital bits or binary bits in the category of computer science.

For example, for one sample data, the sample identification ID (a in fig. 3) of the sample data may be mapped to the 1 st, 3 rd, and 7 th storage bits of the 12 storage bits of the sample identification ID encoding array 310, respectively, according to the first mapping rule. For another sample data, the sample identification ID (B in fig. 3) of the sample data may be mapped to the 5 th, 8 th, and 10 th storage bits, respectively, of the 12 storage bits of the sample identification ID encoding array 310 according to a first mapping rule.

In one example, the storage bit may be a binary bit, a plurality of different hash functions may be used, the sample identification ID of each sample data is processed to obtain a plurality of different hash values, and the position (e.g., the 1 st, 3 th, 7 th storage bit or the 5 th, 8 th, 10 th storage bit) in the ID encoding array 310 pointed to by the hash value is set to 1, and the rest of the storage bits may be set to 0, respectively.

Therefore, the sample identification ID of each sample data in the data packet can be respectively mapped to partial storage bits in the plurality of storage bits of the coding array 310, the coding array 310 is used as a sample identification ID code, the sample identification ID code contains information related to the sample identification ID of each sample data in the data packet, and when the first participant sends the sample identification ID code, the sending data volume is small, so that the efficiency of information communication can be improved, and the communication burden and the memory consumption can be reduced. And the sample identification ID is properly encrypted, so that the security of the sample identification ID code during transmission can be improved.

FIG. 4 shows a flow diagram of a portion of a federated learning-based decision model training method 200, according to an embodiment of the present disclosure.

As shown in fig. 4, according to some embodiments, the step of mapping the sample identification ID of the sample data to a part of the storage bits of the plurality of storage bits according to the first mapping rule may include:

step S431, dividing the sample identification ID of the sample data into a plurality of sub-parts;

step S432, encrypting each sub-part to obtain an encrypted character string; and

and step S433, mapping the encrypted display character string corresponding to each sub-part to a part of storage bits in the plurality of storage bits respectively.

Referring to fig. 5, fig. 5 shows a schematic diagram of a part of a process of a federal learning based decision model training method 200 in accordance with an embodiment of the present disclosure. As shown in fig. 5, the sample identification ID code may include a sample identification ID code array 510 having 12 storage bits.

For example, for one sample data, the sample identification ID (C in fig. 5) of the sample data may be divided into a plurality of sub-parts; then, respectively encrypting each sub-part to obtain encrypted character strings C1, C2 and C3; and mapping the encrypted display character strings C1, C2 and C3 corresponding to each subsection to the 1 st, 3 rd and 7 th storage bits in the 12 storage bits of the sample identification ID coding array 510 respectively. For another sample data, the sample identification ID (D in fig. 5) of the sample data may be divided into a plurality of sub-portions; then, respectively encrypting each sub-part to obtain encrypted character strings D1, D2 and D3; and mapping the encrypted display strings D1, D2 and D3 corresponding to each sub-part to the 5 th, 8 th and 10 th storage bits in the 12 storage bits of the sample identification ID coding array 510 respectively.

In one example, the stored bits may be numeric bits in the field of computer science, such as numeric bits capable of storing character strings. In one example, each sub-portion may be encrypted using an algorithm, such as HD5, SHA256, or the like, to obtain an encrypted string. As shown in fig. 5, the positions (e.g., the 1 st, 3 th, 7 th storage bits or the 5 th, 8 th, 10 th storage bits) in the ID encoding array 510 are set to the corresponding encryption strings (e.g., C1, C2, C3, D1, D2, D3), respectively, and the remaining storage bits may be set to N (null) respectively.

The data security of the sample identification ID is further improved by the encrypted character string, and the possibility of data coverage in the encoding process is greatly reduced, so that the performance of the trained model is improved.

According to some embodiments, the step S260 of determining, based on the count, a splitting gain corresponding to the data packet may include:

determining, based on the count, a respective proportion of the number of each of the tag values in the at least one tag value; and

and determining the splitting gain based on the proportion corresponding to each label value.

The above process will be described below with reference to the example of table 1.

Table 1 shows the sample identifications ID (1 to 10) of the 10 samples within one data packet, and the corresponding Label values (including three labels Label 1, Label 2, Label 3) of the respective samples at the second participant.

TABLE 1

First, in step S250, the first participant receives, from the second participant, a count of each of at least one tag value corresponding to the data packet, where the count includes: 6 Label 1, 3 Label 2 and 1 Label 3. The first participant may then determine a splitting gain based on the proportion to which each tag value corresponds.

In one example, the splitting gain may be expressed in a Gini (Gini) index. The Gini index may indicate the heterogeneity (or purity) of classification results expected to be obtained by the decision model after node splitting at the data packet. For the Gini index, if the sample purity is higher, the Gini index is smaller. The Gini index can be calculated using the following equation:

where j denotes a split node, i.e., the number of types of Label values corresponding to the sample, and j is 3 for the examples of Label 1, Label 2, and Label 3. p is a radical of _j Indicating the proportion of each class in the split node sample, e.g. for the above example, p _j Respectively 0.6, 0.3 and 0.1.

Further, according to the above equation, the Gini index is:

Gini＝1-(0.6 ² +0.3 ² +0.1 ² )＝0.54

thus, the corresponding split gain can be determined to be 0.54.

According to some embodiments, in step S270, the local splitting node included in the decision model is determined based on the splitting gain corresponding to each data packet, so that training the decision model may include: and selecting the splitting node with the minimum splitting gain from the splitting nodes corresponding to each data packet as a local splitting node, wherein the minimum splitting gain indicates that the heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet is maximum.

Therefore, by selecting the splitting node with the largest heterogeneity (highest purity) as the local splitting node at the data packet, the purity of the classification result expected to be obtained at the node can be the highest, and the classification effect is better.

According to some embodiments, the method 200 may further comprise:

sending the splitting attribute corresponding to the local splitting node to a second party; and

and recording the splitting attribute and the splitting threshold corresponding to the local splitting node in response to receiving the indication information from the second participant.

Thus, at a first party, the split attribute and the split threshold of the model split node are recorded, while at the holder of the tag data (i.e., a second party), the split attribute and the split threshold of the model split node need not be recorded. Therefore, the model structure in the Federal learning system is simplified, the data communication quantity of a training process and a subsequent prediction process by using a model is reduced, and the operation pressure of operation resources can be further reduced.

According to some embodiments, the first party may include a plurality of computing units, and the method 200 may further include:

dividing a plurality of sample data into a plurality of data sets, wherein the sample data with the same attribute category is divided into the same data set, and each data set comprises the sample data of at least one attribute category. Sample data with the same attribute category in any data set is the partial sample data in the sample data; and

each data set is assigned to a respective one of the plurality of computational units such that each computational unit trains the decision model in parallel.

Therefore, sample data with the same attribute category is divided into the same data set (namely, the sample data is divided longitudinally according to the attribute category of the data), and each data set is distributed to a corresponding one of the plurality of computing units, so that a first participant can distribute heavy model training tasks to the plurality of computing units, and the plurality of computing units train decision models in parallel, thereby further improving the efficiency of model training. The applicant finds that the parallel training method is particularly suitable for model training scenes with large sample data amount and more sample data attributes (such as training scenes of random forest models).

In one example, each computing unit may be assigned a computer process, with each process separately performing a model training task for one computing unit.

In one example, the dividing of the data sets may be performed first, and then steps S210-S270 in the method 200 may be performed in each divided data set.

FIG. 6 illustrates a flow diagram of a federated learning-based decision model training method 600, in accordance with an embodiment of the present disclosure;

the method 600 is applied to a second party holding a plurality of tag data in a federated learning system that also includes a first party communicatively coupled to the second party.

According to some embodiments, each of the plurality of tag data held by the second party comprises a sample identification ID and a tag value, and the first party holds a plurality of sample data, each associated with a respective one of the plurality of tag data.

As shown in fig. 6, method 600 includes:

step S610, receiving a plurality of sample identification ID codes from the first participant, each of the plurality of sample identification ID codes being generated by the first participant based on a sample identification ID set formed by a sample identification ID of at least one sample data divided into a corresponding one of the plurality of sample data held by the first participant;

for each sample identification ID code, the method 600 further comprises:

step S620, inquiring the label value of at least one label data associated with the sample identification ID code from a plurality of label data;

step S630, determining a count of each of at least one tag value corresponding to the corresponding data packet based on the tag value of the associated at least one tag data; and

step S640, sending the count to the first participant, so that the first participant determines the local split node included in the decision model based on the count, thereby training the decision model.

According to the method 600, by receiving a sample identification, ID, code from a first participant holding sample data, a second participant queries the tag data held by the second participant for the tag value of at least one tag data associated with the sample identification, ID, code, and returns a count of each tag value to the first participant. Therefore, on one hand, the second participant does not obtain the full amount of sample data or the attribute information of the sample data held by the first participant, and the attribute information of the sample data held by the first participant can be prevented from being leaked.

On the other hand, the second party transmits the count of each tag value corresponding to each packet to the first party without transmitting the tag data itself, thereby being able to reduce or avoid the tag data of the second party from being leaked. Especially for training decision models or classification models, it is difficult for the first participant to extrapolate back from the count of tag values to the original plaintext tag data because the second participant does not send the tag data itself or the encrypted value of the tag data. The count of the label values can be used to train the decision model, thereby reducing or avoiding leakage of label data of the second participant on the basis of ensuring the model training effect. This improves the security of data in federal learning, especially tag data.

Referring to fig. 7, fig. 7 shows a flow diagram of a portion of a federated learning-based decision model training method 600, in accordance with an embodiment of the present disclosure. According to some embodiments, the sample identification ID code may include a sample identification ID code array having a plurality of storage bits, and some of the plurality of storage bits have stored therein sample identification IDs of sample data mapped thereto by the first participant according to the first mapping rule. And the step S620 of querying, from the plurality of tag data, the tag value of at least one tag data associated with the sample identification ID code may include:

step S721, according to the second mapping rule associated with the first mapping rule, parsing the sample ID code to obtain a sample ID set corresponding to the sample ID code; and

step S722, the tag value of the tag data associated with each sample identification ID in the sample identification ID set is queried from the plurality of tag data.

Because the sample identification ID code contains the information related to the sample identification ID of each sample data in the corresponding data packet, when the second participant receives the sample identification ID code, the data volume needing to be processed is small, so that the efficiency of information communication can be improved, and the communication burden and the memory consumption can be reduced. And the sample identification ID is properly encrypted, so that the security of the sample identification ID code during transmission can be improved.

According to some embodiments, a portion of the plurality of storage bits may have stored therein an encryption string mapped therein by the first party, the encryption string being encrypted from a plurality of subsections into which the sample identification ID of the sample data is divided.

The encrypted character string further improves the data security of the sample identification ID, and greatly reduces the possibility of data coverage in the encoding process, thereby improving the performance of the trained model.

FIG. 8 shows a flow diagram of a portion of a federated learning-based decision model training method 600, according to an embodiment of the present disclosure. As shown in fig. 8, according to some embodiments, the step S721 may include:

step S821, in the selection vector with the same number of storage bits as the sample identification ID coding array, respectively mapping the sample identification ID of each label datum to partial storage bits; and

step S822, obtaining a sample identifier ID set corresponding to the sample identifier ID code from the sample identifier ID code array by determining an intersection of the selection vector and the sample identifier ID code array.

For example, the second participant may map the sample identification ID of each tag data into a portion of the stored bits, respectively, in a selection vector having the same number (e.g., 12) of stored bits as the sample identification ID encoding array (e.g., sample identification ID encoding array 310 shown in fig. 3 or sample identification ID encoding array 510 shown in fig. 5). And acquiring a sample identification ID set corresponding to the sample identification ID code from the sample identification ID code array by determining the intersection of the selection vector and the sample identification ID code array.

Therefore, the second party can be further ensured not to acquire the full amount of sample data of the first party, and the corresponding label data of the sample data is only inquired for the sample data corresponding to the sample identification ID coding array, so that the data security is further ensured. In addition, the intersection solving way obviously improves the correctness compared with the related art.

The above steps S821 and S822 will be further described with reference to fig. 9. FIG. 9 shows a schematic diagram of a portion of the process of a federated learning-based decision model training method 600, according to an embodiment of the present disclosure.

As shown in fig. 9, the second party has received a sample identification ID code array 510, for example, from the first party. The second party may map the sample identification ID (e.g., C-Z) of each tag data separately into a portion of the stored bits in a selection vector 910 having the same number of stored bits as the sample identification ID encoding array 510.

Here, only C and Z are shown for the sake of brevity, and several sample identification IDs between C and Z are not shown.

In one example, the stored bits of selection vector 910 may be binary bits, for example. A plurality of different hash functions (the same hash function as that used by the first party) may be used, a plurality of different hash values may be obtained by processing the sample identification ID of each tag data, and the position (e.g., the 1 st, 3 rd, and 7 th storage bits corresponding to C) in the selection vector 910 pointed to by the hash value may be set to 1, and the rest of the storage bits may be set to 0, respectively.

Further, an intersection of the selection vector 910 and the sample identification ID encoding array 510 may be determined, and the sample identification ID set 920 corresponding to the sample identification ID encoding may be obtained from the sample identification ID encoding array 510.

According to some embodiments, the federated learning system may further include a third party communicatively coupled to the second party, the third party holding a plurality of sample data associated with respective ones of a plurality of tag data held by the second party. And the method 600 may further comprise:

receiving a split attribute corresponding to a first local partial split node from a first participant;

receiving a splitting attribute corresponding to a second local splitting node from a third party;

and selecting the node with the minimum splitting gain from the first local splitting node and the second local splitting node as a global splitting node. The minimum splitting gain indicates that the heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet is maximum; and

the splitting attribute of the global splitting node and the participant holding the global splitting node are recorded.

Thus, the split attributes of a global split node of the model and the participant holding the global split node are recorded at the second participant, while at the holder of the sample data (i.e., the first participant), the structure of the entire model (e.g., the structure of the decision tree) need not be recorded. Therefore, the model structure in the Federal learning system is simplified, the data communication quantity of a training process and a subsequent prediction process by using a model is reduced, and the operation pressure of operation resources can be further reduced.

According to some embodiments, the method 600 may further comprise: and sending indication information to the first participant so that the first participant records the splitting attribute and the splitting threshold corresponding to the first partial splitting node. Thus, at a first party, the split attribute and the split threshold of the model split node are recorded, while at the holder of the tag data (i.e., a second party), the split attribute and the split threshold of the model split node need not be recorded. Therefore, the model structure in the Federal learning system is simplified, the data communication quantity of a training process and a subsequent prediction process by using a model is reduced, and the operation pressure of operation resources can be further reduced.

According to an embodiment of the present disclosure, a decision model training apparatus 1000 based on federal learning is also provided. Fig. 10 shows a block diagram of a decision model training apparatus 1000 based on federal learning applied to a first participant in a federal learning system in accordance with an embodiment of the present disclosure.

Apparatus 1000 is applied to a first party holding a plurality of sample data in a federated learning system that further includes a second party in communicative connection with the first party.

As shown in fig. 10, the apparatus 1000 includes: a sorting unit 1010, a grouping unit 1020, a coding unit 1030, a coding transmission unit 1040, a tag value count reception unit 1050, a splitting gain determination unit 1060, and a training unit 1070.

The sorting unit 1010 is configured to sort a part of sample data among the plurality of sample data based on one of at least one attribute value of the part of sample data; the grouping unit 1020 is configured to divide the sorted sample data into a plurality of data groups; for each data packet: the encoding unit 1030 is configured to generate a sample identification ID code based on a set of sample identification IDs of at least one sample data in the data packet; the code transmitting unit 1040 is configured to transmit the sample identification ID code to the second participant; the tag value count receiving unit 1050 is configured to receive, from the second participant, a count of each of at least one tag value corresponding to the data packet; and the splitting gain determination unit 1060 is configured to determine a splitting gain corresponding to the data packet based on the count, the splitting gain indicating heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet. Furthermore, the training unit 1070 is configured to determine the local splitting nodes comprised by the decision model based on the splitting gain corresponding to each data packet, thereby training the decision model.

There is also provided a federal learning-based decision model training device 1100, in accordance with an embodiment of the present disclosure. Fig. 11 shows a block diagram of a decision model training apparatus 1100 based on federated learning applied to a second participant in a federated learning system, according to an embodiment of the present disclosure.

The apparatus 1100 is applied to a second party holding a plurality of tag data in a federated learning system that also includes a first party in communicative connection with the second party.

As shown in fig. 11, the apparatus 1100 includes: an encoding receiving unit 1110, a tag value querying unit 1120, a tag value counting unit 1130, and a tag value counting transmitting unit 1140.

The code receiving unit 1110 is configured to receive a plurality of sample identification ID codes from the first participant, each of the plurality of sample identification ID codes being generated by the first participant based on a set of sample identification IDs of held sample data divided into at least one sample data of a corresponding one of the data groups; ID code for each sample identification: the tag value querying unit 1120 is configured to query the tag value of at least one tag data associated with the sample identification ID code from the plurality of tag data; the tag value counting unit 1130 is configured to determine, based on the tag value of the associated at least one tag data, a count of each of at least one tag value corresponding to the respective data packet; and the tag value count sending unit 1140 is configured to send the count to the first participant to cause the first participant to determine the local split node included in the decision model based on the count, thereby training the decision model.

According to another aspect of the present disclosure, a federated learning system is provided, which includes a federated learning-based decision model training apparatus 1000 according to the above-mentioned first participant holding a plurality of sample data applied to the federated learning system, and a federated learning-based decision model training apparatus 1100 according to the above-mentioned second participant holding a plurality of label data applied to the federated learning system.

The decision model training method according to an embodiment of the present disclosure will be further described below in conjunction with fig. 12.

FIG. 12 illustrates a process diagram for model training using a federated learning-based decision model training method in accordance with an embodiment of the present disclosure.

As shown in fig. 12, a first party 1210 in a federated learning system holds a plurality of sample data and a second party 1220 holds a plurality of label data.

In step S1201, the first participant 1210 sorts partial sample data among the plurality of sample data based on one of at least one attribute value of the partial sample data.

At step S1202, the first participant 1210 divides the sorted sample data into a plurality of data packets.

In step S1203, the first participant 1210 generates a sample identification ID code based on a sample identification ID set formed by the sample identification IDs of at least one sample data in the data packet.

In step S1204, the first party 1210 sends the sample identification ID code to the second party 1220.

The second participant 1220 inquires of the plurality of tag data about a tag value of at least one tag data associated with the sample identification ID code at step S1205.

In step S1206, the second participant 1220 determines a count of each of the at least one tag value corresponding to the respective data packet based on the tag value of the associated at least one tag data.

The second party 1220 sends the count to the first party at step S1207.

Based on the count, the first participant 1210 determines a splitting gain corresponding to the data packet at step S1208.

In step S1209, the first participant 1210 determines local splitting nodes included in the decision model based on the splitting gain corresponding to each data packet, so as to train the decision model.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 13, a block diagram of a structure of an electronic device 1300, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 13, the apparatus 1300 includes a computing unit 1301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)132 or a computer program loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM1303, various programs and data necessary for the operation of the device 1300 can also be stored. The calculation unit 1301, the ROM 1302, and the RAM1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

A number of components in the device 1300 connect to the I/O interface 1305, including: input section 1306, output section 1307, storage section 1308, and communication section 1309. Input unit 1306 may be any type of device capable of inputting information to device 1300, and input unit 1306 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 1307 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 1308 can include, but is not limited to, a magnetic disk, an optical disk. The communication unit 1309 allows the device 1300 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth ^TM Devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 1301 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of computing unit 1301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1301 performs the respective methods and processes described above. For example, in some embodiments, the methods described in embodiments of the present disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1308. In some embodiments, some or all of the computer program may be loaded onto and/or installed onto device 1300 via ROM 1302 and/or communications unit 1309. When the computer program is loaded into the RAM1303 and executed by the computing unit 1301, one or more steps of the methods described above may be performed. Alternatively, in other embodiments, the computing unit 1301 may be configured in any other suitable manner (e.g., by means of firmware) to perform methods according to embodiments of the present disclosure.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A decision model training method based on federated learning is applied to a first participant holding a plurality of sample data in a federated learning system, the federated learning system further comprises a second participant in communication connection with the first participant, the method comprises:

sorting a part of sample data in the plurality of sample data based on one attribute value of at least one attribute value of the part of sample data;

partitioning the sorted sample data into a plurality of data packets;

for each data packet:

generating a sample identification ID code based on a sample identification ID set formed by the sample identification ID of at least one sample data in the data grouping;

sending the sample Identification (ID) code to the second party;

receiving, from the second participant, a count of each of the at least one tag value to which the data packet corresponds; and

determining, based on the count, a splitting gain corresponding to the data packet, the splitting gain indicating heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet; and

and determining local splitting nodes included by the decision model based on the splitting gain corresponding to each data packet, so as to train the decision model.

2. The method of claim 1, wherein the sample identification ID code comprises an array of sample identification ID codes having a plurality of storage bits, and wherein said generating a sample identification ID code based on a set of sample identification IDs of at least one sample data in the data packet comprises:

for each sample data in the data packet:

and mapping the sample identification ID of the sample data into partial storage bits in the plurality of storage bits respectively according to a first mapping rule.

3. The method of claim 2, wherein said mapping the sample identification ID of the sample data to a respective portion of the plurality of storage bits according to a first mapping rule comprises:

dividing the sample identification ID of the sample data into a plurality of sub-portions;

encrypting each sub-portion to obtain an encrypted string; and

and mapping the encryption character string corresponding to each sub-part into the part of the storage bits respectively.

4. The method of claim 1, wherein the determining, based on the count, a split gain corresponding to the data packet comprises:

determining a respective proportion of the number of each type of tag value in the at least one tag value based on the count; and

5. The method of any of claims 1 to 4, wherein the determining the local splitting node included in the decision model based on the splitting gain corresponding to each data packet comprises:

and selecting the splitting node with the minimum splitting gain from the splitting nodes corresponding to each data packet as the local splitting node, wherein the minimum splitting gain indicates that the heterogeneity of the classification result expected to be obtained by the decision model after node splitting at the data packet is maximum.

6. The method of claim 5, further comprising:

sending the splitting attribute corresponding to the local splitting node to the second participant; and

and in response to receiving the indication information from the second party, recording the splitting attribute and the splitting threshold corresponding to the local splitting node.

7. The method of any of claims 1-4, wherein the first participant comprises a plurality of computing units, and the method further comprises:

dividing the plurality of sample data into a plurality of data sets, wherein the sample data with the same attribute category is divided into the same data set, and each data set comprises the sample data with at least one attribute category, wherein the sample data with the same attribute category in any one data set is the partial sample data in the plurality of sample data; and

assigning each data set to a respective one of the plurality of computational units such that each computational unit trains the decision model in parallel.

8. A decision model training method based on federated learning is applied to a second participant holding a plurality of label data in a federated learning system, the federated learning system further comprises a first participant in communication connection with the second participant, the method comprises:

receiving a plurality of sample Identification (ID) codes from the first participant, each of the plurality of sample Identification (ID) codes being generated by the first participant based on a set of sample Identification (IDs) of held sample data divided into at least one sample data of a corresponding one of the data groups; and

ID code for each sample identification:

querying the plurality of tag data for a tag value of at least one tag data associated with the sample identification, ID, code;

determining, based on the tag values of the associated at least one tag data, a count for each of at least one tag value corresponding to the respective data packet; and

sending the count to the first participant to cause the first participant to determine a local split node included in the decision model based on the count to train the decision model.

9. The method of claim 8, wherein the sample identification ID code comprises a sample identification ID code array having a plurality of storage bits, and some of the plurality of storage bits have stored therein sample identification IDs of sample data mapped thereto by the first participant according to a first mapping rule, and wherein querying the plurality of tag data for tag values of at least one tag data associated with the sample identification ID code comprises:

analyzing the sample identification ID code to obtain a sample identification ID set corresponding to the sample identification ID code according to a second mapping rule associated with the first mapping rule; and

the tag data associated with each sample identification ID in the set of sample identification IDs is queried for a tag value from the plurality of tag data.

10. The method of claim 9, wherein some of the plurality of storage bits have stored therein an encryption string mapped thereto by the first party, the encryption string being encrypted for a plurality of sub-portions into which sample identification IDs of sample data are partitioned.

11. The method of claim 9 or 10, wherein parsing from the sample identification, ID, code a set of sample identification, ID, codes corresponding to the sample identification, ID, code according to a second mapping rule associated with the first mapping rule comprises:

mapping the sample identification ID of each tag data into partial storage bits respectively in a selection vector with the same number of storage bits as the sample identification ID coding array; and

and acquiring a sample identification ID set corresponding to the sample identification ID code from the sample identification ID code array by determining the intersection of the selection vector and the sample identification ID code array.

12. The method of any of claims 8 to 10, wherein the federated learning system further comprises a third party communicatively connected to the second party, the third party holding a plurality of sample data associated with respective ones of a plurality of tag data held by the second party, the method further comprising:

receiving a split attribute corresponding to a first partially split node from the first participant;

receiving a split attribute corresponding to a second local split node from the third party;

selecting a node with a minimum splitting gain from the first local splitting node and the second local splitting node as the global splitting node, wherein the minimum splitting gain indicates that the heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet is maximum; and

and recording the splitting attribute of the global splitting node and the participant holding the global splitting node.

13. The method of claim 12, further comprising:

and sending indication information to the first party so that the first party records the splitting attribute and the splitting threshold corresponding to the first local partial splitting node.

14. A decision model training apparatus based on federated learning, applied to a first participant holding a plurality of sample data in a federated learning system, the federated learning system further includes a second participant in communication with the first participant, the apparatus includes: a sorting unit, a grouping unit, a coding sending unit, a label value counting and receiving unit, a splitting gain determining unit and a training unit,

the sorting unit is configured to sort a part of sample data in the plurality of sample data based on one attribute value of at least one attribute value of the part of sample data;

a grouping unit configured to divide the sorted sample data into a plurality of data groups;

for each data packet:

the encoding unit is configured to generate a sample identification ID code based on a sample identification ID set consisting of sample identification IDs of at least one sample data in the data packet;

the code transmitting unit is configured to transmit the sample identification, ID, code to the second party;

a tag value count receiving unit configured to receive, from the second participant, a count of each of at least one tag value corresponding to the data packet; and is

A splitting gain determination unit configured to determine, based on the count, a splitting gain corresponding to the data packet, the splitting gain indicating heterogeneity of classification results expected to be obtained by the decision model after node splitting at the data packet; and is

The training unit is configured to determine local splitting nodes comprised by the decision model based on the splitting gain corresponding to each data packet, thereby training the decision model.

15. A decision model training device based on federal learning, applied to a second participant holding a plurality of label data in a federal learning system, wherein the federal learning system further comprises a first participant in communication with the second participant, the device comprising: a code receiving unit, a label value inquiring unit, a label value counting unit and a label value counting and sending unit, wherein,

the code receiving unit is configured to receive a plurality of sample identification ID codes from the first participant, each of the plurality of sample identification ID codes being generated by the first participant based on a set of sample identification IDs of at least one of held sample data divided into a corresponding one of the data packets;

ID code for each sample identification:

a tag value inquiring unit configured to inquire a tag value of at least one tag data associated with the sample identification ID code from the plurality of tag data;

a tag value counting unit configured to determine a count of each of at least one tag value corresponding to a respective data packet based on the tag value of the associated at least one tag data; and is

The tag value count sending unit is configured to send the count to the first participant to cause the first participant to determine a local split node included in the decision model based on the count, thereby training the decision model.

16. A bang learning system, comprising:

the federal learning based decision model training device of claim 14; and

the federal learning based decision model training device of claim 15.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-13.

19. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-13 when executed by a processor.