CN113051239A - Data sharing method, use method of model applying data sharing method and related equipment - Google Patents

Data sharing method, use method of model applying data sharing method and related equipment Download PDF

Info

Publication number
CN113051239A
CN113051239A CN202110324495.3A CN202110324495A CN113051239A CN 113051239 A CN113051239 A CN 113051239A CN 202110324495 A CN202110324495 A CN 202110324495A CN 113051239 A CN113051239 A CN 113051239A
Authority
CN
China
Prior art keywords
model
party
data
participant
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110324495.3A
Other languages
Chinese (zh)
Inventor
杨恺
王虎
韩雨锦
刘佳豪
黄志翔
彭南博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202110324495.3A priority Critical patent/CN113051239A/en
Publication of CN113051239A publication Critical patent/CN113051239A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides a data sharing method, a method and a device for using a model applying the data sharing method, a computer readable storage medium and an electronic device, and belongs to the technical field of computers and communication. The method comprises the following steps: the computer aligning data samples of the first and second parties based on a privacy deal technique; the computer building a sub-model of a first participant using the tags and the data of the first participant; the computer constructs a sub-model of the second party according to the label and the data of the second party; and the computer acquires a global model according to the submodel of the first party and the submodel of the second party so as to realize the sharing of the data of the first party and the data of the second party. The method disclosed by the invention can be used for realizing the training of the global model based on the federal learning.

Description

Data sharing method, use method of model applying data sharing method and related equipment
Technical Field
The present disclosure relates to the field of computer and communication technologies, and in particular, to a data sharing method, a method and an apparatus for using a model applying the data sharing method, a computer-readable storage medium, and an electronic device.
Background
Data-driven intelligent technologies are increasingly being developed and applied. With the increase of the industry universality of the application and the continuous deepening of the industry combination depth, the method provides a plurality of requirements for technologies such as big data, artificial intelligence and the like. One of the important aspects is data privacy and security considerations. The value brought by the data is continuously highlighted, so that all data owners can realize the importance of protecting the data privacy and the data security, governments in various countries can also realize the importance, and legal rules are disputed to protect the data security. China pays more attention to data, and the data is listed as one of production elements by relevant parts in China. Different data owners want to collaborate to exert the value of data, but are unwilling to share the data or can not share the data, which causes the data island problem. To solve this problem, the concept of federal learning has been proposed, and it is desirable that multiple participants share no data, only intermediate results, and no data can be inferred while achieving the purpose of co-training the model. The realization that data cannot be reversely deduced is generally to ensure that each party cannot obtain data information of other parties through a certain security technology.
According to different characteristics of data held by different participants of federal learning, the data can be classified into horizontal federal learning, longitudinal federal learning, federal transfer learning and the like. The horizontal federal learning is the segmentation of samples, and a relatively universal federal learning scheme can be designed due to good characteristics of the samples. What is more difficult and has found many applications in the industry is vertical federal learning, which is characterized by different data holders having different feature dimensions of the same sample (often a client), due to different business of different companies.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The embodiment of the disclosure provides a data sharing method, a use method and a use device of a model applying the data sharing method, a computer readable storage medium and electronic equipment, which can realize the training of a global model based on federal learning.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to one aspect of the disclosure, a method for sharing data based on federal learning is provided, which includes:
the computer aligning data samples of the first and second parties based on a privacy deal technique;
the computer building a sub-model of a first participant using the tags and the data of the first participant;
the computer constructs a sub-model of the second party according to the label and the data of the second party;
and the computer acquires a global model according to the submodel of the first party and the submodel of the second party so as to realize the sharing of the data of the first party and the data of the second party.
In one embodiment, the computer building a sub-model of the first party using the tag and the data of the first party comprises:
the computer constructs a sub-model of a gradient boosting tree of a first participant using the tags and data of the first participant through an extreme gradient boosting tree or a light gradient addition model at the first participant.
In one embodiment, the computer obtaining a global model from the sub-model of the first party and the sub-model of the second party comprises:
and the computer acquires a fusion model and a cross model according to the submodel of the first party and the submodel of the second party.
In one embodiment, the first participant's sub-model comprises a model score and a predicted outcome and the second participant's sub-model comprises a model score and a predicted outcome, wherein the computer obtaining a fusion model and a cross model from the first participant's sub-model and the second participant's sub-model comprises:
the computer obtains the fusion model from the label, the model score of the sub-model of the first party, and the model score of the sub-model of the second party.
In one embodiment, the computer obtaining the fusion model from the tag, the model score for the sub-model of the first participant, and the model score for the sub-model of the second participant comprises:
the computer obtains a fusion model from the label, the model score of the sub-model of the first party, and the model score of the sub-model of the second party using a logistic regression or gradient boosting tree model.
In one embodiment, the computer obtaining a fusion model and a cross model from the sub-model of the first party and the sub-model of the second party comprises:
and the computer obtains the cross model according to the label, the output of the fusion model, the prediction result of the sub model of the first party and the prediction result of the sub model of the second party.
According to one aspect of the disclosure, a method for using a global model based on federal learning is provided, which is characterized by comprising:
the computer acquires a prediction request of a first participant;
the computer aligns data of the prediction request based on a privacy intersection technology;
the computer predicts the data of the prediction request by using the submodel of each participant;
the computer predicts the prediction result of the sub-model of each participant by using a global model to obtain the prediction result of the prediction request of the first participant so as to realize the sharing of the data of the first participant and the data of the second participant.
According to an aspect of the present disclosure, there is provided a data sharing apparatus based on federal learning, including:
an obtaining module configured to obtain a prediction request of a first participant;
an alignment module configured to align data of the prediction request based on a privacy intersection technique;
a sub-prediction module configured to predict data of the prediction request using sub-models of respective participants;
the main prediction module is configured to predict the prediction result of the sub-model of each participant by using a global model to obtain the prediction result of the prediction request of the first participant, so as to realize the sharing of the data of the first participant and the data of the second participant.
According to an aspect of the present disclosure, there is provided an electronic device including:
one or more processors;
a storage device configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of any of the above embodiments.
According to an aspect of the present disclosure, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the method of any one of the above embodiments.
In the technical scheme provided by some embodiments of the disclosure, the global model training based on the federal learning can be realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The following figures depict certain illustrative embodiments of the invention in which like reference numerals refer to like elements. These described embodiments are to be considered as exemplary embodiments of the disclosure and not limiting in any way.
FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which the federated learning-based data sharing approach of embodiments of the present disclosure may be applied;
FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device implementing embodiments of the present disclosure;
FIG. 3 schematically illustrates a related art longitudinal federated tree model framework diagram;
FIG. 4 schematically illustrates a flow chart of a federated learning-based data sharing method in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates a conceptual overall framework diagram of one embodiment of the present disclosure;
FIG. 6 illustrates a sub-model training framework diagram of one embodiment of the present disclosure;
FIG. 7 illustrates a sub-model fusion framework diagram of one embodiment of the present disclosure;
fig. 8 schematically illustrates a block diagram of a federated learning-based data sharing apparatus in accordance with an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which the federated learning-based data sharing method of embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.
The staff member may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may be various electronic devices having display screens including, but not limited to, smart phones, tablets, portable and desktop computers, digital cinema projectors, and the like.
The server 105 may be a server that provides various services. For example, the staff member sends a data sharing request based on federal learning to the server 105 by using the terminal device 103 (which may be the terminal device 101 or 102). The server 105 may align data samples of the first and second parties based on privacy trading techniques; constructing a sub-model of the first party using tags and data of the first party; constructing a sub-model of a second party according to the label and the data of the second party; and acquiring a global model according to the submodel of the first party and the submodel of the second party so as to realize the sharing of the data of the first party and the data of the second party. The server 105 may display the trained global model based on the federal learning on the terminal device 103, and the staff may view the global model based on the federal learning based on the content displayed on the terminal device 103.
Also for example, the terminal device 103 (also may be the terminal device 101 or 102) may be a smart tv, a VR (Virtual Reality)/AR (Augmented Reality) helmet display, or a mobile terminal such as a smart phone, a tablet computer, etc. on which a navigation, a network appointment, an instant messaging, a video Application (APP), etc. is installed, and a worker may send a data sharing request based on federal learning to the server 105 through the smart tv, the VR/AR helmet display, or the navigation, the network appointment, the instant messaging, the video APP. The server 105 can obtain the global model based on the federal learning based on the request for sharing the data based on the federal learning, and return the global model based on the federal learning to the smart television, the VR/AR helmet display or the navigation, network appointment, instant messaging and video APP, so that the global model based on the federal learning is displayed through the smart television, the VR/AR helmet display or the navigation, network appointment, instant messaging and video APP.
FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.
It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.
As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.
The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 208 including a hard disk and the like; and a communication section 209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 210 as necessary, so that a computer program read out therefrom is installed into the storage section 208 as necessary.
In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and/or apparatus of the present application.
It should be noted that the computer readable storage medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF (Radio Frequency), etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules and/or units and/or sub-units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described modules and/or units and/or sub-units may also be disposed in a processor. Wherein the names of such modules and/or units and/or sub-units in some cases do not constitute a limitation on the modules and/or units and/or sub-units themselves.
As another aspect, the present application also provides a computer-readable storage medium, which may be contained in the electronic device described in the above embodiment; or may exist separately without being assembled into the electronic device. The computer-readable storage medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the embodiments below. For example, the electronic device may implement the steps of fig. 4.
In the related art, for example, a global model for federal learning may be trained by using a machine learning method, a deep learning method, or the like, and the range of application is different for different methods.
Longitudinal federal learning is more widely applied and developed in a risk control scene, wherein the characteristics are that a label is only held by one party (hereinafter referred to as a Guest party), and other parties only have partial characteristics of data (hereinafter referred to as a Host party). Guest party hopes to improve the effect of the model through the cooperation with Host party, and the purpose of reducing the risk is achieved. In the process, both the Guest party and the Host party need to ensure the data security of the own party.
In the risk control scenario, the tree model is a very important method because of its good interpretability. A Gradient Boosting tree model (GBM) in the tree model can effectively solve the problem of insufficient performance caused by a single decision tree, integrates strong learning capacity of a plurality of trees, can process the classification problem and the regression problem, and is widely applied to wind control scenes.
In the longitudinal Federal gradient lifting tree model training, a solution is provided by a SecureBoost (safety lifting tree). The method comprises two processes of firstly carrying out encryption alignment on data samples and secondly carrying out encrypted model training. The related art longitudinal federal gradient elevated tree model training procedure is shown in fig. 3.
Fig. 3 schematically illustrates a related art longitudinal federal tree model framework diagram.
Referring to fig. 3:
first stage encrypted sample alignment
Because the training algorithm needs each participant to correspond the features belonging to the same piece of data, in order to protect the data privacy and safety, the framework aligns the data samples based on the privacy intersection technology.
Second stage cryptographic model training
In the second phase, federated modeling is performed with the data of each participant based on the aligned data. The intermediate data is encrypted in an addition homomorphic mode (such as Pailllier) to protect the security of the intermediate value and prevent the data privacy from being disclosed. By using
Figure BDA0002994053340000091
To represent a homomorphic encryption operation, additive homomorphic encryption enables the addition and multiplication operations based on the ciphertext, namely:
Figure BDA0002994053340000092
the key for establishing the gradient lifting tree model is as follows: 1) the construction of the t tree is learned by taking the tag residual error of the t-1 th time as a target; 2) in the construction of each tree, how each node is split and whether each node is split or not are determined, and the process is realized by calculating the score of each feature after each threshold is split through multi-party cooperation and information exchange and taking the result of the maximum score to split.
Note that the existing data is: guest (participant A) side data X0And label y, and P Host (participants B and C, etc.) party data X1,...,XPThe p-th participant data dimension is dp(ii) a The modeling stage is divided into the following specific steps:
step 0: and the Guest party generates a homomorphic encrypted secret key and a public key, and sends the public key to each Host party.
Step 1: for each characteristic dimension k of the N pieces of data of the p-th party, 1pAnd performing binning to obtain L quantile points S of the characteristicsk={sk1,...,skLAs a threshold candidate to be split.
T-th 1, 2,.. M trees are constructed by performing the following t iterations:
step 2: guest's method calculates the predicted value after each t-1 iterations
Figure BDA0002994053340000101
First derivative g of (initially y)nAnd second derivative hnAnd after homomorphic encryption, transmitting the encrypted information to each Host party.
Step 3: for each leaf node of the current tree, each participant p needs to perform the following operations to determine whether to split or how to split, and the specific process is as follows:
step3.1: according to L threshold values SkDividing data samples with id in the set I into L-1 box intervals;
step3.2: based on the split boxes obtained by Step3.1, if the p-th party is Host, calculating the summation of the first derivative and the summation of the second derivative in each box by using a public key, and sending the summation to the Guest party for decryption; if Guest, the plaintext is directly calculated.
Step3.3: and the Guest party calculates the score of each splitting of each feature according to all the obtained first-order derivatives and second-order derivative aggregation values, and does not split if the maximum score is smaller than the threshold value gamma. Otherwise, splitting the current node according to the corresponding attribute and the threshold value to obtain two corresponding leaf nodes and respective sample id sets ILAnd IR
Step 4: for each leaf node of the t-th tree, the above Step3 is executed in a loop until all leaf nodes can not be split any more or the depth of the tree reaches the set maximum depth, and the optimal weight w of each leaf node is calculated by the Guest partyjThen the construction of the t-th tree is completed.
Step 5: updating the current prediction result of each sample according to the t trees constructed in the previous step
Figure BDA0002994053340000102
Therefore, in the process of modeling, Guest parties and Host parties cannot acquire data information of opposite parties, and data safety of each participant is guaranteed.
The disadvantages of the above related art federal gradient elevated tree model method are:
1. when each node of the tree model is constructed, interaction information of a Guest party and a Host party is needed, and the needed calculation and transmission data amount is huge, so that the algorithm efficiency is low. Different participants often transmit through the public network, and the geographical position span is large, so that the consumption of frequently transmitting data is large.
2. The construction of each node of the tree model and the construction of each tree are strongly dependent on all participants and cannot be performed in parallel, so that the operation efficiency of the algorithm is low.
Fig. 4 schematically shows a flowchart of a federated learning-based data sharing method according to an embodiment of the present disclosure. The method steps of the embodiment of the present disclosure may be executed by a computer, wherein the computer may be a terminal device, a server or a server cluster composed of a plurality of servers.
The method steps of the embodiments of the present disclosure may also be executed by the terminal device, the server, or the terminal device and the server interactively, for example, the server 105 in fig. 1 described above, but the present disclosure is not limited thereto.
In step S410, the computer aligns data samples of the first and second parties based on a privacy deal technique.
In this step, the computer aligns data samples of the first and second parties based on a privacy deal technique.
FIG. 5 illustrates a conceptual overall framework diagram of one embodiment of the present disclosure. Referring to fig. 5, the number of participants is 3, but the present disclosure is not limited thereto, and the number of participants may be 2 or more. Referring to fig. 5, the first party is, for example, party a and the second party is, for example, party B or party C. The "small number of interactions" in fig. 5, for example, correspond to the contents of Step2 and Step3.2 above.
This step corresponds to the encrypted sample alignment in fig. 5, since the training algorithm requires each participant to correspond the features belonging to the same piece of data, in order to protect data privacy and security, the framework aligns the data samples based on the privacy intersection technique.
In the embodiments of the present disclosure, the terminal device may be implemented in various forms. For example, the terminal described in the present disclosure may include mobile terminals such as a mobile phone, a tablet, a notebook, a palmtop, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a modeling apparatus of a global model of federal learning, a wearable device, a smart band, a pedometer, a robot, an unmanned vehicle, and the like, and fixed terminals such as a digital TV (television), a desktop computer, and the like.
In step S420, the computer constructs a sub-model of the first participant using the tag and the data of the first participant.
In this step, a computer builds a sub-model of the first party using the tags and the data of the first party. In one embodiment, the computer building a sub-model of the first party using the tag and the data of the first party comprises: the first participant builds a sub-model of the gradient boosting tree of the first participant by an extreme gradient boosting tree or a light gradient addition model using the tags and the data of the first participant. In one embodiment, the label may be the purpose or target of the established global model analysis.
FIG. 6 illustrates a sub-model training framework diagram of one embodiment of the present disclosure. Referring to fig. 5 and 6:
first, a submodel of a Gradient Boosting tree (a submodel of a first participant) is constructed by using a label and data of a Guest (a first participant or a participant a) by using methods such as XGBoost (eXtreme Gradient Boosting tree) or Light Gradient Boosting Machine (Light Gradient Boosting model). In the following, the XGBoost is taken as an example to give the main steps, and t-th 1, 2.
Step 1.1: for each feature dimension k of all data, 1k={sk1,...,skLAs a threshold candidate to be split.
Step 1.2: for each leaf node of the current tree, the following operations are performed to determine whether the current tree needs to be split or not and how to split, and the sample set to be split is denoted as I. The initial node is the root node and the sample set is all samples. The specific process is as follows:
according to the threshold value SkDividing data samples with id in the set I into different box body intervals;
based on the obtained binning, the Guest square calculates the first derivative and the second derivative aggregate value, and then calculates the score of each feature splitting at each threshold. If the maximum score is still less than the threshold γ, no splitting occurs. Otherwise, splitting the current node according to the corresponding attribute and the threshold value to obtain two corresponding leaf nodes and respective sample id sets ILAnd IR
Step 1.3: and (3) circularly executing the step 1.2 for each leaf node of the t-th tree until all the leaf nodes can not be split any more or the depth of the tree reaches the set maximum depth, calculating the optimal weight of each leaf node, and completing the construction of the t-th tree of the Guest square sub-model.
In step S430, the computer constructs a sub-model of the second party from the tag and data of the second party.
In this step, the computer constructs a sub-model of the second party from the tag and data of the second party.
Referring to fig. 5 and 6:
for each Host party (a second participant, a participant B or a participant C), a gradient lifting tree sub-model is constructed by using the labels and Host party data by using methods such as XGboost or LightGBM. The method for constructing the sub-model comprises the following steps: combining the data label of the Guest party and each Host party, completing the steps shown in fig. 6 by using homomorphic encryption to construct a submodel of the Host party (refer to a corresponding related SecureBoost scheme in fig. 3).
In step S440, the computer obtains a global model according to the sub-model of the first participant and the sub-model of the second participant, so as to implement sharing of the data of the first participant and the data of the second participant.
In this step, the computer obtains a global model from the sub-model of the first party and the sub-model of the second party. In one embodiment, the computer obtaining a global model from the sub-model of the first party and the sub-model of the second party comprises: and the computer acquires a fusion model and a cross model according to the submodel of the first party and the submodel of the second party. In one embodiment, the first participant's sub-model comprises a model score and a predicted outcome and the second participant's sub-model comprises a model score and a predicted outcome, wherein the computer obtaining a fusion model and a cross model from the first participant's sub-model and the second participant's sub-model comprises: the computer obtains the fusion model from the label, the model score of the sub-model of the first party, and the model score of the sub-model of the second party. In one embodiment, the computer obtaining the fusion model from the tag, the model score for the sub-model of the first participant, and the model score for the sub-model of the second participant comprises: the computer obtains a fusion model from the label, the model score of the sub-model of the first party, and the model score of the sub-model of the second party using a logistic regression or gradient boosting tree model. In one embodiment, the computer obtaining a fusion model and a cross model from the sub-model of the first party and the sub-model of the second party comprises: and the computer obtains the cross model according to the label, the output of the fusion model, the prediction result of the sub model of the first party and the prediction result of the sub model of the second party.
FIG. 7 illustrates a sub-model fusion framework diagram of one embodiment of the present disclosure. Referring to fig. 7:
fusing the submodels of all participants to obtain a prediction output result of a global model (a fusion model + a cross model), and the method comprises the following specific steps:
step3.1 submodel f according to each partypAnd data of each party
Figure BDA0002994053340000131
Obtaining sub-model outputMapping score (model score)
Figure BDA0002994053340000132
And corresponding classification results (predicted results)
Figure BDA0002994053340000133
Divide the mapping into
Figure BDA0002994053340000134
Direct fusion to obtain output
Figure BDA0002994053340000135
The selection of the fusion model (direct fusion model) may use logistic regression or a gradient lifting tree model. The input y may be, for example, a tag, and the fusion model is constructed from the tag and the model scores of the participants.
Step3.2 Guest party (first party, party A) classifies the sub-model
Figure BDA0002994053340000141
Training a cross model as the data characteristics of each party, wherein the goal of the cross model is to cross the classification result of each sub model and fit the residual error of the direct fusion model, namely
Figure BDA0002994053340000142
The intersection model may use a gradient lifting tree model. The cross model is constructed according to the label minus the output of the fusion model and the prediction result of each participant submodel. The final prediction result is the output of the fusion model plus the output of the cross model.
The key point of the framework is that the designed scheme does not directly use global data for modeling, so that the consumption caused by frequent data interaction is avoided. Instead, each party models the sub-model with higher efficiency based on local data. And directly fusing the output mapping components of the sub-model modeling, and fusing through the output classification result of the sub-model to fit the directly fused residual error to obtain a final prediction result. The confidentiality of the submodel enables the method to effectively protect data privacy.
The data sharing method based on federal learning can be applied to the financial field, for example, a first participant can be a certain online shopping platform, a second participant can be a certain bank, a label can be credit card handling, and each participant can establish an evaluation model for credit card handling risks through the modeling method.
The invention provides a data sharing method of a federated tree, which realizes longitudinal federated learning by a method of constructing a sub-model and then fusing; an efficient model construction scheme is designed, and the construction of each sub-model only depends on local data characteristics and does not depend on other Host parties, so that the data transmission quantity is reduced, and the efficiency of the whole algorithm is improved; the designed submodel construction method ensures that whether the node of each submodel is split or not is independent of the data characteristics of other parties, so that the construction of each submodel can be carried out in parallel, and the execution efficiency of the whole algorithm is improved; the method of combining direct fusion and cross of the submodels uses the thought of a residual error network for reference. On one hand, the result of independent modeling of each party can be effectively utilized, and on the other hand, the characteristics of labels which cannot be fitted are directly fused by utilizing the cross learning of the sub-models, so that the algorithm performance is improved.
In one embodiment, the present disclosure also includes a method of using a global model based on federated learning, the method comprising:
the computer acquires a prediction request of a first participant;
the computer aligns data of the prediction request based on a privacy intersection technology;
the computer predicts the data of the prediction request by using the submodel of each participant;
the computer predicts the prediction result of the sub-model of each participant by using a global model to obtain the prediction result of the prediction request of the first participant so as to realize the sharing of the data of the first participant and the data of the second participant.
The usage method of the present disclosure uses a global model based on federated learning established by the method shown in fig. 4.
Fig. 8 schematically illustrates a block diagram of a federated learning-based data sharing apparatus in accordance with an embodiment of the present disclosure. The federally learned data sharing apparatus 800 according to the embodiment of the present disclosure may be disposed on a terminal device, a server, or a part of the apparatus may be disposed on the terminal device and a part of the apparatus may be disposed on the server, for example, the apparatus may be disposed on the server 105 in fig. 1, but the present disclosure is not limited thereto.
The federal learning-based data sharing device 800 provided by the embodiments of the present disclosure may include an obtaining module 810, an aligning module 820, a sub-prediction module 830, and a main prediction module 840.
The obtaining module is configured to obtain a prediction request of a first participant; an alignment module configured to align data of the prediction request based on a privacy intersection technique; a sub-prediction module configured to predict data of the prediction request using sub-models of respective participants; the main prediction module is configured to predict the prediction result of the sub-model of each participant by using a global model to obtain the prediction result of the prediction request of the first participant, so as to realize the sharing of the data of the first participant and the data of the second participant.
According to an embodiment of the present disclosure, the above federated learning-based data sharing apparatus 800 may be used in a method for using the global federated learning-based model described in the present disclosure.
It is to be understood that the obtaining module 810, the aligning module 820, the sub-prediction module 830, and the main prediction module 840 may be combined into one module to be implemented, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present invention, at least one of the obtaining module 810, the aligning module 820, the sub-prediction module 830, and the main prediction module 840 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in a suitable combination of three implementations of software, hardware, and firmware. Alternatively, at least one of the acquisition module 810, the alignment module 820, the sub prediction module 830 and the main prediction module 840 may be at least partially implemented as a computer program module that, when executed by a computer, may perform the functions of the respective modules.
It should be noted that although several modules, units and sub-units of the apparatus for action execution are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules, units and sub-units described above may be embodied in one module, unit and sub-unit, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module, unit and sub-unit described above may be further divided into embodiments by a plurality of modules, units and sub-units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A data sharing method based on federal learning is characterized by comprising the following steps:
the computer aligning data samples of the first and second parties based on a privacy deal technique;
the computer building a sub-model of a first participant using the tags and the data of the first participant;
the computer constructs a sub-model of the second party according to the label and the data of the second party;
and the computer acquires a global model according to the submodel of the first party and the submodel of the second party so as to realize the sharing of the data of the first party and the data of the second party.
2. The method of claim 1, wherein the computer building a sub-model of the first party using the tag and the data of the first party comprises:
the computer constructs a sub-model of a gradient boosting tree of a first participant using the tags and data of the first participant through an extreme gradient boosting tree or a light gradient addition model at the first participant.
3. The method of claim 1, wherein the computer obtaining a global model from the sub-model of the first party and the sub-model of the second party comprises:
and the computer acquires a fusion model and a cross model according to the submodel of the first party and the submodel of the second party.
4. The method of claim 3, wherein the first participant's submodel includes a model score and a predicted outcome and the second participant's submodel includes a model score and a predicted outcome, and wherein the computer obtaining a fusion model and a cross model from the first participant's submodel and the second participant's submodel comprises:
the computer obtains the fusion model from the label, the model score of the sub-model of the first party, and the model score of the sub-model of the second party.
5. The method of claim 4, wherein the computer obtaining the fusion model from the tag, the model score for the sub-model of the first participant, and the model score for the sub-model of the second participant comprises:
the computer obtains a fusion model from the label, the model score of the sub-model of the first party, and the model score of the sub-model of the second party using a logistic regression or gradient boosting tree model.
6. The method of claim 4, wherein the computer obtaining a fusion model and a cross model from the sub-model of the first party and the sub-model of the second party comprises:
and the computer obtains the cross model according to the label, the output of the fusion model, the prediction result of the sub model of the first party and the prediction result of the sub model of the second party.
7. A use method of a global model based on federal learning is characterized by comprising the following steps:
the computer acquires a prediction request of a first participant;
the computer aligns data of the prediction request based on a privacy intersection technology;
the computer predicts the data of the prediction request by using the submodel of each participant;
the computer predicts the prediction result of the sub-model of each participant by using a global model to obtain the prediction result of the prediction request of the first participant so as to realize the sharing of the data of the first participant and the data of the second participant.
8. A federally-learned-based data sharing apparatus, comprising:
an obtaining module configured to obtain a prediction request of a first participant;
an alignment module configured to align data of the prediction request based on a privacy intersection technique;
a sub-prediction module configured to predict data of the prediction request using sub-models of respective participants;
the main prediction module is configured to predict the prediction result of the sub-model of each participant by using a global model to obtain the prediction result of the prediction request of the first participant, so as to realize the sharing of the data of the first participant and the data of the second participant.
9. An electronic device, comprising:
one or more processors;
a storage device configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110324495.3A 2021-03-26 2021-03-26 Data sharing method, use method of model applying data sharing method and related equipment Pending CN113051239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110324495.3A CN113051239A (en) 2021-03-26 2021-03-26 Data sharing method, use method of model applying data sharing method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110324495.3A CN113051239A (en) 2021-03-26 2021-03-26 Data sharing method, use method of model applying data sharing method and related equipment

Publications (1)

Publication Number Publication Date
CN113051239A true CN113051239A (en) 2021-06-29

Family

ID=76515372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110324495.3A Pending CN113051239A (en) 2021-03-26 2021-03-26 Data sharing method, use method of model applying data sharing method and related equipment

Country Status (1)

Country Link
CN (1) CN113051239A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435537A (en) * 2021-07-16 2021-09-24 同盾控股有限公司 Cross-feature federated learning method and prediction method based on Soft GBDT
CN113537597A (en) * 2021-07-16 2021-10-22 上海大学 Privacy protection-based material performance prediction method and system
CN113592097A (en) * 2021-07-23 2021-11-02 京东科技控股股份有限公司 Federal model training method and device and electronic equipment
CN113626848A (en) * 2021-08-24 2021-11-09 北京沃东天骏信息技术有限公司 Sample data generation method and device, electronic equipment and computer readable medium
CN113722739A (en) * 2021-09-06 2021-11-30 京东科技控股股份有限公司 Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN114386069A (en) * 2022-01-06 2022-04-22 北京数牍科技有限公司 Federal learning model training method based on condition privacy set intersection

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020029590A1 (en) * 2018-08-10 2020-02-13 深圳前海微众银行股份有限公司 Sample prediction method and device based on federated training, and storage medium
WO2020134704A1 (en) * 2018-12-28 2020-07-02 深圳前海微众银行股份有限公司 Model parameter training method based on federated learning, terminal, system and medium
CN111784001A (en) * 2020-09-07 2020-10-16 腾讯科技(深圳)有限公司 Model training method and device and computer readable storage medium
CN111860864A (en) * 2020-07-23 2020-10-30 深圳前海微众银行股份有限公司 Longitudinal federal modeling optimization method, device and readable storage medium
US20200358599A1 (en) * 2019-05-07 2020-11-12 International Business Machines Corporation Private and federated learning
CN112199709A (en) * 2020-10-28 2021-01-08 支付宝(杭州)信息技术有限公司 Multi-party based privacy data joint training model method and device
CN112257876A (en) * 2020-11-15 2021-01-22 腾讯科技(深圳)有限公司 Federal learning method, apparatus, computer device and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020029590A1 (en) * 2018-08-10 2020-02-13 深圳前海微众银行股份有限公司 Sample prediction method and device based on federated training, and storage medium
WO2020134704A1 (en) * 2018-12-28 2020-07-02 深圳前海微众银行股份有限公司 Model parameter training method based on federated learning, terminal, system and medium
US20200358599A1 (en) * 2019-05-07 2020-11-12 International Business Machines Corporation Private and federated learning
CN111860864A (en) * 2020-07-23 2020-10-30 深圳前海微众银行股份有限公司 Longitudinal federal modeling optimization method, device and readable storage medium
CN111784001A (en) * 2020-09-07 2020-10-16 腾讯科技(深圳)有限公司 Model training method and device and computer readable storage medium
CN112199709A (en) * 2020-10-28 2021-01-08 支付宝(杭州)信息技术有限公司 Multi-party based privacy data joint training model method and device
CN112257876A (en) * 2020-11-15 2021-01-22 腾讯科技(深圳)有限公司 Federal learning method, apparatus, computer device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王亚珅;: "面向数据共享交换的联邦学习技术发展综述", 无人系统技术, no. 06, 15 November 2019 (2019-11-15) *
陈涛;郭睿;刘志强;: "面向大数据隐私保护的联邦学习算法航空应用模型研究", 信息安全与通信保密, no. 09, 10 September 2020 (2020-09-10) *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435537A (en) * 2021-07-16 2021-09-24 同盾控股有限公司 Cross-feature federated learning method and prediction method based on Soft GBDT
CN113537597A (en) * 2021-07-16 2021-10-22 上海大学 Privacy protection-based material performance prediction method and system
CN113592097A (en) * 2021-07-23 2021-11-02 京东科技控股股份有限公司 Federal model training method and device and electronic equipment
CN113592097B (en) * 2021-07-23 2024-02-06 京东科技控股股份有限公司 Training method and device of federal model and electronic equipment
CN113626848A (en) * 2021-08-24 2021-11-09 北京沃东天骏信息技术有限公司 Sample data generation method and device, electronic equipment and computer readable medium
CN113722739A (en) * 2021-09-06 2021-11-30 京东科技控股股份有限公司 Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN113722739B (en) * 2021-09-06 2024-04-09 京东科技控股股份有限公司 Gradient lifting tree model generation method and device, electronic equipment and storage medium
CN114386069A (en) * 2022-01-06 2022-04-22 北京数牍科技有限公司 Federal learning model training method based on condition privacy set intersection
CN114386069B (en) * 2022-01-06 2024-09-10 北京数牍科技有限公司 Federal learning model training method based on conditional privacy set intersection

Similar Documents

Publication Publication Date Title
CN113051239A (en) Data sharing method, use method of model applying data sharing method and related equipment
WO2022089256A1 (en) Method, apparatus and device for training federated neural network model, and computer program product and computer-readable storage medium
CN110084377B (en) Method and device for constructing decision tree
US11176469B2 (en) Model training methods, apparatuses, and systems
CN110309587B (en) Decision model construction method, decision method and decision model
CN113505882B (en) Data processing method based on federal neural network model, related equipment and medium
CN111428887B (en) Model training control method, device and system based on multiple computing nodes
CN107807991A (en) For handling the method and device of block chain data
CN111563267B (en) Method and apparatus for federal feature engineering data processing
US11410081B2 (en) Machine learning with differently masked data in secure multi-party computing
WO2023216494A1 (en) Federated learning-based user service strategy determination method and apparatus
CN113657471A (en) Construction method and device of multi-classification gradient lifting tree and electronic equipment
WO2023040429A1 (en) Data processing method, apparatus, and device for federated feature engineering, and medium
CN114186256A (en) Neural network model training method, device, equipment and storage medium
US20240005165A1 (en) Machine learning model training method, prediction method therefor, apparatus, device, computer-readable storage medium, and computer program product
CN111444335B (en) Method and device for extracting central word
US20230418794A1 (en) Data processing method, and non-transitory medium and electronic device
CN116306905A (en) Semi-supervised non-independent co-distributed federal learning distillation method and device
CN112417018A (en) Data sharing method and device
CN117094421B (en) Asymmetric longitudinal federal learning method, device, electronic equipment and storage medium
CN113722744B (en) Data processing method, device, equipment and medium for federal feature engineering
CN115378624B (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN116304644B (en) Data processing method, device, equipment and medium based on federal learning
US20240064179A1 (en) Highly scalable four-dimensional geospatial data system for simulated worlds
CN116186720A (en) Modeling and using methods of data encryption model and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination