WO2022088606A1 - 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质 - Google Patents

基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2022088606A1
WO2022088606A1 PCT/CN2021/084670 CN2021084670W WO2022088606A1 WO 2022088606 A1 WO2022088606 A1 WO 2022088606A1 CN 2021084670 W CN2021084670 W CN 2021084670W WO 2022088606 A1 WO2022088606 A1 WO 2022088606A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
gradient
party
division
gbdt
Prior art date
Application number
PCT/CN2021/084670
Other languages
English (en)
French (fr)
Inventor
王健宗
肖京
何安珣
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022088606A1 publication Critical patent/WO2022088606A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Definitions

  • the present application relates to the technical field of model hosting, and in particular, to a method, apparatus, device and storage medium for GBDT and LR fusion based on federated learning.
  • GBDT Gradient Boost Decision Tree, gradient boosting tree
  • GBDT Logistic Regression, generalized linear model
  • the control of the financial industry is becoming more and more strict, and financial data cannot be directly aggregated for machine learning model training.
  • the main purpose of this application is to provide a GBDT and LR fusion method, device, equipment and storage medium based on federated learning, which aims to solve the technical problem that financial data cannot be directly aggregated for the fusion model training of GBDT and LR models.
  • the present application provides a GBDT and LR fusion method based on federated learning, which is applied to the active party and includes the following steps:
  • the gradients and groups are calculated by the passive party after grouping the second samples according to their attributes, and the gradients and the obtained gradients and groups are calculated; the The first sample and the second sample correspond to the same user, and the second sample does not have a label;
  • the sample space is obtained by dividing the second sample by the passive party according to the division value, and the left node is obtained or the sample space corresponding to the right node;
  • a feature matrix is constructed according to the tree structure, and logistic regression is trained to obtain an LR model.
  • This application also provides another GBDT and LR fusion method based on federated learning, applied to the passive party, including the following steps:
  • the gradient and group are encrypted and transmitted to the active party
  • the division value is obtained by the active party decrypting the gradient and the group, and dividing according to the decrypted gradient and selecting the optimal feature;
  • the sample space corresponding to the left node or the right node is transmitted to the active party.
  • the application also provides a GBDT and LR fusion device based on federated learning, including:
  • a calculation unit configured to calculate the gradient of each first sample, and transmit the gradient to the passive party after encryption, wherein the first sample has a label
  • the first obtaining unit is used to obtain the encrypted gradients and groups of the passive party; wherein, the gradients and groups are obtained by grouping the second samples according to their attributes by the passive party, and then calculating the gradients and groups of the respective groups.
  • the gradient and group of ; the first sample and the second sample correspond to the same user, and the second sample does not have a label;
  • a decryption unit configured to decrypt the gradient and the group, and transmit the corresponding division value of the optimal feature division to the passive party according to the decrypted gradient and the optimal feature division;
  • a second obtaining unit configured to obtain a sample space in which the passive party is divided into left nodes or right nodes; wherein, the sample space is obtained by dividing the second sample by the passive party according to the division value. Divide, the obtained sample space corresponding to the left node or right node;
  • splitting unit for splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model
  • the construction unit is used for constructing a feature matrix according to the tree structure, and performing logistic regression training to obtain an LR model.
  • This application also provides another GBDT and LR fusion device based on federated learning, including:
  • a third obtaining unit configured to obtain the age characteristic value of each second sample of the passive party
  • a sorting unit configured to sort the second sample according to the age feature value
  • a grouping unit configured to divide the second sample according to the preset quantile according to the sorting, to obtain each grouping
  • a computing unit for computing the gradient sum of each grouping to obtain the gradient sum group
  • an encryption unit used for encrypting the gradient and the group and transmitting it to the active party
  • the fourth obtaining unit is used to obtain the division value of the active party; wherein, the division value is that the active party decrypts the gradient and the group, and divides the division according to the decrypted gradient and selecting the optimal feature. owned;
  • a division unit configured to divide the second sample belonging to the division value to the left node, and divide the second sample that does not belong to the division value to the right node to obtain the left node or all The sample space corresponding to the right node;
  • a transfer unit configured to transfer the sample space corresponding to the left node or the right node to the active party.
  • the present application also provides a computer device, comprising a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps of a federated learning-based GBDT and LR fusion method when executing the computer program:
  • the gradients and groups are calculated by the passive party after grouping the second samples according to their attributes, and the gradients and the obtained gradients and groups are calculated; the The first sample and the second sample correspond to the same user, and the second sample does not have a label;
  • the sample space is obtained by dividing the second sample by the passive party according to the division value, and the left node is obtained or the sample space corresponding to the right node;
  • a feature matrix is constructed according to the tree structure, and logistic regression is trained to obtain an LR model.
  • the present application also provides a computer device, comprising a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps of a federated learning-based GBDT and LR fusion method when executing the computer program:
  • the gradient and group are encrypted and transmitted to the active party
  • the division value is obtained by the active party decrypting the gradient and the group, and dividing according to the decrypted gradient and selecting the optimal feature;
  • the sample space corresponding to the left node or the right node is transmitted to the active party.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of a federated learning-based GBDT and LR fusion method are implemented:
  • the gradients and groups are calculated by the passive party after grouping the second samples according to their attributes, and the gradients and the obtained gradients and groups are calculated; the The first sample and the second sample correspond to the same user, and the second sample does not have a label;
  • the sample space is obtained by dividing the second sample by the passive party according to the division value, and the left node is obtained or the sample space corresponding to the right node;
  • a feature matrix is constructed according to the tree structure, and logistic regression is trained to obtain an LR model.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of a federated learning-based GBDT and LR fusion method are implemented:
  • the gradient and group are encrypted and transmitted to the active party
  • the division value is obtained by the active party decrypting the gradient and the group, and dividing according to the decrypted gradient and selecting the optimal feature;
  • the sample space corresponding to the left node or the right node is transmitted to the active party.
  • the GBDT and LR fusion method, device, equipment and storage medium based on federated learning can build a federated model with high interpretability and better effect without directly aggregating data for model training.
  • the method When training the LR model, the method directly constructs the feature matrix according to the tree structure of the GBDT model, and can obtain the LR model with high interpretability without tedious feature construction.
  • This method only needs to transmit gradients when building the GBDT model.
  • the construction of the LR model is directly obtained by training the active side. Therefore, the time efficiency basically depends on the model efficiency of the GBDT, and the time complexity will not be increased.
  • the active party and the passive party do not need to know each other's data information, and their respective financial data will not be known by other parties, so that financial data can also be used for machine learning model training.
  • FIG. 1 is a schematic diagram of an implementation environment of an embodiment of the present application.
  • FIG. 2 is a schematic diagram of steps of a GBDT and LR fusion method based on federated learning in an embodiment of the present application
  • FIG. 3 is a schematic diagram of steps of another GBDT and LR fusion method based on federated learning in another embodiment of the present application;
  • FIG. 4 is a structural block diagram of a GBDT and LR fusion device based on federated learning in an embodiment of the present application
  • FIG. 5 is a structural block diagram of another GBDT and LR fusion device based on federated learning in an embodiment of the present application
  • FIG. 6 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
  • the active side has a first terminal 1
  • the passive side has at least one second terminal 2, and data communication can be performed between the first terminal 1 and the second terminal 2 through the network; wherein, The active party and the passive party have the same user, the first terminal 1 has sample data X1 and tag data Y, and the second terminal 2 has sample data X2, X3, . . . XN.
  • Both the first terminal 1 and the second terminal 2 may include an independently running server, or a distributed server, or a server cluster composed of multiple servers, wherein the server may be a cloud server.
  • Federated learning can build machine learning systems without direct access to sample data, which remains in its original location, helping to ensure privacy and reduce communication costs.
  • the first terminal 1 calculates the gradient of the first sample, encrypts it and transmits it to the second terminal 2, and the second terminal 2 groups the second sample, calculates the gradient of the grouping and forms a gradient sum and transmits it to the first terminal 1,
  • the first terminal 1 decrypts the gradient and the group, selects the optimal feature division according to the decrypted gradient and selects the optimal feature division, and transmits the division value of the optimal feature to the second terminal 2, and the second terminal 2 divides the value of the second sample according to the division, Pass the divided sample space of the left node or the right node to the first terminal 1, and the first terminal 1 splits the first sample according to the sample space to obtain a GBDT model, and constructs a feature matrix according to the tree structure of the GBDT model , perform logistic regression training to get the LR model.
  • an embodiment of the present application provides a GBDT and LR fusion method based on federated learning, applied to the active party, including the following steps:
  • Step S1 the gradient of each first sample is first calculated, and the gradient is encrypted and transmitted to the passive party, wherein the first sample has a label;
  • Step S2 obtaining the encrypted gradient and group of the passive party; wherein, the gradient and group are calculated by the passive party after grouping each second sample according to the attribute, and then calculating the gradient of each group and the obtained gradient and group. ; the first sample and the second sample correspond to the same user, and the second sample does not have a label;
  • Step S3 decrypt the gradient and the group, and transmit the corresponding division value of the optimal feature division to the passive party according to the decrypted gradient and the optimal feature division;
  • Step S4 obtaining a sample space in which the passive party is divided into a left node or a right node; wherein, the sample space is obtained by dividing the second sample by the passive party according to the division value, and obtaining The sample space corresponding to the left node or the right node;
  • Step S5 splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model
  • Step S6 construct a feature matrix according to the tree structure, and perform logistic regression training to obtain an LR model.
  • the GBDT and LR fusion method based on federated learning proposed in this embodiment is applied to the active party.
  • the active party and the passive party have the same batch of users, but the user information owned by the active party and the passive party is different.
  • Bank A has users: A, B, C;
  • Bank B has users A, B, C; for these three users,
  • Bank A has the user's phone number and the label of whether the loan is in good condition
  • Bank B has the user's age, gender and other information, and will have the label of the user.
  • One party is called the active party, such as Bank A
  • the passive party such as Bank B.
  • the passive party may have a label.
  • the active party When the passive party has a label, the active party is used.
  • the label of the main side the label of the passive side does not participate in the training.
  • passive parties may include multiple banks.
  • the gradient of each first sample in the active party is calculated. Specifically, at the t-th iteration, the calculated gradient is:
  • the encrypted gradients and groups of each group of the passive party are obtained.
  • the passive party groups the second samples.
  • the second samples can be grouped based on some characteristic attributes in the second samples, such as gender, age, etc., and the gradients of the second samples in each group are calculated according to the above formulas (1) and (2). , and then calculate the gradient sum of each group, and encrypt the gradient sum of each group and transmit it to the active party.
  • the passive party when training with a multi-party bank, the passive party includes multiple banks, the second samples of each bank are grouped, the gradient sum of the grouping is calculated, and then the gradient sum group is formed, and then the gradient and group are encrypted. Then pass it on to the active party.
  • the base learner of the GBDT model adopts the tree model
  • the tree model splits according to the decrypted gradient sum, and each time splits the A node is divided into a left node and a right node, and each first sample will fall into a unique leaf node in each tree.
  • Select the optimal feature division and pass the corresponding division value of the optimal feature division to the passive party.
  • the optimal feature division represents the best way to split the tree model.
  • the division value is a hyperparameter, which is the value set before starting training. parameters, rather than the parameter data obtained through training.
  • the passive party divides the samples according to the division value, that is, the samples whose value of the characteristic column is in the division value range are divided into the left node, and the rest are divided into the right node. node, and return the sample space divided into left node or right node to the active party, and the active party obtains the sample space divided into left node or right node by the passive party.
  • step S5 after receiving the sample space divided into left nodes, the active party can know which second samples fall into the left nodes, so the first sample can be split in the same way, and the left and right nodes can be split.
  • the points are divided, and finally the corresponding threshold is reached to construct the leaf nodes, and the tree structure of the GBDT model is obtained.
  • step S6 since the active party knows the tree structure of the GBDT model and the sample space falling into the leaf node, the active party assigns values to the first sample, constructs a sparse matrix, and uses the sparse matrix as a feature matrix, Perform logistic regression training to get the LR model.
  • the GBDT and LR fusion method based on federated learning can construct a federated model with high interpretability and better effect without directly aggregating data for model training.
  • the method directly constructs the feature matrix according to the tree structure of the GBDT model, and can obtain the LR model with high interpretability without tedious feature construction.
  • This method only needs to transmit gradients when building the GBDT model.
  • the construction of the LR model is directly obtained by training the active side. Therefore, the time efficiency basically depends on the model efficiency of the GBDT, and the time complexity will not be increased.
  • this method does not disclose the data information of each bank when training the risk control model with multi-party banks, so this method can effectively use multi-party data as much as possible to complete personal data security under the circumstance of ensuring data security. Risk assessment enables banks to effectively identify users with poor credit conditions at other banks.
  • the step S3 of decrypting the gradient sum and selecting the optimal feature division according to the gradient sum and transmitting the corresponding feature column and division value of the optimal feature division to the passive party includes: :
  • the division value corresponding to the optimal feature division is transmitted to the passive party.
  • the active party decrypts the gradient sum group, obtains the gradient sum of each group, and calculates the gain of the first sample in the active party according to the gradient sum.
  • the step of calculating the gain of the first sample according to the decrypted gradient sum includes:
  • the gain of the first sample is calculated by the above formula, and the value of the gain can represent the pros and cons of the splitting of the tree model, and the larger the gain, the better the splitting of the tree model.
  • the step S5 of splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model includes:
  • the first sample corresponding to the second sample in the sample space is divided into left nodes
  • the active party knows the sample space divided into the left node by the passive party
  • the first sample corresponding to the second sample located at the left node is divided into the left node, and the remaining first samples are divided into At the right node, the tree structure of the GBDT model is obtained, and the training of the GBDT model is completed.
  • the step S6 of constructing a feature matrix according to the tree structure and performing logistic regression training to obtain an LR model includes:
  • the first sample is assigned according to the one-hot encoding, and a sparse matrix is constructed
  • the GBDT model after the training has generated multiple trees, and each first sample will fall into a unique leaf node in each tree, and each leaf node is regarded as a feature.
  • the leaf nodes are one-hot encoded, and the first sample is assigned according to the one-hot encoding.
  • Each first sample gets the corresponding feature vector, and all the first samples will eventually get a sparse matrix, and each column Representing the meaning of the leaf node, put this sparse matrix into logistic regression for training to complete the construction of the LR model.
  • no transmission is performed during the construction of the LR model, so the construction process is very efficient and does not increase the time complexity.
  • an embodiment of the present application provides another GBDT and LR fusion method based on federated learning, which is applied to the passive side and includes the following steps:
  • Step S10 obtaining the age characteristic value of each second sample of the passive party
  • Step S20 sorting the second sample according to the age feature value
  • Step S30 dividing the second sample according to the preset quantile according to the sorting to obtain each grouping
  • Step S40 calculating the gradient sum of each grouping to obtain the gradient sum group
  • Step S50 passing the gradient and group to the active party after encryption
  • Step S60 obtaining the division value of the active party; wherein, the division value is obtained by the active party decrypting the gradient and the group, and dividing according to the decrypted gradient and selecting the optimal feature;
  • Step S70 Divide the second sample belonging to the division value to the left node, and divide the second sample that does not belong to the division value to the right node to obtain the left node or the right node.
  • Step S80 Transfer the sample space corresponding to the left node or the right node to the active party.
  • the GBDT and LR fusion method based on federated learning proposed in this embodiment is applied to the passive party.
  • the second sample of the passive party has the feature of age, and the second sample is obtained through the passive party.
  • the age characteristic value of sorts according to the age characteristic value, that is, sorts according to the age from youngest to oldest.
  • each second sample is grouped according to the sorting. Specifically, a quantile is preset, such as a quartile and a quintile. If a quartile is used, the sorting is completed. A number of second samples are divided into four equal parts, resulting in four groups.
  • the gradient sum of each group is obtained by summing the gradients of the second samples in each group by the passive party, and the gradient sum group is obtained. Specifically, the gradient sum of each group is encrypted by the additive homomorphic encryption algorithm and then transmitted to the active party.
  • each second sample is grouped according to the age characteristic value, and the number of second samples in each group is the same. By grouping by the quantile, you can directly know that the second samples are arranged from small to large. The ratio of the number of second samples less than a certain value to the total number of second samples.
  • the passive party receives the division value of the optimal feature division of the active party
  • the second sample belonging to the division value is divided into the left node
  • the second sample that does not belong to the division value is divided into Right node
  • the GBDT and LR fusion method based on federated learning provided in this application can be used in the blockchain field, and the trained GBDT model and LR model are stored in the blockchain network.
  • the blockchain is a distributed data storage, point-to-point transmission , consensus mechanism, encryption algorithm and other new application modes of computer technology.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the underlying platform of the blockchain can include processing modules such as user management, basic services, smart contracts, and operation monitoring.
  • the user management module is responsible for the identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, and maintenance of the corresponding relationship between the user's real identity and blockchain address (authority management), etc.
  • account management maintenance of public and private key generation
  • key management key management
  • authorization management maintenance of the corresponding relationship between the user's real identity and blockchain address
  • the basic service module is deployed on all blockchain node devices to verify the validity of business requests, After completing the consensus on valid requests, record them in the storage.
  • the basic service For a new business request, the basic service first adapts the interface for analysis and authentication processing (interface adaptation), and then encrypts the business information through the consensus algorithm (consensus management), After encryption, it is completely and consistently transmitted to the shared ledger (network communication), and records are stored; the smart contract module is responsible for the registration and issuance of contracts, as well as contract triggering and contract execution.
  • contract logic through a programming language and publish to On the blockchain (contract registration), according to the logic of the contract terms, call the key or other events to trigger execution, complete the contract logic, and also provide the function of contract upgrade and cancellation;
  • the operation monitoring module is mainly responsible for the deployment in the product release process , configuration modification, contract settings, cloud adaptation, and visual output of real-time status in product operation, such as: alarms, monitoring network conditions, monitoring node equipment health status, etc.
  • an embodiment of the present application provides a GBDT and LR fusion device based on federated learning, including:
  • the computing unit 10 is configured to calculate the gradient of each first sample, and transmit the gradient to the passive party after encryption, wherein the first sample has a label;
  • the first obtaining unit 20 is configured to obtain the encrypted gradients and groups of the passive party; wherein, the gradients and groups are calculated by the passive party after the second samples are grouped according to their attributes, and then the gradients and all of the groupings are calculated.
  • the obtained gradient and group; the first sample and the second sample correspond to the same user, and the second sample does not have a label;
  • the decryption unit 30 is used for decrypting the gradient and the group, and according to the decrypted gradient and selecting the optimal feature division, the division value corresponding to the optimal feature division is passed to the passive party;
  • the second obtaining unit 40 is configured to obtain a sample space in which the passive party is divided into a left node or a right node; wherein, the sample space is obtained by dividing the second sample by the passive party according to the division value Divide, and obtain the sample space corresponding to the left node or right node;
  • a splitting unit 50 configured to split the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model
  • the construction unit 60 is configured to construct a feature matrix according to the tree structure, and perform logistic regression training to obtain an LR model.
  • the decryption unit 30 includes:
  • a decryption subunit for decrypting the gradients and groups
  • a second calculation subunit configured to calculate the gain of the first sample according to the decrypted gradient sum
  • a delivery subunit configured to deliver the feature column and the partition value corresponding to the optimal feature partition to the passive party.
  • the second calculation subunit includes:
  • Calculation module for passing formulas Calculate the gain of the first sample, wherein the g l and h l are the gradient information of the first sample split into the left node, and the gr r and h r are the first sample split into the right node. Gradient information of the sample, the g and h are the gradient information of the current first sample, and the ⁇ is the parameter of the formula G.
  • the splitting unit 50 includes:
  • a third dividing subunit configured to divide the first sample corresponding to the second sample in the sample space into a left node according to the sample space;
  • the fourth division subunit is used for dividing the remaining first samples into right nodes to obtain a tree structure corresponding to the GBDT model.
  • the construction unit 60 includes:
  • a coding unit used for one-hot coding of leaf nodes in the tree structure
  • an assignment unit configured to assign a value to the first sample according to the one-hot encoding to construct a sparse matrix
  • the training subunit is configured to use the sparse matrix as a feature matrix to perform logistic regression training to obtain the LR model.
  • an embodiment of the present application provides another GBDT and LR fusion device based on federated learning, including:
  • a third obtaining unit 1A configured to obtain the age characteristic value of each second sample of the passive party
  • a sorting unit 1B configured to sort the second sample according to the age feature value
  • the grouping unit 1C is configured to divide the second samples according to the preset quantiles according to the sorting to obtain each grouping;
  • the computing unit 1D is used to calculate the gradient sum of each group to obtain the gradient sum group
  • the encryption unit 1E is used for encrypting the gradient and group and transmitting them to the active party;
  • the fourth obtaining unit 1F is used to obtain the division value of the active party; wherein, the division value is that the active party decrypts the gradient and the group, and divides it according to the decrypted gradient and selecting the optimal feature obtained;
  • the division unit 1G is configured to divide the second sample belonging to the division value to the left node, and divide the second sample that does not belong to the division value to the right node to obtain the left node or the sample space corresponding to the right node;
  • the transfer unit 1H is configured to transfer the sample space corresponding to the left node or the right node to the active party.
  • an embodiment of the present application further provides a computer device
  • the computer device may be a server, and its internal structure may be as shown in FIG. 6 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus.
  • the processor of the computer design is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store the first sample data, the second sample data, and the like.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program implements a federated learning-based GBDT and LR fusion method when executed by the processor.
  • FIG. 6 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer-readable storage medium, where the storage medium may be a non-volatile storage medium or a volatile storage medium.
  • a computer program is stored thereon, and when the computer program is executed by the processor, a GBDT and LR fusion method based on federated learning is implemented.
  • the active party calculates the gradient of each first sample, and encrypts the gradient and transmits it to the storage medium.
  • Passive party wherein the first sample has a label; obtain the encrypted gradient and group of the passive party; wherein, the gradient and group are calculated by the passive party after grouping each second sample according to attributes The gradient of each group and the obtained gradient and group; the first sample and the second sample correspond to the same user, and the second sample does not have a label; decrypt the gradient and group, according to the decrypted
  • the gradient and the optimal feature division are selected, and the division value corresponding to the optimal feature division is passed to the passive party; the sample space in which the passive party is divided into a left node or a right node is obtained; wherein, The sample space is obtained by dividing the second sample according to the division value by the passive party, and the sample space corresponding to the left node or the right node is obtained; the first sample is divided according to the sample space.
  • the GBDT and LR fusion method based on federated learning can build a federated model with high interpretability and good effect without directly aggregating data for model training.
  • the method directly constructs the feature matrix according to the tree structure of the GBDT model, and can obtain the LR model with high interpretability without tedious feature construction.
  • This method only needs to transmit gradients when building the GBDT model.
  • the construction of the LR model is directly obtained by training the active side. Therefore, the time efficiency basically depends on the model efficiency of the GBDT, and the time complexity will not be increased.
  • the active party and the passive party do not need to know each other's data information, and their respective financial data will not be known by other parties, so that financial data can also be used for machine learning model training.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Security & Cryptography (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质,所述方法包括:计算各个第一样本的梯度,将梯度加密后传给被动方,其中,所述第一样本具有标签(S1);获取被动方各个分组经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签(S2);对所述梯度和组进行解密,根据所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给被动方(S3);获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间(S4);根据样本空间对第一样本进行分裂,得到GBDT模型对应的树结构(S5);根据树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型(S6)。通过本方法提供的基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质,使得金融数据能够直接聚合进行GBDT和LR模型的融合模型训练。

Description

基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质
本申请要求于2020年10月29日提交中国专利局、申请号为202011182203.9,发明名称为“基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及模型托管的技术领域,特别涉及一种基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质。
背景技术
在金融场景下,经常涉及到一些风控模型的构建,并且由于业界需要可解释性高的模型,因此常使用简单有效的逻辑回归进行分类问题的处理。但逻辑回归是一个线性模型,并不能捕捉到非线性信息,需要大量特征工程,耗费人力物力,而GBDT(Gradient Boost Decision Tree,梯度提升树)正好可以用来发觉有区分度的特征、特征组合,减少特征工程中人力成本。但相应地,GBDT是一种集成方法,因此它的解释性较低。GBDT与LR(Logistic Regression,广义线性模型)的融合模型恰好结合了两者的优点,先采用GBDT来发掘有区分度的特征以及组合特征,进而使用LR构建解释性高的模型。
发明人意识到,现有的GBDT与LR的融合模型都是建立在开源数据的基础上进行模型的训练。而如今对金融行业的管控越来越严格,金融数据无法被直接聚合来进行机器学习模型训练。
技术问题
本申请的主要目的为提供一种基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质,旨在解决金融数据无法直接聚合进行GBDT和LR模型的融合模型训练的技术问题。
技术解决方案
为实现上述目的,本申请提供了一种基于联邦学习的GBDT与LR融合方法,应用于主动方,包括以下步骤:
计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
本申请还提供了另一种基于联邦学习的GBDT与LR融合方法,应用于被动方,包括以下步骤:
获取所述被动方的各个第二样本的年龄特征值;
对所述第二样本根据所述年龄特征值进行排序;
对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
计算各个分组的梯度和,得到所述梯度和组;
将所述梯度和组经过加密后传给主动方;
获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
将所述左结点或所述右结点对应的样本空间传递给所述主动方。
本申请还提供了一种基于联邦学习的GBDT与LR融合装置,包括:
计算单元,用于计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
第一获取单元,用于获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
解密单元,用于对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
第二获取单元,用于获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
分裂单元,用于根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
构建单元,用于根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
本申请还提供了另一种基于联邦学习的GBDT与LR融合装置,包括:
第三获取单元,用于获取所述被动方的各个第二样本的年龄特征值;
排序单元,用于对所述第二样本根据所述年龄特征值进行排序;
分组单元,用于对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
计算单元,用于计算各个分组的梯度和,得到所述梯度和组;
加密单元,用于将所述梯度和组经过加密后传给主动方;
第四获取单元,用于获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
划分单元,用于将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
传递单元,用于将所述左结点或所述右结点对应的样本空间传递给所述主动方。
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所 述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
本申请还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器执行所述计算机程序时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
获取所述被动方的各个第二样本的年龄特征值;
对所述第二样本根据所述年龄特征值进行排序;
对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
计算各个分组的梯度和,得到所述梯度和组;
将所述梯度和组经过加密后传给主动方;
获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
将所述左结点或所述右结点对应的样本空间传递给所述主动方。
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
获取所述被动方的各个第二样本的年龄特征值;
对所述第二样本根据所述年龄特征值进行排序;
对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
计算各个分组的梯度和,得到所述梯度和组;
将所述梯度和组经过加密后传给主动方;
获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
将所述左结点或所述右结点对应的样本空间传递给所述主动方。
有益效果
本申请提供的基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质,能在不 直接将数据聚合进行模型训练的情况下构建一个可解释性高且效果较好的联邦模型。该方法在训练LR模型时,直接根据GBDT模型的树结构构建特征矩阵,无需繁琐的特征构造,就能得到解释性高的LR模型。该方法仅仅在构建GBDT模型时需要传输梯度等,LR模型的构建直接在主动方训练得到,因此时间效率基本取决于GBDT的模型效率,不会提升时间复杂度。同时,主动方与被动方之间并不需要知道对方的数据信息,各自的金融数据不会被其他方知晓,使得金融数据也可以用于机器学习模型训练。
附图说明
图1是本申请一实施例的实施环境示意图;
图2是本申请一实施例中基于联邦学习的GBDT与LR融合方法步骤示意图;
图3是本申请另一实施例中另一基于联邦学习的GBDT与LR融合方法步骤示意图;
图4是本申请一实施例中基于联邦学习的GBDT与LR融合装置结构框图;
图5是本申请一实施例中另一基于联邦学习的GBDT与LR融合装置结构框图;
图6为本申请一实施例的计算机设备的结构示意框图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的最佳实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
参见图1,联邦学习下,主动方具有第一终端1,被动方至少具有一个第二终端2,所述第一终端1和所述第二终端2之间可通过网络进行数据通信;其中,主动方和被动方拥有相同用户,第一终端1具有样本数据X1和标签数据Y,第二终端2具有样本数据X2、X3……XN。第一终端1以及第二终端2均可以包括一个独立运行的服务器,或者分布式服务器,或者由多个服务器组成的服务器集群,其中服务器可以是云端服务器。联邦学习可以在不直接访问样本数据的情况下构建机器学习系统,样本数据保留在原始位置,有助于确保隐私并降低通信成本。
具体的,第一终端1计算第一样本的梯度并加密传递给第二终端2,第二终端2对第二样本进行分组并计算分组的梯度和形成梯度和组传递给第一终端1,第一终端1对梯度和组解密并根据解密后的梯度和选取最优特征划分,将最优特征的划分值传递给第二终端2,第二终端2根据划分对值第二样本进行划分,将划分完成后的左结点或右结点的样本空间传递给第一终端1,第一终端1根据样本空间对第一样本进行分裂,得到GBDT模型,根据GBDT模型的树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
参照图2,本申请一实施例提供了一种基于联邦学习的GBDT与LR融合方法,应用于主动方,包括以下步骤:
步骤S1,第一计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
步骤S2,获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
步骤S3,对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
步骤S4,获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
步骤S5,根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
步骤S6,根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
本实施例中所提出的基于联邦学习的GBDT与LR融合方法应用在主动方,主动方和被动方拥有同一批用户,但主动方和被动方所拥有的用户信息不同,例如甲银行有用户:A,B,C;乙银行有用户A,B,C;这三位用户,甲银行有用户的电话、借贷情况是否良好的标签,乙银行有用户的年龄、性别等信息,将具有标签的一方称为主动方,如甲银行,将仅有特征、没有标签的一方称为被动方,如乙银行,在其他实施例中,被动方可能具有标签,当被动方具有标签时,以主动方的标签为主,被动方的标签不参与训练。具体的,被动方可包括多家银行。如上述步骤S1所述,计算主动方中各个第一样本的梯度,具体的,第t次迭代时,计算得到的梯度为:
Figure PCTCN2021084670-appb-000001
Figure PCTCN2021084670-appb-000002
其中h i为g i的二阶导数;y i为第一样本y的第i个特征,将所述梯度加密后传给被动方,具体的,可通过加法同态加密算法进行加密后传给被动方。
如上述步骤S2所述,获取被动方各个分组经过加密后的梯度和组。被动方对第二样本进行分组,具体的,可基于第二样本中的一些特征属性进行分组,如性别、年龄等,根据上述公式(1)和(2)计算各个分组中第二样本的梯度,再计算每个分组的梯度和,将各个分组的梯度和加密后传给主动方。在另一实施例中,联合多方银行进行训练时,被动方包括多个银行,每个银行的第二样本进行分组,并计算分组的梯度和,然后形成梯度和组,再将梯度和组加密后传给主动方。
如上述步骤S3所述,由于GBDT模型的基学习器采用的是树模型,在主动方对接收到的梯度和进行解密后,树模型根据解密得到的梯度和进行分裂,每次分裂的时候将一个结点分为左结点和右结点,每个第一样本在每棵树中都会落入唯一的叶子结点中。选取最优特征划分,将最优特征划分相应的划分值传给被动方,最优特征划分表示着树模型分裂的最佳方式,划分值为一个超参,即是在开始训练之前设置值的参数,而不是通过训练得到的参数数据。
如上述步骤S4所述,被动方接收到特征列和划分值后,根据该划分值对样本进行划分,即将特征列的值在划分值区间内的样本划分在左结点,其余的划分在右结点,并将划分为左结点或右结点的样本空间返回给主动方,主动方获取被动方被划分为左结点或右结点的样本空间。
如上述步骤S5所述,主动方接受到划分为左结点的样本空间后,即可知道哪些第二样本中落入左结点,因此可对第一样本做同样的分裂,进行左右结点的划分,最终达到相应的阈值构造叶子结点,得到GBDT模型的树结构。
如上述步骤S6所述,由于主动方已知GBDT模型的树结构以及落入叶子结点的样本空间,因此主动方对第一样本进行赋值,构造稀疏矩阵,将该稀疏矩阵作为特征矩阵,进行逻辑回归的训练,得到LR模型。
本实施例中,基于联邦学习的GBDT与LR融合方法能在不直接将数据聚合进行模型训练的情况下构建一个可解释性高且效果较好的联邦模型。该方法在训练LR模型时,直接根据GBDT模型的树结构构建特征矩阵,无需繁琐的特征构造,就能得到解释性高的LR模型。该方法仅仅在构建GBDT模型时需要传输梯度等,LR模型的构建直接在主动方训练得到,因此时间效率基本取决于GBDT的模型效率,不会提升时间复杂度。同时,主动方与被动方之间并不需要知道对方的数据信息,各自的金融数据不会被其他方知晓,使得金融数据 也可以用于机器学习模型训练。具体的,本方法在联合多方银行训练风控模型时,并未泄露各方银行的数据信息,因此该方法能有效的在保障数据安全的情形下,尽可能的利用多方数据来完成对个人的风险评估,使得银行可以有效识别出在其它银行信贷情况不良的用户。
在一实施例中,所述对所述梯度和进行解密,根据所述梯度和选取最优特征划分,将所述最优特征划分对应的特征列及划分值传给被动方的步骤S3,包括:
对所述梯度和组进行解密;
根据解密后的所述梯度和计算所述第一样本的增益;
根据所述增益选取最优特征划分;
将所述最优特征划分所对应的划分值传递给所述被动方。
本实施例中,主动方对梯度和组进行解密,得到各个分组的梯度和,根据所述梯度和计算主动方中第一样本的增益。
在一实施例中,所述根据解密后的所述梯度和计算所述第一样本的增益的步骤,包括:
通过公式
Figure PCTCN2021084670-appb-000003
计算所述第一样本的增益,其中,所述g l、h l为分裂为左结点中第一样本的梯度信息,所述g r、h r为分裂为右结点中第一样本的梯度信息,所述g、h为当前所述第一样本的梯度信息,所述λ为公式G的参数。本实施例中,通过上述公式计算第一样本的增益,增益的值能表征树模型分裂的优劣,当增益越大,则表明树模型分裂得越好。
根据增益选取最优特征划分,选择增益值最大时的所对应的树模型,将最优特征划分所对应的特征列和划分值传给被动方,使得被动方知晓主动方的分裂。
在一实施例中,所述根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构的步骤S5,包括:
根据所述样本空间,将所述样本空间内的所述第二样本相对应的所述第一样本划分为左结点;
将剩余所述第一样本划分为右结点,得到所述GBDT模型对应的树结构。
本实施例中,主动方知晓被动方的划分为左结点的样本空间后,将处于左结点的第二样本所对应的第一样本划分为左结点,剩余第一样本划分为右结点,得到GBDT模型的树结构,完成GBDT模型的训练。
在一实施例中,所述根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型的步骤S6,包括:
所述对所述树结构中的叶子结点作one-hot编码;
根据所述one-hot编码对所述第一样本进行赋值,构造稀疏矩阵;
将所述稀疏矩阵作为特征矩阵,进行逻辑回归的训练,得到所述LR模型。
本实施例中,训练结束后的GBDT模型产生了多棵树,每个第一样本在每棵树里都会落入唯一的叶子结点,将每个叶子结点都视为一个特征,对叶子结点做one-hot编码,根据one-hot编码对第一样本进行赋值,每一个第一样本得到相应的特征向量,所有的第一样本最终会得到一个稀疏矩阵,并且每一列代表该叶子结点的含义,将这一稀疏矩阵放入逻辑回归进行训练,完成LR模型的构建。本实施例中,LR模型的构建过程中没有进行传输,因此构建过程十分高效,不会提升时间复杂度。
参见图3,本申请一实施例提供了另一种基于联邦学习的GBDT与LR融合方法,应用在被动方,包括以下步骤:
步骤S10,获取所述被动方的各个第二样本的年龄特征值;
步骤S20,对所述第二样本根据所述年龄特征值进行排序;
步骤S30,对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
步骤S40,计算各个分组的梯度和,得到所述梯度和组;
步骤S50,将所述梯度和组经过加密后传给主动方;
步骤S60,获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
步骤S70,将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
步骤S80,将所述左结点或所述右结点对应的样本空间传递给所述主动方。
本实施例中所提出的基于联邦学习的GBDT与LR融合方法,应用于被动方,如上述步骤S10-S20所述,被动方的第二样本具有年龄这一特征,通过被动方获取第二样本的年龄特征值,根据年龄特征值进行排序,即按照年龄大小从小到大进行排序。
如上述步骤S30所述,根据所述排序对各个第二样本进行分组,具体的,预先设定一个分位数,如四分位、五分位,如采用四分位,则是将排序完成的若干第二样本分为四个等分,得到四个分组。
如上述步骤S40-S50所述,通过被动方将各个分组内的第二样本的梯度加和计算得到各个分组的梯度和,得到梯度和组。具体的,将每个分组的梯度和经过加法同态加密算法加密后传给主动方。通过预设分位数,根据年龄特征值对各个第二样本进行分组,每个分组内的第二样本数相同,通过分位数进行分组,可以直接的了解到第二样本从小至大排列之后小于某值的第二样本数占总第二样本数的比例。
如上述步骤S60-S80所述,被动方收到主动方最优特征划分的划分值后,将属于该划分值的第二样本划分在左结点,不属于该划分值的第二样本划分到右结点,将划分为左结点或右结点的样本空间传递给主动方,使得主动方可以知晓被动方的样本空间,便于主动方根据被动方的一半的样本空间就可以进行分裂,得到GBDT模型。
本申请提供的基于联邦学习的GBDT与LR融合方法可运用在区块链领域中,将训练好的GBDT模型和LR模型存储在区块链网络中,区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。
区块链底层平台可以包括用户管理、基础服务、智能合约以及运营监控等处理模块。其中,用户管理模块负责所有区块链参与者的身份信息管理,包括维护公私钥生成(账户管理)、密钥管理以及用户真实身份和区块链地址对应关系维护(权限管理)等,并且在授权的情况下,监管和审计某些真实身份的交易情况,提供风险控制的规则配置(风控审计);基础服务模块部署在所有区块链节点设备上,用来验证业务请求的有效性,并对有效请求完成共识后记录到存储上,对于一个新的业务请求,基础服务先对接口适配解析和鉴权处理(接口适配),然后通过共识算法将业务信息加密(共识管理),在加密之后完整一致的传输至共享账本上(网络通信),并进行记录存储;智能合约模块负责合约的注册发行以及合约触发和合约执行,开发人员可以通过某种编程语言定义合约逻辑,发布到区块链上(合约注册),根据合约条款的逻辑,调用密钥或者其它的事件触发执行,完成合约逻辑,同时还提供对合约升级注销的功能;运营监控模块主要负责产品发布过程中的部署、配置的修改、合约设置、云适配以及产品运行中的实时状态的可视化输出,例如:告警、监控网络情况、监控节点设备健康状态等。
参见图4,本申请一实施例提供了一种基于联邦学习的GBDT与LR融合装置,包括:
计算单元10,用于计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
第一获取单元20,用于获取被动方经过加密后的梯度和组;其中,所述梯度和组是通 过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
解密单元30,用于对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
第二获取单元40,用于获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
分裂单元50,用于根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
构建单元60,用于根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
在一实施例中,所述解密单元30,包括:
解密子单元,用于对所述梯度和组进行解密;
第二计算子单元,用于根据解密后的所述梯度和计算所述第一样本的增益;
选取子单元,用于根据所述增益选取最优特征划分;
传递子单元,用于将所述最优特征划分所对应的特征列及划分值传递给所述被动方。
在一实施例中,所述第二计算子单元,包括:
计算模块,用于通过公式
Figure PCTCN2021084670-appb-000004
计算所述第一样本的增益,其中,所述g l、h l为分裂为左结点中第一样本的梯度信息,所述g r、h r为分裂为右结点中第一样本的梯度信息,所述g、h为当前所述第一样本的梯度信息,所述λ为公式G的参数。
在一实施例中,所述分裂单元50,包括:
第三划分子单元,用于根据所述样本空间,将与所述样本空间内的所述第二样本相对应的所述第一样本划分为左结点;
第四划分子单元,用于将剩余所述第一样本划分为右结点,得到所述GBDT模型对应的树结构。
在一实施例中,所述构建单元60,包括:
编码单元,用于对所述树结构中的叶子结点作one-hot编码;
赋值单元,用于根据所述one-hot编码对所述第一样本进行赋值,构造稀疏矩阵;
训练子单元,用于将所述稀疏矩阵作为特征矩阵,进行逻辑回归的训练,得到所述LR模型。
参见图5,本申请一实施例提供了另一种基于联邦学习的GBDT与LR融合装置,包括:
第三获取单元1A,用于获取所述被动方的各个第二样本的年龄特征值;
排序单元1B,用于对所述第二样本根据所述年龄特征值进行排序;
分组单元1C,用于对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
计算单元1D,用于计算各个分组的梯度和,得到所述梯度和组;
加密单元1E,用于将所述梯度和组经过加密后传给主动方;
第四获取单元1F,用于获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
划分单元1G,用于将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
传递单元1H,用于将所述左结点或所述右结点对应的样本空间传递给所述主动方。
在本实施例中,上述各个单元、子单元的具体实现请参照上述方法实施例中所述,在此不再进行赘述。
参照图6,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储第一样本数据、第二样本数据等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于联邦学习的GBDT与LR融合方法。
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,上述存储介质可以是非易失性存储介质,也可以是易失性存储介质。其上存储有计算机程序,计算机程序被处理器执行时实现一种基于联邦学习的GBDT与LR融合方法。
综上所述,为本申请实施例中提供的基于联邦学习的GBDT与LR融合方法、装置、设备和存储介质,主动方计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。基于联邦学习的GBDT与LR融合方法能在不直接将数据聚合进行模型训练的情况下构建一个可解释性高且效果较好的联邦模型。该方法在训练LR模型时,直接根据GBDT模型的树结构构建特征矩阵,无需繁琐的特征构造,就能得到解释性高的LR模型。该方法仅仅在构建GBDT模型时需要传输梯度等,LR模型的构建直接在主动方训练得到,因此时间效率基本取决于GBDT的模型效率,不会提升时间复杂度。同时,主动方与被动方之间并不需要知道对方的数据信息,各自的金融数据不会被其他方知晓,使得金融数据也可以用于机器学习模型训练。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储与一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM通过多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包 括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于联邦学习的GBDT与LR融合方法,应用于主动方,其中,包括以下步骤:
    计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
    获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
    对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
    获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
    根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
    根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
  2. 根据权利要求1所述的基于联邦学习的GBDT与LR融合方法,其中,所述对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方的步骤,包括:
    对所述梯度和组进行解密;
    根据解密后的所述梯度和计算所述第一样本的增益;
    根据所述增益选取最优特征划分;
    将所述最优特征划分所对应的划分值传递给所述被动方。
  3. 根据权利要求2所述的基于联邦学习的GBDT与LR融合方法,其中,所述根据解密后的所述梯度和计算所述第一样本的增益的步骤,包括:
    通过公式
    Figure PCTCN2021084670-appb-100001
    计算所述第一样本的增益,其中,所述g l、h l为分裂为左结点中第一样本的梯度信息,所述g r、h r为分裂为右结点中第一样本的梯度信息,所述g、h为当前所述第一样本的梯度信息,所述λ为公式G的参数。
  4. 根据权利要求1所述的基于联邦学习的GBDT与LR融合方法,其中,所述根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构的步骤,包括:
    根据所述样本空间,将所述样本空间内的所述第二样本相对应的所述第一样本划分为左结点;
    将剩余所述第一样本划分为右结点,得到所述GBDT模型对应的树结构。
  5. 根据权利要求1所述的基于联邦学习的GBDT与LR融合方法,其中,所述根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型的步骤,包括:
    对所述树结构中的叶子结点作one-hot编码;
    根据所述one-hot编码对所述第一样本进行赋值,构造稀疏矩阵;
    将所述稀疏矩阵作为特征矩阵,进行逻辑回归的训练,得到所述LR模型。
  6. 一种基于联邦学习的GBDT与LR融合方法,应用于被动方,其中,包括以下步骤:
    获取所述被动方的各个第二样本的年龄特征值;
    对所述第二样本根据所述年龄特征值进行排序;
    对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
    计算各个分组的梯度和,得到所述梯度和组;
    将所述梯度和组经过加密后传给主动方;
    获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
    将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
    将所述左结点或所述右结点对应的样本空间传递给所述主动方。
  7. 一种基于联邦学习的GBDT与LR融合装置,其中,包括:
    计算单元,用于计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
    第一获取单元,用于获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
    解密单元,用于对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
    第二获取单元,用于获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
    分裂单元,用于根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
    构建单元,用于根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
  8. 一种基于联邦学习的GBDT与LR融合装置,其中,包括:
    第三获取单元,用于获取所述被动方的各个第二样本的年龄特征值;
    排序单元,用于对所述第二样本根据所述年龄特征值进行排序;
    分组单元,用于对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
    计算单元,用于计算各个分组的梯度和,得到所述梯度和组;
    加密单元,用于将所述梯度和组经过加密后传给主动方;
    第四获取单元,用于获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
    划分单元,用于将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
    传递单元,用于将所述左结点或所述右结点对应的样本空间传递给所述主动方。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
    计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
    获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
    对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
    获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
    根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
    根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方的步骤,包括:
    对所述梯度和组进行解密;
    根据解密后的所述梯度和计算所述第一样本的增益;
    根据所述增益选取最优特征划分;
    将所述最优特征划分所对应的划分值传递给所述被动方。
  11. 根据权利要求10所述的计算机设备,其中,所述根据解密后的所述梯度和计算所述第一样本的增益的步骤,包括:
    通过公式
    Figure PCTCN2021084670-appb-100002
    计算所述第一样本的增益,其中,所述g l、h l为分裂为左结点中第一样本的梯度信息,所述g r、h r为分裂为右结点中第一样本的梯度信息,所述g、h为当前所述第一样本的梯度信息,所述λ为公式G的参数。
  12. 根据权利要求9所述的计算机设备,其中,所述根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构的步骤,包括:
    根据所述样本空间,将所述样本空间内的所述第二样本相对应的所述第一样本划分为左结点;
    将剩余所述第一样本划分为右结点,得到所述GBDT模型对应的树结构。
  13. 根据权利要求9所述的计算机设备,其中,所述根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型的步骤,包括:
    对所述树结构中的叶子结点作one-hot编码;
    根据所述one-hot编码对所述第一样本进行赋值,构造稀疏矩阵;
    将所述稀疏矩阵作为特征矩阵,进行逻辑回归的训练,得到所述LR模型。
  14. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
    获取所述被动方的各个第二样本的年龄特征值;
    对所述第二样本根据所述年龄特征值进行排序;
    对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
    计算各个分组的梯度和,得到所述梯度和组;
    将所述梯度和组经过加密后传给主动方;
    获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
    将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
    将所述左结点或所述右结点对应的样本空间传递给所述主动方。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
    计算各个第一样本的梯度,将所述梯度经过加密后传给被动方,其中,所述第一样本具有标签;
    获取被动方经过加密后的梯度和组;其中,所述梯度和组是通过所述被动方将各个第二样本按照属性进行分组后,计算各个分组的梯度和所得到的梯度和组;所述第一样本和所述第二样本对应相同的用户,所述第二样本不具有标签;
    对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方;
    获取所述被动方被划分为左结点或右结点的样本空间;其中,所述样本空间是通过所 述被动方将所述第二样本根据所述划分值进行划分,所得到左结点或右结点对应的样本空间;
    根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构;
    根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分,将所述最优特征划分对应的划分值传给所述被动方的步骤,包括:
    对所述梯度和组进行解密;
    根据解密后的所述梯度和计算所述第一样本的增益;
    根据所述增益选取最优特征划分;
    将所述最优特征划分所对应的划分值传递给所述被动方。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据解密后的所述梯度和计算所述第一样本的增益的步骤,包括:
    通过公式
    Figure PCTCN2021084670-appb-100003
    计算所述第一样本的增益,其中,所述g l、h l为分裂为左结点中第一样本的梯度信息,所述g r、h r为分裂为右结点中第一样本的梯度信息,所述g、h为当前所述第一样本的梯度信息,所述λ为公式G的参数。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述样本空间对所述第一样本进行分裂,得到GBDT模型对应的树结构的步骤,包括:
    根据所述样本空间,将所述样本空间内的所述第二样本相对应的所述第一样本划分为左结点;
    将剩余所述第一样本划分为右结点,得到所述GBDT模型对应的树结构。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述树结构构建特征矩阵,进行逻辑回归的训练,得到LR模型的步骤,包括:
    对所述树结构中的叶子结点作one-hot编码;
    根据所述one-hot编码对所述第一样本进行赋值,构造稀疏矩阵;
    将所述稀疏矩阵作为特征矩阵,进行逻辑回归的训练,得到所述LR模型。
  20. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种基于联邦学习的GBDT与LR融合方法的步骤:
    获取所述被动方的各个第二样本的年龄特征值;
    对所述第二样本根据所述年龄特征值进行排序;
    对所述第二样本根据所述排序按照预设分位数进行划分,得到各个分组;
    计算各个分组的梯度和,得到所述梯度和组;
    将所述梯度和组经过加密后传给主动方;
    获取所述主动方的划分值;其中,所述划分值是所述主动方对所述梯度和组进行解密,根据解密后的所述梯度和选取最优特征划分所得到的;
    将属于所述划分值的所述第二样本划分在左结点,将不属于所述划分值的所述第二样本划分在右结点,得到所述左结点或所述右结点对应的样本空间;
    将所述左结点或所述右结点对应的样本空间传递给所述主动方。
PCT/CN2021/084670 2020-10-29 2021-03-31 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质 WO2022088606A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011182203.9A CN112288101A (zh) 2020-10-29 2020-10-29 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质
CN2020111822039 2020-10-29

Publications (1)

Publication Number Publication Date
WO2022088606A1 true WO2022088606A1 (zh) 2022-05-05

Family

ID=74353220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084670 WO2022088606A1 (zh) 2020-10-29 2021-03-31 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN112288101A (zh)
WO (1) WO2022088606A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196776A (zh) * 2023-09-09 2023-12-08 广东德澳智慧医疗科技有限公司 一种基于随机梯度提升树算法的跨境电商产品信用标记与售后系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112288101A (zh) * 2020-10-29 2021-01-29 平安科技(深圳)有限公司 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质
CN113516253B (zh) * 2021-07-02 2022-04-05 深圳市洞见智慧科技有限公司 一种联邦学习中数据加密优化方法及装置
CN113435537B (zh) * 2021-07-16 2022-08-26 同盾控股有限公司 基于Soft GBDT的跨特征联邦学习方法、预测方法
CN113689948A (zh) * 2021-08-18 2021-11-23 深圳先进技术研究院 呼吸机机械通气的人机异步检测方法、装置和相关设备
CN113657614B (zh) * 2021-09-02 2024-03-01 京东科技信息技术有限公司 联邦学习模型的更新方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299728A (zh) * 2018-08-10 2019-02-01 深圳前海微众银行股份有限公司 联邦学习方法、系统及可读存储介质
CN110210233A (zh) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 预测模型的联合构建方法、装置、存储介质及计算机设备
US20190340534A1 (en) * 2016-09-26 2019-11-07 Google Llc Communication Efficient Federated Learning
CN111738359A (zh) * 2020-07-24 2020-10-02 支付宝(杭州)信息技术有限公司 一种两方决策树训练方法和系统
CN112288101A (zh) * 2020-10-29 2021-01-29 平安科技(深圳)有限公司 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190340534A1 (en) * 2016-09-26 2019-11-07 Google Llc Communication Efficient Federated Learning
CN109299728A (zh) * 2018-08-10 2019-02-01 深圳前海微众银行股份有限公司 联邦学习方法、系统及可读存储介质
CN110210233A (zh) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 预测模型的联合构建方法、装置、存储介质及计算机设备
CN111738359A (zh) * 2020-07-24 2020-10-02 支付宝(杭州)信息技术有限公司 一种两方决策树训练方法和系统
CN112288101A (zh) * 2020-10-29 2021-01-29 平安科技(深圳)有限公司 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Decrypting the Longitudinal GBDT Algorithm in Angel PowerFL Federated Learning Platform", TENCENT BIG DATA, 9 September 2020 (2020-09-09), XP055924728, Retrieved from the Internet <URL:https://cloud.tencent.com/developer/article/1694641> *
QINBIN LI; ZEYI WEN; BINGSHENG HE: "Practical Federated Gradient Boosting Decision Trees", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 November 2019 (2019-11-11), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081529993 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196776A (zh) * 2023-09-09 2023-12-08 广东德澳智慧医疗科技有限公司 一种基于随机梯度提升树算法的跨境电商产品信用标记与售后系统

Also Published As

Publication number Publication date
CN112288101A (zh) 2021-01-29

Similar Documents

Publication Publication Date Title
WO2022088606A1 (zh) 基于联邦学习的gbdt与lr融合方法、装置、设备和存储介质
CN111598186B (zh) 基于纵向联邦学习的决策模型训练方法、预测方法及装置
US20210409191A1 (en) Secure Machine Learning Analytics Using Homomorphic Encryption
US20220230071A1 (en) Method and device for constructing decision tree
Qi et al. Cpds: Enabling compressed and private data sharing for industrial Internet of Things over blockchain
Paul et al. Blockchain technology and its types—a short review
CN111784001B (zh) 一种模型训练方法、设备及计算机可读存储介质
CN112257873A (zh) 机器学习模型的训练方法、装置、系统、设备及存储介质
CN113159327A (zh) 基于联邦学习系统的模型训练方法、装置、电子设备
JP2021515271A (ja) コンピュータにより実施される投票処理およびシステム
CN111081337B (zh) 一种协同任务预测方法及计算机可读存储介质
CN111340453B (zh) 联邦学习开发方法、装置、设备及存储介质
CN113221153B (zh) 图神经网络训练方法、装置、计算设备及存储介质
CN112039702B (zh) 基于联邦学习和相互学习的模型参数训练方法及装置
CN111311211A (zh) 一种基于区块链的数据处理方法以及设备
CN112686393A (zh) 一种联邦学习系统
CN113449048A (zh) 数据标签分布确定方法、装置、计算机设备和存储介质
CN113591097A (zh) 业务数据处理方法、装置、电子设备及存储介质
Zhang et al. SABlockFL: a blockchain-based smart agent system architecture and its application in federated learning
Yang et al. Accountable and verifiable secure aggregation for federated learning in IoT networks
Romero-Tris et al. Distributed system for private web search with untrusted partners
Wang et al. Blockchain-Enabled Lightweight Fine-Grained Searchable Knowledge Sharing for Intelligent IoT
CN116506227B (zh) 数据处理方法、装置、计算机设备和存储介质
CN117436047A (zh) 验证码生成方法、装置、计算机设备和存储介质
Bergers et al. Dwh-dim: a blockchain based decentralized integrity verification model for data warehouses

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884350

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21884350

Country of ref document: EP

Kind code of ref document: A1