CN112288101A

CN112288101A - GBDT and LR fusion method, device, equipment and storage medium based on federal learning

Info

Publication number: CN112288101A
Application number: CN202011182203.9A
Authority: CN
Inventors: 王健宗; 肖京; 何安珣
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-29
Also published as: WO2022088606A1

Abstract

The application relates to a GBDT and LR fusion method, a device, equipment and a storage medium based on federal learning, wherein the method comprises the following steps: calculating the gradient of each first sample, encrypting the gradient and transmitting the gradient to a passive party; acquiring the gradient and the group of each group of the passive side after encryption; decrypting the gradient and the group, selecting optimal feature division according to the gradient and transmitting a division value corresponding to the optimal feature division to a passive party; obtaining a sample space of the passive side divided into a left node or a right node; splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model; and constructing a characteristic matrix according to the tree structure, and performing logistic regression training to obtain an LR model. By the adoption of the GBDT and LR fusion method, the GBDT and LR fusion device and the storage medium based on federal learning, financial data can be directly aggregated to conduct fusion model training of GBDT and LR models.

Description

GBDT and LR fusion method, device, equipment and storage medium based on federal learning

Technical Field

The application relates to the technical field of model hosting, in particular to a GBDT and LR fusion method, a device, equipment and a storage medium based on federal learning.

Background

In a financial scenario, some wind control models are often constructed, and since the industry needs a model with high interpretability, simple and effective logistic regression is often used for processing classification problems. However, logistic regression is a linear model, and cannot capture nonlinear information, and requires a large amount of feature engineering and consumes manpower and material resources, while GBDT (Gradient Boost Decision Tree) can be used to find differentiated features and feature combinations, thereby reducing the labor cost in feature engineering. But accordingly GBDT is an integrated method and therefore less explanatory. The GBDT and LR (generalized linear model) fusion model just combines the advantages of the GBDT and the LR, and the GBDT is firstly adopted to explore the distinguishing characteristics and the combination characteristics, and then the LR is used to construct a model with high interpretability.

The existing GBDT and LR fusion model is established on the basis of open source data for model training. Nowadays, the financial industry is more and more strictly controlled, and financial data cannot be directly aggregated to perform machine learning model training.

Disclosure of Invention

The application mainly aims to provide a GBDT and LR fusion method, a device, equipment and a storage medium based on federal learning, and aims to solve the technical problem that financial data cannot be directly aggregated to carry out fusion model training of GBDT and LR models.

In order to achieve the above object, the present application provides a GBDT and LR fusion method based on federal learning, which is applied to an active side, and includes the following steps:

calculating the gradient of each first sample, encrypting the gradient and transmitting the gradient to a passive party, wherein the first sample has a label;

acquiring the gradient and the group of the passive side after encryption; the gradient sum group is obtained by calculating the gradient sum of each group after each second sample is grouped according to attributes by the passive party; the first and second swatches correspond to the same user, the second swatch having no label;

decrypting the gradient and the group, selecting the optimal feature partition according to the decrypted gradient and transmitting a partition value corresponding to the optimal feature partition to the passive side;

obtaining a sample space of the passive side divided into a left node or a right node; the sample space is obtained by dividing the second sample according to the division value through the passive party, and the sample space corresponding to a left node or a right node is obtained;

splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model;

and constructing a characteristic matrix according to the tree structure, and carrying out logistic regression training to obtain an LR model.

Further, the step of decrypting the gradient and the group, selecting an optimal feature partition according to the decrypted gradient and transmitting a partition value corresponding to the optimal feature partition to the passive side includes:

decrypting the gradient sum group;

calculating a gain of the first sample from the decrypted gradient sum;

selecting optimal feature division according to the gain;

and transmitting the division value corresponding to the optimal feature division to the passive side.

Further, the step of calculating the gain of the first sample according to the decrypted gradient sum comprises:

by the formula

Calculating a gain of the first sample, wherein the g_l、h_lLadder split into first samples in left nodeDegree information, said g_r、h_rThe gradient information of the first sample split into the right node is shown, G and h are the gradient information of the current first sample, and λ is a parameter of formula G.

Further, the step of splitting the first sample according to the sample space to obtain a tree structure corresponding to a GBDT model includes:

according to the sample space, dividing the first samples corresponding to the second samples in the sample space into left nodes;

and dividing the rest of the first samples into right nodes to obtain a tree structure corresponding to the GBDT model.

Further, the step of constructing a feature matrix according to the tree structure and performing logistic regression training to obtain an LR model includes:

performing one-hot coding on leaf nodes in the tree structure;

assigning values to the first sample according to the one-hot codes to construct a sparse matrix;

and taking the sparse matrix as a characteristic matrix, and carrying out logistic regression training to obtain the LR model.

The application also provides another GBDT and LR fusion method based on federal learning, which is applied to a passive side and comprises the following steps:

acquiring age characteristic values of second samples of the passive party;

sorting the second sample according to the age characteristic value;

dividing the second samples according to the sequence and a preset quantile to obtain each group;

calculating the gradient sum of each group to obtain the gradient sum group;

the gradient and the group are encrypted and then transmitted to a master side;

acquiring a division value of the active side; the division value is obtained by decrypting the gradient and the group by the active party and dividing according to the decrypted gradient and the selected optimal characteristic;

dividing the second samples belonging to the division values into left nodes, and dividing the second samples not belonging to the division values into right nodes to obtain sample spaces corresponding to the left nodes or the right nodes;

and transmitting the sample space corresponding to the left node or the right node to the master.

The application also provides a GBDT and LR fusion device based on federal learning, including:

the calculating unit is used for calculating the gradient of each first sample, encrypting the gradient and transmitting the encrypted gradient to a passive party, wherein the first samples have labels;

the first acquisition unit is used for acquiring the encrypted gradient and group of the passive side; the gradient sum group is obtained by calculating the gradient sum of each group after each second sample is grouped according to attributes by the passive party; the first and second swatches correspond to the same user, the second swatch not having a label

The decryption unit is used for decrypting the gradient and the group, selecting the optimal feature partition according to the decrypted gradient and transmitting a partition value corresponding to the optimal feature partition to the passive side;

the second acquisition unit is used for acquiring a sample space of the passive side divided into a left node or a right node; the sample space is obtained by dividing the second sample according to the division value through the passive party, and the sample space corresponding to a left node or a right node is obtained;

the splitting unit is used for splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model;

and the construction unit is used for constructing a characteristic matrix according to the tree structure, and carrying out training of logistic regression to obtain an LR model.

The application also provides another GBDT and LR fusion device based on federal learning, which comprises:

a third obtaining unit, configured to obtain an age characteristic value of each second sample of the passive party;

the sorting unit is used for sorting the second samples according to the age characteristic values;

the grouping unit is used for dividing the second samples according to the sequence and a preset quantile to obtain each group;

the calculating unit is used for calculating the gradient sum of each group to obtain the gradient sum group;

the encryption unit is used for encrypting the gradient and the group and transmitting the encrypted gradient and group to the master side;

a fourth obtaining unit, configured to obtain a division value of the master; the division value is obtained by decrypting the gradient and the group by the active party and dividing according to the decrypted gradient and the selected optimal characteristic;

the dividing unit is used for dividing the second samples belonging to the division values into a left node and dividing the second samples not belonging to the division values into a right node to obtain a sample space corresponding to the left node or the right node;

and the transmission unit is used for transmitting the sample space corresponding to the left node or the right node to the master.

The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of any of the above GBDT and LR fusion methods based on federated learning when executing the computer program.

The present application further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the above-described federate learning based GBDT and LR fusion methods.

According to the GBDT and LR fusion method, the device, the equipment and the storage medium based on the federal learning, a federal model with high interpretability and good effect can be constructed under the condition that model training is not directly carried out on data aggregation. According to the method, when an LR model is trained, a feature matrix is directly constructed according to a tree structure of a GBDT model, and the LR model with high interpretability can be obtained without complicated feature construction. The method only needs to transmit gradient and the like when the GBDT model is constructed, and the construction of the LR model is directly obtained by training on the active side, so that the time efficiency basically depends on the model efficiency of the GBDT, and the time complexity cannot be improved. Meanwhile, the data information of the opposite party is not required to be known between the active party and the passive party, and the respective financial data is not known by other parties, so that the financial data can also be used for machine learning model training.

Drawings

FIG. 1 is a schematic illustration of an implementation environment of an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating steps of a GBDT and LR fusion method based on federated learning according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another exemplary method for fusion of GBDT and LR based on federated learning according to another embodiment of the present application;

FIG. 4 is a block diagram of a GBDT and LR fusion device based on federated learning according to an embodiment of the present application;

FIG. 5 is a block diagram of another exemplary embodiment of a GBDT and LR fusion device based on federated learning;

fig. 6 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, under federal study, an active party has a first terminal 1, a passive party has at least one second terminal 2, and the first terminal 1 and the second terminal 2 can perform data communication through a network; the active side and the passive side have the same user, the first terminal 1 has sample data X1 and tag data Y, and the second terminal 2 has sample data X2 and X3 … … XN. The first terminal 1 and the second terminal 2 may each include an independently operating server, or a distributed server, or a server cluster composed of a plurality of servers, where the servers may be cloud servers. Federated learning can build machine learning systems without direct access to sample data, which remains in the original location, helping to ensure privacy and reduce communication costs.

Specifically, the first terminal 1 calculates the gradient of a first sample and transmits the gradient to the second terminal 2 in an encrypted manner, the second terminal 2 groups a second sample and calculates the gradient of the group and forms the gradient and the group to be transmitted to the first terminal 1, the first terminal 1 decrypts the gradient and the group and selects the optimal characteristic according to the decrypted gradient and divides the optimal characteristic, the divided value of the optimal characteristic is transmitted to the second terminal 2, the second terminal 2 divides the second sample according to the division, the divided sample space of the left node or the right node is transmitted to the first terminal 1, the first terminal 1 divides the first sample according to the sample space to obtain a GBDT model, a characteristic matrix is constructed according to the tree structure of the GBDT model, and training of logistic regression is performed to obtain an LR model.

Referring to fig. 2, an embodiment of the present application provides a GBDT and LR fusion method based on federal learning, which is applied to an active side, and includes the following steps:

step S1, firstly, calculating the gradient of each first sample, encrypting the gradient and transmitting the gradient to a passive party, wherein the first samples have labels;

step S2, acquiring the gradient and group of the passive side after encryption; the gradient sum group is obtained by calculating the gradient sum of each group after each second sample is grouped according to attributes by the passive party; the first and second swatches correspond to the same user, the second swatch having no label;

step S3, the gradient and the group are decrypted, and according to the decrypted gradient and the selected optimal characteristic division, the division value corresponding to the optimal characteristic division is transmitted to the passive side;

step S4, obtaining a sample space of the passive side divided into a left node or a right node; the sample space is obtained by dividing the second sample according to the division value through the passive party, and the sample space corresponding to a left node or a right node is obtained;

step S5, splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model;

and step S6, constructing a feature matrix according to the tree structure, and performing logistic regression training to obtain an LR model.

The GBDT and LR fusion method based on federal learning proposed in this embodiment is applied to an active party, where the active party and a passive party have the same batch of users, but the user information of the active party and the passive party is different, for example, a bank a has users: a, B, C; b bank has users A, B, C; in other embodiments, the passive party may have a label, and when the passive party has the label, the label of the active party is taken as the master, and the label of the passive party does not participate in training. In particular, the passive party may include multiple banks. As described in step S1, the gradient of each first sample in the master is calculated, specifically, during the t-th iteration, the calculated gradient is:

wherein h is_iIs g_iThe second derivative of (a); y is_iAnd encrypting the gradient to be the ith characteristic of the first sample y and then transmitting the gradient to the passive party, specifically, encrypting the gradient by an addition homomorphic encryption algorithm and then transmitting the gradient to the passive party.

As described in step S2 above, the encrypted gradient and group of each passive packet are obtained. The passive party groups the second samples, specifically, the second samples may be grouped based on some characteristic attributes in the second samples, such as gender, age, and the like, the gradient of the second samples in each group is calculated according to the above equations (1) and (2), the gradient sum of each group is calculated, and the gradient sum of each group is encrypted and then transmitted to the active party. In another embodiment, when training is performed in combination with a multi-party bank, the passive party comprises a plurality of banks, the second sample of each bank is grouped, the gradient sum of the grouping is calculated, then the gradient sum group is formed, and the gradient sum group is encrypted and then transmitted to the active party.

As described in step S3, since the base learner of the GBDT model uses the tree model, after the master decrypts the received gradient sums, the tree model splits according to the decrypted gradient sums, each split node is divided into a left node and a right node, and each first sample falls into a unique leaf node in each tree. And selecting an optimal feature partition, and transmitting a partition value corresponding to the optimal feature partition to a passive side, wherein the optimal feature partition represents an optimal mode of tree model splitting, and the partition value is a hyper parameter, namely a parameter of a set value before training is started, and is not parameter data obtained through training.

As described in step S4, after receiving the feature column and the partition value, the passive side partitions the samples according to the partition value, that is, the samples of the feature column value in the partition value interval are partitioned into the left node and the rest are partitioned into the right node, and returns the sample space partitioned into the left node or the right node to the active side, and the active side obtains the sample space partitioned into the left node or the right node by the passive side.

As described in step S5, after receiving the sample space divided into left nodes, the master can know which second samples fall into the left nodes, so that the first sample can be split equally, the left and right nodes are divided, and finally the corresponding threshold is reached to construct leaf nodes, so as to obtain the tree structure of the GBDT model.

As described in step S6, since the master knows the tree structure of the GBDT model and the sample space that falls into the leaf node, the master assigns a value to the first sample, constructs a sparse matrix, and trains a logistic regression using the sparse matrix as a feature matrix to obtain an LR model.

In this embodiment, the GBDT and LR fusion method based on federal learning can construct a federal model with high interpretability and good effect without model training by directly aggregating data. According to the method, when an LR model is trained, a feature matrix is directly constructed according to a tree structure of a GBDT model, and the LR model with high interpretability can be obtained without complicated feature construction. The method only needs to transmit gradient and the like when the GBDT model is constructed, and the construction of the LR model is directly obtained by training on the active side, so that the time efficiency basically depends on the model efficiency of the GBDT, and the time complexity cannot be improved. Meanwhile, the data information of the opposite party is not required to be known between the active party and the passive party, and the respective financial data is not known by other parties, so that the financial data can also be used for machine learning model training. Specifically, when the method is used for training the wind control model in combination with a multi-party bank, data information of the banks of all parties is not disclosed, so that the method can effectively utilize the multi-party data to complete personal risk assessment as much as possible under the condition of ensuring data safety, and the bank can effectively identify users with poor credit conditions in other banks.

In an embodiment, the step S3 of decrypting the gradient sum, selecting an optimal feature partition according to the gradient sum, and transmitting the feature column and the partition value corresponding to the optimal feature partition to the passive side includes:

decrypting the gradient sum group;

calculating a gain of the first sample from the decrypted gradient sum;

selecting optimal feature division according to the gain;

In this embodiment, the master decrypts the gradient sum groups to obtain the gradient sum of each group, and calculates the gain of the first sample in the master according to the gradient sum.

In an embodiment, the step of calculating the gain of the first sample according to the decrypted gradient sum includes:

by the formula

Calculating a gain of the first sample, wherein the g_l、h_lFor gradient information split into the first sample in the left node, the g_r、h_rThe gradient information of the first sample split into the right node is shown, G and h are the gradient information of the current first sample, and λ is a parameter of formula G. In this embodiment, the gain of the first sample is calculated through the above formula, and the value of the gain can represent the splitting quality of the tree model, and when the gain is larger, it indicates that the splitting quality of the tree model is better.

And selecting the optimal characteristic division according to the gain, selecting the corresponding tree model when the gain value is maximum, and transmitting the characteristic column and the division value corresponding to the optimal characteristic division to the passive side so that the passive side knows the splitting of the active side.

In an embodiment, the step S5 of splitting the first sample according to the sample space to obtain a tree structure corresponding to a GBDT model includes:

In this embodiment, after knowing the sample space of the passive side divided into the left nodes, the active side divides the first sample corresponding to the second sample at the left node into the left nodes, and divides the remaining first samples into the right nodes, so as to obtain the tree structure of the GBDT model, thereby completing the training of the GBDT model.

In an embodiment, the step S6 of constructing the feature matrix according to the tree structure and performing training of logistic regression to obtain an LR model includes:

performing one-hot coding on leaf nodes in the tree structure;

In this embodiment, the GBDT model after training generates a plurality of trees, each first sample falls into a unique leaf node in each tree, each leaf node is regarded as a feature, one-hot encoding is performed on the leaf nodes, the first samples are assigned according to the one-hot encoding, each first sample obtains a corresponding feature vector, all the first samples finally obtain a sparse matrix, each column represents the meaning of the leaf node, and the sparse matrix is put into logistic regression for training to complete construction of the LR model. In this embodiment, no transmission is performed in the construction process of the LR model, so the construction process is very efficient, and the time complexity is not increased.

Referring to fig. 3, an embodiment of the present application provides another GBDT and LR fusion method based on federal learning, which is applied to a passive side, and includes the following steps:

step S10, acquiring age characteristic values of each second sample of the passive side;

step S20, sorting the second sample according to the age characteristic value;

step S30, dividing the second samples according to the sequence and preset quantiles to obtain each group;

step S40, calculating the gradient sum of each group to obtain the gradient sum group;

step S50, the gradient and the group are encrypted and then transmitted to the master side;

step S60, obtaining the division value of the master; the division value is obtained by decrypting the gradient and the group by the active party and dividing according to the decrypted gradient and the selected optimal characteristic;

step S70, dividing the second samples belonging to the division values into left nodes, and dividing the second samples not belonging to the division values into right nodes, to obtain sample spaces corresponding to the left nodes or the right nodes;

step S80, the sample space corresponding to the left node or the right node is transmitted to the master.

The GBDT and LR fusion method based on federal learning proposed in this embodiment is applied to the passive side, and as described in steps S10-S20, the second sample of the passive side has the characteristic of age, and the age characteristic value of the second sample is obtained by the passive side, and the samples are sorted according to the age characteristic value, that is, sorted from small to large according to the age.

As described in step S30, the second samples are grouped according to the sorting, specifically, a quantile is preset, such as a quartile and a quintile, and if a quartile is adopted, the sorted second samples are divided into four equal parts to obtain four groups.

As described in the above steps S40-S50, the gradient sum group is obtained by the passive side summing the gradients of the second samples in each group. Specifically, the gradient sum of each packet is encrypted by an addition homomorphic encryption algorithm and then transmitted to the master. The second samples are grouped according to the age characteristic value through a preset quantile, the number of the second samples in each group is the same, and the proportion of the second samples smaller than a certain value after the second samples are arranged from small to large in the second samples to the total number of the second samples can be directly known through grouping through the quantile.

As described in the above steps S60-S80, after receiving the partition value of the optimal feature partition of the active side, the passive side divides the second sample belonging to the partition value into the left node, divides the second sample not belonging to the partition value into the right node, and transmits the sample space divided into the left node or the right node to the active side, so that the active side can know the sample space of the passive side, and the active side can split the sample space according to half of the sample space of the passive side, thereby obtaining the GBDT model.

The GBDT and LR fusion method based on federal learning provided by the application can be applied to the field of block chains, the trained GBDT model and LR model are stored in a block chain network, and the block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

Referring to fig. 4, an embodiment of the present application provides a GBDT and LR fusion device based on federal learning, including:

the calculation unit 10 is configured to calculate a gradient of each first sample, encrypt the gradient, and transmit the encrypted gradient to a passive party, where the first sample has a label;

a first obtaining unit 20, configured to obtain the encrypted gradient and group of the passive side; the gradient sum group is obtained by calculating the gradient sum of each group after each second sample is grouped according to attributes by the passive party; the first and second swatches correspond to the same user, the second swatch not having a label

The decryption unit 30 is configured to decrypt the gradient and the group, select an optimal feature partition according to the decrypted gradient and select the optimal feature partition, and transmit a partition value corresponding to the optimal feature partition to the passive side;

a second obtaining unit 40, configured to obtain a sample space of the passive side divided into a left node or a right node; the sample space is obtained by dividing the second sample according to the division value through the passive party, and the sample space corresponding to a left node or a right node is obtained;

a splitting unit 50, configured to split the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model;

and the constructing unit 60 is configured to construct a feature matrix according to the tree structure, and perform training of logistic regression to obtain an LR model.

In one embodiment, the decryption unit 30 includes:

a decryption subunit, configured to decrypt the gradient sum group;

a second calculating subunit, configured to calculate a gain of the first sample according to the decrypted gradient sum;

the selecting subunit is used for selecting the optimal characteristic division according to the gain;

and the transfer subunit is used for transferring the feature column and the division value corresponding to the optimal feature division to the passive side.

In one embodiment, the second computing subunit includes:

a calculation module for passing the formula

Calculating a gain of the first sample, wherein the g_l、h_lFor gradient information split into the first sample in the left node, the g_r、h_rThe gradient information of the first sample split into the right node is shown, G and h are the gradient information of the current first sample, and λ is a parameter of formula G.

In one embodiment, the splitting unit 50 includes:

a third dividing subunit, configured to divide the first sample corresponding to the second sample in the sample space into left nodes according to the sample space;

and the fourth dividing subunit is used for dividing the rest of the first samples into right nodes to obtain the tree structure corresponding to the GBDT model.

In one embodiment, the building unit 60 includes:

the encoding unit is used for performing one-hot encoding on leaf nodes in the tree structure;

an assignment unit, configured to assign a value to the first sample according to the one-hot code, and construct a sparse matrix;

and the training subunit is used for performing logistic regression training by taking the sparse matrix as a characteristic matrix to obtain the LR model.

Referring to fig. 5, an embodiment of the present application provides another GBDT and LR fusion apparatus based on federal learning, including:

a third obtaining unit 1A, configured to obtain an age characteristic value of each second sample of the passive party;

a sorting unit 1B for sorting the second samples according to the age characteristic values;

the grouping unit 1C is configured to divide the second samples according to the sorting and a preset quantile to obtain each group;

a calculating unit 1D, configured to calculate a gradient sum of each group, to obtain the gradient sum group;

the encryption unit 1E is configured to encrypt the gradient sum group and transmit the encrypted gradient sum group to the master;

a fourth obtaining unit 1F, configured to obtain a division value of the master; the division value is obtained by decrypting the gradient and the group by the active party and dividing according to the decrypted gradient and the selected optimal characteristic;

the dividing unit 1G is configured to divide the second sample belonging to the division value into a left node, and divide the second sample not belonging to the division value into a right node, so as to obtain a sample space corresponding to the left node or the right node;

and the transmission unit 1H is configured to transmit the sample space corresponding to the left node or the right node to the master.

In this embodiment, please refer to the above method embodiment for the specific implementation of each unit and sub-unit, which is not described herein again.

Referring to fig. 6, a computer device, which may be a server and whose internal structure may be as shown in fig. 6, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing first sample data, second sample data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a federate learning based GBDT and LR fusion method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the present teachings and is not intended to limit the scope of the present teachings as applied to computer devices.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for fusion of GBDT and LR based on federal learning is implemented.

In summary, for the GBDT and LR fusion method, apparatus, device, and storage medium based on federal learning provided in this embodiment of the present application, the active side calculates the gradient of each first sample, encrypts the gradient, and transmits the encrypted gradient to the passive side, where the first sample has a tag; acquiring the gradient and the group of the passive side after encryption; the gradient sum group is obtained by calculating the gradient sum of each group after each second sample is grouped according to attributes by the passive party; the first and second swatches correspond to the same user, the second swatch having no label; decrypting the gradient and the group, selecting the optimal feature partition according to the decrypted gradient and transmitting a partition value corresponding to the optimal feature partition to the passive side; obtaining a sample space of the passive side divided into a left node or a right node; the sample space is obtained by dividing the second sample according to the division value through the passive party, and the sample space corresponding to a left node or a right node is obtained; splitting the first sample according to the sample space to obtain a tree structure corresponding to the GBDT model; and constructing a characteristic matrix according to the tree structure, and carrying out logistic regression training to obtain an LR model. The GBDT and LR fusion method based on federal learning can construct a federal model with high interpretability and good effect under the condition of not directly carrying out model training on data aggregation. According to the method, when an LR model is trained, a feature matrix is directly constructed according to a tree structure of a GBDT model, and the LR model with high interpretability can be obtained without complicated feature construction. The method only needs to transmit gradient and the like when the GBDT model is constructed, and the construction of the LR model is directly obtained by training on the active side, so that the time efficiency basically depends on the model efficiency of the GBDT, and the time complexity cannot be improved. Meanwhile, the data information of the opposite party is not required to be known between the active party and the passive party, and the respective financial data is not known by other parties, so that the financial data can also be used for machine learning model training.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware associated with instructions of a computer program, which may be stored on a non-volatile computer-readable storage medium, and when executed, may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only for the preferred embodiment of the present application and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A GBDT and LR fusion method based on federal learning is applied to an active side and is characterized by comprising the following steps:

2. The GBDT and LR fusion method according to claim 1, wherein the step of decrypting the gradient and group, selecting an optimal feature partition according to the decrypted gradient and selecting a partition value corresponding to the optimal feature partition to the passive side includes:

decrypting the gradient sum group;

calculating a gain of the first sample from the decrypted gradient sum;

selecting optimal feature division according to the gain;

3. The method of claim 2, wherein the step of calculating the gain of the first sample based on the decrypted gradient sum comprises:

by the formula

4. The method according to claim 1, wherein the step of splitting the first sample according to the sample space to obtain a tree structure corresponding to a GBDT model comprises:

5. The GBDT and LR fusion method according to claim 1, wherein the step of constructing a feature matrix from the tree structure and performing logistic regression training to obtain an LR model comprises:

performing one-hot coding on leaf nodes in the tree structure;

6. A GBDT and LR fusion method based on federal learning is applied to a passive side and is characterized by comprising the following steps:

acquiring age characteristic values of second samples of the passive party;

sorting the second sample according to the age characteristic value;

calculating the gradient sum of each group to obtain the gradient sum group;

the gradient and the group are encrypted and then transmitted to a master side;

7. A GBDT and LR fusion device based on federal learning, comprising:

8. A GBDT and LR fusion device based on federal learning, comprising:

9. A computer device comprising a memory and a processor, the memory having a computer program stored therein, wherein the processor when executing the computer program performs the steps of the federal learning based GBDT and LR fusion method of any of claims 1 to 5 or 6.

10. A computer readable storage medium having stored thereon a computer program for implementing the steps of the federate learning based GBDT and LR fusion method of any one of claims 1 to 5 or 6 when executed by a processor.