CN113537333B

CN113537333B - Method for training optimization tree model and longitudinal federal learning system

Info

Publication number: CN113537333B
Application number: CN202110777115.1A
Authority: CN
Inventors: 黄一珉; 王湾湾; 何浩; 姚明
Original assignee: Shenzhen Dongjian Intelligent Technology Co ltd
Current assignee: Shenzhen Dongjian Intelligent Technology Co ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2022-05-24
Anticipated expiration: 2041-07-09
Also published as: CN113537333A

Abstract

The embodiment of the invention provides a method for training an optimization tree model and a longitudinal federal learning system, wherein the method comprises the following steps: the data provider determines the sub-bucket with the maximum user sample data quantity based on the quantity of first user sample data in each sub-bucket obtained by sub-bucket; then, establishing an encrypted first gradient histogram based on homomorphic encrypted gradient information sent by a data demand side and first user sample data of each sub-bucket except the sub-bucket with the largest number of user sample data; and finally, determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data by the data demander, so that the time overhead brought by the encryption calculation process and the decryption calculation process can be reduced, and the training efficiency of the tree model can be improved.

Description

Method for training optimization tree model and longitudinal federal learning system

Technical Field

The invention relates to the technical field of machine learning, in particular to a method for training an optimization tree model and a longitudinal federated learning system.

Background

With the development of artificial intelligence, the value of data is more and more emphasized. And the data of different fields often have great complementarity, so that great data fusion requirements also exist among different fields. However, based on factors such as privacy protection, self interests and policy supervision, various enterprises having data in different fields cannot directly fuse the data. Thereby presenting a data islanding problem.

In order to solve the problem of data islanding, related research and development personnel provide a federal learning method which is a distributed machine learning method, and the federal learning method actually takes various enterprises with data in different fields as participants, trains machine learning models by the participants, obtains intermediate training results of the machine learning models trained by the participants, continues to train the machine learning models based on the intermediate training results of the models of the participants, and finally obtains the machine learning models meeting the requirements of the participants, thereby indirectly realizing the fusion of the data in different fields.

While in the longitudinal federal learning method, tree models are the commonly used machine learning models. In the process of training the tree model, the following methods are generally adopted: the data demand party firstly determines gradient information according to the label information of owned user data, then sends homomorphic encrypted gradient information to the data provider, then the data provider establishes an encrypted gradient histogram based on the homomorphic encrypted gradient information and the characteristic information of the owned user data, and sends the encrypted gradient histogram to the data demand party, and the data demand party decrypts the encrypted gradient histogram and searches for the global optimal split point of each party according to a split gain calculation formula; and if the optimal split point belongs to the data provider, the data demand side returns the split point to the data provider, if the optimal split point belongs to the data demand side, the optimal split point is held, then the side holding the optimal split point performs split division on the samples on the nodes of the tree model of the side, and the split division result is sent to the opposite side, so that the opposite side updates the index between the samples and the tree nodes.

However, the inventor finds that, when the tree model is trained by using the related art, since the data provider performs the encryption summation calculation on all gradient information to obtain the encrypted gradient histogram and sends the encrypted gradient histogram to the data consumer, the time overhead is large, so that the training efficiency of the tree model is low.

Disclosure of Invention

The embodiment of the invention aims to provide a method for optimizing training of a tree model and a longitudinal federated learning system, so as to improve the training efficiency of the tree model and reduce the time overhead of training the tree model. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for training an optimization tree model, which is applied to a longitudinal federated learning system, where the longitudinal federated learning system includes at least one data provider and at least one data demander, and the method includes:

the data provider acquires homomorphic encrypted gradient information sent by the data demander;

the data provider carries out bucket dividing processing on first user sample data stored locally, and determines a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket;

the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side;

and the data demand side decrypts the encrypted first gradient histogram, determines gradient information corresponding to the bucket with the largest quantity of user sample data based on the gradient sum of second user sample data stored locally and the decrypted first gradient histogram, and determines the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.

Optionally, the data providing party performs bucket dividing processing on the first user sample data stored locally, including:

the data provider performs equal-frequency bucket division processing on the first user sample data; or

And the data provider performs equidistant barrel processing on the first user sample data.

Optionally, when the first user sample data includes first user sample data with a missing value, the data provider performs bucket splitting on the first user sample data stored locally, including:

the data provider takes the first user sample data with the missing value as a sub-bucket; and performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.

Optionally, the data provider performs bucket splitting processing on the first user sample data stored locally, and determines a bucket with the largest number of user sample data based on the number of the first user sample data in each bucket, including:

the data provider determines the sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket without the missing value and the number of the first user sample data in the sub-bucket with the missing value;

the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; and sending the encrypted first gradient histogram to a data consumer, comprising:

when the data provider determines that the number of the first user sample data in the sub-buckets with the missing values is the largest, an encrypted first gradient histogram is established based on homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and the encrypted first gradient histogram is sent to the data demander.

Optionally, determining an optimal split point of the tree model based on the decrypted first gradient histogram and gradient information corresponding to the bucket with the largest number of user sample data includes:

determining a plurality of first splitting gains of the decrypted first gradient histogram and a plurality of second splitting gains of the gradient information corresponding to the buckets with the largest number of user sample data by adopting a splitting gain calculation formula based on the decrypted first gradient histogram and the gradient information corresponding to the buckets with the largest number of user sample data;

and determining the maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and taking the user characteristic corresponding to the maximum splitting gain and the characteristic value of the user characteristic as the optimal splitting point.

In a second aspect, an embodiment of the present invention further provides a longitudinal federal learning system, where the longitudinal federal learning system includes at least one data provider and at least one data demander;

the data provider is used for acquiring homomorphic encrypted gradient information sent by the data demander;

the data provider is further used for performing bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket;

the data provider is further used for establishing an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side;

and the data demander is used for decrypting the encrypted first gradient histogram, determining gradient information corresponding to the bucket with the largest quantity of user sample data based on the locally stored gradient sum of the second user sample data and the decrypted first gradient histogram, and determining an optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.

Optionally, the data provider is specifically configured to:

performing equal-frequency bucket division processing on the first user sample data; or alternatively

And carrying out equidistant bucket-dividing processing on the first user sample data.

Optionally, when the first user sample data includes first user sample data with a missing value, the data provider is specifically configured to:

taking the first user sample data with the missing value as a sub-bucket; and performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.

Optionally, the data provider is specifically configured to:

determining the sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket without missing values and the number of the first user sample data in the sub-bucket with the missing values;

when the maximum number of the first user sample data in the sub-buckets with the missing values is determined, an encrypted first gradient histogram is established based on homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and the encrypted first gradient histogram is sent to a data demand side.

Optionally, the data demander is specifically configured to:

The embodiment of the invention has the following beneficial effects:

according to the method for training the optimization tree model and the longitudinal federal learning system provided by the embodiment of the invention, homomorphic encrypted gradient information sent by a data demand party is obtained by a data provider; then carrying out bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket; establishing an encrypted first gradient histogram by the data provider based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side; and finally, decrypting the encrypted first gradient histogram by the data demander, determining the gradient information corresponding to the bucket with the maximum number of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the maximum number of user sample data. When the encryption gradient histogram is obtained, the first user sample data in the sub-bucket with the largest number of user sample data does not need to be used, and further, the calculation amount during encryption and decryption calculation can be reduced, so that the time overhead caused by the encryption calculation process and the decryption calculation process can be reduced, and further, the training efficiency of the tree model can be improved. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.

FIG. 1 is a flowchart of a first implementation of a method for training an optimization tree model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a second implementation of a method for training an optimized tree model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a longitudinal federal learning system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments given herein by one of ordinary skill in the art, are within the scope of the invention.

In order to solve the problems in the prior art, embodiments of the present invention provide a method for optimizing training of a tree model and a longitudinal federated learning system, so as to improve training efficiency of the tree model and reduce time overhead for training the tree model.

First, a method for training an optimized tree model according to an embodiment of the present invention is described below, as shown in fig. 1, which is a flowchart of a first implementation manner of the method for training an optimized tree model according to an embodiment of the present invention, where the method may include:

s110, a data provider acquires homomorphic encrypted gradient information sent by a data demander;

s120, the data provider carries out bucket dividing processing on the first user sample data stored locally, and determines a bucket with the largest number of the user sample data based on the number of the first user sample data in each bucket;

s130, the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side;

and S140, the data demander decrypts the encrypted first gradient histogram, determines gradient information corresponding to the bucket with the largest quantity of user sample data based on the locally stored gradient sum of the second user sample data and the decrypted first gradient histogram, and determines the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.

In some examples, the method of optimizing tree model training may be applied in a longitudinal federated learning system, which may include at least one data provider and at least one data demander.

The data providing party stores first user sample data, and the data demanding party stores second user sample data. The second user sample data is provided with label information, and the label information is used for identifying the category information of the second user sample data.

In some examples, the first user sample data and the second user sample data may include user information and user characteristic information, and the user characteristic information included in the first user sample data and the user characteristic information included in the second user sample data are not identical. For example, the first user sample data may include user behavior feature information therein, and the second user sample data may include user portrait information therein.

In still other examples, when the tree model is trained by using the first user sample data and the second user sample data, the data demander may perform homomorphic encryption on its gradient information first, and then send the homomorphic encrypted gradient information to the data provider, so that the data provider may obtain the homomorphic encrypted gradient information sent by the data demander.

The data provider may then pre-process the first user sample data stored locally.

Specifically, the data provider may perform bucket splitting on the first user sample data stored locally. Determining the sub-bucket with the maximum user sample data quantity based on the quantity of the first user sample data in each sub-bucket;

for example, assuming that the first user sample data includes a user name and a user age, wherein the user age is 11-60 years, the first user sample data may be divided into 5 buckets, the first bucket includes users between 11-20 years of age and corresponding user names, the second bucket includes users between 21-30 years of age and corresponding user names, the third bucket includes users between 31-40 years of age and corresponding user names, the fourth bucket includes users between 41-50 years of age and corresponding user names, and the fifth bucket includes users between 51-60 years of age and corresponding user names.

In some examples, the data provider may perform equal-frequency bucket dividing on the first user sample data when performing bucket dividing on the first user sample data stored locally; or carrying out equidistant barrel processing on the first user sample data.

The equal-frequency bucket dividing processing means that when a data provider divides locally stored first user sample data into a plurality of buckets, the number of the first user sample data in each bucket has the same or similar deviation, for example, the deviation of the number is less than or equal to 0.1.

The equidistant barrel separation treatment comprises the following steps: the data provider divides the first user sample data stored locally into N equal parts from the minimum value to the maximum value, for example, if the minimum value and the maximum value of the first user sample data are a and B, the first user sample data are subjected to bucket division according to (B-a)/N, and the boundary values of the buckets are a + W, a +2W, … and a + (N-1) W. The number of first user sample data in each bucket may not be equal.

In some examples, after obtaining the individual buckets, in order to improve training efficiency of the tree model, a time overhead for training the tree model is reduced. The bucket with the largest amount of user sample data may be determined based on the amount of the first user sample data in each bucket.

For example, assuming that the number of first user sample data in a first bucket is 6, the number of first user sample data in a second bucket is 10, the number of first user sample data in a third bucket is 5, the number of first user sample data in a fourth bucket is 15, and the number of first user sample data in a fifth bucket is 20, among 5 buckets, the fifth bucket may be determined as the bucket with the largest number of user sample data,

in the embodiment of the invention, the sub-bucket with the largest quantity of user sample data is determined by carrying out sub-bucket processing on the first user sample data and based on the quantity of the first user sample data in each sub-bucket; the complexity of subsequently building the encrypted first gradient histogram may be reduced.

After determining the sub-bucket with the largest quantity of user sample data, establishing an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest quantity of user sample data; sending the encrypted first gradient histogram to a data demand side;

specifically, the data provider may first obtain the encryption gradients of the sub-buckets except the sub-bucket with the largest number of user sample data based on the homomorphic encryption gradient information, and then sum up the encryption gradients corresponding to the sub-buckets, so as to obtain the encrypted first gradient histogram. The encrypted first gradient histogram is then sent to the data consumer.

For example, the data provider may build a first gradient histogram based on the received homomorphically encrypted gradient information corresponding to the age group of the user included in each of the five buckets.

In the embodiment of the invention, the first gradient histogram is obtained by only carrying out encryption summation calculation on the gradient information corresponding to the sub-bucket with less user sample data, and the gradient information corresponding to the sub-bucket with the most user sample data is not subjected to encryption summation calculation, so that the time overhead brought by the encryption calculation of the first gradient histogram can be reduced, and the training efficiency of the tree model can be improved.

After the data provider sends the encrypted first gradient histogram to the data demander, the data demander may decrypt the encrypted first gradient histogram, determine, based on a sum of gradients of second user sample data stored locally and the decrypted first gradient histogram, gradient information corresponding to a bucket with the largest number of user sample data, and determine, based on the decrypted first gradient histogram and the gradient information corresponding to a bucket with the largest number of user sample data, an optimal split point of the tree model.

In some examples, since the first gradient histogram may reflect a sum of gradients of first user sample data of other buckets except for the bucket with the largest amount of user sample data, and since the sum of gradients of the first user sample data locally stored by the data provider is the same as the sum of gradients of the second user sample data locally stored by the data demander when the tree model is trained, the data demander may determine, after decrypting the encrypted first gradient histogram, gradient information corresponding to the bucket with the largest amount of user sample data based on the sum of gradients of the second user sample data locally stored and the decrypted first gradient histogram.

Therefore, the gradient information corresponding to the bucket with the largest amount of user sample data does not need to be encrypted, summed and calculated and sent to the data demand side, and the data demand side does not need to decrypt the gradient information, but indirectly obtains the gradient information corresponding to the bucket with the largest amount of user sample data based on the locally stored gradient sum of the second user sample data and the decrypted first gradient histogram.

Therefore, the time overhead of encrypting and calculating the first gradient histogram can be reduced, the time overhead of decrypting the encrypted first gradient histogram can also be reduced, and the training efficiency of the tree model can be further improved.

After obtaining the gradient information corresponding to the bucket with the largest amount of user sample data, the data demander may determine the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest amount of user sample data.

In some examples, when determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data, the data demander may first determine, based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data, a plurality of first split gains of the decrypted first gradient histogram and a plurality of second split gains of the gradient information corresponding to the bucket with the largest number of user sample data by using a split gain calculation formula; and then determining the maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and finally taking the user characteristic corresponding to the maximum splitting gain and the characteristic value of the user characteristic as the optimal splitting point.

After the optimal split point is obtained, whether the optimal split point belongs to a data provider or not can be judged, if the optimal split point belongs to the data provider, a data demander returns the split point to the data provider, if the optimal split point belongs to the data demander, the optimal split point is held, then the party holding the optimal split point performs split division on samples on nodes of the tree model of the party, and split division results are sent to the other party, so that training of the tree model is achieved.

The method for training the optimization tree model provided by the embodiment of the invention comprises the steps that a data provider acquires homomorphic encrypted gradient information sent by a data demander; then carrying out bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket; establishing an encrypted first gradient histogram by the data provider based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side; and finally, decrypting the encrypted first gradient histogram by the data demander, determining the gradient information corresponding to the bucket with the maximum number of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the maximum number of user sample data. When the encryption gradient histogram is obtained, the first user sample data in the sub-bucket with the largest number of user sample data does not need to be used, and further, the calculation amount during encryption and decryption calculation can be reduced, so that the time overhead caused by the encryption calculation process and the decryption calculation process can be reduced, and further, the training efficiency of the tree model can be improved.

In some practical application scenarios, user sample data has many missing values, and how to directly train a tree model by using user sample data with missing values causes great computational redundancy, so on the basis of the method for optimizing tree model training shown in fig. 1, the embodiment of the present invention further provides a possible implementation manner, as shown in fig. 2, which is a flowchart of a second implementation manner of the method for optimizing tree model training according to the embodiment of the present invention, and the method may include:

s210, a data provider acquires homomorphic encrypted gradient information sent by a data demander;

s220, the data provider takes the first user sample data with the missing value as a sub-bucket; performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value;

s230, the data provider determines the sub-bucket with the maximum number of user sample data based on the number of the first user sample data in each sub-bucket;

s240, when the maximum number of the first user sample data in the sub-buckets with the missing values is determined, the data provider determines the sub-buckets with the maximum number of the user sample data based on the number of the first user sample data in each sub-bucket without the missing values and the number of the first user sample data in the sub-buckets with the missing values;

and S250, when the data provider determines that the number of the first user sample data in the sub-buckets with the missing values is the largest, the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and sends the encrypted first gradient histogram to the data demander.

And S260, the data demander decrypts the encrypted first gradient histogram, determines gradient information corresponding to the bucket with the largest quantity of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determines the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.

In some examples, the first user sample data locally stored by the data provider may include first user sample data with a missing value, where the user sample data with the missing value indicates that the user information in the user sample data lacks corresponding user characteristic information, for example, the user information in the user sample data lacks corresponding user behavior characteristic information, or the user information in the user sample data lacks corresponding user portrait information.

At this time, the data provider may treat the first user sample data having the missing value as one sub-bucket. For example, the user sample data lacking the corresponding user behavior feature information is taken as a bucket, or the user sample data lacking the corresponding user portrait information is taken as a bucket.

And then performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.

By treating the first user sample data with missing values as a bucket. The influence of the first user sample data with the missing value on the process of obtaining the first gradient histogram based on the gradient information encryption calculation corresponding to other buckets can be reduced,

in still other examples, after obtaining the plurality of buckets, the data provider may further determine the bucket with the largest amount of user sample data based on the amount of first user sample data in each bucket where no missing value exists and the amount of first user sample data in the bucket where the missing value exists;

when it is determined that the number of the first user sample data in the sub-bucket with the missing value is the largest, an encrypted first gradient histogram may be established based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing value, and the encrypted first gradient histogram may be sent to the data demanding side.

Because the number of the first user sample data with the missing value is usually larger, the time overhead of the encryption summation calculation process can be greatly reduced by not encrypting and summing the gradient information of the sub-bucket corresponding to the first user sample data with the missing value, and the training efficiency of the tree model is further greatly improved.

Corresponding to the above method embodiment, an embodiment of the present invention further provides a longitudinal federal learning system, as shown in fig. 3, which is a schematic structural diagram of a longitudinal federal learning system according to an embodiment of the present invention, and the system may include: at least one data provider 310 and at least one data consumer 320;

the data provider 310 is configured to obtain homomorphic encrypted gradient information sent by a data demander;

the data provider 310 is further configured to perform bucket dividing processing on the first user sample data stored locally, and determine a bucket with the largest number of user sample data based on the number of the first user sample data in each bucket;

the data provider 310 is further configured to establish an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; and sends the encrypted first gradient histogram to the data consumer 320;

and the data demander 320 is configured to decrypt the encrypted first gradient histogram, determine, based on a sum of gradients of second user sample data stored locally and the decrypted first gradient histogram, gradient information corresponding to the bucket with the largest number of user sample data, and determine, based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data, an optimal split point of the tree model, where the sum of gradients of the first user sample data is the same as the sum of gradients of the second user sample data.

According to the longitudinal federal learning system provided by the embodiment of the invention, homomorphic encrypted gradient information sent by a data demand party is obtained by a data provider; then carrying out bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket; establishing an encrypted first gradient histogram by the data provider based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side; and finally, decrypting the encrypted first gradient histogram by the data demander, determining the gradient information corresponding to the bucket with the maximum number of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the maximum number of user sample data. When the encryption gradient histogram is obtained, the first user sample data in the sub-bucket with the largest number of user sample data does not need to be used, and further, the calculation amount during encryption and decryption calculation can be reduced, so that the time overhead caused by the encryption calculation process and the decryption calculation process can be reduced, and further, the training efficiency of the tree model can be improved.

In some examples, the data provider 310 is specifically configured to:

performing equal-frequency bucket division processing on the first user sample data; or

In some examples, when the first user sample data includes first user sample data with a missing value, the data provider 310 is specifically configured to:

In some examples, the data provider 310 is specifically configured to:

when it is determined that the number of the first user sample data in the sub-bucket with the missing value is the largest, an encrypted first gradient histogram is established based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing value, and the encrypted first gradient histogram is sent to the data demanding party 320.

In some examples, the data demander 320 is specifically configured to:

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for optimizing training of a tree model, applied to a longitudinal federated learning system, the longitudinal federated learning system including at least one data provider and at least one data demander, the method comprising:

the data provider carries out bucket dividing processing on first user sample data stored locally, and determines a sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket;

the data provider establishes an encrypted first gradient histogram based on the homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of the user sample data; sending the encrypted first gradient histogram to the data demander;

the data demander decrypts the encrypted first gradient histogram, determines gradient information corresponding to a bucket with the largest quantity of user sample data based on a locally stored gradient sum of second user sample data and the decrypted first gradient histogram, and determines an optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.

2. The method of claim 1, wherein the data provider performs a bucket splitting process on the first user sample data stored locally, and the bucket splitting process comprises:

3. The method according to claim 1, wherein when the first user sample data includes first user sample data with missing values, the data provider performs bucket splitting on the locally stored first user sample data, including:

4. The method of claim 3, wherein the data provider performs a bucket splitting process on the first user sample data stored locally, and determines a bucket with the largest amount of user sample data based on the amount of the first user sample data in each bucket, including:

the data provider determines the sub-bucket with the maximum number of user sample data based on the number of the first user sample data in each sub-bucket without the missing value and the number of the first user sample data in the sub-bucket with the missing value;

the data provider establishes an encrypted first gradient histogram based on the homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of the user sample data; and sending the encrypted first gradient histogram to the data consumer, including:

and when the data provider determines that the number of the first user sample data in the sub-buckets with the missing values is the maximum, establishing the encrypted first gradient histogram based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and sending the encrypted first gradient histogram to the data demander.

5. The method of claim 1, wherein the determining an optimal split point of the tree model based on the decrypted first gradient histogram and gradient information corresponding to a bucket with a largest number of user sample data comprises:

determining a plurality of first splitting gains of the decrypted first gradient histogram and a plurality of second splitting gains of the gradient information corresponding to the buckets with the largest quantity of the user sample data by adopting a splitting gain calculation formula based on the decrypted first gradient histogram and the gradient information corresponding to the buckets with the largest quantity of the user sample data;

determining a maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and taking a user characteristic corresponding to the maximum splitting gain and a characteristic value of the user characteristic as the optimal splitting point.

6. A longitudinal federal learning system including at least one data provider and at least one data demander;

the data provider is further used for establishing an encrypted first gradient histogram based on the homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of the user sample data; sending the encrypted first gradient histogram to the data demander;

the data demander is configured to decrypt the encrypted first gradient histogram, determine, based on a locally stored gradient sum of second user sample data and the decrypted first gradient histogram, gradient information corresponding to a bucket with the largest amount of user sample data, and determine an optimal split point of a tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest amount of user sample data, where the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.

7. The system of claim 6, wherein the data provider is specifically configured to:

And carrying out equidistant bucket dividing processing on the first user sample data.

8. The system according to claim 6, wherein when the first user sample data includes first user sample data having a missing value, the data provider is specifically configured to:

9. The system of claim 8, wherein the data provider is specifically configured to:

when the maximum number of the first user sample data in the sub-buckets with the missing values is determined, establishing the encrypted first gradient histogram based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and sending the encrypted first gradient histogram to the data demand side.

10. The system of claim 6, wherein the data consumer is specifically configured to: