CN113537333B - Method for training optimization tree model and longitudinal federal learning system - Google Patents

Method for training optimization tree model and longitudinal federal learning system Download PDF

Info

Publication number
CN113537333B
CN113537333B CN202110777115.1A CN202110777115A CN113537333B CN 113537333 B CN113537333 B CN 113537333B CN 202110777115 A CN202110777115 A CN 202110777115A CN 113537333 B CN113537333 B CN 113537333B
Authority
CN
China
Prior art keywords
sample data
user sample
bucket
data
gradient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110777115.1A
Other languages
Chinese (zh)
Other versions
CN113537333A (en
Inventor
黄一珉
王湾湾
何浩
姚明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dongjian Intelligent Technology Co ltd
Original Assignee
Shenzhen Dongjian Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dongjian Intelligent Technology Co ltd filed Critical Shenzhen Dongjian Intelligent Technology Co ltd
Priority to CN202110777115.1A priority Critical patent/CN113537333B/en
Publication of CN113537333A publication Critical patent/CN113537333A/en
Application granted granted Critical
Publication of CN113537333B publication Critical patent/CN113537333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention provides a method for training an optimization tree model and a longitudinal federal learning system, wherein the method comprises the following steps: the data provider determines the sub-bucket with the maximum user sample data quantity based on the quantity of first user sample data in each sub-bucket obtained by sub-bucket; then, establishing an encrypted first gradient histogram based on homomorphic encrypted gradient information sent by a data demand side and first user sample data of each sub-bucket except the sub-bucket with the largest number of user sample data; and finally, determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data by the data demander, so that the time overhead brought by the encryption calculation process and the decryption calculation process can be reduced, and the training efficiency of the tree model can be improved.

Description

Method for training optimization tree model and longitudinal federal learning system
Technical Field
The invention relates to the technical field of machine learning, in particular to a method for training an optimization tree model and a longitudinal federated learning system.
Background
With the development of artificial intelligence, the value of data is more and more emphasized. And the data of different fields often have great complementarity, so that great data fusion requirements also exist among different fields. However, based on factors such as privacy protection, self interests and policy supervision, various enterprises having data in different fields cannot directly fuse the data. Thereby presenting a data islanding problem.
In order to solve the problem of data islanding, related research and development personnel provide a federal learning method which is a distributed machine learning method, and the federal learning method actually takes various enterprises with data in different fields as participants, trains machine learning models by the participants, obtains intermediate training results of the machine learning models trained by the participants, continues to train the machine learning models based on the intermediate training results of the models of the participants, and finally obtains the machine learning models meeting the requirements of the participants, thereby indirectly realizing the fusion of the data in different fields.
While in the longitudinal federal learning method, tree models are the commonly used machine learning models. In the process of training the tree model, the following methods are generally adopted: the data demand party firstly determines gradient information according to the label information of owned user data, then sends homomorphic encrypted gradient information to the data provider, then the data provider establishes an encrypted gradient histogram based on the homomorphic encrypted gradient information and the characteristic information of the owned user data, and sends the encrypted gradient histogram to the data demand party, and the data demand party decrypts the encrypted gradient histogram and searches for the global optimal split point of each party according to a split gain calculation formula; and if the optimal split point belongs to the data provider, the data demand side returns the split point to the data provider, if the optimal split point belongs to the data demand side, the optimal split point is held, then the side holding the optimal split point performs split division on the samples on the nodes of the tree model of the side, and the split division result is sent to the opposite side, so that the opposite side updates the index between the samples and the tree nodes.
However, the inventor finds that, when the tree model is trained by using the related art, since the data provider performs the encryption summation calculation on all gradient information to obtain the encrypted gradient histogram and sends the encrypted gradient histogram to the data consumer, the time overhead is large, so that the training efficiency of the tree model is low.
Disclosure of Invention
The embodiment of the invention aims to provide a method for optimizing training of a tree model and a longitudinal federated learning system, so as to improve the training efficiency of the tree model and reduce the time overhead of training the tree model. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a method for training an optimization tree model, which is applied to a longitudinal federated learning system, where the longitudinal federated learning system includes at least one data provider and at least one data demander, and the method includes:
the data provider acquires homomorphic encrypted gradient information sent by the data demander;
the data provider carries out bucket dividing processing on first user sample data stored locally, and determines a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket;
the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side;
and the data demand side decrypts the encrypted first gradient histogram, determines gradient information corresponding to the bucket with the largest quantity of user sample data based on the gradient sum of second user sample data stored locally and the decrypted first gradient histogram, and determines the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.
Optionally, the data providing party performs bucket dividing processing on the first user sample data stored locally, including:
the data provider performs equal-frequency bucket division processing on the first user sample data; or
And the data provider performs equidistant barrel processing on the first user sample data.
Optionally, when the first user sample data includes first user sample data with a missing value, the data provider performs bucket splitting on the first user sample data stored locally, including:
the data provider takes the first user sample data with the missing value as a sub-bucket; and performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.
Optionally, the data provider performs bucket splitting processing on the first user sample data stored locally, and determines a bucket with the largest number of user sample data based on the number of the first user sample data in each bucket, including:
the data provider determines the sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket without the missing value and the number of the first user sample data in the sub-bucket with the missing value;
the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; and sending the encrypted first gradient histogram to a data consumer, comprising:
when the data provider determines that the number of the first user sample data in the sub-buckets with the missing values is the largest, an encrypted first gradient histogram is established based on homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and the encrypted first gradient histogram is sent to the data demander.
Optionally, determining an optimal split point of the tree model based on the decrypted first gradient histogram and gradient information corresponding to the bucket with the largest number of user sample data includes:
determining a plurality of first splitting gains of the decrypted first gradient histogram and a plurality of second splitting gains of the gradient information corresponding to the buckets with the largest number of user sample data by adopting a splitting gain calculation formula based on the decrypted first gradient histogram and the gradient information corresponding to the buckets with the largest number of user sample data;
and determining the maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and taking the user characteristic corresponding to the maximum splitting gain and the characteristic value of the user characteristic as the optimal splitting point.
In a second aspect, an embodiment of the present invention further provides a longitudinal federal learning system, where the longitudinal federal learning system includes at least one data provider and at least one data demander;
the data provider is used for acquiring homomorphic encrypted gradient information sent by the data demander;
the data provider is further used for performing bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket;
the data provider is further used for establishing an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side;
and the data demander is used for decrypting the encrypted first gradient histogram, determining gradient information corresponding to the bucket with the largest quantity of user sample data based on the locally stored gradient sum of the second user sample data and the decrypted first gradient histogram, and determining an optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.
Optionally, the data provider is specifically configured to:
performing equal-frequency bucket division processing on the first user sample data; or alternatively
And carrying out equidistant bucket-dividing processing on the first user sample data.
Optionally, when the first user sample data includes first user sample data with a missing value, the data provider is specifically configured to:
taking the first user sample data with the missing value as a sub-bucket; and performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.
Optionally, the data provider is specifically configured to:
determining the sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket without missing values and the number of the first user sample data in the sub-bucket with the missing values;
when the maximum number of the first user sample data in the sub-buckets with the missing values is determined, an encrypted first gradient histogram is established based on homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and the encrypted first gradient histogram is sent to a data demand side.
Optionally, the data demander is specifically configured to:
determining a plurality of first splitting gains of the decrypted first gradient histogram and a plurality of second splitting gains of the gradient information corresponding to the buckets with the largest number of user sample data by adopting a splitting gain calculation formula based on the decrypted first gradient histogram and the gradient information corresponding to the buckets with the largest number of user sample data;
and determining the maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and taking the user characteristic corresponding to the maximum splitting gain and the characteristic value of the user characteristic as the optimal splitting point.
The embodiment of the invention has the following beneficial effects:
according to the method for training the optimization tree model and the longitudinal federal learning system provided by the embodiment of the invention, homomorphic encrypted gradient information sent by a data demand party is obtained by a data provider; then carrying out bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket; establishing an encrypted first gradient histogram by the data provider based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side; and finally, decrypting the encrypted first gradient histogram by the data demander, determining the gradient information corresponding to the bucket with the maximum number of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the maximum number of user sample data. When the encryption gradient histogram is obtained, the first user sample data in the sub-bucket with the largest number of user sample data does not need to be used, and further, the calculation amount during encryption and decryption calculation can be reduced, so that the time overhead caused by the encryption calculation process and the decryption calculation process can be reduced, and further, the training efficiency of the tree model can be improved. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by referring to these drawings.
FIG. 1 is a flowchart of a first implementation of a method for training an optimization tree model according to an embodiment of the present invention;
FIG. 2 is a flowchart of a second implementation of a method for training an optimized tree model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a longitudinal federal learning system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments given herein by one of ordinary skill in the art, are within the scope of the invention.
In order to solve the problems in the prior art, embodiments of the present invention provide a method for optimizing training of a tree model and a longitudinal federated learning system, so as to improve training efficiency of the tree model and reduce time overhead for training the tree model.
First, a method for training an optimized tree model according to an embodiment of the present invention is described below, as shown in fig. 1, which is a flowchart of a first implementation manner of the method for training an optimized tree model according to an embodiment of the present invention, where the method may include:
s110, a data provider acquires homomorphic encrypted gradient information sent by a data demander;
s120, the data provider carries out bucket dividing processing on the first user sample data stored locally, and determines a bucket with the largest number of the user sample data based on the number of the first user sample data in each bucket;
s130, the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side;
and S140, the data demander decrypts the encrypted first gradient histogram, determines gradient information corresponding to the bucket with the largest quantity of user sample data based on the locally stored gradient sum of the second user sample data and the decrypted first gradient histogram, and determines the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.
In some examples, the method of optimizing tree model training may be applied in a longitudinal federated learning system, which may include at least one data provider and at least one data demander.
The data providing party stores first user sample data, and the data demanding party stores second user sample data. The second user sample data is provided with label information, and the label information is used for identifying the category information of the second user sample data.
In some examples, the first user sample data and the second user sample data may include user information and user characteristic information, and the user characteristic information included in the first user sample data and the user characteristic information included in the second user sample data are not identical. For example, the first user sample data may include user behavior feature information therein, and the second user sample data may include user portrait information therein.
In still other examples, when the tree model is trained by using the first user sample data and the second user sample data, the data demander may perform homomorphic encryption on its gradient information first, and then send the homomorphic encrypted gradient information to the data provider, so that the data provider may obtain the homomorphic encrypted gradient information sent by the data demander.
The data provider may then pre-process the first user sample data stored locally.
Specifically, the data provider may perform bucket splitting on the first user sample data stored locally. Determining the sub-bucket with the maximum user sample data quantity based on the quantity of the first user sample data in each sub-bucket;
for example, assuming that the first user sample data includes a user name and a user age, wherein the user age is 11-60 years, the first user sample data may be divided into 5 buckets, the first bucket includes users between 11-20 years of age and corresponding user names, the second bucket includes users between 21-30 years of age and corresponding user names, the third bucket includes users between 31-40 years of age and corresponding user names, the fourth bucket includes users between 41-50 years of age and corresponding user names, and the fifth bucket includes users between 51-60 years of age and corresponding user names.
In some examples, the data provider may perform equal-frequency bucket dividing on the first user sample data when performing bucket dividing on the first user sample data stored locally; or carrying out equidistant barrel processing on the first user sample data.
The equal-frequency bucket dividing processing means that when a data provider divides locally stored first user sample data into a plurality of buckets, the number of the first user sample data in each bucket has the same or similar deviation, for example, the deviation of the number is less than or equal to 0.1.
The equidistant barrel separation treatment comprises the following steps: the data provider divides the first user sample data stored locally into N equal parts from the minimum value to the maximum value, for example, if the minimum value and the maximum value of the first user sample data are a and B, the first user sample data are subjected to bucket division according to (B-a)/N, and the boundary values of the buckets are a + W, a +2W, … and a + (N-1) W. The number of first user sample data in each bucket may not be equal.
In some examples, after obtaining the individual buckets, in order to improve training efficiency of the tree model, a time overhead for training the tree model is reduced. The bucket with the largest amount of user sample data may be determined based on the amount of the first user sample data in each bucket.
For example, assuming that the number of first user sample data in a first bucket is 6, the number of first user sample data in a second bucket is 10, the number of first user sample data in a third bucket is 5, the number of first user sample data in a fourth bucket is 15, and the number of first user sample data in a fifth bucket is 20, among 5 buckets, the fifth bucket may be determined as the bucket with the largest number of user sample data,
in the embodiment of the invention, the sub-bucket with the largest quantity of user sample data is determined by carrying out sub-bucket processing on the first user sample data and based on the quantity of the first user sample data in each sub-bucket; the complexity of subsequently building the encrypted first gradient histogram may be reduced.
After determining the sub-bucket with the largest quantity of user sample data, establishing an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest quantity of user sample data; sending the encrypted first gradient histogram to a data demand side;
specifically, the data provider may first obtain the encryption gradients of the sub-buckets except the sub-bucket with the largest number of user sample data based on the homomorphic encryption gradient information, and then sum up the encryption gradients corresponding to the sub-buckets, so as to obtain the encrypted first gradient histogram. The encrypted first gradient histogram is then sent to the data consumer.
For example, the data provider may build a first gradient histogram based on the received homomorphically encrypted gradient information corresponding to the age group of the user included in each of the five buckets.
In the embodiment of the invention, the first gradient histogram is obtained by only carrying out encryption summation calculation on the gradient information corresponding to the sub-bucket with less user sample data, and the gradient information corresponding to the sub-bucket with the most user sample data is not subjected to encryption summation calculation, so that the time overhead brought by the encryption calculation of the first gradient histogram can be reduced, and the training efficiency of the tree model can be improved.
After the data provider sends the encrypted first gradient histogram to the data demander, the data demander may decrypt the encrypted first gradient histogram, determine, based on a sum of gradients of second user sample data stored locally and the decrypted first gradient histogram, gradient information corresponding to a bucket with the largest number of user sample data, and determine, based on the decrypted first gradient histogram and the gradient information corresponding to a bucket with the largest number of user sample data, an optimal split point of the tree model.
In some examples, since the first gradient histogram may reflect a sum of gradients of first user sample data of other buckets except for the bucket with the largest amount of user sample data, and since the sum of gradients of the first user sample data locally stored by the data provider is the same as the sum of gradients of the second user sample data locally stored by the data demander when the tree model is trained, the data demander may determine, after decrypting the encrypted first gradient histogram, gradient information corresponding to the bucket with the largest amount of user sample data based on the sum of gradients of the second user sample data locally stored and the decrypted first gradient histogram.
Therefore, the gradient information corresponding to the bucket with the largest amount of user sample data does not need to be encrypted, summed and calculated and sent to the data demand side, and the data demand side does not need to decrypt the gradient information, but indirectly obtains the gradient information corresponding to the bucket with the largest amount of user sample data based on the locally stored gradient sum of the second user sample data and the decrypted first gradient histogram.
Therefore, the time overhead of encrypting and calculating the first gradient histogram can be reduced, the time overhead of decrypting the encrypted first gradient histogram can also be reduced, and the training efficiency of the tree model can be further improved.
After obtaining the gradient information corresponding to the bucket with the largest amount of user sample data, the data demander may determine the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest amount of user sample data.
In some examples, when determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data, the data demander may first determine, based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data, a plurality of first split gains of the decrypted first gradient histogram and a plurality of second split gains of the gradient information corresponding to the bucket with the largest number of user sample data by using a split gain calculation formula; and then determining the maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and finally taking the user characteristic corresponding to the maximum splitting gain and the characteristic value of the user characteristic as the optimal splitting point.
After the optimal split point is obtained, whether the optimal split point belongs to a data provider or not can be judged, if the optimal split point belongs to the data provider, a data demander returns the split point to the data provider, if the optimal split point belongs to the data demander, the optimal split point is held, then the party holding the optimal split point performs split division on samples on nodes of the tree model of the party, and split division results are sent to the other party, so that training of the tree model is achieved.
The method for training the optimization tree model provided by the embodiment of the invention comprises the steps that a data provider acquires homomorphic encrypted gradient information sent by a data demander; then carrying out bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket; establishing an encrypted first gradient histogram by the data provider based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side; and finally, decrypting the encrypted first gradient histogram by the data demander, determining the gradient information corresponding to the bucket with the maximum number of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the maximum number of user sample data. When the encryption gradient histogram is obtained, the first user sample data in the sub-bucket with the largest number of user sample data does not need to be used, and further, the calculation amount during encryption and decryption calculation can be reduced, so that the time overhead caused by the encryption calculation process and the decryption calculation process can be reduced, and further, the training efficiency of the tree model can be improved.
In some practical application scenarios, user sample data has many missing values, and how to directly train a tree model by using user sample data with missing values causes great computational redundancy, so on the basis of the method for optimizing tree model training shown in fig. 1, the embodiment of the present invention further provides a possible implementation manner, as shown in fig. 2, which is a flowchart of a second implementation manner of the method for optimizing tree model training according to the embodiment of the present invention, and the method may include:
s210, a data provider acquires homomorphic encrypted gradient information sent by a data demander;
s220, the data provider takes the first user sample data with the missing value as a sub-bucket; performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value;
s230, the data provider determines the sub-bucket with the maximum number of user sample data based on the number of the first user sample data in each sub-bucket;
s240, when the maximum number of the first user sample data in the sub-buckets with the missing values is determined, the data provider determines the sub-buckets with the maximum number of the user sample data based on the number of the first user sample data in each sub-bucket without the missing values and the number of the first user sample data in the sub-buckets with the missing values;
and S250, when the data provider determines that the number of the first user sample data in the sub-buckets with the missing values is the largest, the data provider establishes an encrypted first gradient histogram based on homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and sends the encrypted first gradient histogram to the data demander.
And S260, the data demander decrypts the encrypted first gradient histogram, determines gradient information corresponding to the bucket with the largest quantity of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determines the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.
In some examples, the first user sample data locally stored by the data provider may include first user sample data with a missing value, where the user sample data with the missing value indicates that the user information in the user sample data lacks corresponding user characteristic information, for example, the user information in the user sample data lacks corresponding user behavior characteristic information, or the user information in the user sample data lacks corresponding user portrait information.
At this time, the data provider may treat the first user sample data having the missing value as one sub-bucket. For example, the user sample data lacking the corresponding user behavior feature information is taken as a bucket, or the user sample data lacking the corresponding user portrait information is taken as a bucket.
And then performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.
By treating the first user sample data with missing values as a bucket. The influence of the first user sample data with the missing value on the process of obtaining the first gradient histogram based on the gradient information encryption calculation corresponding to other buckets can be reduced,
in still other examples, after obtaining the plurality of buckets, the data provider may further determine the bucket with the largest amount of user sample data based on the amount of first user sample data in each bucket where no missing value exists and the amount of first user sample data in the bucket where the missing value exists;
when it is determined that the number of the first user sample data in the sub-bucket with the missing value is the largest, an encrypted first gradient histogram may be established based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing value, and the encrypted first gradient histogram may be sent to the data demanding side.
Because the number of the first user sample data with the missing value is usually larger, the time overhead of the encryption summation calculation process can be greatly reduced by not encrypting and summing the gradient information of the sub-bucket corresponding to the first user sample data with the missing value, and the training efficiency of the tree model is further greatly improved.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a longitudinal federal learning system, as shown in fig. 3, which is a schematic structural diagram of a longitudinal federal learning system according to an embodiment of the present invention, and the system may include: at least one data provider 310 and at least one data consumer 320;
the data provider 310 is configured to obtain homomorphic encrypted gradient information sent by a data demander;
the data provider 310 is further configured to perform bucket dividing processing on the first user sample data stored locally, and determine a bucket with the largest number of user sample data based on the number of the first user sample data in each bucket;
the data provider 310 is further configured to establish an encrypted first gradient histogram based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; and sends the encrypted first gradient histogram to the data consumer 320;
and the data demander 320 is configured to decrypt the encrypted first gradient histogram, determine, based on a sum of gradients of second user sample data stored locally and the decrypted first gradient histogram, gradient information corresponding to the bucket with the largest number of user sample data, and determine, based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest number of user sample data, an optimal split point of the tree model, where the sum of gradients of the first user sample data is the same as the sum of gradients of the second user sample data.
According to the longitudinal federal learning system provided by the embodiment of the invention, homomorphic encrypted gradient information sent by a data demand party is obtained by a data provider; then carrying out bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket; establishing an encrypted first gradient histogram by the data provider based on homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of user sample data; sending the encrypted first gradient histogram to a data demand side; and finally, decrypting the encrypted first gradient histogram by the data demander, determining the gradient information corresponding to the bucket with the maximum number of user sample data based on the gradient sum of the second user sample data stored locally and the decrypted first gradient histogram, and determining the optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the maximum number of user sample data. When the encryption gradient histogram is obtained, the first user sample data in the sub-bucket with the largest number of user sample data does not need to be used, and further, the calculation amount during encryption and decryption calculation can be reduced, so that the time overhead caused by the encryption calculation process and the decryption calculation process can be reduced, and further, the training efficiency of the tree model can be improved.
In some examples, the data provider 310 is specifically configured to:
performing equal-frequency bucket division processing on the first user sample data; or
And carrying out equidistant bucket-dividing processing on the first user sample data.
In some examples, when the first user sample data includes first user sample data with a missing value, the data provider 310 is specifically configured to:
taking the first user sample data with the missing value as a sub-bucket; and performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.
In some examples, the data provider 310 is specifically configured to:
determining the sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket without missing values and the number of the first user sample data in the sub-bucket with the missing values;
when it is determined that the number of the first user sample data in the sub-bucket with the missing value is the largest, an encrypted first gradient histogram is established based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing value, and the encrypted first gradient histogram is sent to the data demanding party 320.
In some examples, the data demander 320 is specifically configured to:
determining a plurality of first splitting gains of the decrypted first gradient histogram and a plurality of second splitting gains of the gradient information corresponding to the buckets with the largest number of user sample data by adopting a splitting gain calculation formula based on the decrypted first gradient histogram and the gradient information corresponding to the buckets with the largest number of user sample data;
and determining the maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and taking the user characteristic corresponding to the maximum splitting gain and the characteristic value of the user characteristic as the optimal splitting point.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for optimizing training of a tree model, applied to a longitudinal federated learning system, the longitudinal federated learning system including at least one data provider and at least one data demander, the method comprising:
the data provider acquires homomorphic encrypted gradient information sent by the data demander;
the data provider carries out bucket dividing processing on first user sample data stored locally, and determines a sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket;
the data provider establishes an encrypted first gradient histogram based on the homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of the user sample data; sending the encrypted first gradient histogram to the data demander;
the data demander decrypts the encrypted first gradient histogram, determines gradient information corresponding to a bucket with the largest quantity of user sample data based on a locally stored gradient sum of second user sample data and the decrypted first gradient histogram, and determines an optimal split point of the tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest quantity of user sample data, wherein the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.
2. The method of claim 1, wherein the data provider performs a bucket splitting process on the first user sample data stored locally, and the bucket splitting process comprises:
the data provider performs equal-frequency bucket division processing on the first user sample data; or
And the data provider performs equidistant barrel processing on the first user sample data.
3. The method according to claim 1, wherein when the first user sample data includes first user sample data with missing values, the data provider performs bucket splitting on the locally stored first user sample data, including:
the data provider takes the first user sample data with the missing value as a sub-bucket; and performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.
4. The method of claim 3, wherein the data provider performs a bucket splitting process on the first user sample data stored locally, and determines a bucket with the largest amount of user sample data based on the amount of the first user sample data in each bucket, including:
the data provider determines the sub-bucket with the maximum number of user sample data based on the number of the first user sample data in each sub-bucket without the missing value and the number of the first user sample data in the sub-bucket with the missing value;
the data provider establishes an encrypted first gradient histogram based on the homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of the user sample data; and sending the encrypted first gradient histogram to the data consumer, including:
and when the data provider determines that the number of the first user sample data in the sub-buckets with the missing values is the maximum, establishing the encrypted first gradient histogram based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and sending the encrypted first gradient histogram to the data demander.
5. The method of claim 1, wherein the determining an optimal split point of the tree model based on the decrypted first gradient histogram and gradient information corresponding to a bucket with a largest number of user sample data comprises:
determining a plurality of first splitting gains of the decrypted first gradient histogram and a plurality of second splitting gains of the gradient information corresponding to the buckets with the largest quantity of the user sample data by adopting a splitting gain calculation formula based on the decrypted first gradient histogram and the gradient information corresponding to the buckets with the largest quantity of the user sample data;
determining a maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and taking a user characteristic corresponding to the maximum splitting gain and a characteristic value of the user characteristic as the optimal splitting point.
6. A longitudinal federal learning system including at least one data provider and at least one data demander;
the data provider is used for acquiring homomorphic encrypted gradient information sent by the data demander;
the data provider is further used for performing bucket dividing processing on the first user sample data stored locally, and determining a bucket with the largest quantity of user sample data based on the quantity of the first user sample data in each bucket;
the data provider is further used for establishing an encrypted first gradient histogram based on the homomorphic encrypted gradient information and first user sample data in each sub-bucket except the sub-bucket with the largest number of the user sample data; sending the encrypted first gradient histogram to the data demander;
the data demander is configured to decrypt the encrypted first gradient histogram, determine, based on a locally stored gradient sum of second user sample data and the decrypted first gradient histogram, gradient information corresponding to a bucket with the largest amount of user sample data, and determine an optimal split point of a tree model based on the decrypted first gradient histogram and the gradient information corresponding to the bucket with the largest amount of user sample data, where the gradient sum of the first user sample data is the same as the gradient sum of the second user sample data.
7. The system of claim 6, wherein the data provider is specifically configured to:
performing equal-frequency bucket division processing on the first user sample data; or
And carrying out equidistant bucket dividing processing on the first user sample data.
8. The system according to claim 6, wherein when the first user sample data includes first user sample data having a missing value, the data provider is specifically configured to:
taking the first user sample data with the missing value as a sub-bucket; and performing equal-frequency bucket dividing processing or equal-distance bucket dividing processing on the first user sample data except the first user sample data with the missing value.
9. The system of claim 8, wherein the data provider is specifically configured to:
determining the sub-bucket with the largest number of user sample data based on the number of the first user sample data in each sub-bucket without missing values and the number of the first user sample data in the sub-bucket with the missing values;
when the maximum number of the first user sample data in the sub-buckets with the missing values is determined, establishing the encrypted first gradient histogram based on the homomorphic encrypted gradient information and the first user sample data in each sub-bucket without the missing values, and sending the encrypted first gradient histogram to the data demand side.
10. The system of claim 6, wherein the data consumer is specifically configured to:
determining a plurality of first splitting gains of the decrypted first gradient histogram and a plurality of second splitting gains of the gradient information corresponding to the buckets with the largest quantity of the user sample data by adopting a splitting gain calculation formula based on the decrypted first gradient histogram and the gradient information corresponding to the buckets with the largest quantity of the user sample data;
determining a maximum splitting gain in the plurality of first splitting gains and the plurality of second splitting gains, and taking a user characteristic corresponding to the maximum splitting gain and a characteristic value of the user characteristic as the optimal splitting point.
CN202110777115.1A 2021-07-09 2021-07-09 Method for training optimization tree model and longitudinal federal learning system Active CN113537333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110777115.1A CN113537333B (en) 2021-07-09 2021-07-09 Method for training optimization tree model and longitudinal federal learning system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110777115.1A CN113537333B (en) 2021-07-09 2021-07-09 Method for training optimization tree model and longitudinal federal learning system

Publications (2)

Publication Number Publication Date
CN113537333A CN113537333A (en) 2021-10-22
CN113537333B true CN113537333B (en) 2022-05-24

Family

ID=78127216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110777115.1A Active CN113537333B (en) 2021-07-09 2021-07-09 Method for training optimization tree model and longitudinal federal learning system

Country Status (1)

Country Link
CN (1) CN113537333B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563564B (en) * 2022-12-02 2023-03-17 腾讯科技(深圳)有限公司 Processing method and device of decision tree model, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364908A (en) * 2020-11-05 2021-02-12 浙江大学 Decision tree-oriented longitudinal federal learning method
CN112396189A (en) * 2020-11-27 2021-02-23 中国银联股份有限公司 Method and device for multi-party construction of federal learning model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11139961B2 (en) * 2019-05-07 2021-10-05 International Business Machines Corporation Private and federated learning
CN111368901A (en) * 2020-02-28 2020-07-03 深圳前海微众银行股份有限公司 Multi-party combined modeling method, device and medium based on federal learning
CN111695697B (en) * 2020-06-12 2023-09-08 深圳前海微众银行股份有限公司 Multiparty joint decision tree construction method, equipment and readable storage medium
CN113051557B (en) * 2021-03-15 2022-11-11 河南科技大学 Social network cross-platform malicious user detection method based on longitudinal federal learning
CN112990484B (en) * 2021-04-21 2021-07-20 腾讯科技(深圳)有限公司 Model joint training method, device and equipment based on asymmetric federated learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364908A (en) * 2020-11-05 2021-02-12 浙江大学 Decision tree-oriented longitudinal federal learning method
CN112396189A (en) * 2020-11-27 2021-02-23 中国银联股份有限公司 Method and device for multi-party construction of federal learning model

Also Published As

Publication number Publication date
CN113537333A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN110189192B (en) Information recommendation model generation method and device
US11468192B2 (en) Runtime control of automation accuracy using adjustable thresholds
CN108768943B (en) Method and device for detecting abnormal account and server
WO2020238677A1 (en) Data processing method and apparatus, and computer readable storage medium
CN111695697A (en) Multi-party combined decision tree construction method and device and readable storage medium
CN102483731B (en) Have according to search load by the medium of the fingerprint database of equilibrium
CN111368901A (en) Multi-party combined modeling method, device and medium based on federal learning
CN113157648A (en) Block chain based distributed data storage method, device, node and system
EP3709568A1 (en) Deleting user data from a blockchain
CN105868231A (en) Cache data updating method and device
US11544758B2 (en) Distributed database structures for anonymous information exchange
TW202046206A (en) Abnormal account detection method and device
CN105447113A (en) Big data based informatiion analysis method
CN114611128B (en) Longitudinal federal learning method, device, system, equipment and storage medium
CN115409198A (en) Distributed prediction method and system thereof
CN111428887A (en) Model training control method, device and system based on multiple computing nodes
EP3970038B1 (en) Siem system and methods for exfiltrating event data
CN113537333B (en) Method for training optimization tree model and longitudinal federal learning system
CN115982115A (en) Data sharing method
US11818246B2 (en) Blockchain data structures and systems and methods therefor for multipath transaction management
CN114155083A (en) Transaction detection method, device and equipment based on block chain and readable storage medium
CN113361618A (en) Industrial data joint modeling method and system based on federal learning
CN109784918A (en) Information measure of supervision, device, equipment and storage medium based on block chain
CN116192538B (en) Network security assessment method, device, equipment and medium based on machine learning
CN112529102A (en) Feature expansion method, device, medium, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant