CN115481415A - Communication cost optimization method, system, device and medium based on longitudinal federal learning - Google Patents

Communication cost optimization method, system, device and medium based on longitudinal federal learning Download PDF

Info

Publication number
CN115481415A
CN115481415A CN202211008707.8A CN202211008707A CN115481415A CN 115481415 A CN115481415 A CN 115481415A CN 202211008707 A CN202211008707 A CN 202211008707A CN 115481415 A CN115481415 A CN 115481415A
Authority
CN
China
Prior art keywords
party
derivative
labeled
participant
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211008707.8A
Other languages
Chinese (zh)
Inventor
杨树森
袁博文
李亚男
任雪斌
赵鹏
韩青
郭思言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202211008707.8A priority Critical patent/CN115481415A/en
Publication of CN115481415A publication Critical patent/CN115481415A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a communication cost optimization method, a system, equipment and a medium based on longitudinal federal learning, wherein the method comprises the steps of obtaining a common sample ID of each participant, and determining a data set with the same sample ID, wherein each participant needs to participate in training; clustering all samples participating in training to obtain different groups, calculating the derivative of a loss function of each sample, calculating the sum of the derivatives of each group of samples after clustering and grouping, encrypting, and sending a ciphertext and each group of samples to a label-free participant; calculating an approximate derivative of each sample according to the received information, and then calculating gain and sending the result to the labeled participant; and selecting the position with the maximum gain value as the optimal splitting point by the received information, and if the splitting point belongs to the non-label participant, sending the information of the optimal splitting point back to the non-label participant. The method and the device ensure the privacy and safety of information transmission of both parties, remarkably reduce the transmission of communication data volume and ensure the validity of the transmitted information.

Description

Communication cost optimization method, system, device and medium based on longitudinal federal learning
Technical Field
The invention belongs to the technical field of communication, and relates to a communication cost optimization method, a communication cost optimization system, communication cost optimization equipment and a communication cost optimization medium based on longitudinal federal learning.
Background
XGboost is an algorithm based on GBDT, and the basic idea is the same as GBDT, but many improvements are made on relevant modeling details. The XGboost model design oriented to the longitudinal federal learning scene needs to integrate the characteristics of the longitudinal federal learning technology and the XGboost algorithm. In a federal scene, different from centralized modeling, the original data of each participant cannot be directly stored in a certain party or concentrated in a third party, namely, the local data cannot be locally generated; the key point is to ensure the privacy and safety of local data of each participant and the effectiveness of the built model. Federal learning is a distributed machine learning framework, and different local data parties are involved in model training on the premise of ensuring privacy and safety of data in various regions. During the model training process, the participants can communicate the relevant information (such as model structure, parameters, etc.) of the model (the exchange mode can be plaintext, data is encrypted, relevant noise is added, etc.), but the local data held by the participants and participating in the model training is not transmitted to any one party. Through the communication mode, the safety of each local data participating in training can be ensured, and the risk of revealing data privacy can be reduced. Longitudinal federated learning accommodates scenarios where more overlap of sample spaces occurs between participants and where the feature spaces of the samples overlap less or even none.
In the existing XGboost algorithm for the longitudinal federal learning scene, due to the modeling characteristics of the algorithm, frequent and large-data-volume communication is required among all participants in the XGboost modeling process under the longitudinal federal, so that the XGboost algorithm under the longitudinal federal learning scene has low communication efficiency and large overhead is caused in the actual use process.
Disclosure of Invention
The invention aims to solve the problem that in the prior art, communication efficiency is low due to the fact that communication with large data volume needs to be frequently carried out among all participants, and provides a communication cost optimization method, a communication cost optimization system, communication cost optimization equipment and a communication cost optimization medium based on longitudinal federal learning.
In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:
the communication cost optimization method based on longitudinal federal learning comprises the following steps:
s1, acquiring a common sample ID set of a labeled party and a non-labeled party;
s2, according to the common sample ID set, carrying out derivation on the loss function of the current decision tree to obtain a derivative result of each sample;
s3, performing K-means clustering on the common sample ID set by using the derivative result to obtain a clustering result;
s4, respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
s5, calculating a derivative approximate value of a sample in each class by the non-tag participant according to the ciphertext and the clustering result;
s6, dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
s7, decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party respectively according to the division result of the new node;
s8, judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled participant and a non-labeled participant respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in a corresponding leaf node, and updating the predicted label value of each sample in the labeled participant; if the preset value is not reached, returning to S3;
s9, judging whether the generation quantity of the decision trees reaches a predicted condition or the residual error is smaller than a given threshold value according to the predicted label value of each sample in the labeled participants, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value to obtain an optimized XGboost model; otherwise, return to S2.
The invention is further improved in that:
in the step S1, the common sample ID set is obtained by a privacy protection set intersection method, which specifically includes the following steps:
tagged participant possession data { I A (ID),f A (id) }, no-tag participant owns the data { I B (ID),f B (id)};
And (2) calculating a common sample ID set T (ID) by the tagged party and the untagged party according to a PSI method, wherein the common sample ID set T (ID) is shown in a formula (1):
T(ID)=I A (ID)∩I B (ID) (1)
wherein, I A (ID),I B (ID) represents the ID set of the samples owned by the tagged party and the untagged party, respectively, f A (id),f B And (ID) respectively represents the characteristic attribute corresponding to the sample ID epsilon I (ID) in the tagged party and the untagged party.
The current decision tree in S2 is the t-th tree and the target function L thereof (t) As shown in formula (2):
Figure BDA0003810058770000031
wherein the loss function represents the true value y i And the predicted value
Figure BDA0003810058770000032
Difference of (d), Ω (f) t ) Representing a regularization term, the loss function is as shown in equation (3):
Figure BDA0003810058770000033
the derivative of the loss function comprises a first derivative g i And the second derivative h i As shown in formulas (4) and (5):
Figure BDA0003810058770000034
Figure BDA0003810058770000035
in the S3, the K-means clustering is a self-adaptive K-means clustering method based on the contour coefficient, and the method specifically comprises the following steps:
s3.1, initializing an average contour coefficient threshold value to be-1;
s3.2, setting a loop body according to the number N of samples contained in the common sample set T (ID), wherein the loop times are set to be (1,N);
s3.3, setting an alarm capture mechanism in the loop body, and ending the loop when the number of clusters is smaller than a set threshold value;
s3.4, when the loop body is executed, updating the threshold value of the clustering number in the K-means clustering algorithm to be the number of times of executing the loop body;
s3.5, in each cycle, clustering the derivative result to obtain an average contour coefficient under the clustering, comparing the average contour coefficient with the existing average contour coefficient, updating the existing contour coefficient to the maximum contour coefficient, and recording the clustering result corresponding to the maximum contour coefficient;
and S3.6, after the circulation is finished, taking the cluster corresponding to the maximum outline coefficient as a final result of clustering the derivative result to obtain a training set subset ID set contained in each cluster.
In the S4, summing operation is respectively carried out on the derivative results of the samples contained in each class, the summing result is encrypted by using a homomorphic encryption Paillier algorithm, and the ciphertext and the clustering result are sent to the non-tag participants.
In S5, the unlabeled participant calculates the derivative approximation of the samples in each class according to the ciphertext and the clustering result, including the first derivative ciphertext approximation
Figure BDA0003810058770000041
And second derivative ciphertext approximation
Figure BDA0003810058770000042
Calculated according to equations (6) and (7):
Figure BDA0003810058770000043
Figure BDA0003810058770000044
wherein, en (G) j )、En(H j ) First derivative sum and second derivative sum, T, of ciphertext forms in each class j (ID) is the set of training set subset IDs contained in each cluster.
In S6, the tagged party and the untagged party calculate their respective division gains according to equation (8):
Figure BDA0003810058770000051
wherein G is L 、G R 、H L 、H R Respectively the first derivative contained in the left and right subtrees after the sample is divided based on the characteristic valueSum and second derivative sum, λ, μ being the canonical term Ω (f) t ) The regularization coefficients in (1);
the specific calculation steps include:
s6.1, dividing the sample ID set into a left subset I and a right subset I based on each eigenvalue division L (ID) and I R (ID);
S6.2, calculating a set I L First derivative on (ID) and G L And the sum of the second derivative and H L Computing a set I R First derivative on (ID) and G R And the second derivative sum H R
And S6.2, calculating a corresponding gain value when each characteristic value is divided according to the formula (8).
A longitudinal federal learning-based communication cost optimization system, comprising:
the data acquisition module is used for acquiring a common sample ID set of a labeled party and an unlabeled party;
the first data calculation module is used for performing derivation on the loss function of the current decision tree according to the common sample ID set to obtain a derivative result of each sample;
the data processing module is used for carrying out K-means clustering on the common sample ID set according to the derivation result to obtain a clustering result;
the data encryption module is used for respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
the second data calculation module is used for calculating derivative approximate values of samples in each class by the non-tag participator according to the ciphertext and the clustering result;
the third data calculation module is used for dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
the data decryption module is used for decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party according to the division result of the new node;
the first circulation module is used for judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled participant and an unlabeled participant respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in the corresponding leaf node, and updating the predicted label value of each sample in the labeled participant; when the preset value is not reached, returning to the data processing module;
the second circulation module is used for judging whether the generation quantity of the decision tree reaches a predicted condition or whether the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value, so that an optimized XGboost model is obtained; otherwise, returning to the first data calculation module.
An apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of any one of the preceding claims when executing the computer program.
A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any of the preceding claims.
Compared with the prior art, the invention has the following beneficial effects:
according to the communication cost optimization method based on longitudinal federal learning, the K-means clustering algorithm is adopted, the processed clustering information replaces the sum of the first derivative and the second derivative sent in the prior art, the communication traffic is obviously reduced, and the communication efficiency is effectively improved; and the non-label participant calculates the derivative mean value in each sample as an approximate value to replace a true value based on the received clustering information after processing and encryption, so that the error caused by calculation based on the approximate value is reduced.
Furthermore, the clustering number can be automatically adjusted by adopting a self-adaptive K-means clustering method based on the contour coefficient, so that the clustering number is optimized.
Drawings
In order to more clearly explain the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a communication cost optimization method based on longitudinal federated learning according to the present invention;
FIG. 2 is a block diagram of a longitudinal federal learning based communication cost optimization system of the present invention;
FIG. 3 is a logic diagram of a communication cost optimization modeling method based on longitudinal federated learning according to the present invention;
FIG. 4 is a logic flow diagram of an adaptive K-means clustering method based on contour coefficients;
FIG. 5 is an average square error (MSE) index of an XGboost model based on a K-means method for manually inputting different clustering numbers when training is performed on public Boston rate data;
FIG. 6 is an average absolute error MAE index of an XGboost model based on a K-means method for manually inputting different clustering numbers when training is performed on public Boston's rate data;
FIG. 7 is an R square value R2 index of an XGboost model based on a K-means method for manually inputting different clustering numbers when training is performed on public Boston rate data;
FIG. 8 is a total traffic TC index of an XGboost model in a modeling process based on a K-means method for manually inputting different clustering numbers;
FIG. 9 is a diagram of the Mean Square Error (MSE) of traffic and final model performance when an XGboost model is trained on public Boston rate data before and after optimization;
FIG. 10 is a comparison graph of the average absolute error MAE of traffic and the performance of the final model when the XGboost model is trained on public Boston rate data before and after optimization;
FIG. 11 is a comparison graph of the R-square value R2 of the traffic and the performance of the final model when the XGboost model is trained on public Boston rate data before and after optimization;
FIG. 12 is a comparison graph of traffic and total traffic TC for the final model performance when the XGboost model is trained on public Boston rate data before and after optimization.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the embodiments of the present invention, it should be noted that if the terms "upper", "lower", "horizontal", "inner", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings or the orientation or positional relationship which is usually arranged when the product of the present invention is used, the description is merely for convenience and simplicity, and the indication or suggestion that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, cannot be understood as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
Furthermore, the term "horizontal", if present, does not mean that the component is required to be absolutely horizontal, but may be slightly inclined. For example, "horizontal" merely means that the direction is more horizontal than "vertical" and does not mean that the structure must be perfectly horizontal, but may be slightly inclined.
In the description of the embodiments of the present invention, it should be further noted that unless otherwise explicitly stated or limited, the terms "disposed," "mounted," "connected," and "connected" should be interpreted broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The invention is described in further detail below with reference to the accompanying drawings:
referring to fig. 1, a flow chart of a communication cost optimization method based on longitudinal federal learning in the present invention specifically includes the following steps:
s1, acquiring a common sample ID set of a labeled party and a non-labeled party;
participant U based on having label A And a non-tag participant U B ID set I of respectively owned samples A (ID)、I B (ID), acquiring a sample ID set T (ID) common to both parties by a privacy protection set transaction PSI method;
tagged participant U A Possession of data { I A (ID),f A (id) }, unlabeled participant U B Possession of data { I B (ID),f B (id)},U A And U B Calculating a sample set T (ID) common to both parties by using a PSI method, wherein the sample set T (ID) is shown as a formula (1):
T(ID)=I A (ID)∩I B (ID) (1)
wherein, I A (ID),I B (ID) represents ID sets of samples owned by tagged and untagged parties, respectively, f A (id),f B And (ID) respectively represents the characteristic attribute corresponding to the sample ID epsilon I (ID) in the tagged party and the untagged party.
S2, according to the common sample ID set, carrying out derivation on the loss function of the current decision tree to obtain a derivative result of each sample;
begin modeling the t tree with tagged participant U A And the square loss function of the current model is subjected to derivation based on a sample ID set T (ID) of the training set, and the first derivative g of each sample in the training set is determined i And second derivative h i
The current decision tree is the t-th tree with its objective function L (t) As shown in formula (2):
Figure BDA0003810058770000091
wherein the loss function represents the true value y i And the predicted value
Figure BDA0003810058770000092
Difference of (d), Ω (f) t ) Representing a regularization term, the loss function is shown as equation (3):
Figure BDA0003810058770000101
the derivative of the loss function comprises a first derivative g i And second derivative h i As shown in formulas (4) and (5):
Figure BDA0003810058770000102
Figure BDA0003810058770000103
s3, performing K-means clustering on the sample ID set according to the derivative result to obtain a clustering result;
in tagged party U A Based on the result of derivation (g) i ,h i ) Carrying out self-adaptive K-means clustering based on contour coefficients on a training set sample set T (ID) to obtain a training set subset ID set T contained in each cluster j (ID), see FIG. 4, the specific steps are:
s3.1, initializing an average contour coefficient threshold value to be-1;
s3.2, setting a loop body according to the number N of samples contained in the common sample set T (ID), wherein the loop times are set to be (1,N);
s3.3, setting an alarm capture mechanism in the loop body, and automatically ending the loop when the number of clusters is smaller than a set threshold value so as to reduce the loop execution times and time overhead;
s3.4, when the loop body is executed, updating the threshold value of the clustering number in the K-means clustering algorithm to be the number of times of executing the loop body;
s3.5, in each cycle, the derivative result (g) i ,h i ) After clustering, obtaining the average contour coefficient under the current clustering, comparing the average contour coefficient with the existing average contour coefficient, updating the current existing contour coefficient into the maximum contour coefficient, and corresponding the maximum contour coefficient to (g) i ,h i ) Recording the clustering result;
s3.6, after the circulation is finished, taking the cluster corresponding to the maximum outline coefficient as the final cluster for clustering the derivative resultObtaining the training set subset ID set T contained in each cluster j (ID) where T j (ID)∈T(ID)。
S4, respectively carrying out summation calculation on the derivative result of the sample contained in each class according to the clustering result, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering result to a non-tag participant;
in tagged party U A Based on the result T after K-means clustering j (ID), g for samples contained in each class i 、h i Respectively carrying out summation operation to obtain G i 、H i And G of each class is encrypted by using a homomorphic encryption Paillier algorithm j 、H j Encrypting to obtain a ciphertext En (G) j )、En(H j ) Will { En (G) j ),En(H j ),T j (ID) } sending to untagged participant U B . Wherein G j 、H j Is calculated as shown in equations (6) and (7):
Figure BDA0003810058770000111
Figure BDA0003810058770000112
s5, calculating a derivative approximate value of each class of sample by the non-tag participant according to the ciphertext and the clustering result;
in a non-tag participant U B Based on the received ciphertext and the clustering result { En (G) j ),En(H j ),T j (ID) }, calculating ciphertext approximate values of the first derivative and the second derivative of each sample in the class
Figure BDA0003810058770000113
Wherein
Figure BDA0003810058770000114
Is calculated as shown in equations (8) and (9):
Figure BDA0003810058770000115
Figure BDA0003810058770000116
and with
Figure BDA0003810058770000117
As first and second derivative approximations for each sample in the current cluster.
S6, dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and the approximate value of the derivative of the sample in each class; wherein, the participant with a label calculates to obtain a gain value in a plaintext form, and the participant without a label calculates to obtain a gain value in a ciphertext form;
participant U based on having tag A And a non-tag participant U B Dividing T (ID) of a training set according to each value of each characteristic, and calculating a labeled participant U based on a sample ID set obtained by dividing each characteristic attribute value and a derivative of each sample in the set A And a non-tag participant U B Respective division gain, tagless participant U B Sending own gain value to tagged participant U A (ii) a Wherein, the participant with the label calculates the gain value in the form of plaintext
Figure BDA0003810058770000121
The non-tag participant calculates the gain value of the ciphertext form
Figure BDA0003810058770000122
In tagged party U A And a non-tag participant U B When T (ID) of training set is divided, the labeled participator U A And a non-tag participant U B The calculation formula of the gain when the respective selected corresponding characteristic values are divided is shown as the formula (10):
Figure BDA0003810058770000123
wherein G is L 、G R 、H L 、H R Respectively the sum of the first derivative and the second derivative contained in the left subtree and the right subtree after the sample is divided based on the characteristic value, and lambda and mu are regular terms omega (f) t ) The regularization coefficients of (1);
the tagged participant U A And a non-tag participant U B Calculating respective division gains, and the specific steps comprise:
s6.1, dividing the sample ID set into a left subset I and a right subset I based on each eigenvalue division L (ID) and I R (ID);
S6.2, calculating a set I L First derivative on (ID) and G L And the second derivative sum H L Computing a set I R First derivative on (ID) and G R And the second derivative sum H R
And S6.2, calculating corresponding gain values when each characteristic value is divided according to the formula (10).
S7, decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party respectively according to the division result of the new node;
in tagged party U A To receiving unlabeled participant U B Gain value of transmission
Figure BDA0003810058770000131
Carry out decryption to obtain
Figure BDA0003810058770000132
Comparison of
Figure BDA0003810058770000133
And
Figure BDA0003810058770000134
using the division information corresponding to the maximum gain value as the information of the current splitting point of the current tree, generating a new node of the current tree and according to the division information of the node, having a label participant U A And a non-tag participant U B Each divides T (ID).
S8, judging whether the generation of the current decision tree reaches a preset condition or not according to the common sample ID sets divided by the labeled participators and the unlabeled participators respectively, stopping the generation when the preset condition is reached, and calculating the weight W of each leaf node of the current decision tree k t Recorded in the corresponding leaf node, updated with the tagged participant U A The predicted tag value of each sample in (a); if the preset value is not reached, returning to S3;
s9, judging whether the generation quantity of the decision tree reaches a predicted condition or the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value to obtain an optimized XGboost model; otherwise, return to S2.
Referring to fig. 2, a structure diagram of a communication cost optimization system based on longitudinal federal learning according to the present invention includes:
the data acquisition module is used for acquiring a common sample ID set of a labeled party and an unlabeled party;
the first data calculation module is used for performing derivation on the loss function of the current decision tree according to the common sample ID set to obtain a derivative result of each sample;
the data processing module is used for carrying out K-means clustering on the common sample ID set according to the derivation result to obtain a clustering result;
the data encryption module is used for respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
the second data calculation module is used for calculating derivative approximate values of samples in each class by the non-tag participator according to the ciphertext and the clustering result;
the third data calculation module is used for dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
the data decryption module is used for decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, using the partition information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and partitioning the common sample ID set by the labeled party and the non-labeled party respectively according to the partition result of the new node;
the first circulation module is used for judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled party and a non-labeled party respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in the corresponding leaf node, and updating the predicted label value of each sample in the labeled party; when the preset value is not reached, returning to the data processing module;
the second circulation module is used for judging whether the generation quantity of the decision tree reaches a predicted condition or whether the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value, so that an optimized XGboost model is obtained; otherwise, returning to the first data calculation module.
Referring to fig. 5-8, the XGBoost model based on manually input clustering number is a result of training test on the public Boston room price data set, where the average square error MSE, the average absolute error MAE and the R-square value R2 are used as model performance evaluation indexes, including the total traffic TC in the modeling process. By comparing the model performance obtained by inputting different clustering numbers in the graph, the smaller the number of clusters input, that is, the smaller the number of clustered classes during clustering operation, the smaller the amount of communication of the XGBoost model.
Referring to fig. 9-12, the results obtained by performing training tests on public Boston house price datasets for the XGBoost Model before and after Optimization are obtained, wherein an average square error MSE, an average absolute error MAE and an R square value R2 are used as Model performance evaluation indexes, including a total traffic TC in a modeling process, wherein a Baseline Model represents the XGBoost Model before Optimization, and an Optimization Model represents the XGBoost Model after Optimization. The optimized model is the XGboost model after the self-adaptive K-means clustering based on the contour coefficient is added. By comparing with the relevant indexes of the XGboost model which costs money, the optimized model is found to have no loss of performance and the communication traffic TC is compressed by 51 percent. Therefore, the method can greatly improve the communication traffic, save more communication resources compared with the existing method, and simultaneously ensure the minimum loss of the original performance of the model.
Compared with the existing longitudinal Federal XGboost modeling process, the self-adaptive K-means clustering method based on the contour coefficient is used for automatically adjusting the clustering number, the communication efficiency can be greatly improved after the clustering number is optimized, meanwhile, because the clustering number is automatically optimized, the derivative mean value calculated based on the clustering result replaces the true value, the error caused by subsequent correlation calculation is extremely small, and on part of public data sets, even the error is zero.
An embodiment of the present invention provides a terminal device. The terminal device of this embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor. The processor implements the steps in each of the communication cost optimization method embodiments described above when executing the computer program. Alternatively, the processor implements the functions of the modules/units in the above device embodiments when executing the computer program.
The computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention.
The communication cost optimization device/terminal device may be a desktop computer, a notebook, a palm top computer, a cloud server, or other computing devices. The communication cost optimization device/terminal equipment may include, but is not limited to, a processor, a memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc.
The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the communication cost optimization apparatus/terminal device by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory.
The communication cost optimization device/terminal device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The communication cost optimization method based on longitudinal federal learning is characterized by comprising the following steps of:
s1, acquiring a common sample ID set of a labeled party and a non-labeled party;
s2, according to the common sample ID set, carrying out derivation on the loss function of the current decision tree to obtain a derivative result of each sample;
s3, performing K-means clustering on the common sample ID set by the derivative result to obtain a clustering result;
s4, respectively carrying out summation calculation on the derivative result of the sample contained in each class according to the clustering result, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering result to a non-tag participant;
s5, calculating a derivative approximate value of each class of sample by the non-tag participant according to the ciphertext and the clustering result;
s6, dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
s7, decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party respectively according to the division result of the new node;
s8, judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled participant and a non-labeled participant respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in a corresponding leaf node, and updating the predicted label value of each sample in the labeled participant; if the preset value is not reached, returning to S3;
s9, judging whether the generation quantity of the decision tree reaches a predicted condition or the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value to obtain an optimized XGboost model; otherwise, return to S2.
2. The communication cost optimization method based on longitudinal federal learning as claimed in claim 1, wherein the step S1 of obtaining the common sample ID set by a privacy protection set intersection method specifically includes the following steps:
tagged participant possession data { I A (ID),f A (id) }, unlabeled participant owns the data { I } B (ID),f B (id)};
The tagged party and the untagged party calculate a common sample ID set T (ID) according to a PSI method, as shown in formula (1):
T(ID)=I A (ID)∩I B (ID) (1)
wherein, I A (ID),I B (ID) represents ID sets of samples owned by tagged and untagged parties, respectively, f A (id),f B And (ID) respectively represents the characteristic attributes corresponding to the sample ID epsilon I (ID) in the tagged party and the untagged party.
3. The communication cost optimization method based on longitudinal federated learning according to claim 1, wherein the current decision tree in S2 is the t-th tree whose objective function L is (t) As shown in formula (2):
Figure FDA0003810058760000021
wherein the loss function represents the true value y i And the predicted value
Figure FDA0003810058760000022
Difference of (d), Ω (f) t ) Representing a regularization term, the loss function is as shown in equation (3):
Figure FDA0003810058760000023
the derivative of the loss function includes a first derivative g i And second derivative h i As shown in formulas (4) and (5):
Figure FDA0003810058760000024
Figure FDA0003810058760000025
4. the communication cost optimization method based on longitudinal federal learning as claimed in claim 1, wherein in S3, K-means clustering is an adaptive K-means clustering method based on contour coefficients, and the specific steps are as follows:
s3.1, initializing an average contour coefficient threshold value to be-1;
s3.2, setting a loop body according to the number N of samples contained in the common sample set T (ID), wherein the loop times are set to be (1,N);
s3.3, setting an alarm capture mechanism in the loop body, and ending the loop when the number of clusters is smaller than a set threshold value;
s3.4, when the loop body is executed, updating the threshold value of the clustering number in the K-means clustering algorithm to be the number of times of executing the loop body;
s3.5, in each cycle, clustering the derivative result to obtain an average contour coefficient under the clustering, comparing the average contour coefficient with the existing average contour coefficient, updating the existing contour coefficient to the maximum contour coefficient, and recording the clustering result corresponding to the maximum contour coefficient;
and S3.6, after the circulation is finished, taking the cluster corresponding to the maximum outline coefficient as a final result of clustering the derivative result to obtain a training set subset ID set contained in each cluster.
5. The communication cost optimization method based on longitudinal federated learning according to claim 1, wherein in S4, a summation operation is performed on the derivative results of the samples included in each class, a homomorphic encryption Paillier algorithm is used to encrypt the summation result, and a ciphertext and a clustering result are sent to a non-tag participant.
6. The communication cost optimization method based on longitudinal federated learning of claim 1, wherein in S5, the unlabeled participant calculates the derivative approximation, including the first derivative, of the samples in each class from the ciphertext and the clustering resultNumerical cipher text approximation
Figure FDA0003810058760000031
And second derivative ciphertext approximations
Figure FDA0003810058760000032
Calculated according to equations (6) and (7):
Figure FDA0003810058760000033
Figure FDA0003810058760000034
wherein, en (G) j )、En(H j ) First derivative sum and second derivative sum, T, of ciphertext forms in each class j (ID) is the set of training set subset IDs contained in each cluster.
7. The method of claim 1, wherein the tagged and untagged participants in S6 calculate respective partition gains according to equation (8):
Figure FDA0003810058760000041
wherein, G L 、G R 、H L 、H R Respectively the first derivative sum and the second derivative sum contained in the left and right subtrees after the sample is divided based on the characteristic value, and the lambda and the mu are regular terms omega (f) t ) The regularization coefficients of (1);
the specific calculation steps include:
s6.1, dividing the sample ID set into a left subset I and a right subset I based on each eigenvalue division L (ID) and I R (ID);
S6.2, computing the set I L (ID) onFirst derivative sum of G L And the second derivative sum H L Computing a set I R First derivative on (ID) and G R And the second derivative sum H R
And S6.2, calculating a corresponding gain value when each characteristic value is divided according to the formula (8).
8. A longitudinal federal learning-based communication cost optimization system, comprising:
the data acquisition module is used for acquiring a common sample ID set of a labeled party and an unlabeled party;
the first data calculation module is used for performing derivation on the loss function of the current decision tree according to the common sample ID set to obtain a derivative result of each sample;
the data processing module is used for carrying out K-means clustering on the common sample ID set according to the derivation result to obtain a clustering result;
the data encryption module is used for respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
the second data calculation module is used for calculating derivative approximate values of samples in each class by the non-tag participator according to the ciphertext and the clustering result;
the third data calculation module is used for dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
the data decryption module is used for decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party according to the division result of the new node;
the first circulation module is used for judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled party and a non-labeled party respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in the corresponding leaf node, and updating the predicted label value of each sample in the labeled party; when the preset value is not reached, returning to the data processing module;
the second circulation module is used for judging whether the generation quantity of the decision tree reaches a predicted condition or whether the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value, so that an optimized XGboost model is obtained; otherwise, returning to the first data calculation module.
9. An apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202211008707.8A 2022-08-22 2022-08-22 Communication cost optimization method, system, device and medium based on longitudinal federal learning Pending CN115481415A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211008707.8A CN115481415A (en) 2022-08-22 2022-08-22 Communication cost optimization method, system, device and medium based on longitudinal federal learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211008707.8A CN115481415A (en) 2022-08-22 2022-08-22 Communication cost optimization method, system, device and medium based on longitudinal federal learning

Publications (1)

Publication Number Publication Date
CN115481415A true CN115481415A (en) 2022-12-16

Family

ID=84422905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211008707.8A Pending CN115481415A (en) 2022-08-22 2022-08-22 Communication cost optimization method, system, device and medium based on longitudinal federal learning

Country Status (1)

Country Link
CN (1) CN115481415A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117675411A (en) * 2024-01-31 2024-03-08 智慧眼科技股份有限公司 Global model acquisition method and system based on longitudinal XGBoost algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117675411A (en) * 2024-01-31 2024-03-08 智慧眼科技股份有限公司 Global model acquisition method and system based on longitudinal XGBoost algorithm
CN117675411B (en) * 2024-01-31 2024-04-26 智慧眼科技股份有限公司 Global model acquisition method and system based on longitudinal XGBoost algorithm

Similar Documents

Publication Publication Date Title
WO2021249086A1 (en) Multi-party joint decision tree construction method, device and readable storage medium
Zhang et al. PIC: Enable large-scale privacy preserving content-based image search on cloud
Liu et al. Cloud-enabled privacy-preserving collaborative learning for mobile sensing
Enthoven et al. An overview of federated deep learning privacy attacks and defensive strategies
Liu et al. Intelligent and secure content-based image retrieval for mobile users
CN109615021B (en) Privacy information protection method based on k-means clustering
CN113420232B (en) Privacy protection-oriented federated recommendation method for neural network of graph
CN113128701A (en) Sample sparsity-oriented federal learning method and system
AU2016218947A1 (en) Learning from distributed data
CN115660050A (en) Robust federated learning method with efficient privacy protection
Yang et al. Gradient leakage attacks in federated learning: Research frontiers, taxonomy and future directions
Qin et al. Privacy-preserving outsourcing of image global feature detection
CN109086830B (en) Typical correlation analysis near-duplicate video detection method based on sample punishment
Shao et al. Federated generalized face presentation attack detection
Cheng et al. SecureAD: A secure video anomaly detection framework on convolutional neural network in edge computing environment
Yang et al. Practical feature inference attack in vertical federated learning during prediction in artificial Internet of Things
CN114417388B (en) Power load prediction method, system, equipment and medium based on longitudinal federal learning
CN115481415A (en) Communication cost optimization method, system, device and medium based on longitudinal federal learning
Ranbaduge et al. Differentially private vertical federated learning
Chen et al. Sparse general non-negative matrix factorization based on left semi-tensor product
CN118116554A (en) Medical image caching processing method based on big data processing
CN117391816A (en) Heterogeneous graph neural network recommendation method, device and equipment
CN116383470B (en) Image searching method with privacy protection function
CN116467415A (en) Bidirectional cross-domain session recommendation method based on GCNsformer hybrid network and multi-channel semantics
Hong et al. Augmented Rotation‐Based Transformation for Privacy‐Preserving Data Clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination