CN115481415A - Communication cost optimization method, system, device and medium based on longitudinal federal learning - Google Patents
Communication cost optimization method, system, device and medium based on longitudinal federal learning Download PDFInfo
- Publication number
- CN115481415A CN115481415A CN202211008707.8A CN202211008707A CN115481415A CN 115481415 A CN115481415 A CN 115481415A CN 202211008707 A CN202211008707 A CN 202211008707A CN 115481415 A CN115481415 A CN 115481415A
- Authority
- CN
- China
- Prior art keywords
- party
- derivative
- labeled
- participant
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a communication cost optimization method, a system, equipment and a medium based on longitudinal federal learning, wherein the method comprises the steps of obtaining a common sample ID of each participant, and determining a data set with the same sample ID, wherein each participant needs to participate in training; clustering all samples participating in training to obtain different groups, calculating the derivative of a loss function of each sample, calculating the sum of the derivatives of each group of samples after clustering and grouping, encrypting, and sending a ciphertext and each group of samples to a label-free participant; calculating an approximate derivative of each sample according to the received information, and then calculating gain and sending the result to the labeled participant; and selecting the position with the maximum gain value as the optimal splitting point by the received information, and if the splitting point belongs to the non-label participant, sending the information of the optimal splitting point back to the non-label participant. The method and the device ensure the privacy and safety of information transmission of both parties, remarkably reduce the transmission of communication data volume and ensure the validity of the transmitted information.
Description
Technical Field
The invention belongs to the technical field of communication, and relates to a communication cost optimization method, a communication cost optimization system, communication cost optimization equipment and a communication cost optimization medium based on longitudinal federal learning.
Background
XGboost is an algorithm based on GBDT, and the basic idea is the same as GBDT, but many improvements are made on relevant modeling details. The XGboost model design oriented to the longitudinal federal learning scene needs to integrate the characteristics of the longitudinal federal learning technology and the XGboost algorithm. In a federal scene, different from centralized modeling, the original data of each participant cannot be directly stored in a certain party or concentrated in a third party, namely, the local data cannot be locally generated; the key point is to ensure the privacy and safety of local data of each participant and the effectiveness of the built model. Federal learning is a distributed machine learning framework, and different local data parties are involved in model training on the premise of ensuring privacy and safety of data in various regions. During the model training process, the participants can communicate the relevant information (such as model structure, parameters, etc.) of the model (the exchange mode can be plaintext, data is encrypted, relevant noise is added, etc.), but the local data held by the participants and participating in the model training is not transmitted to any one party. Through the communication mode, the safety of each local data participating in training can be ensured, and the risk of revealing data privacy can be reduced. Longitudinal federated learning accommodates scenarios where more overlap of sample spaces occurs between participants and where the feature spaces of the samples overlap less or even none.
In the existing XGboost algorithm for the longitudinal federal learning scene, due to the modeling characteristics of the algorithm, frequent and large-data-volume communication is required among all participants in the XGboost modeling process under the longitudinal federal, so that the XGboost algorithm under the longitudinal federal learning scene has low communication efficiency and large overhead is caused in the actual use process.
Disclosure of Invention
The invention aims to solve the problem that in the prior art, communication efficiency is low due to the fact that communication with large data volume needs to be frequently carried out among all participants, and provides a communication cost optimization method, a communication cost optimization system, communication cost optimization equipment and a communication cost optimization medium based on longitudinal federal learning.
In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:
the communication cost optimization method based on longitudinal federal learning comprises the following steps:
s1, acquiring a common sample ID set of a labeled party and a non-labeled party;
s2, according to the common sample ID set, carrying out derivation on the loss function of the current decision tree to obtain a derivative result of each sample;
s3, performing K-means clustering on the common sample ID set by using the derivative result to obtain a clustering result;
s4, respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
s5, calculating a derivative approximate value of a sample in each class by the non-tag participant according to the ciphertext and the clustering result;
s6, dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
s7, decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party respectively according to the division result of the new node;
s8, judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled participant and a non-labeled participant respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in a corresponding leaf node, and updating the predicted label value of each sample in the labeled participant; if the preset value is not reached, returning to S3;
s9, judging whether the generation quantity of the decision trees reaches a predicted condition or the residual error is smaller than a given threshold value according to the predicted label value of each sample in the labeled participants, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value to obtain an optimized XGboost model; otherwise, return to S2.
The invention is further improved in that:
in the step S1, the common sample ID set is obtained by a privacy protection set intersection method, which specifically includes the following steps:
tagged participant possession data { I A (ID),f A (id) }, no-tag participant owns the data { I B (ID),f B (id)};
And (2) calculating a common sample ID set T (ID) by the tagged party and the untagged party according to a PSI method, wherein the common sample ID set T (ID) is shown in a formula (1):
T(ID)=I A (ID)∩I B (ID) (1)
wherein, I A (ID),I B (ID) represents the ID set of the samples owned by the tagged party and the untagged party, respectively, f A (id),f B And (ID) respectively represents the characteristic attribute corresponding to the sample ID epsilon I (ID) in the tagged party and the untagged party.
The current decision tree in S2 is the t-th tree and the target function L thereof (t) As shown in formula (2):
wherein the loss function represents the true value y i And the predicted valueDifference of (d), Ω (f) t ) Representing a regularization term, the loss function is as shown in equation (3):
the derivative of the loss function comprises a first derivative g i And the second derivative h i As shown in formulas (4) and (5):
in the S3, the K-means clustering is a self-adaptive K-means clustering method based on the contour coefficient, and the method specifically comprises the following steps:
s3.1, initializing an average contour coefficient threshold value to be-1;
s3.2, setting a loop body according to the number N of samples contained in the common sample set T (ID), wherein the loop times are set to be (1,N);
s3.3, setting an alarm capture mechanism in the loop body, and ending the loop when the number of clusters is smaller than a set threshold value;
s3.4, when the loop body is executed, updating the threshold value of the clustering number in the K-means clustering algorithm to be the number of times of executing the loop body;
s3.5, in each cycle, clustering the derivative result to obtain an average contour coefficient under the clustering, comparing the average contour coefficient with the existing average contour coefficient, updating the existing contour coefficient to the maximum contour coefficient, and recording the clustering result corresponding to the maximum contour coefficient;
and S3.6, after the circulation is finished, taking the cluster corresponding to the maximum outline coefficient as a final result of clustering the derivative result to obtain a training set subset ID set contained in each cluster.
In the S4, summing operation is respectively carried out on the derivative results of the samples contained in each class, the summing result is encrypted by using a homomorphic encryption Paillier algorithm, and the ciphertext and the clustering result are sent to the non-tag participants.
In S5, the unlabeled participant calculates the derivative approximation of the samples in each class according to the ciphertext and the clustering result, including the first derivative ciphertext approximationAnd second derivative ciphertext approximationCalculated according to equations (6) and (7):
wherein, en (G) j )、En(H j ) First derivative sum and second derivative sum, T, of ciphertext forms in each class j (ID) is the set of training set subset IDs contained in each cluster.
In S6, the tagged party and the untagged party calculate their respective division gains according to equation (8):
wherein G is L 、G R 、H L 、H R Respectively the first derivative contained in the left and right subtrees after the sample is divided based on the characteristic valueSum and second derivative sum, λ, μ being the canonical term Ω (f) t ) The regularization coefficients in (1);
the specific calculation steps include:
s6.1, dividing the sample ID set into a left subset I and a right subset I based on each eigenvalue division L (ID) and I R (ID);
S6.2, calculating a set I L First derivative on (ID) and G L And the sum of the second derivative and H L Computing a set I R First derivative on (ID) and G R And the second derivative sum H R ;
And S6.2, calculating a corresponding gain value when each characteristic value is divided according to the formula (8).
A longitudinal federal learning-based communication cost optimization system, comprising:
the data acquisition module is used for acquiring a common sample ID set of a labeled party and an unlabeled party;
the first data calculation module is used for performing derivation on the loss function of the current decision tree according to the common sample ID set to obtain a derivative result of each sample;
the data processing module is used for carrying out K-means clustering on the common sample ID set according to the derivation result to obtain a clustering result;
the data encryption module is used for respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
the second data calculation module is used for calculating derivative approximate values of samples in each class by the non-tag participator according to the ciphertext and the clustering result;
the third data calculation module is used for dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
the data decryption module is used for decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party according to the division result of the new node;
the first circulation module is used for judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled participant and an unlabeled participant respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in the corresponding leaf node, and updating the predicted label value of each sample in the labeled participant; when the preset value is not reached, returning to the data processing module;
the second circulation module is used for judging whether the generation quantity of the decision tree reaches a predicted condition or whether the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value, so that an optimized XGboost model is obtained; otherwise, returning to the first data calculation module.
An apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of any one of the preceding claims when executing the computer program.
A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any of the preceding claims.
Compared with the prior art, the invention has the following beneficial effects:
according to the communication cost optimization method based on longitudinal federal learning, the K-means clustering algorithm is adopted, the processed clustering information replaces the sum of the first derivative and the second derivative sent in the prior art, the communication traffic is obviously reduced, and the communication efficiency is effectively improved; and the non-label participant calculates the derivative mean value in each sample as an approximate value to replace a true value based on the received clustering information after processing and encryption, so that the error caused by calculation based on the approximate value is reduced.
Furthermore, the clustering number can be automatically adjusted by adopting a self-adaptive K-means clustering method based on the contour coefficient, so that the clustering number is optimized.
Drawings
In order to more clearly explain the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a communication cost optimization method based on longitudinal federated learning according to the present invention;
FIG. 2 is a block diagram of a longitudinal federal learning based communication cost optimization system of the present invention;
FIG. 3 is a logic diagram of a communication cost optimization modeling method based on longitudinal federated learning according to the present invention;
FIG. 4 is a logic flow diagram of an adaptive K-means clustering method based on contour coefficients;
FIG. 5 is an average square error (MSE) index of an XGboost model based on a K-means method for manually inputting different clustering numbers when training is performed on public Boston rate data;
FIG. 6 is an average absolute error MAE index of an XGboost model based on a K-means method for manually inputting different clustering numbers when training is performed on public Boston's rate data;
FIG. 7 is an R square value R2 index of an XGboost model based on a K-means method for manually inputting different clustering numbers when training is performed on public Boston rate data;
FIG. 8 is a total traffic TC index of an XGboost model in a modeling process based on a K-means method for manually inputting different clustering numbers;
FIG. 9 is a diagram of the Mean Square Error (MSE) of traffic and final model performance when an XGboost model is trained on public Boston rate data before and after optimization;
FIG. 10 is a comparison graph of the average absolute error MAE of traffic and the performance of the final model when the XGboost model is trained on public Boston rate data before and after optimization;
FIG. 11 is a comparison graph of the R-square value R2 of the traffic and the performance of the final model when the XGboost model is trained on public Boston rate data before and after optimization;
FIG. 12 is a comparison graph of traffic and total traffic TC for the final model performance when the XGboost model is trained on public Boston rate data before and after optimization.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the embodiments of the present invention, it should be noted that if the terms "upper", "lower", "horizontal", "inner", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings or the orientation or positional relationship which is usually arranged when the product of the present invention is used, the description is merely for convenience and simplicity, and the indication or suggestion that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, cannot be understood as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
Furthermore, the term "horizontal", if present, does not mean that the component is required to be absolutely horizontal, but may be slightly inclined. For example, "horizontal" merely means that the direction is more horizontal than "vertical" and does not mean that the structure must be perfectly horizontal, but may be slightly inclined.
In the description of the embodiments of the present invention, it should be further noted that unless otherwise explicitly stated or limited, the terms "disposed," "mounted," "connected," and "connected" should be interpreted broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The invention is described in further detail below with reference to the accompanying drawings:
referring to fig. 1, a flow chart of a communication cost optimization method based on longitudinal federal learning in the present invention specifically includes the following steps:
s1, acquiring a common sample ID set of a labeled party and a non-labeled party;
participant U based on having label A And a non-tag participant U B ID set I of respectively owned samples A (ID)、I B (ID), acquiring a sample ID set T (ID) common to both parties by a privacy protection set transaction PSI method;
tagged participant U A Possession of data { I A (ID),f A (id) }, unlabeled participant U B Possession of data { I B (ID),f B (id)},U A And U B Calculating a sample set T (ID) common to both parties by using a PSI method, wherein the sample set T (ID) is shown as a formula (1):
T(ID)=I A (ID)∩I B (ID) (1)
wherein, I A (ID),I B (ID) represents ID sets of samples owned by tagged and untagged parties, respectively, f A (id),f B And (ID) respectively represents the characteristic attribute corresponding to the sample ID epsilon I (ID) in the tagged party and the untagged party.
S2, according to the common sample ID set, carrying out derivation on the loss function of the current decision tree to obtain a derivative result of each sample;
begin modeling the t tree with tagged participant U A And the square loss function of the current model is subjected to derivation based on a sample ID set T (ID) of the training set, and the first derivative g of each sample in the training set is determined i And second derivative h i ;
The current decision tree is the t-th tree with its objective function L (t) As shown in formula (2):
wherein the loss function represents the true value y i And the predicted valueDifference of (d), Ω (f) t ) Representing a regularization term, the loss function is shown as equation (3):
the derivative of the loss function comprises a first derivative g i And second derivative h i As shown in formulas (4) and (5):
s3, performing K-means clustering on the sample ID set according to the derivative result to obtain a clustering result;
in tagged party U A Based on the result of derivation (g) i ,h i ) Carrying out self-adaptive K-means clustering based on contour coefficients on a training set sample set T (ID) to obtain a training set subset ID set T contained in each cluster j (ID), see FIG. 4, the specific steps are:
s3.1, initializing an average contour coefficient threshold value to be-1;
s3.2, setting a loop body according to the number N of samples contained in the common sample set T (ID), wherein the loop times are set to be (1,N);
s3.3, setting an alarm capture mechanism in the loop body, and automatically ending the loop when the number of clusters is smaller than a set threshold value so as to reduce the loop execution times and time overhead;
s3.4, when the loop body is executed, updating the threshold value of the clustering number in the K-means clustering algorithm to be the number of times of executing the loop body;
s3.5, in each cycle, the derivative result (g) i ,h i ) After clustering, obtaining the average contour coefficient under the current clustering, comparing the average contour coefficient with the existing average contour coefficient, updating the current existing contour coefficient into the maximum contour coefficient, and corresponding the maximum contour coefficient to (g) i ,h i ) Recording the clustering result;
s3.6, after the circulation is finished, taking the cluster corresponding to the maximum outline coefficient as the final cluster for clustering the derivative resultObtaining the training set subset ID set T contained in each cluster j (ID) where T j (ID)∈T(ID)。
S4, respectively carrying out summation calculation on the derivative result of the sample contained in each class according to the clustering result, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering result to a non-tag participant;
in tagged party U A Based on the result T after K-means clustering j (ID), g for samples contained in each class i 、h i Respectively carrying out summation operation to obtain G i 、H i And G of each class is encrypted by using a homomorphic encryption Paillier algorithm j 、H j Encrypting to obtain a ciphertext En (G) j )、En(H j ) Will { En (G) j ),En(H j ),T j (ID) } sending to untagged participant U B . Wherein G j 、H j Is calculated as shown in equations (6) and (7):
s5, calculating a derivative approximate value of each class of sample by the non-tag participant according to the ciphertext and the clustering result;
in a non-tag participant U B Based on the received ciphertext and the clustering result { En (G) j ),En(H j ),T j (ID) }, calculating ciphertext approximate values of the first derivative and the second derivative of each sample in the classWhereinIs calculated as shown in equations (8) and (9):
S6, dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and the approximate value of the derivative of the sample in each class; wherein, the participant with a label calculates to obtain a gain value in a plaintext form, and the participant without a label calculates to obtain a gain value in a ciphertext form;
participant U based on having tag A And a non-tag participant U B Dividing T (ID) of a training set according to each value of each characteristic, and calculating a labeled participant U based on a sample ID set obtained by dividing each characteristic attribute value and a derivative of each sample in the set A And a non-tag participant U B Respective division gain, tagless participant U B Sending own gain value to tagged participant U A (ii) a Wherein, the participant with the label calculates the gain value in the form of plaintextThe non-tag participant calculates the gain value of the ciphertext form
In tagged party U A And a non-tag participant U B When T (ID) of training set is divided, the labeled participator U A And a non-tag participant U B The calculation formula of the gain when the respective selected corresponding characteristic values are divided is shown as the formula (10):
wherein G is L 、G R 、H L 、H R Respectively the sum of the first derivative and the second derivative contained in the left subtree and the right subtree after the sample is divided based on the characteristic value, and lambda and mu are regular terms omega (f) t ) The regularization coefficients of (1);
the tagged participant U A And a non-tag participant U B Calculating respective division gains, and the specific steps comprise:
s6.1, dividing the sample ID set into a left subset I and a right subset I based on each eigenvalue division L (ID) and I R (ID);
S6.2, calculating a set I L First derivative on (ID) and G L And the second derivative sum H L Computing a set I R First derivative on (ID) and G R And the second derivative sum H R ;
And S6.2, calculating corresponding gain values when each characteristic value is divided according to the formula (10).
S7, decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party respectively according to the division result of the new node;
in tagged party U A To receiving unlabeled participant U B Gain value of transmissionCarry out decryption to obtainComparison ofAndusing the division information corresponding to the maximum gain value as the information of the current splitting point of the current tree, generating a new node of the current tree and according to the division information of the node, having a label participant U A And a non-tag participant U B Each divides T (ID).
S8, judging whether the generation of the current decision tree reaches a preset condition or not according to the common sample ID sets divided by the labeled participators and the unlabeled participators respectively, stopping the generation when the preset condition is reached, and calculating the weight W of each leaf node of the current decision tree k t Recorded in the corresponding leaf node, updated with the tagged participant U A The predicted tag value of each sample in (a); if the preset value is not reached, returning to S3;
s9, judging whether the generation quantity of the decision tree reaches a predicted condition or the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value to obtain an optimized XGboost model; otherwise, return to S2.
Referring to fig. 2, a structure diagram of a communication cost optimization system based on longitudinal federal learning according to the present invention includes:
the data acquisition module is used for acquiring a common sample ID set of a labeled party and an unlabeled party;
the first data calculation module is used for performing derivation on the loss function of the current decision tree according to the common sample ID set to obtain a derivative result of each sample;
the data processing module is used for carrying out K-means clustering on the common sample ID set according to the derivation result to obtain a clustering result;
the data encryption module is used for respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
the second data calculation module is used for calculating derivative approximate values of samples in each class by the non-tag participator according to the ciphertext and the clustering result;
the third data calculation module is used for dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
the data decryption module is used for decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, using the partition information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and partitioning the common sample ID set by the labeled party and the non-labeled party respectively according to the partition result of the new node;
the first circulation module is used for judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled party and a non-labeled party respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in the corresponding leaf node, and updating the predicted label value of each sample in the labeled party; when the preset value is not reached, returning to the data processing module;
the second circulation module is used for judging whether the generation quantity of the decision tree reaches a predicted condition or whether the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value, so that an optimized XGboost model is obtained; otherwise, returning to the first data calculation module.
Referring to fig. 5-8, the XGBoost model based on manually input clustering number is a result of training test on the public Boston room price data set, where the average square error MSE, the average absolute error MAE and the R-square value R2 are used as model performance evaluation indexes, including the total traffic TC in the modeling process. By comparing the model performance obtained by inputting different clustering numbers in the graph, the smaller the number of clusters input, that is, the smaller the number of clustered classes during clustering operation, the smaller the amount of communication of the XGBoost model.
Referring to fig. 9-12, the results obtained by performing training tests on public Boston house price datasets for the XGBoost Model before and after Optimization are obtained, wherein an average square error MSE, an average absolute error MAE and an R square value R2 are used as Model performance evaluation indexes, including a total traffic TC in a modeling process, wherein a Baseline Model represents the XGBoost Model before Optimization, and an Optimization Model represents the XGBoost Model after Optimization. The optimized model is the XGboost model after the self-adaptive K-means clustering based on the contour coefficient is added. By comparing with the relevant indexes of the XGboost model which costs money, the optimized model is found to have no loss of performance and the communication traffic TC is compressed by 51 percent. Therefore, the method can greatly improve the communication traffic, save more communication resources compared with the existing method, and simultaneously ensure the minimum loss of the original performance of the model.
Compared with the existing longitudinal Federal XGboost modeling process, the self-adaptive K-means clustering method based on the contour coefficient is used for automatically adjusting the clustering number, the communication efficiency can be greatly improved after the clustering number is optimized, meanwhile, because the clustering number is automatically optimized, the derivative mean value calculated based on the clustering result replaces the true value, the error caused by subsequent correlation calculation is extremely small, and on part of public data sets, even the error is zero.
An embodiment of the present invention provides a terminal device. The terminal device of this embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor. The processor implements the steps in each of the communication cost optimization method embodiments described above when executing the computer program. Alternatively, the processor implements the functions of the modules/units in the above device embodiments when executing the computer program.
The computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention.
The communication cost optimization device/terminal device may be a desktop computer, a notebook, a palm top computer, a cloud server, or other computing devices. The communication cost optimization device/terminal equipment may include, but is not limited to, a processor, a memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc.
The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the communication cost optimization apparatus/terminal device by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory.
The communication cost optimization device/terminal device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. The communication cost optimization method based on longitudinal federal learning is characterized by comprising the following steps of:
s1, acquiring a common sample ID set of a labeled party and a non-labeled party;
s2, according to the common sample ID set, carrying out derivation on the loss function of the current decision tree to obtain a derivative result of each sample;
s3, performing K-means clustering on the common sample ID set by the derivative result to obtain a clustering result;
s4, respectively carrying out summation calculation on the derivative result of the sample contained in each class according to the clustering result, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering result to a non-tag participant;
s5, calculating a derivative approximate value of each class of sample by the non-tag participant according to the ciphertext and the clustering result;
s6, dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
s7, decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party respectively according to the division result of the new node;
s8, judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled participant and a non-labeled participant respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in a corresponding leaf node, and updating the predicted label value of each sample in the labeled participant; if the preset value is not reached, returning to S3;
s9, judging whether the generation quantity of the decision tree reaches a predicted condition or the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value to obtain an optimized XGboost model; otherwise, return to S2.
2. The communication cost optimization method based on longitudinal federal learning as claimed in claim 1, wherein the step S1 of obtaining the common sample ID set by a privacy protection set intersection method specifically includes the following steps:
tagged participant possession data { I A (ID),f A (id) }, unlabeled participant owns the data { I } B (ID),f B (id)};
The tagged party and the untagged party calculate a common sample ID set T (ID) according to a PSI method, as shown in formula (1):
T(ID)=I A (ID)∩I B (ID) (1)
wherein, I A (ID),I B (ID) represents ID sets of samples owned by tagged and untagged parties, respectively, f A (id),f B And (ID) respectively represents the characteristic attributes corresponding to the sample ID epsilon I (ID) in the tagged party and the untagged party.
3. The communication cost optimization method based on longitudinal federated learning according to claim 1, wherein the current decision tree in S2 is the t-th tree whose objective function L is (t) As shown in formula (2):
wherein the loss function represents the true value y i And the predicted valueDifference of (d), Ω (f) t ) Representing a regularization term, the loss function is as shown in equation (3):
the derivative of the loss function includes a first derivative g i And second derivative h i As shown in formulas (4) and (5):
4. the communication cost optimization method based on longitudinal federal learning as claimed in claim 1, wherein in S3, K-means clustering is an adaptive K-means clustering method based on contour coefficients, and the specific steps are as follows:
s3.1, initializing an average contour coefficient threshold value to be-1;
s3.2, setting a loop body according to the number N of samples contained in the common sample set T (ID), wherein the loop times are set to be (1,N);
s3.3, setting an alarm capture mechanism in the loop body, and ending the loop when the number of clusters is smaller than a set threshold value;
s3.4, when the loop body is executed, updating the threshold value of the clustering number in the K-means clustering algorithm to be the number of times of executing the loop body;
s3.5, in each cycle, clustering the derivative result to obtain an average contour coefficient under the clustering, comparing the average contour coefficient with the existing average contour coefficient, updating the existing contour coefficient to the maximum contour coefficient, and recording the clustering result corresponding to the maximum contour coefficient;
and S3.6, after the circulation is finished, taking the cluster corresponding to the maximum outline coefficient as a final result of clustering the derivative result to obtain a training set subset ID set contained in each cluster.
5. The communication cost optimization method based on longitudinal federated learning according to claim 1, wherein in S4, a summation operation is performed on the derivative results of the samples included in each class, a homomorphic encryption Paillier algorithm is used to encrypt the summation result, and a ciphertext and a clustering result are sent to a non-tag participant.
6. The communication cost optimization method based on longitudinal federated learning of claim 1, wherein in S5, the unlabeled participant calculates the derivative approximation, including the first derivative, of the samples in each class from the ciphertext and the clustering resultNumerical cipher text approximationAnd second derivative ciphertext approximationsCalculated according to equations (6) and (7):
wherein, en (G) j )、En(H j ) First derivative sum and second derivative sum, T, of ciphertext forms in each class j (ID) is the set of training set subset IDs contained in each cluster.
7. The method of claim 1, wherein the tagged and untagged participants in S6 calculate respective partition gains according to equation (8):
wherein, G L 、G R 、H L 、H R Respectively the first derivative sum and the second derivative sum contained in the left and right subtrees after the sample is divided based on the characteristic value, and the lambda and the mu are regular terms omega (f) t ) The regularization coefficients of (1);
the specific calculation steps include:
s6.1, dividing the sample ID set into a left subset I and a right subset I based on each eigenvalue division L (ID) and I R (ID);
S6.2, computing the set I L (ID) onFirst derivative sum of G L And the second derivative sum H L Computing a set I R First derivative on (ID) and G R And the second derivative sum H R ;
And S6.2, calculating a corresponding gain value when each characteristic value is divided according to the formula (8).
8. A longitudinal federal learning-based communication cost optimization system, comprising:
the data acquisition module is used for acquiring a common sample ID set of a labeled party and an unlabeled party;
the first data calculation module is used for performing derivation on the loss function of the current decision tree according to the common sample ID set to obtain a derivative result of each sample;
the data processing module is used for carrying out K-means clustering on the common sample ID set according to the derivation result to obtain a clustering result;
the data encryption module is used for respectively carrying out summation calculation on derivative results of samples contained in each class according to clustering results, encrypting by using a homomorphic encryption algorithm to obtain a ciphertext, and sending the ciphertext and the clustering results to a non-tag participant;
the second data calculation module is used for calculating derivative approximate values of samples in each class by the non-tag participator according to the ciphertext and the clustering result;
the third data calculation module is used for dividing the common sample ID set according to the value of each characteristic of the labeled party and the unlabeled party, and calculating respective gain values of the labeled party and the unlabeled party and division information corresponding to the gain values based on the divided common sample ID set and derivative approximate values of samples in each class; wherein, the participant with the tag calculates to obtain a gain value in a plaintext form, and the participant without the tag calculates to obtain a gain value in a ciphertext form;
the data decryption module is used for decrypting the gain value of the non-labeled party, comparing the gain value with the gain value of the labeled party, taking the division information corresponding to the maximum gain value as the current split point of the current decision tree, generating a new node of the current decision tree, and dividing the common sample ID set by the labeled party and the non-labeled party according to the division result of the new node;
the first circulation module is used for judging whether the generation of the current decision tree reaches a preset condition or not according to a common sample ID set divided by a labeled party and a non-labeled party respectively, stopping the generation when the preset condition is reached, calculating the weight of each leaf node of the current decision tree, recording the weight in the corresponding leaf node, and updating the predicted label value of each sample in the labeled party; when the preset value is not reached, returning to the data processing module;
the second circulation module is used for judging whether the generation quantity of the decision tree reaches a predicted condition or whether the residual error is smaller than a given threshold value or not according to the predicted label value of each sample in the labeled participant, and stopping generation when the predicted condition or the residual error is smaller than the given threshold value, so that an optimized XGboost model is obtained; otherwise, returning to the first data calculation module.
9. An apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211008707.8A CN115481415A (en) | 2022-08-22 | 2022-08-22 | Communication cost optimization method, system, device and medium based on longitudinal federal learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211008707.8A CN115481415A (en) | 2022-08-22 | 2022-08-22 | Communication cost optimization method, system, device and medium based on longitudinal federal learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115481415A true CN115481415A (en) | 2022-12-16 |
Family
ID=84422905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211008707.8A Pending CN115481415A (en) | 2022-08-22 | 2022-08-22 | Communication cost optimization method, system, device and medium based on longitudinal federal learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115481415A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117675411A (en) * | 2024-01-31 | 2024-03-08 | 智慧眼科技股份有限公司 | Global model acquisition method and system based on longitudinal XGBoost algorithm |
-
2022
- 2022-08-22 CN CN202211008707.8A patent/CN115481415A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117675411A (en) * | 2024-01-31 | 2024-03-08 | 智慧眼科技股份有限公司 | Global model acquisition method and system based on longitudinal XGBoost algorithm |
CN117675411B (en) * | 2024-01-31 | 2024-04-26 | 智慧眼科技股份有限公司 | Global model acquisition method and system based on longitudinal XGBoost algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021249086A1 (en) | Multi-party joint decision tree construction method, device and readable storage medium | |
Zhang et al. | PIC: Enable large-scale privacy preserving content-based image search on cloud | |
Liu et al. | Cloud-enabled privacy-preserving collaborative learning for mobile sensing | |
Enthoven et al. | An overview of federated deep learning privacy attacks and defensive strategies | |
Liu et al. | Intelligent and secure content-based image retrieval for mobile users | |
CN109615021B (en) | Privacy information protection method based on k-means clustering | |
CN113420232B (en) | Privacy protection-oriented federated recommendation method for neural network of graph | |
CN113128701A (en) | Sample sparsity-oriented federal learning method and system | |
AU2016218947A1 (en) | Learning from distributed data | |
CN115660050A (en) | Robust federated learning method with efficient privacy protection | |
Yang et al. | Gradient leakage attacks in federated learning: Research frontiers, taxonomy and future directions | |
Qin et al. | Privacy-preserving outsourcing of image global feature detection | |
CN109086830B (en) | Typical correlation analysis near-duplicate video detection method based on sample punishment | |
Shao et al. | Federated generalized face presentation attack detection | |
Cheng et al. | SecureAD: A secure video anomaly detection framework on convolutional neural network in edge computing environment | |
Yang et al. | Practical feature inference attack in vertical federated learning during prediction in artificial Internet of Things | |
CN114417388B (en) | Power load prediction method, system, equipment and medium based on longitudinal federal learning | |
CN115481415A (en) | Communication cost optimization method, system, device and medium based on longitudinal federal learning | |
Ranbaduge et al. | Differentially private vertical federated learning | |
Chen et al. | Sparse general non-negative matrix factorization based on left semi-tensor product | |
CN118116554A (en) | Medical image caching processing method based on big data processing | |
CN117391816A (en) | Heterogeneous graph neural network recommendation method, device and equipment | |
CN116383470B (en) | Image searching method with privacy protection function | |
CN116467415A (en) | Bidirectional cross-domain session recommendation method based on GCNsformer hybrid network and multi-channel semantics | |
Hong et al. | Augmented Rotation‐Based Transformation for Privacy‐Preserving Data Clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |