CN116341004A - Longitudinal federal learning privacy leakage detection method based on feature embedding analysis - Google Patents
Longitudinal federal learning privacy leakage detection method based on feature embedding analysis Download PDFInfo
- Publication number
- CN116341004A CN116341004A CN202310304542.7A CN202310304542A CN116341004A CN 116341004 A CN116341004 A CN 116341004A CN 202310304542 A CN202310304542 A CN 202310304542A CN 116341004 A CN116341004 A CN 116341004A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- shadow
- embedding
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 109
- 238000000034 method Methods 0.000 claims abstract description 51
- 230000008569 process Effects 0.000 claims abstract description 30
- 238000009499 grossing Methods 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims description 9
- 238000010367 cloning Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000012038 vulnerability analysis Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 241001671553 Calophyllum antillanum Species 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioethics (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Databases & Information Systems (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis. The method comprises the following steps: the inspector embeds shadow data in the training process of longitudinal federal learning; acquiring feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data; embedding a proxy model of the data clone bottom model by utilizing the shadow data and the characteristics of the shadow data after smoothing treatment; and embedding private training data of the matched reconstruction target participant through the characteristics by using the proxy model, and detecting the original data leakage of the longitudinal federal learning. The method of the invention simultaneously realizes vulnerability analysis of model leakage, original data leakage and data characteristic leakage under the condition of not damaging the utility of the VFL.
Description
Technical Field
The invention relates to the technical field of network security technology and privacy, in particular to a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis.
Background
FL (Federated Learning, federal learning) has become a promising privacy-friendly machine learning mechanism allowing participants to exchange intermediate results periodically rather than explicitly sharing training data to achieve convergence of model training. Participants of the VFL (Vertical Federated Learning, vertical federal learning) hold the same training data set but have different feature subsets, i.e., vertically partitioned training data. In practice, VFL is suitable for knowledge fusion of heterogeneous and confidential signature sources between potentially competing companies to drive powerful predictive analysis. For example, an insurance company may wish to combine loan credits of the same principal with banking records provided by different financial institutions to predict future financial risk for that principal.
In a VFL system, local participants share the same sample space but segment the feature space of the data, while the server has tags for training data. The local participant hosts own bottom model for feature extraction of the data, and embeds and transmits corresponding features to the server. The server trains a top model by stitching feature embeddings uploaded from different participants as input. Feature embedding of data is a compressed representation of a private training instance and thus can be used as an information source for estimating a target training instance. According to the invention, a detector can clone the bottom model and infer original training data and data attributes on the premise of not interfering with the utility of the VFL, and can be used for detecting privacy leakage of the VFL scene by only utilizing part of shadow data and analyzing an intermediate result (namely, characteristic embedding of local data) submitted by a local participant to a server.
Although VFLs are designed to protect privacy, there have been many efforts to demonstrate that VFLs still present various risks of privacy disclosure. In previous work, it was proposed that a malicious server infer training data of a local participant by actively manipulating the feature-embedded gradient sent to the target participant, i.e., hijacking the VFL training process. The feature hijacking attack replaces the normal bottom model hosted by the target participant with a well-crafted model, facilitating reconstruction of proprietary training data. However, feature hijacking attacks can result in a huge utility penalty for VFL trained classifiers, and thus are not suitable for real-world scenarios.
In one prior art VFL privacy leak detection scheme, the server may infer the training data of the target participant by actively manipulating the feature-embedded gradient sent to the target participant, i.e., hijacking the VFL training process.
The first step: the server selects a new learning task to replace the original learning task selected by the client.
And a second step of: the server uses its control over the client training process to hijack the client's bottom model and direct it to a selected target feature space.
And a third step of: the server uses the hijacked target feature space to reverse recover the private training instance.
Drawbacks of one VFL privacy disclosure detection scheme of the above prior art include: this approach can result in a significant utility penalty for the VFL-trained classifier, limiting its use in real environments.
Disclosure of Invention
The embodiment of the invention provides a longitudinal federal learning privacy leakage detection method based on feature embedding analysis, which is used for effectively ensuring the utility of a VFL.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
A longitudinal federal learning privacy disclosure detection method based on feature embedding analysis comprises the following steps:
the inspector embeds shadow data in the training process of longitudinal federal learning;
acquiring the shadow data and feature embedded data of private training data of a target participant in longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data;
cloning a proxy model of the bottom model by utilizing the shadow data and the characteristic embedded data after the shadow data smoothing process;
and embedding private training data of the matched reconstruction target participant through characteristics by utilizing the agent model, and detecting the original data leakage of the longitudinal federal learning.
Preferably, the inspector embeds shadow data in a training process of longitudinal federal learning, including:
the method comprises the steps that a server is used as a detector, the detector selects a shadow user, the shadow user is registered in the training process of longitudinal federation learning, and the original attribute of shadow data of the shadow user and the attribute of private training data of a target participant in the longitudinal federation learning are distributed in the same way, so that the shadow data participates in the training process of the longitudinal federation learning.
Preferably, the obtaining feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and performing smoothing processing on the feature embedded data includes:
in the training process of longitudinal federal learning, a detector records embedded shadow data, and characteristic embedding of the shadow data and private training data of a target participant on a bottom model, and smoothes the characteristic embedding at continuous T moments;
the smoothing mechanism for feature embedding is as follows: let f B For the bottom model of the target participant, assuming that the detector initiates detection at the t-th round, recording private training data x of the target participant v Continuous T on bottom modelFeature embedding at timeRecording shadow data x s, and xs Feature embedding at successive T times on a bottom modelFor-> and />Respectively smoothing to obtain feature embedded data +.> and />
Preferably, the cloning the proxy model of the bottom model by using the shadow data and the feature embedded data after the shadow data smoothing process includes:
the inspector utilizes the recorded plurality ofMapping pair learning a proxy model>Using agent model->Approximating the original feature space and the bottom model f B Mapping between the embedding spaces;
the proxy modelThe learning process of (1) is as follows: feature embedding minimizing proxy model generation>And feature embedding generated by a true bottom model +.>The l2 distance between:
wherein Representing agent model->Model parameters of->Representing shadow data x s In proxy modelAnd an output from the first and second switches.
Preferably, the detecting the leakage of the original data for the longitudinal federal learning by embedding the private training data of the matched reconstruction target participant by the features by using the proxy model includes:
the proxy model reconstructs private training data of the target participant by feature embedding matching and generates f G To implement regression problem of the feature embedding matching, the generator f G Is oneDecoder model with random noise x n As an input, a reconstructed image f is output G (x n ) The image f will be reconstructed G (x n ) Input proxy model acquisition feature embeddingThe inspector embeds +_ by minimizing features of the reconstructed image provided by the agent model>L2 distance between true feature embedding of target image corresponding to private training data, finding generator f G To reconstruct the image f G (x n ) The embedding that produces as close as possible to the target image is optimized as follows:
wherein ,LR () Is a loss function based on mean square error embedded with a matching function;
obtaining a generator f G By comparing the reconstructed images f obtained by the generator after the optimal parameters of (a) G (x n ) Private training data x with target participants v And the similarity between the two data is obtained, and the protection degree of the longitudinal federal learning on the original data is detected.
Preferably, the method further comprises:
the detector takes the discrete attribute value of the shadow data as a class label in the classification task, embeds the average characteristic of the shadow data as input, trains an attribute decoder, and utilizes the attribute decoder to infer the attribute of the private training data of the target participant;
defining the attribute reasoning model of the private training data as a multi-class classifier f C The unique category of each attribute has a class label, let y P :=[y i ,y 2 ,...,y P ]Representing p attributes corresponding to training data, and using characteristic embedding of records by a detectorEntry attribute valueTraining a multiclass classifier f C, wherein />Representing the p-class output of the classifier, the multi-class classifier f C The objective of the optimization problem of (a) is to minimize the classifier f C Empirical classification loss of collected shadow data:
wherein ,LC Is the cross-entropy loss and,output embedded on the classifier for the average feature of the shadow data,/>Is shadow data x s Corresponding attribute tags;
when multi-class classifier f c After training is completed, the detector performs smooth embedding according to the private training data of the target participantTo predict x x Attributes of (i.e.)>
The multiclass classifier f during training C Smooth feature embedding for shadow data as inputOutput as shadow data x s Corresponding attribute tag->The multiclass classifier f at the time of reasoning C Smooth feature embedding of private training data of target participants>Outputting private training data x for target participants v Corresponding attribute tag->By comparing predicted private training data x v Corresponding attribute tag->And private training data x v Is->To detect the protection capability of the longitudinal federal learning model on the user data.
According to the technical scheme provided by the embodiment of the invention, the invention provides a longitudinal federal learning privacy leakage detection method based on feature embedding analysis. The prior detection method based on feature hijacking does not involve model leakage analysis and destroys the utility of the VFL. The invention simultaneously realizes vulnerability analysis of model leakage, original data leakage and data characteristic leakage under the condition of not damaging the utility of the VFL.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of implementation of a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis according to an embodiment of the present invention;
fig. 2 is a process flow diagram of a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.
The embodiment of the invention designs a longitudinal federal learning privacy leakage detection method based on feature embedding analysis. The method can perform bottom model leakage detection, original data leakage detection and data attribute leakage detection. The inspector may apply a small number of shadow users and register their data into the VFL training process prior to model training. Following the standard set-up of existing privacy leak detection, we assume that the original data attributes of the shadow user are identical to the attribute distribution of the private training data. During the training process, the inspector needs to record the complete shadow data, data features, and feature embedding of all training data for subsequent privacy disclosure analysis.
The first step in implementing privacy-preserving detection is to perform a smoothing enhancement technique on the received feature embedding to suppress feature embedding fluctuations due to updating of the bottom model during training. Based on the collected original shadow data and the corresponding smooth feature embedding, a proxy model can be trained to approximate the conversion between the original data and the feature embedding, thereby realizing cloning of the bottom model. Based on the cloned proxy model, the inspector can further match the feature embedding of the private training data of the target participating user on the bottom model with the feature embedding of the reconstruction data in the proxy model, so as to optimize the reconstruction data to be infinitely close to the real target data. In addition, the inspector can embed the discrete attribute values and the corresponding average features of the shadow data as class labels and features in the classification task to train an attribute decoder. Thereby inferring the attributes of the target data using the attribute decoder.
The characteristic embedding and corresponding gradient of the data are the only interactive information of the local participants and the server in the VFL system in the co-training process. The server may infer training data of the local participant by actively manipulating the feature-embedded gradients sent to the target participant to assess vulnerability to data leakage in the VFL system. However, this approach counterfeits the gradient returned to the target participant to force the participant-generated feature embedding to converge to the feature space desired by the attacker, which compromises the utility of the VFL.
The implementation schematic diagram of the longitudinal federal learning privacy leakage detection method based on feature embedding analysis provided by the embodiment of the invention is shown in fig. 1, the specific processing flow is shown in fig. 2, and the method comprises the following processing steps:
step S1: embedding shadow data by a detector in the VFL training process, so that the shadow data participates in the training process of the VFL;
in the invention, the inspector can be an honest but curious server,
step S2: recording the shadow data, the characteristics of the shadow data and the characteristic embedding of all the data on the bottom model, and smoothing the characteristic embedding at the continuous T time.
Step S3: embedding the smoothed characteristics of the shadow data and the private training data of the target participant into a cloned proxy model;
step S4: reconstructing private training data based on the cloned proxy model and smoothed features of the private training data;
step S5: the feature of the shadow data and the corresponding smooth feature are utilized to embed a training attribute decoder and applied to the feature embedding of the private training data so as to infer the sensitive attribute of the private training data.
Specifically, the step S1 includes: the inspector may employ a small number of shadow users and register these shadow users into the VFL training process. It is assumed that the original properties of the shadow data of the shadow user have the same distribution as the properties of the private training data of the target participants in the VFL. The attributes of the shadow user will be provided to the local feature set owned by each local participant before the VFL training begins.
Specifically, the step S2 includes: and recording shadow data and feature embedding, and smoothing the feature embedding.
During the training of the VFLThe inspector needs to record the embedded shadow data and the feature embedding of all the data on the bottom model and smooth the feature embedding at consecutive T times. The specific smoothing mechanism is as follows: let f B Is a model of the bottom of the target participant. Assuming that the detector initiates detection at the t-th round, recording private training data x of the target participant v Feature embedding at successive T times on a bottom modelRecording shadow data x s, and xs Feature embedding +.>For-> and />Respectively smoothing to obtain feature embedded data +.> and />
The smoothed feature embedding helps to suppress embedding fluctuations caused by the bottom model update during training for later use in the result stability at privacy leak detection.
Specifically, the step S3 includes: based on the stealing attack of the bottom model, the leakage vulnerability of the bottom model is analyzed.
Based on the shadow data x recorded in the step S2 s and xs Corresponding smoothed feature embedded dataThe inspector can train a proxy model to approximate the conversion of training data to feature embedding, thereby achieving cloning of the bottom model. The examiner can use the recorded multiple +.>Mapping pair learning a proxy model>Using agent model->Approximating the original feature space and the bottom model f B Mapping between the embedding spaces;
the proxy modelThe learning process of (1) is as follows: feature embedding minimizing proxy model generation>And feature embedding generated by a true bottom model +.>The l2 distance between:
wherein Representing agent model->Model parameters of->Representing shadow data x s In proxy modelAnd an output from the first and second switches.
Given the original characteristics of a data instance by optimizing the above equation, a learned proxy modelA true bottom model f can be generated B Approximately the same features are embedded.
Specifically, the step S4 includes: based on the data reconstruction attack, the vulnerability of data leakage is analyzed.
Based on the agent model learned in the step S3, the inspector can further recover the private training data of the target participant through feature embedding matching. This embedded matching process can be expressed as a regression problem. The goal of an attacker is to find estimates of the true raw attribute values that can produce feature embedding that best matches the attributes generated by the true attributes.
In order to restore the image data, the invention introduces a generator f G To help solve the regression problem. Generator f G Is a decoder model with random noise x n As an input, and output a reconstructed image f G (x n ). An image typically contains hundreds or thousands of pixels, and by directly estimating the value of each pixel, it is difficult to obtain stable reconstruction results because solving the high-dimensional regression task is prone to curse of dimensions. Inputting the reconstructed image into a proxy model to obtain feature embeddingThe inspector can be reduced by minimizing reconstruction provided by the proxy modelThe l2 distance between the feature embedding of the image and the true feature embedding of the target image, finding the generator f G Is of the optimum parameter omega G Thereby making the reconstructed image f G (x n ) Resulting in an embedding as close as possible to the target image. The optimization formula is as follows:
wherein ,LR () Is a loss function embedded in the matching function, i.e., a loss based on Mean Square Error (MSE).
In this optimization problem we consider estimating ω G As a variable. The reason behind this is: if the proxy model is able to accurately approximate the feature embedding transformation of the bottom model, then performing feature embedding matching may drive the estimate f G (x n ) Near true x v . For restoring image data, we further add a Total Variance (TV) regularization R tv () To improve the smoothness of the reconstructed image. For reconstructing the numerical features of the tabular data, we eliminate R tv () Because the numerical properties do not necessarily follow smoothness constraints as in images. Furthermore, since the number of numerical features is typically much smaller than the number of pixels in the image, we can directly estimate the attribute x v,t Without introducing a generator module f G . Thus, the numerical properties are reconstructed in the tabular data, whose optimization equations are reduced to:
wherein ,LR () Is a loss function based on mean square error embedded with a matching function;
obtaining a generator f G By comparing the reconstructed images f obtained by the generator after the optimal parameters of (a) G (x n ) Private training data x with target participants v And the similarity between the two data is obtained, and the protection degree of the longitudinal federal learning on the original data is detected.
Specifically, the step S5 includes: based on the attribute reasoning attack, the vulnerability of data attribute leakage is analyzed.
Given the target properties of training shadow data, we use the discrete property values of the shadow data as class labels in the classification task. We define the attribute inference model as a multi-class classifier, with a class label for each unique class of attributes. Let y P :=[y 1 ,y 2 ,...,y P ]Representing p attributes corresponding to the training data. The inspector can embed attribute values using recorded featuresTraining a multiclass classifier f C, wherein />Representing the p-class output of the classifier. The objective of this optimization problem is to minimize the classifier f C Empirical classification loss of collected shadow data:
wherein ,LC Is the cross-entropy loss and,output embedded on the classifier for the average feature of the shadow data,/>Is shadow data x s Corresponding attribute tags;
when multi-class classifier f C After training is completed, the detector performs smooth embedding according to the private training data of the target participantTo predict x v Attributes of (i.e.)>
The multiclass classifier f during training C Smooth feature embedding for shadow data as inputOutput as shadow data x s Corresponding attribute tag->The multiclass classifier f at the time of reasoning C Smooth feature embedding of private training data of target participants>Outputting private training data x for target participants v Corresponding attribute tag->By comparing predicted private training data x v Corresponding attribute tag->And private training data x v Is->To detect the protection capability of the longitudinal federal learning model on the user data.
The feature embedding analysis-based longitudinal federal learning privacy leakage detection method provided by the embodiment of the invention. We evaluated the effectiveness of this method on three different complexity models (FCNN, lenet, and ResNet) and five different pixel size data sets (band marking, credit, census, UTKFace, calabA), respectively. Experimental results show that compared with the existing method, the longitudinal federal learning privacy disclosure detection method based on feature embedding analysis can realize privacy disclosure analysis on models, original data and data features while the VFL utility is not damaged.
Therefore we show that the gradient uploaded by the user still carries important information of the training data and by designing the correct attack method, a high-precision original image can be stably and effectively reconstructed even without auxiliary data and complex recovery data. It is desirable that this work motivates people to re-think about the role of VFL in model and data privacy protection, further enhancing the design and development of existing privacy protection frameworks.
In summary, the feature embedding analysis-based longitudinal federal learning privacy leakage detection method provided by the embodiment of the invention does not interfere with the training process of the VFL, thereby ensuring the utility of the VFL. And secondly, the method simultaneously realizes privacy disclosure analysis of the model, the original data and the data characteristics.
The feature embedding analysis-based longitudinal federal learning privacy disclosure detection method provided by the invention effectively realizes the omnibearing detection and analysis of privacy disclosure vulnerability in the VFL, and the provided detection method does not negatively influence the utility of the VFL; the proposed smoothing strategy can effectively resist noise interference in the training process.
Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (6)
1. A longitudinal federal learning privacy disclosure detection method based on feature embedding analysis is characterized by comprising the following steps:
the inspector embeds shadow data in the training process of longitudinal federal learning;
acquiring the shadow data and feature embedded data of private training data of a target participant in longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data;
cloning a proxy model of the bottom model by utilizing the shadow data and the characteristic embedded data after the shadow data smoothing process;
and embedding private training data of the matched reconstruction target participant through characteristics by utilizing the agent model, and detecting the original data leakage of the longitudinal federal learning.
2. The method of claim 1, wherein the inspector embeds shadow data during training of longitudinal federal learning, comprising:
the method comprises the steps that a server is used as a detector, the detector selects a shadow user, the shadow user is registered in the training process of longitudinal federation learning, and the original attribute of shadow data of the shadow user and the attribute of private training data of a target participant in the longitudinal federation learning are distributed in the same way, so that the shadow data participates in the training process of the longitudinal federation learning.
3. The method according to claim 1 or 2, wherein obtaining feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and smoothing the feature embedded data comprises:
in the training process of longitudinal federal learning, a detector records embedded shadow data, and characteristic embedding of the shadow data and private training data of a target participant on a bottom model, and smoothes the characteristic embedding at continuous T moments;
the smoothing mechanism for feature embedding is as follows: order the fB For the bottom model of the target participant, assume that the detector is at the first t Wheel initiated detection, recording private training data of target participants x v Feature embedding at successive T times on a bottom modelRecording shadow data x s, and xs Feature embedding at successive T times on a bottom modelFor-> and />Respectively smoothing to obtain feature embedded data +.> and />
4. The method of claim 3, wherein cloning the proxy model of the bottom model using the shadow data and the feature embedding data after the shadow data smoothing process comprises:
the inspector utilizes the recorded plurality ofMapping pair learning a proxy model>Using agent model->Approximating the original feature space and the bottom model f B Mapping between the embedding spaces;
the proxy modelThe learning process of (1) is as follows: feature embedding minimizing proxy model generation>And feature embedding generated by a true bottom model +.>The l2 distance between:
5. The method of claim 4, wherein said using the proxy model to reconstruct private training data of the target participant through feature embedding matches, performing raw data leak detection on longitudinal federal learning, comprises:
the proxy model reconstructs private training data of the target participant by feature embedding matching and generates f G To implement regression problem of the feature embedding matching, the generator f G Is a decoder model with random noise x n As an input, a reconstructed image f is output G (x n ) The image f will be reconstructed G (x n ) Input proxy model acquisition feature embeddingThe inspector embeds +_ by minimizing features of the reconstructed image provided by the agent model>L2 distance between true feature embedding of target image corresponding to private training data, finding generator f G To reconstruct the image f G (x n ) The embedding that produces as close as possible to the target image is optimized as follows:
wherein ,LR () Is a loss function based on mean square error embedded with a matching function;
obtaining a generator f G By comparing the reconstructed images f obtained by the generator after the optimal parameters of (a) G (x n ) Private training data x with target participants v And the similarity between the two data is obtained, and the protection degree of the longitudinal federal learning on the original data is detected.
6. A method according to claim 3, wherein the method further comprises:
the detector takes the discrete attribute value of the shadow data as a class label in the classification task, embeds the average characteristic of the shadow data as input, trains an attribute decoder, and utilizes the attribute decoder to infer the attribute of the private training data of the target participant;
defining the attribute reasoning model of the private training data as a multi-class classifier f C The unique category of each attribute has a class label, let y P :=[y 1 ,y 2 ,...,y P ]Representing p attributes corresponding to training data, and embedding attribute values by using recorded characteristics by a detectorTraining a multiclass classifier f C, wherein />Representing the p-class output of the classifier, the multi-class classifier f C The objective of the optimization problem of (a) is to minimize the classifier f C Empirical classification loss of collected shadow data:
wherein ,LC Is the cross-entropy loss and,output embedded on the classifier for the average feature of the shadow data,/>Is shadow data x s Corresponding attribute tags;
when multi-class classifier f C After training is completed, the detector performs smooth embedding according to the private training data of the target participantTo predict x v Attributes of (i.e.)>
The multi-class classifier during training fC Smooth feature embedding for shadow data as inputOutput as shadow data x s Corresponding attribute tag->The multiclass classifier f at the time of reasoning C Smooth feature embedding of private training data of target participants>Outputting private training data x for target participants v Corresponding attribute tag->By comparing predicted private training data x v Corresponding attribute tag->And private training data x v Is->To detect the protection capability of the longitudinal federal learning model on the user data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310304542.7A CN116341004B (en) | 2023-03-27 | 2023-03-27 | Longitudinal federal learning privacy leakage detection method based on feature embedding analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310304542.7A CN116341004B (en) | 2023-03-27 | 2023-03-27 | Longitudinal federal learning privacy leakage detection method based on feature embedding analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116341004A true CN116341004A (en) | 2023-06-27 |
CN116341004B CN116341004B (en) | 2023-09-08 |
Family
ID=86881870
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310304542.7A Active CN116341004B (en) | 2023-03-27 | 2023-03-27 | Longitudinal federal learning privacy leakage detection method based on feature embedding analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116341004B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117592042A (en) * | 2024-01-17 | 2024-02-23 | 杭州海康威视数字技术股份有限公司 | Privacy disclosure detection method and device for federal recommendation system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113094758A (en) * | 2021-06-08 | 2021-07-09 | 华中科技大学 | Gradient disturbance-based federated learning data privacy protection method and system |
US20220222539A1 (en) * | 2021-01-12 | 2022-07-14 | Sap Se | Adversarial learning of privacy preserving representations |
CN114936372A (en) * | 2022-04-06 | 2022-08-23 | 湘潭大学 | Model protection method based on three-party homomorphic encryption longitudinal federal learning |
-
2023
- 2023-03-27 CN CN202310304542.7A patent/CN116341004B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220222539A1 (en) * | 2021-01-12 | 2022-07-14 | Sap Se | Adversarial learning of privacy preserving representations |
CN113094758A (en) * | 2021-06-08 | 2021-07-09 | 华中科技大学 | Gradient disturbance-based federated learning data privacy protection method and system |
CN114936372A (en) * | 2022-04-06 | 2022-08-23 | 湘潭大学 | Model protection method based on three-party homomorphic encryption longitudinal federal learning |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117592042A (en) * | 2024-01-17 | 2024-02-23 | 杭州海康威视数字技术股份有限公司 | Privacy disclosure detection method and device for federal recommendation system |
CN117592042B (en) * | 2024-01-17 | 2024-04-05 | 杭州海康威视数字技术股份有限公司 | Privacy disclosure detection method and device for federal recommendation system |
Also Published As
Publication number | Publication date |
---|---|
CN116341004B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xiao et al. | Improving transferability of adversarial patches on face recognition with generative models | |
Li et al. | Auditing privacy defenses in federated learning via generative gradient leakage | |
Guo et al. | Fake face detection via adaptive manipulation traces extraction network | |
Zhang et al. | The secret revealer: Generative model-inversion attacks against deep neural networks | |
CN113688855B (en) | Data processing method, federal learning training method, related device and equipment | |
Yuan et al. | Robust visual tracking with correlation filters and metric learning | |
Salehi et al. | Arae: Adversarially robust training of autoencoders improves novelty detection | |
Gan et al. | Multigraph fusion for dynamic graph convolutional network | |
EP4085369A1 (en) | Forgery detection of face image | |
Wu et al. | Federated unlearning: Guarantee the right of clients to forget | |
Li et al. | Privacy-preserving lightweight face recognition | |
Feng et al. | Deep-masking generative network: A unified framework for background restoration from superimposed images | |
CN115563650A (en) | Privacy protection system for realizing medical data based on federal learning | |
CN116341004B (en) | Longitudinal federal learning privacy leakage detection method based on feature embedding analysis | |
CN111726472B (en) | Image anti-interference method based on encryption algorithm | |
Xu et al. | Visual-semantic transformer for face forgery detection | |
Li et al. | High-capacity coverless image steganographic scheme based on image synthesis | |
Ye et al. | Privacy-preserving age estimation for content rating | |
Zhou et al. | Neural encoding and decoding with a flow-based invertible generative model | |
Zhang et al. | Effective presentation attack detection driven by face related task | |
Ilyas et al. | E-Cap Net: an efficient-capsule network for shallow and deepfakes forgery detection | |
CN117037244A (en) | Face security detection method, device, computer equipment and storage medium | |
Qiu et al. | A closer look at gan priors: Exploiting intermediate features for enhanced model inversion attacks | |
Yang et al. | Fast Generation-Based Gradient Leakage Attacks: An Approach to Generate Training Data Directly From the Gradient | |
CN116091891A (en) | Image recognition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |