CN116341004A - Longitudinal federal learning privacy leakage detection method based on feature embedding analysis - Google Patents

Longitudinal federal learning privacy leakage detection method based on feature embedding analysis Download PDF

Info

Publication number
CN116341004A
CN116341004A CN202310304542.7A CN202310304542A CN116341004A CN 116341004 A CN116341004 A CN 116341004A CN 202310304542 A CN202310304542 A CN 202310304542A CN 116341004 A CN116341004 A CN 116341004A
Authority
CN
China
Prior art keywords
data
model
shadow
embedding
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310304542.7A
Other languages
Chinese (zh)
Other versions
CN116341004B (en
Inventor
王伟
许向蕊
管晓宏
沈超
刘鹏睿
吕晓婷
郝玉蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Xian Jiaotong University
Original Assignee
Beijing Jiaotong University
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University, Xian Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202310304542.7A priority Critical patent/CN116341004B/en
Publication of CN116341004A publication Critical patent/CN116341004A/en
Application granted granted Critical
Publication of CN116341004B publication Critical patent/CN116341004B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioethics (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis. The method comprises the following steps: the inspector embeds shadow data in the training process of longitudinal federal learning; acquiring feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data; embedding a proxy model of the data clone bottom model by utilizing the shadow data and the characteristics of the shadow data after smoothing treatment; and embedding private training data of the matched reconstruction target participant through the characteristics by using the proxy model, and detecting the original data leakage of the longitudinal federal learning. The method of the invention simultaneously realizes vulnerability analysis of model leakage, original data leakage and data characteristic leakage under the condition of not damaging the utility of the VFL.

Description

Longitudinal federal learning privacy leakage detection method based on feature embedding analysis
Technical Field
The invention relates to the technical field of network security technology and privacy, in particular to a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis.
Background
FL (Federated Learning, federal learning) has become a promising privacy-friendly machine learning mechanism allowing participants to exchange intermediate results periodically rather than explicitly sharing training data to achieve convergence of model training. Participants of the VFL (Vertical Federated Learning, vertical federal learning) hold the same training data set but have different feature subsets, i.e., vertically partitioned training data. In practice, VFL is suitable for knowledge fusion of heterogeneous and confidential signature sources between potentially competing companies to drive powerful predictive analysis. For example, an insurance company may wish to combine loan credits of the same principal with banking records provided by different financial institutions to predict future financial risk for that principal.
In a VFL system, local participants share the same sample space but segment the feature space of the data, while the server has tags for training data. The local participant hosts own bottom model for feature extraction of the data, and embeds and transmits corresponding features to the server. The server trains a top model by stitching feature embeddings uploaded from different participants as input. Feature embedding of data is a compressed representation of a private training instance and thus can be used as an information source for estimating a target training instance. According to the invention, a detector can clone the bottom model and infer original training data and data attributes on the premise of not interfering with the utility of the VFL, and can be used for detecting privacy leakage of the VFL scene by only utilizing part of shadow data and analyzing an intermediate result (namely, characteristic embedding of local data) submitted by a local participant to a server.
Although VFLs are designed to protect privacy, there have been many efforts to demonstrate that VFLs still present various risks of privacy disclosure. In previous work, it was proposed that a malicious server infer training data of a local participant by actively manipulating the feature-embedded gradient sent to the target participant, i.e., hijacking the VFL training process. The feature hijacking attack replaces the normal bottom model hosted by the target participant with a well-crafted model, facilitating reconstruction of proprietary training data. However, feature hijacking attacks can result in a huge utility penalty for VFL trained classifiers, and thus are not suitable for real-world scenarios.
In one prior art VFL privacy leak detection scheme, the server may infer the training data of the target participant by actively manipulating the feature-embedded gradient sent to the target participant, i.e., hijacking the VFL training process.
The first step: the server selects a new learning task to replace the original learning task selected by the client.
And a second step of: the server uses its control over the client training process to hijack the client's bottom model and direct it to a selected target feature space.
And a third step of: the server uses the hijacked target feature space to reverse recover the private training instance.
Drawbacks of one VFL privacy disclosure detection scheme of the above prior art include: this approach can result in a significant utility penalty for the VFL-trained classifier, limiting its use in real environments.
Disclosure of Invention
The embodiment of the invention provides a longitudinal federal learning privacy leakage detection method based on feature embedding analysis, which is used for effectively ensuring the utility of a VFL.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
A longitudinal federal learning privacy disclosure detection method based on feature embedding analysis comprises the following steps:
the inspector embeds shadow data in the training process of longitudinal federal learning;
acquiring the shadow data and feature embedded data of private training data of a target participant in longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data;
cloning a proxy model of the bottom model by utilizing the shadow data and the characteristic embedded data after the shadow data smoothing process;
and embedding private training data of the matched reconstruction target participant through characteristics by utilizing the agent model, and detecting the original data leakage of the longitudinal federal learning.
Preferably, the inspector embeds shadow data in a training process of longitudinal federal learning, including:
the method comprises the steps that a server is used as a detector, the detector selects a shadow user, the shadow user is registered in the training process of longitudinal federation learning, and the original attribute of shadow data of the shadow user and the attribute of private training data of a target participant in the longitudinal federation learning are distributed in the same way, so that the shadow data participates in the training process of the longitudinal federation learning.
Preferably, the obtaining feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and performing smoothing processing on the feature embedded data includes:
in the training process of longitudinal federal learning, a detector records embedded shadow data, and characteristic embedding of the shadow data and private training data of a target participant on a bottom model, and smoothes the characteristic embedding at continuous T moments;
the smoothing mechanism for feature embedding is as follows: let f B For the bottom model of the target participant, assuming that the detector initiates detection at the t-th round, recording private training data x of the target participant v Continuous T on bottom modelFeature embedding at time
Figure BDA0004146256190000031
Recording shadow data x s, and xs Feature embedding at successive T times on a bottom model
Figure BDA0004146256190000032
For->
Figure BDA0004146256190000033
and />
Figure BDA0004146256190000034
Respectively smoothing to obtain feature embedded data +.>
Figure BDA0004146256190000035
and />
Figure BDA0004146256190000036
Figure BDA0004146256190000037
Figure BDA0004146256190000041
Preferably, the cloning the proxy model of the bottom model by using the shadow data and the feature embedded data after the shadow data smoothing process includes:
the inspector utilizes the recorded plurality of
Figure BDA0004146256190000042
Mapping pair learning a proxy model>
Figure BDA0004146256190000043
Using agent model->
Figure BDA0004146256190000044
Approximating the original feature space and the bottom model f B Mapping between the embedding spaces;
the proxy model
Figure BDA0004146256190000045
The learning process of (1) is as follows: feature embedding minimizing proxy model generation>
Figure BDA0004146256190000046
And feature embedding generated by a true bottom model +.>
Figure BDA0004146256190000047
The l2 distance between:
Figure BDA0004146256190000048
wherein
Figure BDA0004146256190000049
Representing agent model->
Figure BDA00041462561900000410
Model parameters of->
Figure BDA00041462561900000411
Representing shadow data x s In proxy model
Figure BDA00041462561900000412
And an output from the first and second switches.
Preferably, the detecting the leakage of the original data for the longitudinal federal learning by embedding the private training data of the matched reconstruction target participant by the features by using the proxy model includes:
the proxy model reconstructs private training data of the target participant by feature embedding matching and generates f G To implement regression problem of the feature embedding matching, the generator f G Is oneDecoder model with random noise x n As an input, a reconstructed image f is output G (x n ) The image f will be reconstructed G (x n ) Input proxy model acquisition feature embedding
Figure BDA00041462561900000413
The inspector embeds +_ by minimizing features of the reconstructed image provided by the agent model>
Figure BDA00041462561900000414
L2 distance between true feature embedding of target image corresponding to private training data, finding generator f G To reconstruct the image f G (x n ) The embedding that produces as close as possible to the target image is optimized as follows:
Figure BDA00041462561900000415
wherein ,LR () Is a loss function based on mean square error embedded with a matching function;
obtaining a generator f G By comparing the reconstructed images f obtained by the generator after the optimal parameters of (a) G (x n ) Private training data x with target participants v And the similarity between the two data is obtained, and the protection degree of the longitudinal federal learning on the original data is detected.
Preferably, the method further comprises:
the detector takes the discrete attribute value of the shadow data as a class label in the classification task, embeds the average characteristic of the shadow data as input, trains an attribute decoder, and utilizes the attribute decoder to infer the attribute of the private training data of the target participant;
defining the attribute reasoning model of the private training data as a multi-class classifier f C The unique category of each attribute has a class label, let y P :=[y i ,y 2 ,...,y P ]Representing p attributes corresponding to training data, and using characteristic embedding of records by a detectorEntry attribute value
Figure BDA0004146256190000051
Training a multiclass classifier f C, wherein />
Figure BDA0004146256190000052
Representing the p-class output of the classifier, the multi-class classifier f C The objective of the optimization problem of (a) is to minimize the classifier f C Empirical classification loss of collected shadow data:
Figure BDA0004146256190000053
wherein ,LC Is the cross-entropy loss and,
Figure BDA0004146256190000054
output embedded on the classifier for the average feature of the shadow data,/>
Figure BDA0004146256190000055
Is shadow data x s Corresponding attribute tags;
when multi-class classifier f c After training is completed, the detector performs smooth embedding according to the private training data of the target participant
Figure BDA0004146256190000056
To predict x x Attributes of (i.e.)>
Figure BDA0004146256190000057
The multiclass classifier f during training C Smooth feature embedding for shadow data as input
Figure BDA0004146256190000058
Output as shadow data x s Corresponding attribute tag->
Figure BDA0004146256190000059
The multiclass classifier f at the time of reasoning C Smooth feature embedding of private training data of target participants>
Figure BDA00041462561900000510
Outputting private training data x for target participants v Corresponding attribute tag->
Figure BDA00041462561900000511
By comparing predicted private training data x v Corresponding attribute tag->
Figure BDA00041462561900000512
And private training data x v Is->
Figure BDA00041462561900000513
To detect the protection capability of the longitudinal federal learning model on the user data.
According to the technical scheme provided by the embodiment of the invention, the invention provides a longitudinal federal learning privacy leakage detection method based on feature embedding analysis. The prior detection method based on feature hijacking does not involve model leakage analysis and destroys the utility of the VFL. The invention simultaneously realizes vulnerability analysis of model leakage, original data leakage and data characteristic leakage under the condition of not damaging the utility of the VFL.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of implementation of a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis according to an embodiment of the present invention;
fig. 2 is a process flow diagram of a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.
The embodiment of the invention designs a longitudinal federal learning privacy leakage detection method based on feature embedding analysis. The method can perform bottom model leakage detection, original data leakage detection and data attribute leakage detection. The inspector may apply a small number of shadow users and register their data into the VFL training process prior to model training. Following the standard set-up of existing privacy leak detection, we assume that the original data attributes of the shadow user are identical to the attribute distribution of the private training data. During the training process, the inspector needs to record the complete shadow data, data features, and feature embedding of all training data for subsequent privacy disclosure analysis.
The first step in implementing privacy-preserving detection is to perform a smoothing enhancement technique on the received feature embedding to suppress feature embedding fluctuations due to updating of the bottom model during training. Based on the collected original shadow data and the corresponding smooth feature embedding, a proxy model can be trained to approximate the conversion between the original data and the feature embedding, thereby realizing cloning of the bottom model. Based on the cloned proxy model, the inspector can further match the feature embedding of the private training data of the target participating user on the bottom model with the feature embedding of the reconstruction data in the proxy model, so as to optimize the reconstruction data to be infinitely close to the real target data. In addition, the inspector can embed the discrete attribute values and the corresponding average features of the shadow data as class labels and features in the classification task to train an attribute decoder. Thereby inferring the attributes of the target data using the attribute decoder.
The characteristic embedding and corresponding gradient of the data are the only interactive information of the local participants and the server in the VFL system in the co-training process. The server may infer training data of the local participant by actively manipulating the feature-embedded gradients sent to the target participant to assess vulnerability to data leakage in the VFL system. However, this approach counterfeits the gradient returned to the target participant to force the participant-generated feature embedding to converge to the feature space desired by the attacker, which compromises the utility of the VFL.
The implementation schematic diagram of the longitudinal federal learning privacy leakage detection method based on feature embedding analysis provided by the embodiment of the invention is shown in fig. 1, the specific processing flow is shown in fig. 2, and the method comprises the following processing steps:
step S1: embedding shadow data by a detector in the VFL training process, so that the shadow data participates in the training process of the VFL;
in the invention, the inspector can be an honest but curious server,
step S2: recording the shadow data, the characteristics of the shadow data and the characteristic embedding of all the data on the bottom model, and smoothing the characteristic embedding at the continuous T time.
Step S3: embedding the smoothed characteristics of the shadow data and the private training data of the target participant into a cloned proxy model;
step S4: reconstructing private training data based on the cloned proxy model and smoothed features of the private training data;
step S5: the feature of the shadow data and the corresponding smooth feature are utilized to embed a training attribute decoder and applied to the feature embedding of the private training data so as to infer the sensitive attribute of the private training data.
Specifically, the step S1 includes: the inspector may employ a small number of shadow users and register these shadow users into the VFL training process. It is assumed that the original properties of the shadow data of the shadow user have the same distribution as the properties of the private training data of the target participants in the VFL. The attributes of the shadow user will be provided to the local feature set owned by each local participant before the VFL training begins.
Specifically, the step S2 includes: and recording shadow data and feature embedding, and smoothing the feature embedding.
During the training of the VFLThe inspector needs to record the embedded shadow data and the feature embedding of all the data on the bottom model and smooth the feature embedding at consecutive T times. The specific smoothing mechanism is as follows: let f B Is a model of the bottom of the target participant. Assuming that the detector initiates detection at the t-th round, recording private training data x of the target participant v Feature embedding at successive T times on a bottom model
Figure BDA0004146256190000091
Recording shadow data x s, and xs Feature embedding +.>
Figure BDA0004146256190000092
For->
Figure BDA0004146256190000093
and />
Figure BDA0004146256190000094
Respectively smoothing to obtain feature embedded data +.>
Figure BDA0004146256190000095
and />
Figure BDA0004146256190000096
Figure BDA0004146256190000097
Figure BDA0004146256190000098
The smoothed feature embedding helps to suppress embedding fluctuations caused by the bottom model update during training for later use in the result stability at privacy leak detection.
Specifically, the step S3 includes: based on the stealing attack of the bottom model, the leakage vulnerability of the bottom model is analyzed.
Based on the shadow data x recorded in the step S2 s and xs Corresponding smoothed feature embedded data
Figure BDA0004146256190000099
The inspector can train a proxy model to approximate the conversion of training data to feature embedding, thereby achieving cloning of the bottom model. The examiner can use the recorded multiple +.>
Figure BDA00041462561900000910
Mapping pair learning a proxy model>
Figure BDA00041462561900000911
Using agent model->
Figure BDA00041462561900000912
Approximating the original feature space and the bottom model f B Mapping between the embedding spaces;
the proxy model
Figure BDA00041462561900000913
The learning process of (1) is as follows: feature embedding minimizing proxy model generation>
Figure BDA00041462561900000914
And feature embedding generated by a true bottom model +.>
Figure BDA00041462561900000915
The l2 distance between:
Figure BDA00041462561900000916
wherein
Figure BDA00041462561900000917
Representing agent model->
Figure BDA00041462561900000918
Model parameters of->
Figure BDA00041462561900000919
Representing shadow data x s In proxy model
Figure BDA00041462561900000920
And an output from the first and second switches.
Given the original characteristics of a data instance by optimizing the above equation, a learned proxy model
Figure BDA0004146256190000101
A true bottom model f can be generated B Approximately the same features are embedded.
Specifically, the step S4 includes: based on the data reconstruction attack, the vulnerability of data leakage is analyzed.
Based on the agent model learned in the step S3, the inspector can further recover the private training data of the target participant through feature embedding matching. This embedded matching process can be expressed as a regression problem. The goal of an attacker is to find estimates of the true raw attribute values that can produce feature embedding that best matches the attributes generated by the true attributes.
In order to restore the image data, the invention introduces a generator f G To help solve the regression problem. Generator f G Is a decoder model with random noise x n As an input, and output a reconstructed image f G (x n ). An image typically contains hundreds or thousands of pixels, and by directly estimating the value of each pixel, it is difficult to obtain stable reconstruction results because solving the high-dimensional regression task is prone to curse of dimensions. Inputting the reconstructed image into a proxy model to obtain feature embedding
Figure BDA0004146256190000102
The inspector can be reduced by minimizing reconstruction provided by the proxy modelThe l2 distance between the feature embedding of the image and the true feature embedding of the target image, finding the generator f G Is of the optimum parameter omega G Thereby making the reconstructed image f G (x n ) Resulting in an embedding as close as possible to the target image. The optimization formula is as follows:
Figure BDA0004146256190000103
wherein ,LR () Is a loss function embedded in the matching function, i.e., a loss based on Mean Square Error (MSE).
In this optimization problem we consider estimating ω G As a variable. The reason behind this is: if the proxy model is able to accurately approximate the feature embedding transformation of the bottom model, then performing feature embedding matching may drive the estimate f G (x n ) Near true x v . For restoring image data, we further add a Total Variance (TV) regularization R tv () To improve the smoothness of the reconstructed image. For reconstructing the numerical features of the tabular data, we eliminate R tv () Because the numerical properties do not necessarily follow smoothness constraints as in images. Furthermore, since the number of numerical features is typically much smaller than the number of pixels in the image, we can directly estimate the attribute x v,t Without introducing a generator module f G . Thus, the numerical properties are reconstructed in the tabular data, whose optimization equations are reduced to:
Figure BDA0004146256190000111
wherein ,LR () Is a loss function based on mean square error embedded with a matching function;
obtaining a generator f G By comparing the reconstructed images f obtained by the generator after the optimal parameters of (a) G (x n ) Private training data x with target participants v And the similarity between the two data is obtained, and the protection degree of the longitudinal federal learning on the original data is detected.
Specifically, the step S5 includes: based on the attribute reasoning attack, the vulnerability of data attribute leakage is analyzed.
Given the target properties of training shadow data, we use the discrete property values of the shadow data as class labels in the classification task. We define the attribute inference model as a multi-class classifier, with a class label for each unique class of attributes. Let y P :=[y 1 ,y 2 ,...,y P ]Representing p attributes corresponding to the training data. The inspector can embed attribute values using recorded features
Figure BDA0004146256190000112
Training a multiclass classifier f C, wherein />
Figure BDA0004146256190000113
Representing the p-class output of the classifier. The objective of this optimization problem is to minimize the classifier f C Empirical classification loss of collected shadow data:
Figure BDA0004146256190000114
wherein ,LC Is the cross-entropy loss and,
Figure BDA0004146256190000115
output embedded on the classifier for the average feature of the shadow data,/>
Figure BDA0004146256190000116
Is shadow data x s Corresponding attribute tags;
when multi-class classifier f C After training is completed, the detector performs smooth embedding according to the private training data of the target participant
Figure BDA0004146256190000117
To predict x v Attributes of (i.e.)>
Figure BDA0004146256190000118
The multiclass classifier f during training C Smooth feature embedding for shadow data as input
Figure BDA0004146256190000119
Output as shadow data x s Corresponding attribute tag->
Figure BDA00041462561900001110
The multiclass classifier f at the time of reasoning C Smooth feature embedding of private training data of target participants>
Figure BDA00041462561900001111
Outputting private training data x for target participants v Corresponding attribute tag->
Figure BDA00041462561900001112
By comparing predicted private training data x v Corresponding attribute tag->
Figure BDA00041462561900001113
And private training data x v Is->
Figure BDA0004146256190000121
To detect the protection capability of the longitudinal federal learning model on the user data.
The feature embedding analysis-based longitudinal federal learning privacy leakage detection method provided by the embodiment of the invention. We evaluated the effectiveness of this method on three different complexity models (FCNN, lenet, and ResNet) and five different pixel size data sets (band marking, credit, census, UTKFace, calabA), respectively. Experimental results show that compared with the existing method, the longitudinal federal learning privacy disclosure detection method based on feature embedding analysis can realize privacy disclosure analysis on models, original data and data features while the VFL utility is not damaged.
Therefore we show that the gradient uploaded by the user still carries important information of the training data and by designing the correct attack method, a high-precision original image can be stably and effectively reconstructed even without auxiliary data and complex recovery data. It is desirable that this work motivates people to re-think about the role of VFL in model and data privacy protection, further enhancing the design and development of existing privacy protection frameworks.
In summary, the feature embedding analysis-based longitudinal federal learning privacy leakage detection method provided by the embodiment of the invention does not interfere with the training process of the VFL, thereby ensuring the utility of the VFL. And secondly, the method simultaneously realizes privacy disclosure analysis of the model, the original data and the data characteristics.
The feature embedding analysis-based longitudinal federal learning privacy disclosure detection method provided by the invention effectively realizes the omnibearing detection and analysis of privacy disclosure vulnerability in the VFL, and the provided detection method does not negatively influence the utility of the VFL; the proposed smoothing strategy can effectively resist noise interference in the training process.
Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.
From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (6)

1. A longitudinal federal learning privacy disclosure detection method based on feature embedding analysis is characterized by comprising the following steps:
the inspector embeds shadow data in the training process of longitudinal federal learning;
acquiring the shadow data and feature embedded data of private training data of a target participant in longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data;
cloning a proxy model of the bottom model by utilizing the shadow data and the characteristic embedded data after the shadow data smoothing process;
and embedding private training data of the matched reconstruction target participant through characteristics by utilizing the agent model, and detecting the original data leakage of the longitudinal federal learning.
2. The method of claim 1, wherein the inspector embeds shadow data during training of longitudinal federal learning, comprising:
the method comprises the steps that a server is used as a detector, the detector selects a shadow user, the shadow user is registered in the training process of longitudinal federation learning, and the original attribute of shadow data of the shadow user and the attribute of private training data of a target participant in the longitudinal federation learning are distributed in the same way, so that the shadow data participates in the training process of the longitudinal federation learning.
3. The method according to claim 1 or 2, wherein obtaining feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and smoothing the feature embedded data comprises:
in the training process of longitudinal federal learning, a detector records embedded shadow data, and characteristic embedding of the shadow data and private training data of a target participant on a bottom model, and smoothes the characteristic embedding at continuous T moments;
the smoothing mechanism for feature embedding is as follows: order the fB For the bottom model of the target participant, assume that the detector is at the first t Wheel initiated detection, recording private training data of target participants x v Feature embedding at successive T times on a bottom model
Figure FDA0004146256180000021
Recording shadow data x s, and xs Feature embedding at successive T times on a bottom model
Figure FDA0004146256180000022
For->
Figure FDA0004146256180000023
and />
Figure FDA0004146256180000024
Respectively smoothing to obtain feature embedded data +.>
Figure FDA00041462561800000216
and />
Figure FDA00041462561800000215
Figure FDA0004146256180000025
Figure FDA0004146256180000026
4. The method of claim 3, wherein cloning the proxy model of the bottom model using the shadow data and the feature embedding data after the shadow data smoothing process comprises:
the inspector utilizes the recorded plurality of
Figure FDA0004146256180000027
Mapping pair learning a proxy model>
Figure FDA00041462561800000217
Using agent model->
Figure FDA00041462561800000218
Approximating the original feature space and the bottom model f B Mapping between the embedding spaces;
the proxy model
Figure FDA00041462561800000219
The learning process of (1) is as follows: feature embedding minimizing proxy model generation>
Figure FDA0004146256180000028
And feature embedding generated by a true bottom model +.>
Figure FDA00041462561800000220
The l2 distance between:
Figure FDA0004146256180000029
wherein
Figure FDA00041462561800000210
Representing agent model->
Figure FDA00041462561800000221
Model parameters of->
Figure FDA00041462561800000211
Representing shadow data x s In agent model->
Figure FDA00041462561800000212
And an output from the first and second switches.
5. The method of claim 4, wherein said using the proxy model to reconstruct private training data of the target participant through feature embedding matches, performing raw data leak detection on longitudinal federal learning, comprises:
the proxy model reconstructs private training data of the target participant by feature embedding matching and generates f G To implement regression problem of the feature embedding matching, the generator f G Is a decoder model with random noise x n As an input, a reconstructed image f is output G (x n ) The image f will be reconstructed G (x n ) Input proxy model acquisition feature embedding
Figure FDA00041462561800000213
The inspector embeds +_ by minimizing features of the reconstructed image provided by the agent model>
Figure FDA00041462561800000214
L2 distance between true feature embedding of target image corresponding to private training data, finding generator f G To reconstruct the image f G (x n ) The embedding that produces as close as possible to the target image is optimized as follows:
Figure FDA0004146256180000031
wherein ,LR () Is a loss function based on mean square error embedded with a matching function;
obtaining a generator f G By comparing the reconstructed images f obtained by the generator after the optimal parameters of (a) G (x n ) Private training data x with target participants v And the similarity between the two data is obtained, and the protection degree of the longitudinal federal learning on the original data is detected.
6. A method according to claim 3, wherein the method further comprises:
the detector takes the discrete attribute value of the shadow data as a class label in the classification task, embeds the average characteristic of the shadow data as input, trains an attribute decoder, and utilizes the attribute decoder to infer the attribute of the private training data of the target participant;
defining the attribute reasoning model of the private training data as a multi-class classifier f C The unique category of each attribute has a class label, let y P :=[y 1 ,y 2 ,...,y P ]Representing p attributes corresponding to training data, and embedding attribute values by using recorded characteristics by a detector
Figure FDA00041462561800000311
Training a multiclass classifier f C, wherein />
Figure FDA00041462561800000312
Representing the p-class output of the classifier, the multi-class classifier f C The objective of the optimization problem of (a) is to minimize the classifier f C Empirical classification loss of collected shadow data:
Figure FDA0004146256180000032
wherein ,LC Is the cross-entropy loss and,
Figure FDA00041462561800000313
output embedded on the classifier for the average feature of the shadow data,/>
Figure FDA0004146256180000033
Is shadow data x s Corresponding attribute tags;
when multi-class classifier f C After training is completed, the detector performs smooth embedding according to the private training data of the target participant
Figure FDA0004146256180000034
To predict x v Attributes of (i.e.)>
Figure FDA0004146256180000035
The multi-class classifier during training fC Smooth feature embedding for shadow data as input
Figure FDA0004146256180000036
Output as shadow data x s Corresponding attribute tag->
Figure FDA0004146256180000037
The multiclass classifier f at the time of reasoning C Smooth feature embedding of private training data of target participants>
Figure FDA0004146256180000038
Outputting private training data x for target participants v Corresponding attribute tag->
Figure FDA0004146256180000039
By comparing predicted private training data x v Corresponding attribute tag->
Figure FDA00041462561800000310
And private training data x v Is->
Figure FDA0004146256180000041
To detect the protection capability of the longitudinal federal learning model on the user data.
CN202310304542.7A 2023-03-27 2023-03-27 Longitudinal federal learning privacy leakage detection method based on feature embedding analysis Active CN116341004B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310304542.7A CN116341004B (en) 2023-03-27 2023-03-27 Longitudinal federal learning privacy leakage detection method based on feature embedding analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310304542.7A CN116341004B (en) 2023-03-27 2023-03-27 Longitudinal federal learning privacy leakage detection method based on feature embedding analysis

Publications (2)

Publication Number Publication Date
CN116341004A true CN116341004A (en) 2023-06-27
CN116341004B CN116341004B (en) 2023-09-08

Family

ID=86881870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310304542.7A Active CN116341004B (en) 2023-03-27 2023-03-27 Longitudinal federal learning privacy leakage detection method based on feature embedding analysis

Country Status (1)

Country Link
CN (1) CN116341004B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592042A (en) * 2024-01-17 2024-02-23 杭州海康威视数字技术股份有限公司 Privacy disclosure detection method and device for federal recommendation system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094758A (en) * 2021-06-08 2021-07-09 华中科技大学 Gradient disturbance-based federated learning data privacy protection method and system
US20220222539A1 (en) * 2021-01-12 2022-07-14 Sap Se Adversarial learning of privacy preserving representations
CN114936372A (en) * 2022-04-06 2022-08-23 湘潭大学 Model protection method based on three-party homomorphic encryption longitudinal federal learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220222539A1 (en) * 2021-01-12 2022-07-14 Sap Se Adversarial learning of privacy preserving representations
CN113094758A (en) * 2021-06-08 2021-07-09 华中科技大学 Gradient disturbance-based federated learning data privacy protection method and system
CN114936372A (en) * 2022-04-06 2022-08-23 湘潭大学 Model protection method based on three-party homomorphic encryption longitudinal federal learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117592042A (en) * 2024-01-17 2024-02-23 杭州海康威视数字技术股份有限公司 Privacy disclosure detection method and device for federal recommendation system
CN117592042B (en) * 2024-01-17 2024-04-05 杭州海康威视数字技术股份有限公司 Privacy disclosure detection method and device for federal recommendation system

Also Published As

Publication number Publication date
CN116341004B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
Qian et al. Thinking in frequency: Face forgery detection by mining frequency-aware clues
Guo et al. Fake face detection via adaptive manipulation traces extraction network
Li et al. Auditing privacy defenses in federated learning via generative gradient leakage
CN113688855B (en) Data processing method, federal learning training method, related device and equipment
Yuan et al. Robust visual tracking with correlation filters and metric learning
Salehi et al. Arae: Adversarially robust training of autoencoders improves novelty detection
Gan et al. Multigraph fusion for dynamic graph convolutional network
US20230021661A1 (en) Forgery detection of face image
Li et al. Privacy-preserving lightweight face recognition
CN116341004B (en) Longitudinal federal learning privacy leakage detection method based on feature embedding analysis
Chen et al. Self-supervised vision transformer-based few-shot learning for facial expression recognition
CN115563650A (en) Privacy protection system for realizing medical data based on federal learning
Roy et al. 3D CNN architectures and attention mechanisms for deepfake detection
Huang et al. Robust zero-watermarking scheme based on a depthwise overparameterized VGG network in healthcare information security
CN111726472A (en) Image anti-interference method based on encryption algorithm
Yin et al. Dynamic difference learning with spatio-temporal correlation for deepfake video detection
Wang et al. Cross-view representation learning for multi-view logo classification with information bottleneck
Xu et al. Visual-semantic transformer for face forgery detection
Ye et al. Privacy-preserving age estimation for content rating
Li et al. High-capacity coverless image steganographic scheme based on image synthesis
Zhang et al. Effective presentation attack detection driven by face related task
Zhou et al. Neural encoding and decoding with a flow-based invertible generative model
Xu et al. FLPM: A property modification scheme for data protection in federated learning
Ilyas et al. E-Cap Net: an efficient-capsule network for shallow and deepfakes forgery detection
Zhu et al. A face occlusion removal and privacy protection method for IoT devices based on generative adversarial networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant