CN116341004A

CN116341004A - Longitudinal federal learning privacy leakage detection method based on feature embedding analysis

Info

Publication number: CN116341004A
Application number: CN202310304542.7A
Authority: CN
Inventors: 王伟; 许向蕊; 管晓宏; 沈超; 刘鹏睿; 吕晓婷; 郝玉蓉
Original assignee: Beijing Jiaotong University; Xian Jiaotong University
Current assignee: Beijing Jiaotong University; Xian Jiaotong University
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-06-27
Anticipated expiration: 2043-03-27
Also published as: CN116341004B

Abstract

The invention provides a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis. The method comprises the following steps: the inspector embeds shadow data in the training process of longitudinal federal learning; acquiring feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data; embedding a proxy model of the data clone bottom model by utilizing the shadow data and the characteristics of the shadow data after smoothing treatment; and embedding private training data of the matched reconstruction target participant through the characteristics by using the proxy model, and detecting the original data leakage of the longitudinal federal learning. The method of the invention simultaneously realizes vulnerability analysis of model leakage, original data leakage and data characteristic leakage under the condition of not damaging the utility of the VFL.

Description

Longitudinal federal learning privacy leakage detection method based on feature embedding analysis

Technical Field

The invention relates to the technical field of network security technology and privacy, in particular to a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis.

Background

FL (Federated Learning, federal learning) has become a promising privacy-friendly machine learning mechanism allowing participants to exchange intermediate results periodically rather than explicitly sharing training data to achieve convergence of model training. Participants of the VFL (Vertical Federated Learning, vertical federal learning) hold the same training data set but have different feature subsets, i.e., vertically partitioned training data. In practice, VFL is suitable for knowledge fusion of heterogeneous and confidential signature sources between potentially competing companies to drive powerful predictive analysis. For example, an insurance company may wish to combine loan credits of the same principal with banking records provided by different financial institutions to predict future financial risk for that principal.

In a VFL system, local participants share the same sample space but segment the feature space of the data, while the server has tags for training data. The local participant hosts own bottom model for feature extraction of the data, and embeds and transmits corresponding features to the server. The server trains a top model by stitching feature embeddings uploaded from different participants as input. Feature embedding of data is a compressed representation of a private training instance and thus can be used as an information source for estimating a target training instance. According to the invention, a detector can clone the bottom model and infer original training data and data attributes on the premise of not interfering with the utility of the VFL, and can be used for detecting privacy leakage of the VFL scene by only utilizing part of shadow data and analyzing an intermediate result (namely, characteristic embedding of local data) submitted by a local participant to a server.

Although VFLs are designed to protect privacy, there have been many efforts to demonstrate that VFLs still present various risks of privacy disclosure. In previous work, it was proposed that a malicious server infer training data of a local participant by actively manipulating the feature-embedded gradient sent to the target participant, i.e., hijacking the VFL training process. The feature hijacking attack replaces the normal bottom model hosted by the target participant with a well-crafted model, facilitating reconstruction of proprietary training data. However, feature hijacking attacks can result in a huge utility penalty for VFL trained classifiers, and thus are not suitable for real-world scenarios.

In one prior art VFL privacy leak detection scheme, the server may infer the training data of the target participant by actively manipulating the feature-embedded gradient sent to the target participant, i.e., hijacking the VFL training process.

The first step: the server selects a new learning task to replace the original learning task selected by the client.

And a second step of: the server uses its control over the client training process to hijack the client's bottom model and direct it to a selected target feature space.

And a third step of: the server uses the hijacked target feature space to reverse recover the private training instance.

Drawbacks of one VFL privacy disclosure detection scheme of the above prior art include: this approach can result in a significant utility penalty for the VFL-trained classifier, limiting its use in real environments.

Disclosure of Invention

The embodiment of the invention provides a longitudinal federal learning privacy leakage detection method based on feature embedding analysis, which is used for effectively ensuring the utility of a VFL.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A longitudinal federal learning privacy disclosure detection method based on feature embedding analysis comprises the following steps:

the inspector embeds shadow data in the training process of longitudinal federal learning;

acquiring the shadow data and feature embedded data of private training data of a target participant in longitudinal federal learning on a bottom model, and performing smoothing treatment on the feature embedded data;

cloning a proxy model of the bottom model by utilizing the shadow data and the characteristic embedded data after the shadow data smoothing process;

and embedding private training data of the matched reconstruction target participant through characteristics by utilizing the agent model, and detecting the original data leakage of the longitudinal federal learning.

Preferably, the inspector embeds shadow data in a training process of longitudinal federal learning, including:

the method comprises the steps that a server is used as a detector, the detector selects a shadow user, the shadow user is registered in the training process of longitudinal federation learning, and the original attribute of shadow data of the shadow user and the attribute of private training data of a target participant in the longitudinal federation learning are distributed in the same way, so that the shadow data participates in the training process of the longitudinal federation learning.

Preferably, the obtaining feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and performing smoothing processing on the feature embedded data includes:

in the training process of longitudinal federal learning, a detector records embedded shadow data, and characteristic embedding of the shadow data and private training data of a target participant on a bottom model, and smoothes the characteristic embedding at continuous T moments;

the smoothing mechanism for feature embedding is as follows: let f _B For the bottom model of the target participant, assuming that the detector initiates detection at the t-th round, recording private training data x of the target participant ^v Continuous T on bottom modelFeature embedding at time

Recording shadow data x ^s, and x^s Feature embedding at successive T times on a bottom model

For->

and />

Respectively smoothing to obtain feature embedded data +.>

and />

Preferably, the cloning the proxy model of the bottom model by using the shadow data and the feature embedded data after the shadow data smoothing process includes:

the inspector utilizes the recorded plurality of

Mapping pair learning a proxy model>

Using agent model->

Approximating the original feature space and the bottom model f _B Mapping between the embedding spaces;

the proxy model

The learning process of (1) is as follows: feature embedding minimizing proxy model generation>

And feature embedding generated by a true bottom model +.>

The l2 distance between:

wherein

Representing agent model->

Model parameters of->

Representing shadow data x ^s In proxy model

And an output from the first and second switches.

Preferably, the detecting the leakage of the original data for the longitudinal federal learning by embedding the private training data of the matched reconstruction target participant by the features by using the proxy model includes:

the proxy model reconstructs private training data of the target participant by feature embedding matching and generates f _G To implement regression problem of the feature embedding matching, the generator f _G Is oneDecoder model with random noise x _n As an input, a reconstructed image f is output _G (x _n ) The image f will be reconstructed _G (x _n ) Input proxy model acquisition feature embedding

The inspector embeds +_ by minimizing features of the reconstructed image provided by the agent model>

L2 distance between true feature embedding of target image corresponding to private training data, finding generator f _G To reconstruct the image f _G (x _n ) The embedding that produces as close as possible to the target image is optimized as follows:

wherein ,L_R () Is a loss function based on mean square error embedded with a matching function;

obtaining a generator f _G By comparing the reconstructed images f obtained by the generator after the optimal parameters of (a) _G (x _n ) Private training data x with target participants ^v And the similarity between the two data is obtained, and the protection degree of the longitudinal federal learning on the original data is detected.

Preferably, the method further comprises:

the detector takes the discrete attribute value of the shadow data as a class label in the classification task, embeds the average characteristic of the shadow data as input, trains an attribute decoder, and utilizes the attribute decoder to infer the attribute of the private training data of the target participant;

defining the attribute reasoning model of the private training data as a multi-class classifier f _C The unique category of each attribute has a class label, let y _P ：＝[y _i ,y ₂ ,...,y _P ]Representing p attributes corresponding to training data, and using characteristic embedding of records by a detectorEntry attribute value

Training a multiclass classifier f _C, wherein />

Representing the p-class output of the classifier, the multi-class classifier f _C The objective of the optimization problem of (a) is to minimize the classifier f _C Empirical classification loss of collected shadow data:

wherein ,L_C Is the cross-entropy loss and,

output embedded on the classifier for the average feature of the shadow data,/>

Is shadow data x ^s Corresponding attribute tags;

when multi-class classifier f _c After training is completed, the detector performs smooth embedding according to the private training data of the target participant

To predict x ^x Attributes of (i.e.)>

The multiclass classifier f during training _C Smooth feature embedding for shadow data as input

Output as shadow data x ^s Corresponding attribute tag->

The multiclass classifier f at the time of reasoning _C Smooth feature embedding of private training data of target participants>

Outputting private training data x for target participants ^v Corresponding attribute tag->

By comparing predicted private training data x ^v Corresponding attribute tag->

And private training data x ^v Is->

To detect the protection capability of the longitudinal federal learning model on the user data.

According to the technical scheme provided by the embodiment of the invention, the invention provides a longitudinal federal learning privacy leakage detection method based on feature embedding analysis. The prior detection method based on feature hijacking does not involve model leakage analysis and destroys the utility of the VFL. The invention simultaneously realizes vulnerability analysis of model leakage, original data leakage and data characteristic leakage under the condition of not damaging the utility of the VFL.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of implementation of a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis according to an embodiment of the present invention;

fig. 2 is a process flow diagram of a longitudinal federal learning privacy disclosure detection method based on feature embedding analysis according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.

The embodiment of the invention designs a longitudinal federal learning privacy leakage detection method based on feature embedding analysis. The method can perform bottom model leakage detection, original data leakage detection and data attribute leakage detection. The inspector may apply a small number of shadow users and register their data into the VFL training process prior to model training. Following the standard set-up of existing privacy leak detection, we assume that the original data attributes of the shadow user are identical to the attribute distribution of the private training data. During the training process, the inspector needs to record the complete shadow data, data features, and feature embedding of all training data for subsequent privacy disclosure analysis.

The first step in implementing privacy-preserving detection is to perform a smoothing enhancement technique on the received feature embedding to suppress feature embedding fluctuations due to updating of the bottom model during training. Based on the collected original shadow data and the corresponding smooth feature embedding, a proxy model can be trained to approximate the conversion between the original data and the feature embedding, thereby realizing cloning of the bottom model. Based on the cloned proxy model, the inspector can further match the feature embedding of the private training data of the target participating user on the bottom model with the feature embedding of the reconstruction data in the proxy model, so as to optimize the reconstruction data to be infinitely close to the real target data. In addition, the inspector can embed the discrete attribute values and the corresponding average features of the shadow data as class labels and features in the classification task to train an attribute decoder. Thereby inferring the attributes of the target data using the attribute decoder.

The characteristic embedding and corresponding gradient of the data are the only interactive information of the local participants and the server in the VFL system in the co-training process. The server may infer training data of the local participant by actively manipulating the feature-embedded gradients sent to the target participant to assess vulnerability to data leakage in the VFL system. However, this approach counterfeits the gradient returned to the target participant to force the participant-generated feature embedding to converge to the feature space desired by the attacker, which compromises the utility of the VFL.

The implementation schematic diagram of the longitudinal federal learning privacy leakage detection method based on feature embedding analysis provided by the embodiment of the invention is shown in fig. 1, the specific processing flow is shown in fig. 2, and the method comprises the following processing steps:

step S1: embedding shadow data by a detector in the VFL training process, so that the shadow data participates in the training process of the VFL;

in the invention, the inspector can be an honest but curious server,

step S2: recording the shadow data, the characteristics of the shadow data and the characteristic embedding of all the data on the bottom model, and smoothing the characteristic embedding at the continuous T time.

Step S3: embedding the smoothed characteristics of the shadow data and the private training data of the target participant into a cloned proxy model;

step S4: reconstructing private training data based on the cloned proxy model and smoothed features of the private training data;

step S5: the feature of the shadow data and the corresponding smooth feature are utilized to embed a training attribute decoder and applied to the feature embedding of the private training data so as to infer the sensitive attribute of the private training data.

Specifically, the step S1 includes: the inspector may employ a small number of shadow users and register these shadow users into the VFL training process. It is assumed that the original properties of the shadow data of the shadow user have the same distribution as the properties of the private training data of the target participants in the VFL. The attributes of the shadow user will be provided to the local feature set owned by each local participant before the VFL training begins.

Specifically, the step S2 includes: and recording shadow data and feature embedding, and smoothing the feature embedding.

During the training of the VFLThe inspector needs to record the embedded shadow data and the feature embedding of all the data on the bottom model and smooth the feature embedding at consecutive T times. The specific smoothing mechanism is as follows: let f _B Is a model of the bottom of the target participant. Assuming that the detector initiates detection at the t-th round, recording private training data x of the target participant ^v Feature embedding at successive T times on a bottom model

Recording shadow data x ^s, and x^s Feature embedding +.>

For->

and />

Respectively smoothing to obtain feature embedded data +.>

and />

The smoothed feature embedding helps to suppress embedding fluctuations caused by the bottom model update during training for later use in the result stability at privacy leak detection.

Specifically, the step S3 includes: based on the stealing attack of the bottom model, the leakage vulnerability of the bottom model is analyzed.

Based on the shadow data x recorded in the step S2 ^s and x^s Corresponding smoothed feature embedded data

The inspector can train a proxy model to approximate the conversion of training data to feature embedding, thereby achieving cloning of the bottom model. The examiner can use the recorded multiple +.>

Mapping pair learning a proxy model>

Using agent model->

the proxy model

And feature embedding generated by a true bottom model +.>

The l2 distance between:

wherein

Representing agent model->

Model parameters of->

Representing shadow data x ^s In proxy model

And an output from the first and second switches.

Given the original characteristics of a data instance by optimizing the above equation, a learned proxy model

A true bottom model f can be generated _B Approximately the same features are embedded.

Specifically, the step S4 includes: based on the data reconstruction attack, the vulnerability of data leakage is analyzed.

Based on the agent model learned in the step S3, the inspector can further recover the private training data of the target participant through feature embedding matching. This embedded matching process can be expressed as a regression problem. The goal of an attacker is to find estimates of the true raw attribute values that can produce feature embedding that best matches the attributes generated by the true attributes.

In order to restore the image data, the invention introduces a generator f _G To help solve the regression problem. Generator f _G Is a decoder model with random noise x _n As an input, and output a reconstructed image f _G (x _n ). An image typically contains hundreds or thousands of pixels, and by directly estimating the value of each pixel, it is difficult to obtain stable reconstruction results because solving the high-dimensional regression task is prone to curse of dimensions. Inputting the reconstructed image into a proxy model to obtain feature embedding

The inspector can be reduced by minimizing reconstruction provided by the proxy modelThe l2 distance between the feature embedding of the image and the true feature embedding of the target image, finding the generator f _G Is of the optimum parameter omega _G Thereby making the reconstructed image f _G (x _n ) Resulting in an embedding as close as possible to the target image. The optimization formula is as follows:

wherein ,L_R () Is a loss function embedded in the matching function, i.e., a loss based on Mean Square Error (MSE).

In this optimization problem we consider estimating ω _G As a variable. The reason behind this is: if the proxy model is able to accurately approximate the feature embedding transformation of the bottom model, then performing feature embedding matching may drive the estimate f _G (x _n ) Near true x ^v . For restoring image data, we further add a Total Variance (TV) regularization R _tv () To improve the smoothness of the reconstructed image. For reconstructing the numerical features of the tabular data, we eliminate R _tv () Because the numerical properties do not necessarily follow smoothness constraints as in images. Furthermore, since the number of numerical features is typically much smaller than the number of pixels in the image, we can directly estimate the attribute x ^v,t Without introducing a generator module f _G . Thus, the numerical properties are reconstructed in the tabular data, whose optimization equations are reduced to:

Specifically, the step S5 includes: based on the attribute reasoning attack, the vulnerability of data attribute leakage is analyzed.

Given the target properties of training shadow data, we use the discrete property values of the shadow data as class labels in the classification task. We define the attribute inference model as a multi-class classifier, with a class label for each unique class of attributes. Let y _P ：＝[y ₁ ,y ₂ ,...,y _P ]Representing p attributes corresponding to the training data. The inspector can embed attribute values using recorded features

Training a multiclass classifier f _C, wherein />

Representing the p-class output of the classifier. The objective of this optimization problem is to minimize the classifier f _C Empirical classification loss of collected shadow data:

wherein ,L_C Is the cross-entropy loss and,

output embedded on the classifier for the average feature of the shadow data,/>

Is shadow data x ^s Corresponding attribute tags;

To predict x ^v Attributes of (i.e.)>

Output as shadow data x ^s Corresponding attribute tag->

By comparing predicted private training data x ^v Corresponding attribute tag->

And private training data x ^v Is->

The feature embedding analysis-based longitudinal federal learning privacy leakage detection method provided by the embodiment of the invention. We evaluated the effectiveness of this method on three different complexity models (FCNN, lenet, and ResNet) and five different pixel size data sets (band marking, credit, census, UTKFace, calabA), respectively. Experimental results show that compared with the existing method, the longitudinal federal learning privacy disclosure detection method based on feature embedding analysis can realize privacy disclosure analysis on models, original data and data features while the VFL utility is not damaged.

Therefore we show that the gradient uploaded by the user still carries important information of the training data and by designing the correct attack method, a high-precision original image can be stably and effectively reconstructed even without auxiliary data and complex recovery data. It is desirable that this work motivates people to re-think about the role of VFL in model and data privacy protection, further enhancing the design and development of existing privacy protection frameworks.

In summary, the feature embedding analysis-based longitudinal federal learning privacy leakage detection method provided by the embodiment of the invention does not interfere with the training process of the VFL, thereby ensuring the utility of the VFL. And secondly, the method simultaneously realizes privacy disclosure analysis of the model, the original data and the data characteristics.

The feature embedding analysis-based longitudinal federal learning privacy disclosure detection method provided by the invention effectively realizes the omnibearing detection and analysis of privacy disclosure vulnerability in the VFL, and the provided detection method does not negatively influence the utility of the VFL; the proposed smoothing strategy can effectively resist noise interference in the training process.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A longitudinal federal learning privacy disclosure detection method based on feature embedding analysis is characterized by comprising the following steps:

2. The method of claim 1, wherein the inspector embeds shadow data during training of longitudinal federal learning, comprising:

3. The method according to claim 1 or 2, wherein obtaining feature embedded data of the shadow data and private training data of a target participant of longitudinal federal learning on a bottom model, and smoothing the feature embedded data comprises:

the smoothing mechanism for feature embedding is as follows: order the _fB For the bottom model of the target participant, assume that the detector is at the first _t Wheel initiated detection, recording private training data of target participants _x ^v Feature embedding at successive T times on a bottom model

For->

and />

Respectively smoothing to obtain feature embedded data +.>

and />

。

4. The method of claim 3, wherein cloning the proxy model of the bottom model using the shadow data and the feature embedding data after the shadow data smoothing process comprises:

the inspector utilizes the recorded plurality of

Mapping pair learning a proxy model>

Using agent model->

the proxy model

And feature embedding generated by a true bottom model +.>

The l2 distance between:

wherein

Representing agent model->

Model parameters of->

Representing shadow data x ^s In agent model->

And an output from the first and second switches.

5. The method of claim 4, wherein said using the proxy model to reconstruct private training data of the target participant through feature embedding matches, performing raw data leak detection on longitudinal federal learning, comprises:

the proxy model reconstructs private training data of the target participant by feature embedding matching and generates f _G To implement regression problem of the feature embedding matching, the generator f _G Is a decoder model with random noise x _n As an input, a reconstructed image f is output _G (x _n ) The image f will be reconstructed _G (x _n ) Input proxy model acquisition feature embedding

6. A method according to claim 3, wherein the method further comprises:

defining the attribute reasoning model of the private training data as a multi-class classifier f _C The unique category of each attribute has a class label, let y _P ：＝[y ₁ ,y ₂ ,...,y _P ]Representing p attributes corresponding to training data, and embedding attribute values by using recorded characteristics by a detector