WO2024045581A1 - 一种基于分布式gan的隐私保护数据共享方法及系统 - Google Patents

一种基于分布式gan的隐私保护数据共享方法及系统 Download PDF

Info

Publication number
WO2024045581A1
WO2024045581A1 PCT/CN2023/083568 CN2023083568W WO2024045581A1 WO 2024045581 A1 WO2024045581 A1 WO 2024045581A1 CN 2023083568 W CN2023083568 W CN 2023083568W WO 2024045581 A1 WO2024045581 A1 WO 2024045581A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
training
data owner
privacy protection
central server
Prior art date
Application number
PCT/CN2023/083568
Other languages
English (en)
French (fr)
Inventor
王超
王硕
吴爱燕
薛晓卿
何云华
肖珂
Original Assignee
北方工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北方工业大学 filed Critical 北方工业大学
Publication of WO2024045581A1 publication Critical patent/WO2024045581A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/105Multiple levels of security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0407Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the identity of one or more communicating identities is hidden
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present invention relates to the technical fields of data sharing and privacy protection, and in particular refers to a privacy protection data sharing method and system based on distributed GAN.
  • federated learning breaks the limitations of artificial intelligence technology that requires centralized data collection and processing. Therefore, federated learning can be used in a wide range of IoT (Internet of Things, Internet of Things) services, providing a new solution for privacy-preserving data sharing. For example, in IoV (Internet of Vehicles, Internet of Vehicles), data sharing between vehicles can improve service quality.
  • IoV Internet of Vehicles, Internet of Vehicles
  • the author proposes a new architecture based on federated learning [Lu Y, Huang X, Zhang K, et al. Blockchain empowered asynchronous federated learning for secure data sharing in internet of vehicles [J].IEEE Transactions on Vehicular Technology,2020,69(4):4298-4311.].
  • GAN Geneative Adversarial Network
  • CPSS Cyber-Physical-Social Systems
  • human interaction from cyberspace to the physical world is achieved through the sharing of spatio-temporal data.
  • the author uses a modified GAN model and runs two games simultaneously (in between generators, discriminators and differential private identifiers) [Qu Y, Yu S, Zhou W, et al. Gan-driven personalized spatial-temporal private data sharing in cyber-physical social systems [J]. IEEE Transactions on Network Science and Engineering, 2020,7(4):2576-2586.].
  • the present invention is directed at the problem of how to protect the privacy of data owners and encourage them to share data.
  • the present invention provides the following technical solutions:
  • the present invention provides a privacy-preserving data sharing method based on distributed GAN.
  • the method is implemented by a privacy-preserving data sharing system based on distributed GAN.
  • the system includes a central server and multiple data owners;
  • the method includes:
  • the central server provides multiple personalized contracts.
  • Each data owner among the multiple data owners selects a personalized contract from multiple personalized contracts.
  • Each data owner uses the data owner's local private data set to pre-train the data owner's local generative adversarial network GAN model to obtain the pre-trained local GAN model.
  • the central server designs a privacy protection level selection strategy.
  • multiple personalized contracts in S1 include multiple privacy protection levels and rewards corresponding to the multiple privacy protection levels.
  • each data owner in S3 uses the data owner's local private data set to pre-train the data owner's local generative adversarial network GAN model.
  • the pre-trained local GAN model includes:
  • Each data owner obtains the original GAN model from the central server.
  • Each data owner uses the data owner's local private data set to pre-train the original GAN model to obtain the pre-trained local GAN model.
  • the local GAN model includes a local generator and a local discriminator.
  • Each data owner hides the pretrained local generator.
  • multiple data owners in S5 optimize the central generator model of the central server based on the privacy protection level selection strategy, the personalized contract selected by each data owner, and the pre-trained local GAN model, including:
  • the central server determines the privacy protection level ⁇ of the data owner who assists in this round of training based on the privacy protection level selection strategy.
  • the central server obtains multiple data owners with a privacy protection level of ⁇ among multiple data owners based on the privacy protection level ⁇ and the personalized contract selected by each data owner.
  • the central server randomly selects a data owner from multiple data owners with a privacy protection level of ⁇ as the data owner to assist in training.
  • S54 The data owner who assists in training optimizes the central generator model of the central server based on the pre-trained local GAN model of the data owner who assists in training. After optimization, S51 is executed for iterative training until the number of iterations reaches the predetermined number. Set a threshold to stop iteration, and the central generator model training is completed.
  • the central server in S51 determines the privacy protection level ⁇ of the data owner assisting this round of training based on the privacy protection level selection strategy, including:
  • the central server determines the noise scale based on the number of iterations of the central generator model training process. decay function.
  • the central server determines the noise scale based on the attenuation function.
  • the central server determines the privacy protection level ⁇ of the data owner assisting this round of training based on the noise scale.
  • the data owner who assists in training in S54 optimizes the central generator model of the central server based on the pre-trained local GAN model of the data owner who assists in training, including:
  • the data owner who assists in training obtains the data generated by the central generator model from the central server.
  • the data owner who assists in training updates the local discriminator in the pre-trained local GAN model based on the data generated by the central generator model and the private data set of the data owner who assists in training.
  • the data owner assisting in training calculates the gradient based on the updated local discriminator.
  • the data owner who assists in training perturbs the gradient based on the personalized differential privacy theory and obtains the perturbed gradient.
  • the central server optimizes the central generator model of the central server according to the perturbed gradient.
  • the perturbation gradient based on personalized differential privacy theory in S544 includes:
  • the gradient is perturbed based on the Gaussian mechanism and the degree of disturbance; the degree of disturbance is determined by the privacy protection level of the personalized contract.
  • the present invention provides a privacy-preserving data sharing system based on distributed GAN, which is used to implement a privacy-preserving data sharing method based on distributed GAN.
  • the system includes a central server and multiple data owners, where :
  • the central server is used to provide multiple personalized contracts and design a privacy protection level selection strategy.
  • Multiple data owners are used to select a personalized contract from multiple personalized contracts; use the data owner's local private data set to pre-train the data owner's local generative adversarial network GAN model to obtain the pre-trained Local GAN model; based on the privacy protection level selection strategy, the personalized contract selected by each data owner, and the pre-trained local GAN model, the central generator model of the central server is optimized to complete privacy protection data sharing.
  • multiple personalized contracts include multiple privacy protection levels and are paired with multiple privacy protection levels. due reward.
  • multiple data owners further used to:
  • Each data owner obtains the original GAN model from the central server.
  • Each data owner uses the data owner's local private data set to pre-train the original GAN model to obtain the pre-trained local GAN model.
  • the local GAN model includes a local generator and a local discriminator.
  • Each data owner hides the pretrained local generator.
  • multiple data owners further used to:
  • the central server determines the privacy protection level ⁇ of the data owner who assists in this round of training based on the privacy protection level selection strategy.
  • the central server obtains multiple data owners with a privacy protection level of ⁇ among multiple data owners based on the privacy protection level ⁇ and the personalized contract selected by each data owner.
  • the central server randomly selects a data owner from multiple data owners with a privacy protection level of ⁇ as the data owner to assist in training.
  • S54 The data owner who assists in training optimizes the central generator model of the central server based on the pre-trained local GAN model of the data owner who assists in training. After optimization, S51 is executed for iterative training until the number of iterations reaches the predetermined number. Set a threshold to stop iteration, and the central generator model training is completed.
  • the central server is further used to:
  • the central server determines the attenuation function of the noise scale according to the number of iterations of the central generator model training process.
  • the central server determines the noise scale based on the attenuation function.
  • the central server determines the privacy protection level ⁇ of the data owner assisting this round of training based on the noise scale.
  • multiple data owners further used to:
  • the data owner who assists in training obtains the data generated by the central generator model from the central server.
  • the data owner who assists in training updates the local discriminator in the pre-trained local GAN model based on the data generated by the central generator model and the private data set of the data owner who assists in training.
  • the data owner assisting in training calculates the gradient based on the updated local discriminator.
  • the data owner who assists in training perturbs the gradient based on the personalized differential privacy theory and obtains the perturbed gradient.
  • the central server optimizes the central generator model of the central server according to the perturbed gradient.
  • multiple data owners further used to:
  • the gradient is perturbed based on the Gaussian mechanism and the degree of disturbance; the degree of disturbance is determined by the privacy protection level of the personalized contract.
  • a privacy-preserving data sharing scheme based on asynchronous distributed GAN is proposed to address the privacy issues in IoT data sharing.
  • a central generative model is trained in a personalized privacy-preserving manner using data sets local to each data owner.
  • the proposed distributed GAN training framework can use the local data set of the data owner to collaboratively train the central generation model to achieve data sharing without transmitting the original data, and then use the central generation model to reconstruct the data set for downstream tasks.
  • a gradient "desensitization" strategy is proposed to maximize the availability of gradients while protecting user privacy, and achieve model optimization under the guarantee of differential privacy. Designing multi-level privacy protection contracts for data owners with different privacy preferences and proposing a differential privacy level selection strategy can balance data availability and user privacy protection needs, and complete model training with minimal privacy consumption.
  • Figure 1 is a schematic flow chart of a privacy-preserving data sharing method based on distributed GAN provided by an embodiment of the present invention
  • Figure 2 is a block diagram of a privacy-preserving data sharing system based on distributed GAN provided by an embodiment of the present invention.
  • an embodiment of the present invention provides a privacy-preserving data sharing method based on distributed GAN, which can be implemented by a privacy-preserving data sharing system based on distributed GAN.
  • a privacy-preserving data sharing method based on distributed GAN is shown. The processing flow of this method may include the following steps:
  • the central server provides multiple personalized contracts.
  • multiple personalized contracts in S1 include multiple privacy protection levels and rewards corresponding to the multiple privacy protection levels.
  • the central server designs a series of personalized contracts ( ⁇ 1 , r 1 ), ( ⁇ 2 , r 2 ), with different privacy protection levels and rewards for the data owner at the beginning of data sharing. ..,( ⁇ K ,r K ) to meet the privacy protection needs of data owners with different privacy preferences. The higher the level of privacy protection, the smaller the reward. Data owners can choose corresponding contracts to maximize their profits.
  • the server then publishes the data requirements and contracts to the data owners registered in the system (i.e. ).
  • the central server has powerful computing power and communication bandwidth.
  • the purpose is to recruit enough data owners to collaboratively train a central generator until it has strong data generation capabilities.
  • Embodiments of the present invention assume that the central server will not violate the defined protocol, but may try to infer the user's privacy.
  • Each data owner among the multiple data owners selects a personalized contract from multiple personalized contracts.
  • the data owner set Depend on composed of data owners, each data owner Have a private dataset Including N u data samples (i.e. ). These data owners have certain computing and communication capabilities and want to use private data sets to participate in training tasks in exchange for some compensation. But they want to protect their privacy from inference attacks from central servers. In addition, different users have different privacy preferences (i.e., sensitivity to privacy exposure), thus requiring personalized privacy protection.
  • Each data owner uses the data owner's local private data set to pre-train the data owner's local generative adversarial network GAN model to obtain the pre-trained local GAN model.
  • each data owner in S3 uses the data owner's local private data set to pre-train the data owner's local generative adversarial network GAN model.
  • the pre-trained local GAN model includes:
  • Each data owner obtains the original GAN model from the central server.
  • Each data owner uses the data owner's local private data set to pre-train the original GAN model to obtain the pre-trained local GAN model.
  • the local GAN model includes a local generator and a local discriminator.
  • Each data owner hides the pretrained local generator.
  • embodiments of the present invention propose a privacy-preserving asynchronous distributed GAN training framework, which uses the local data set of the data owner to collaborate with the training center to generate a model.
  • the pre-training process includes: first pre-processing the private data set according to the data requirements, and then training the local GAN model.
  • the pre-training process is detailed in Algorithm 1 below:
  • each data owner After pre-training is completed, each data owner will have a trained generator and discriminator locally.
  • the generator that learned the local data distribution will be hidden, and the discriminator will be stored locally to assist the training center to generate device.
  • the purpose of assisting training is to use a local discriminator and a private dataset on data owner u Training center generator.
  • the central server designs a privacy protection level selection strategy.
  • the embodiment of the present invention designs a privacy protection level selection strategy to select the corresponding data owner to assist in training in each round.
  • multiple data owners in S5 optimize the central generator model of the central server based on the privacy protection level selection strategy, the personalized contract selected by each data owner, and the pre-trained local GAN model, including:
  • the central server determines the privacy protection level ⁇ of the data owner who assists in this round of training based on the privacy protection level selection strategy.
  • the central server in S51 determines the privacy protection level ⁇ of the data owner who assists this round of training based on the privacy protection level selection strategy, including:
  • the central server determines the attenuation function of the noise scale according to the number of iterations of the central generator model training process.
  • the central server determines the noise scale based on the attenuation function.
  • the central server determines the privacy protection level ⁇ of the data owner assisting this round of training based on the noise scale.
  • the central server obtains multiple data owners with a privacy protection level of ⁇ among multiple data owners based on the privacy protection level ⁇ and the personalized contract selected by each data owner.
  • the central server randomly selects a data owner from multiple data owners with a privacy protection level of ⁇ as the data owner to assist in training.
  • the central server designs a privacy protection level selection strategy to determine the privacy protection level ⁇ of the data owner who assists in this round of training. Then a data owner with contracted privacy protection level ⁇ is then randomly selected and his local discriminator is used for this round of training.
  • S54 The data owner who assists in training optimizes the central generator model of the central server based on the pre-trained local GAN model of the data owner who assists in training. After optimization, S51 is executed for iterative training until the number of iterations reaches the predetermined number. Set a threshold to stop iteration, and the central generator model training is completed.
  • the data owner who assists in training in S54 optimizes the central generator model of the central server based on the pre-trained local GAN model of the data owner who assists in training, including:
  • the data owner who assists in training obtains the data generated by the central generator model from the central server.
  • the selected data owner u receives the data generated by the central generator.
  • the data owner who assists in training updates the local discriminator in the pre-trained local GAN model based on the data generated by the central generator model and the private data set of the data owner who assists in training.
  • the data owner assisting in training calculates the gradient based on the updated local discriminator.
  • the data owner who assists in training perturbs the gradient based on the personalized differential privacy theory and obtains the perturbed gradient.
  • the data owner u perturbs the calculated gradient based on the differential privacy theory, where the degree of perturbation is determined by the privacy protection level specified in the signed contract. Then, it sends the perturbed gradients to the central server for generator optimization.
  • the perturbation gradient based on personalized differential privacy theory in S544 includes:
  • the gradient is perturbed based on the Gaussian mechanism and the degree of disturbance.
  • the degree of disturbance is determined by the privacy protection level of the personalized contract.
  • the central server optimizes the central generator model of the central server according to the perturbed gradient.
  • the central server updates the central generator model according to the perturbation gradient of the selected data owner. Then the central server reselects the privacy protection level and data owner for the next round of auxiliary training until the central generator training is completed.
  • the embodiment of the present invention proposes a personalized privacy protection strategy, which achieves differential privacy protection by perturbing the gradient calculated locally by the data owner, and the privacy protection level is specified by a contract signed by each data owner.
  • embodiments of the present invention propose a privacy protection level selection strategy to select different privacy protection levels in different training stages and complete training with minimal privacy loss.
  • the server selects a data owner based on the policy and uses its local discriminator to optimize the central generator.
  • the optimization process of the center generator is described in Algorithm 2 below:
  • Each data owner's assisted training process (line 7) is shown in Algorithm 3 below.
  • data owner u uses its local discriminator and private data sets to optimize the center generator.
  • the selected data owner will first receive the generated data from the central server and update the discriminator with the generated data and the local dataset. Then, the local discriminator is used to calculate the gradient, and perturb the gradient in a personalized differential privacy way before the gradient is returned. The degree of perturbation is determined by the privacy protection level in the signed contract.
  • the personalized privacy protection method can be further explained as follows: Generally speaking, privacy issues in machine learning are due to the fact that model training requires a large amount of user data, and the model acquires multiple data features after multiple rounds of training iterations. Attackers can use model parameters, gradients, etc. to infer relevant information about the input data. Similarly, the training of GAN models also requires a large amount of user data. The generator is trained to generate simulated data to simulate the distribution of real data, and the discriminator needs to input a large amount of real data during training to distinguish real data and simulated data. Therefore, to protect the privacy of each data owner, its local generator needs to be hidden, and the gradients calculated by the local discriminator need to be perturbed in a personalized differentially private manner.
  • the perturbation mechanism of the embodiment of the present invention can reduce the perturbed gradient range, thereby reducing the destruction of useful information. According to the chain rule, the scope of the perturbation mechanism can be reduced.
  • Equation 2 gradient information backpropagation can be divided into two parts.
  • first part Computed by the local discriminator of each data owner based on the simulated data received (where The other part J ⁇ G G(z; ⁇ G ) is the Jacobian matrix calculated by the central generator which is independent of the training data. Therefore, the perturbation range can be narrowed to the first part, and the perturbation process based on the Gaussian mechanism can be further described as the following formulas (3) (4):
  • the noise variance ⁇ 2 directly affects the scale of the noise.
  • the variance of Gaussian noise When it is larger, The larger the value, the higher the level of privacy protection.
  • the noise variance ⁇ 2 is determined by the privacy protection level in the contract signed by each user, thereby achieving personalized differential privacy protection.
  • DP-SGD Differential Privacy Stochastic Gradient Descent
  • the noise selection strategy of the embodiment of the present invention follows the idea that as the generation capability of the central generator increases, the scale of the perturbation noise in its expected gradient is smaller, thereby further optimizing the model.
  • the strategy for selecting the noise scale in this embodiment of the present invention is to monitor the performance of the central generator and gradually select data owners with smaller noise scales.
  • the embodiment of the present invention proposes a strategy for selecting an appropriate noise scale based on training iteration rounds.
  • the noise scale should be determined based on the attenuation function of the noise scale, and the corresponding data owner should be further selected to assist in training.
  • the attenuation function takes training n as a parameter, and the noise scale is negatively related to n.
  • ⁇ 0 is the initial noise parameter
  • n is the number of iteration rounds
  • k is the decay rate.
  • the server redeems rewards for each data owner based on the valuation specified in the signed contract.
  • a privacy-preserving data sharing scheme based on asynchronous distributed GAN is proposed.
  • a central generative model is trained in a personalized privacy-preserving manner using data sets local to each data owner.
  • the proposed distributed GAN training framework can use the local data set of the data owner to collaboratively train the central generation model to achieve data sharing without transmitting the original data, and then use the central generation model to reconstruct the data set for downstream tasks.
  • a gradient "desensitization" strategy is proposed to maximize the availability of gradients while protecting user privacy, and achieve model optimization under the guarantee of differential privacy.
  • Designing multi-level privacy protection contracts for data owners with different privacy preferences and proposing a differential privacy level selection strategy can balance data availability and user privacy protection needs, and complete model training with minimal privacy consumption.
  • an embodiment of the present invention provides a privacy-preserving data sharing system based on distributed GAN.
  • the system is used to implement a privacy-preserving data sharing method based on distributed GAN.
  • the system includes a central server and multiple data Owner, where:
  • the central server is used to provide multiple personalized contracts and design privacy protection level selection strategies.
  • Multiple data owners are used to select a personalized contract from multiple personalized contracts; use the data owner's local private data set to pre-train the data owner's local generative adversarial network GAN model, and obtain the pre-trained The local GAN model; based on the privacy protection level selection strategy, the personalized contract selected by each data owner, and the pre-trained local GAN model, the central generator model of the central server is optimized to complete privacy protection data sharing.
  • multiple personalized contracts include multiple privacy protection levels and rewards corresponding to the multiple privacy protection levels.
  • multiple data owners further used to:
  • Each data owner obtains the original GAN model from the central server.
  • Each data owner uses the data owner's local private data set to pre-train the original GAN model to obtain the pre-trained local GAN model.
  • the local GAN model includes a local generator and a local discriminator.
  • Each data owner hides the pretrained local generator.
  • multiple data owners further used to:
  • the central server determines the privacy protection level ⁇ of the data owner who assists in this round of training based on the privacy protection level selection strategy.
  • the central server obtains multiple data owners with privacy protection levels ⁇ among multiple data owners based on the privacy protection level ⁇ and the personalized contract selected by each data owner.
  • the central server randomly selects a data owner from multiple data owners with a privacy protection level of ⁇ as the data owner to assist in training.
  • S54 The data owner who assists in training optimizes the central generator model of the central server based on the pre-trained local GAN model of the data owner who assists in training. After optimization, S51 is executed for iterative training until the number of iterations reaches the predetermined number. Set a threshold to stop iteration, and the central generator model training is completed.
  • the central server is further used to:
  • the central server determines the attenuation function of the noise scale according to the number of iterations of the central generator model training process.
  • the central server determines the noise scale based on the attenuation function.
  • the central server determines the privacy protection level ⁇ of the data owner assisting this round of training based on the noise scale.
  • multiple data owners further used to:
  • the data owner who assists in training obtains the data generated by the central generator model from the central server.
  • the data owner who assists in training updates the local discriminator in the pre-trained local GAN model based on the data generated by the central generator model and the private data set of the data owner who assists in training.
  • the data owner assisting in training calculates the gradient based on the updated local discriminator.
  • the data owner who assists in training perturbs the gradient based on the personalized differential privacy theory and obtains the perturbed gradient.
  • the central server optimizes the central generator model of the central server according to the perturbed gradient.
  • multiple data owners further used to:
  • the gradient is perturbed based on the Gaussian mechanism and the degree of disturbance; the degree of disturbance is determined by the privacy protection level of the personalized contract.
  • a privacy-preserving data sharing scheme based on asynchronous distributed GAN is proposed.
  • a central generative model is trained in a personalized privacy-preserving manner using data sets local to each data owner.
  • the proposed distributed GAN training framework can use the local data set of the data owner to collaboratively train the central generation model to achieve data sharing without transmitting the original data, and then use the central generation model to reconstruct the data set for downstream tasks.
  • a gradient "desensitization" strategy is proposed to maximize the availability of gradients while protecting user privacy, and achieve model optimization under the guarantee of differential privacy.
  • Designing multi-level privacy protection contracts for data owners with different privacy preferences and proposing a differential privacy level selection strategy can balance data availability and user privacy protection needs, and complete model training with minimal privacy consumption.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种基于分布式GAN的隐私保护数据共享方法及系统,涉及数据共享和隐私保护技术领域。包括:中心服务器提供多个个性化合约;多个数据所有者中的每个数据所有者根据自己的隐私保护需求选择一个个性化合约;每个数据所有者使用其私有数据集对其本地GAN模型进行预训练;中心服务器设计隐私保护等级选择策略;协助训练的数据所有者对中心服务器的中心生成器模型进行优化,完成隐私保护数据共享。本发明能够在不传输原始数据的前提下,利用数据所有者的本地数据集协同训练中心生成模型实现数据共享;在差分隐私的保障下实现模型的训练;为具有不同隐私偏好的数据所有者设计不同隐私保护的合约。

Description

一种基于分布式GAN的隐私保护数据共享方法及系统 技术领域
本发明涉及数据共享和隐私保护技术领域,特别是指一种基于分布式GAN的隐私保护数据共享方法及系统。
背景技术
现如今传感设备数量呈现爆炸式增长趋势,随之而来的是物联网终端产生的“海量级”数据。这些高质量的数据使得机器学习在图像识别、自动驾驶、产品推荐等诸多领域产生了巨大影响。具有高可用性的数据已成为机器学习发展的主要驱动力。然而,当下仍存在没有足够的训练数据用于机器学习任务的情况,这主要是由于公众对数据泄露的担忧以及隐私保护意识的增强。具体来说,共享数据可能包含用户的隐私信息,数据所有者由于隐私泄露问题而不愿对外共享数据。此外,还存在机密数据无法传输只能保存在所有者本地的情况。因此,保护数据所有者的隐私并激励他们共享数据正成为机器学习进一步发展的关键瓶颈之一。
针对数据共享中的隐私问题,各界研究学者们相继提出了一系列解决方案。一些研究者使用基于ABE(Attribute-Based Encryption,属性加密)、SMC(Secure Multi-Party Computation,安全多方计算)和区块链等技术,通过在数据共享中隐藏用户身份或设计细粒度的访问控制机制来实现隐私保护,例如【Pu Y,Hu C,Deng S,et al.R2PEDS:a recoverable and revocable privacy-preserving edge data sharing scheme[J].IEEE Internet of Things Journal,2020,7(9):8077-8089.】、【Zheng X,Cai Z.Privacy-preserved data sharing towards multiple parties in industrial IoTs[J].IEEE Journal on Selected Areas in Communications,2020,38(5):968-979.】、【Xu X,Liu Q,Zhang X,et al.A blockchain-powered crowdsourcing method with privacy preservation in mobile environment[J].IEEE Transactions on Computational Social  Systems,2019,6(6):1407-1419.】。但此类方案侧重于实现身份验证和访问控制机制,这不仅需要传输原始数据并且需要大量的额外计算。联邦学习的兴起为此提供了一种新的解决方案,其能够在不传输原始数据的情况下实现模型的训练。但当训练任务发生变化或机器学习模型更新时,其需要重复访问私有数据集,这增加了隐私泄露的风险。
现有基于人工智能方法解决物联网数据共享中的隐私保护问题的方案可大致分为两类,一类是基于联邦学习的数据共享,另一类是基于生成对抗网络的数据共享。这两类都不需要上传用户的原始数据,这在一定程度上保护了用户的隐私但仍存在一定的局限性,下面将分别介绍并总结其不足之处。
联邦学习的兴起打破了人工智能技术需要集中进行数据收集和处理的局限。因此,联邦学习能够用于广泛的IoT(Internet of Thing,物联网)服务,为隐私保护的数据共享提供了新的解决方案。例如,在IoV(Internet of Vehicles,车联网)中,车辆之间的数据共享可以提高服务质量。为了减少传输负载并解决数据共享中的隐私问题,作者提出了一种基于联邦学习的新架构【Lu Y,Huang X,Zhang K,et al.Blockchain empowered asynchronous federated learning for secure data sharing in internet of vehicles[J].IEEE Transactions on Vehicular Technology,2020,69(4):4298-4311.】。他们开发了由区块链和局部DAG(Directed Acyclic Graph,有向无环图)组成的混合区块链架构,以提高模型参数的安全性和可靠性。论文【Yin L,Feng J,Xun H,et al.A privacy-preserving federated learning for multiparty data sharing in social IoTs[J].IEEE Transactions on Network Science and Engineering,2021,8(3):2706-2718.】也使用了联邦学习来实现数据共享,但作者提出了一种新的混合隐私保护方法来克服联邦学习中数据和内容层面的披露。他们采用先进的功能性加密算法和本地贝叶斯差分隐私来保留上传数据的特征和加权求和过程中每个参与者的权重。
由于GAN(Generative Adversarial Network,生成对抗网络)适用于各种类型的数据,因此许多研究人员通过联合训练GAN取代直接传输数据以实现隐私保护的数据共享。在CPSS(Cyber-Physical-Social Systems,信息物理社会系统)中,人类从网络空间到物理世界的交互是通过时空数据的共享来实现的。为了权衡隐私保护和数据效用,作者使用修改后的GAN模型,同时运行两个博弈(在 生成器、鉴别器和差分私有标识符之间)【Qu Y,Yu S,Zhou W,et al.Gan-driven personalized spatial-temporal private data sharing in cyber-physical social systems[J].IEEE Transactions on Network Science and Engineering,2020,7(4):2576-2586.】。在论文【Chang Q,Qu H,Zhang Y,et al.Synthetic learning:Learn from distributed asynchronized discriminator gan without sharing medical image data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:13856-13866.】中,作者提出了一种隐私保护和通信高效的分布式GAN框架,称为分布式异步鉴别器GAN(AsynDGAN)。它旨在从分布式鉴别器中学习,并仅使用生成的合成图像训练中央生成器以训练分割模型。
这两类方法仍存在一定的局限性,具体表现为:1)使用联邦学习的解决方案可以在不上传数据的情况下训练任务模型。然而,这些方案仍存在较大的隐私泄露风险,因为当任务改变或机器学习架构更新时,需要多次重新访问私有数据集。2)现有的基于GAN的方案无法平衡隐私保护和数据可用性之间的关系,并且无法满足数据所有者们个性化的隐私保护需求。
发明内容
本发明针对如何保护数据所有者的隐私并激励他们共享数据的问题,提出了本发明。
为解决上述技术问题,本发明提供如下技术方案:
一方面,本发明提供了一种基于分布式GAN的隐私保护数据共享方法,该方法由基于分布式GAN的隐私保护数据共享系统实现,该系统包括中心服务器以及多个数据所有者;
该方法包括:
S1、中心服务器提供多个个性化合约。
S2、多个数据所有者中的每个数据所有者从多个个性化合约中选择一个个性化合约。
S3、每个数据所有者利用数据所有者的本地私有数据集对数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型。
S4、中心服务器设计隐私保护等级选择策略。
S5、多个数据所有者根据隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,完成隐私保护数据共享。
可选地,S1中的多个个性化合约包括多个隐私保护等级以及与多个隐私保护等级对应的奖励。
可选地,S3中的每个数据所有者利用数据所有者的本地私有数据集对数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型包括:
S31、每个数据所有者从中心服务器获取原始GAN模型。
S32、每个数据所有者利用数据所有者的本地私有数据集对原始GAN模型进行预训练,得到预训练后的本地GAN模型。
可选地,本地GAN模型包括本地生成器以及本地鉴别器。
S32中的得到预训练后的本地GAN模型后还包括:
每个数据所有者将预训练后的本地生成器隐藏。
可选地,S5中的多个数据所有者根据隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化包括:
S51、中心服务器根据隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ。
S52、中心服务器根据隐私保护等级ρ以及每个数据所有者选择的个性化合约,得到多个数据所有者中隐私保护等级为ρ的多个数据所有者。
S53、中心服务器从隐私保护等级为ρ的多个数据所有者中随机选择一个数据所有者,作为协助训练的数据所有者。
S54、协助训练的数据所有者根据协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,优化后转去执行S51进行迭代训练,直到迭代次数达到预设阈值停止迭代,中心生成器模型训练完成。
可选地,S51中的中心服务器根据隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ包括:
S511、中心服务器根据中心生成器模型训练过程的迭代次数确定噪声规模的 衰减函数。
S512、中心服务器根据衰减函数确定噪声规模。
S513、中心服务器根据噪声规模确定协助本轮训练的数据所有者的隐私保护等级ρ。
可选地,S54中的协助训练的数据所有者根据协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化包括:
S541、协助训练的数据所有者从中心服务器获取中心生成器模型生成的数据。
S542、协助训练的数据所有者根据中心生成器模型生成的数据以及协助训练的数据所有者的私有数据集,对预训练后的本地GAN模型中的本地鉴别器进行更新。
S543、协助训练的数据所有者根据更新后的本地鉴别器计算梯度。
S544、协助训练的数据所有者基于个性化差分隐私理论扰动梯度,得到扰动后的梯度。
S545、中心服务器根据扰动后的梯度对中心服务器的中心生成器模型进行优化。
可选地,S544中的基于个性化差分隐私理论扰动梯度包括:
基于高斯机制以及扰动程度对梯度进行扰动;其中,扰动程度由个性化合约的隐私保护等级决定。
另一方面,本发明提供了一种基于分布式GAN的隐私保护数据共享系统,该系统应用于实现基于分布式GAN的隐私保护数据共享方法,该系统包括中心服务器以及多个数据所有者,其中:
中心服务器,用于提供多个个性化合约;设计隐私保护等级选择策略。
多个数据所有者,用于从多个个性化合约中选择一个个性化合约;利用数据所有者的本地私有数据集对数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型;根据隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,完成隐私保护数据共享。
可选地,多个个性化合约包括多个隐私保护等级以及与多个隐私保护等级对 应的奖励。
可选地,多个数据所有者,进一步用于:
S31、每个数据所有者从中心服务器获取原始GAN模型。
S32、每个数据所有者利用数据所有者的本地私有数据集对原始GAN模型进行预训练,得到预训练后的本地GAN模型。
可选地,本地GAN模型包括本地生成器以及本地鉴别器。
多个数据所有者,进一步用于:
每个数据所有者将预训练后的本地生成器隐藏。
可选地,多个数据所有者,进一步用于:
S51、中心服务器根据隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ。
S52、中心服务器根据隐私保护等级ρ以及每个数据所有者选择的个性化合约,得到多个数据所有者中隐私保护等级为ρ的多个数据所有者。
S53、中心服务器从隐私保护等级为ρ的多个数据所有者中随机选择一个数据所有者,作为协助训练的数据所有者。
S54、协助训练的数据所有者根据协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,优化后转去执行S51进行迭代训练,直到迭代次数达到预设阈值停止迭代,中心生成器模型训练完成。
可选地,中心服务器,进一步用于:
S511、中心服务器根据中心生成器模型训练过程的迭代次数确定噪声规模的衰减函数。
S512、中心服务器根据衰减函数确定噪声规模。
S513、中心服务器根据噪声规模确定协助本轮训练的数据所有者的隐私保护等级ρ。
可选地,多个数据所有者,进一步用于:
S541、协助训练的数据所有者从中心服务器获取中心生成器模型生成的数据。
S542、协助训练的数据所有者根据中心生成器模型生成的数据以及协助训练的数据所有者的私有数据集,对预训练后的本地GAN模型中的本地鉴别器进行更新。
S543、协助训练的数据所有者根据更新后的本地鉴别器计算梯度。
S544、协助训练的数据所有者基于个性化差分隐私理论扰动梯度,得到扰动后的梯度。
S545、中心服务器根据扰动后的梯度对中心服务器的中心生成器模型进行优化。
可选地,多个数据所有者,进一步用于:
基于高斯机制以及扰动程度对梯度进行扰动;其中,扰动程度由个性化合约的隐私保护等级决定。
本发明实施例提供的技术方案带来的有益效果至少包括:
上述方案中,针对物联网数据共享中的隐私问题,提出了一种基于异步分布式GAN的隐私保护数据共享方案。结合差分隐私理论和分布式GAN,使用每个数据所有者本地的数据集以个性化的隐私保护方式训练一个中心生成模型。所提出的分布式GAN训练框架,能够在不传输原始数据的前提下,利用数据所有者的本地数据集协同训练中心生成模型以实现数据共享,然后利用中心生成模型为下游任务重建数据集。结合差分隐私理论提出梯度“脱敏”策略,在保护用户隐私的前提下最大程度保留梯度的可用性,在差分隐私的保障下实现模型的优化。为拥有不同隐私偏向的数据所有者设计多等级的隐私保护合约,以及差分隐私级别选择策略的提出,能够平衡数据可用性和用户隐私保护需求,以最小的隐私消耗完成模型的训练。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的基于分布式GAN的隐私保护数据共享方法流程示意图;
图2是本发明实施例提供的基于分布式GAN的隐私保护数据共享系统框图。
具体实施方式
为使本发明要解决的技术问题、技术方案和优点更加清楚,下面将结合附图及具体实施例进行详细描述。
如图1所示,本发明实施例提供了一种基于分布式GAN的隐私保护数据共享方法,该方法可以由基于分布式GAN的隐私保护数据共享系统实现。如图1所示的基于分布式GAN的隐私保护数据共享方法流程图,该方法的处理流程可以包括如下的步骤:
S1、中心服务器提供多个个性化合约。
可选地,S1中的多个个性化合约包括多个隐私保护等级以及与多个隐私保护等级对应的奖励。
一种可行的实施方式中,中心服务器在数据共享之初为数据所有者设计一系列具有不同隐私保护等级和奖励的个性化合约(ρ1,r1),(ρ2,r2),...,(ρK,rK),以满足不同隐私偏向的数据所有者的隐私保护需求。其中隐私保护级别越高奖励越少,数据所有者可以选择相应的合约以最大化自己的利润。然后,服务器将数据需求和合约发布给在系统中注册的数据所有者(即)。
其中,中心服务器具有强大的计算能力和通信带宽。其目的是招募足够多的数据所有者以协同训练一个中心生成器,直到其具有强大的数据生成能力。本发明实施例假设中心服务器不会违背定义的协议,但可能会尝试推断用户的隐私。
S2、多个数据所有者中的每个数据所有者从多个个性化合约中选择一个个性化合约。
一种可行的实施方式中,数据所有者集合个数据所有者组成,每个数据所有者拥有一个私有数据集包括Nu个数据样本(即)。这些数据所有者具备一定的计算和通信能力,想要使用私有数据集参与训练任务以换取一些报酬。但是他们想要保护自己的隐私免受来自中心服务器的推理攻击。此外,不同的用户具有不同的隐私偏好(即对隐私暴露的敏感性),因此需要个性化的隐私保护。
S3、每个数据所有者利用数据所有者的本地私有数据集对数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型。
可选地,S3中的每个数据所有者利用数据所有者的本地私有数据集对数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型包括:
S31、每个数据所有者从中心服务器获取原始GAN模型。
一种可行的实施方式中,符合要求的数据所有者根据自己的隐私保护需求与服务器签署一份特定合约并下载原始的GAN模型。
S32、每个数据所有者利用数据所有者的本地私有数据集对原始GAN模型进行预训练,得到预训练后的本地GAN模型。
可选地,本地GAN模型包括本地生成器以及本地鉴别器。
S32中的得到预训练后的本地GAN模型后还包括:
每个数据所有者将预训练后的本地生成器隐藏。
一种可行的实施方式中,本发明实施例提出了隐私保护的异步分布式GAN训练框架,该框架利用数据所有者的本地数据集协同训练中心生成模型。
进一步地,所有参与训练的数据所有者使用其私有数据集在本地预训练GAN模型。在预训练完成后,隐藏能够生成模拟数据的生成器,而使用本地鉴别器协助服务器训练中心生成器。
进一步地,预训练过程包括:首先根据数据要求对私有数据集进行预处理,进而训练本地GAN模型。预训练过程在下述算法1中详细阐述:

在预训练完成后,在每个数据所有者本地都拥有一个训练完成的生成器和判别器,其中学习了本地数据分布的生成器将被隐藏,而鉴别器将存储在本地以协助训练中心生成器。协助训练的目的是使用本地鉴别器和数据所有者u上的私有数据集训练中心生成器。
S4、中心服务器设计隐私保护等级选择策略。
一种可行的实施方式中,为了以最小的隐私成本优化中心生成器,本发明实施例设计了隐私保护级别选择策略,以在每一轮中选择相应的数据所有者协助训练。
S5、多个数据所有者根据隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,完成隐私保护数据共享。
可选地,S5中的多个数据所有者根据隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化包括:
S51、中心服务器根据隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ。
可选地,S51中的中心服务器根据隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ包括:
S511、中心服务器根据中心生成器模型训练过程的迭代次数确定噪声规模的衰减函数。
S512、中心服务器根据衰减函数确定噪声规模。
S513、中心服务器根据噪声规模确定协助本轮训练的数据所有者的隐私保护等级ρ。
S52、中心服务器根据隐私保护等级ρ以及每个数据所有者选择的个性化合约,得到多个数据所有者中隐私保护等级为ρ的多个数据所有者。
S53、中心服务器从隐私保护等级为ρ的多个数据所有者中随机选择一个数据所有者,作为协助训练的数据所有者。
一种可行的实施方式中,中心服务器设计隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ。然后从签订合约的隐私保护级别为ρ的数据所有者随机选取一个,并使用他的本地鉴别器进行本轮训练。
S54、协助训练的数据所有者根据协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,优化后转去执行S51进行迭代训练,直到迭代次数达到预设阈值停止迭代,中心生成器模型训练完成。
可选地,S54中的协助训练的数据所有者根据协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化包括:
S541、协助训练的数据所有者从中心服务器获取中心生成器模型生成的数据。
一种可行的实施方式中,被选定的数据所有者u接收中心生成器生成的数据。
S542、协助训练的数据所有者根据中心生成器模型生成的数据以及协助训练的数据所有者的私有数据集,对预训练后的本地GAN模型中的本地鉴别器进行更新。
S543、协助训练的数据所有者根据更新后的本地鉴别器计算梯度。
S544、协助训练的数据所有者基于个性化差分隐私理论扰动梯度,得到扰动后的梯度。
一种可行的实施方式中,数据所有者u基于差分隐私理论扰动计算的梯度,其中扰动程度由签署的合约中规定的隐私保护级别决定。然后,其将扰动后梯度发送给中心服务器用于生成器的优化。
可选地,S544中的基于个性化差分隐私理论扰动梯度包括:
基于高斯机制以及扰动程度对梯度进行扰动。
其中,扰动程度由个性化合约的隐私保护等级决定。
S545、中心服务器根据扰动后的梯度对中心服务器的中心生成器模型进行优化。
一种可行的实施方式中,中心服务器根据所选数据所有者的扰动梯度更新中央生成器模型。然后中心服务器重新选择隐私保护等级和数据所有者进行下一轮辅助训练,直到中心生成器训练完成。
本发明实施例提出了个性化隐私保护策略,其通过扰动数据所有者本地计算的梯度以实现差分隐私保障,并且隐私保护级别由每个数据所有者签署的合约指定。
进一步地,中心服务器上没有判别器,其优化完全依赖于数据所有者端的判别器。并且为了以最小的隐私成本最大化模型性能,本发明实施例提出了隐私保护级别选择策略,以在不同的训练阶段选择不同的隐私保护级别,以最小的隐私损失完成训练。在每次迭代中,服务器根据策略选择一个数据所有者,并使用其本地鉴别器来优化中央生成器。中心生成器的优化过程描述在下述算法2中:
每个数据所有者的协助训练过程(第7行)显示在下述算法3。在协助训练阶段,数据所有者u使用其本地鉴别器和私有数据集来优化中心生成器。详细说明如下,选定的数据所有者将首先从中心服务器接收生成的数据,并使用生成数据和本地数据集来更新鉴别器。然后,利用本地鉴别器计算梯度,并在梯度回传之前以个性化差分隐私的方式对梯度进行扰动,扰动程度由已签署合约中隐私保护级别决定。

个性化隐私保护方法进一步可以解释为:一般来说,机器学习中的隐私问题是由于模型训练需要大量的用户数据,在多轮训练迭代后模型获取重多数据特征。攻击者可以利用模型参数、梯度等推断输入数据的相关信息。同样地,GAN模型的训练也需要大量用户数据,生成器被训练产生模拟数据以模拟真实数据的分布,而鉴别器在训练时需要输入大量真实数据以鉴别真实数据和模拟数据。因此,为保护每个数据所有者的隐私,其本地生成器需要被隐藏,并需要以个性化差分隐私的方式扰动本地鉴别器计算的梯度。
根据差分隐私组合定理,若每个SGD(Stochastic Gradient Descent,随机梯度下降)过程都符合差分隐私【Lee J,Kifer D.Concentrated differentially private gradient descent with adaptive per-iteration privacy budget[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining.2018:1656-1665.】,则最终模型也是差分隐私的,中心生成器梯度下降过程为下式(1):
其中,为机器学习模型,为批量数据,为计算出的随机梯度,为高斯噪声以及为扰动后的随机梯度。
与此相比,本发明实施例的扰动机制可以减少被扰动的梯度范围,从而减少有用信息的破坏。根据链式法则,扰动机制的范围可以缩小。
如上式2所示,梯度信息反向传播可以分为两部分。第一部分由每个数据所有者的本地鉴别器基于接收到的模拟数据计算得出(其中另一部分JθGG(z;θG)是中心生成器计算的雅可比矩阵其独立于训练数据。因此,可以将扰动范围缩小到第一部分,基于高斯机制的扰动过程进一步可以描述为下式(3)(4):

其中,是数据所有者u使用本地鉴别器计算的梯度,是高斯噪声。裁剪(clip)操作使用L2范数执行,其中通过将梯度gu替换gu/max(1,||gu||2/C)确保||gu||≤C。值得注意的是,噪声方差σ2直接影响噪声的尺度,当高斯噪声的方差较大时,越大,隐私保护级别越高。而噪声方差σ2是由每个用户签订合约中隐私保护等级决定,由此实现个性化差分隐私保障。
进一步地,DP-SGD(Differential Privacy Stochastic Gradient Descent,差分隐私随机梯度下降)仍存在两个关键问题。一方面,GAN模型的训练往往需要较大的迭代次数,这会导致较大的隐私损失。另一方面,每个数据所有者需要不同的隐私保护级别,这意味着每个DP-SGD中的噪声规模不同,这也直接影响隐私损失和最终模型的性能。因此,本发明实施例设计隐私保护级别的选择策略,在每轮训练中选择具有特定隐私保护级别的数据所有者,从而在完成模型训练的同时降低隐私成本。具体来说,本发明实施例的噪声选择策略遵循这样的思想,即随着中心生成器生成能力的增强,其期望梯度中的扰动噪声的规模更小,进而使得模型进一步优化。本发明实施例选择噪声规模的策略是监控中心生成器的性能,并逐渐选择噪声规模较小的数据所有者。但由于在训练时,无法直接访问每个数据所有者的本地判别器,只能获得扰动梯度。因此,很难使用每个数据所有者的本地鉴别器来评估中心生成器的性能。为此本发明实施例提出了一种基于训练迭代轮次选择适当噪声规模的策略。具体来说,应该根据噪声规模的衰减函数确定噪声规模,进一步选择相应的数据所有者协助训练。衰减函数以训练n为参数,噪声规模与n呈负相关。衰减函数如下式(5)所示:
ρt=ρ0/(1+kn)     (5)
其中,ρ0是初始噪声参数,n是迭代轮数,k是衰减率。中心服务器通过衰减函数确定噪声规模后,选择与噪声规模最相似的合约,最后从签订合约的数据所有者中选出一个协助本轮训练。
进一步地,完成中心生成器的训练后,服务器根据签订合约中规定的估值为每个数据所有者兑换奖励。
本发明实施例中,针对物联网数据共享中的隐私问题,提出了一种基于异步分布式GAN的隐私保护数据共享方案。结合差分隐私理论和分布式GAN,使用每个数据所有者本地的数据集以个性化的隐私保护方式训练一个中心生成模型。所提出的分布式GAN训练框架,能够在不传输原始数据的前提下,利用数据所有者的本地数据集协同训练中心生成模型以实现数据共享,然后利用中心生成模型为下游任务重建数据集。结合差分隐私理论提出梯度“脱敏”策略,在保护用户隐私的前提下最大程度保留梯度的可用性,在差分隐私的保障下实现模型的优化。为拥有不同隐私偏向的数据所有者设计多等级的隐私保护合约,以及差分隐私级别选择策略的提出,能够平衡数据可用性和用户隐私保护需求,以最小的隐私消耗完成模型的训练。
如图2所示,本发明实施例提供了一种基于分布式GAN的隐私保护数据共享系统,该系统应用于实现基于分布式GAN的隐私保护数据共享方法,该系统包括中心服务器以及多个数据所有者,其中:
中心服务器,用于提供多个个性化合约;设计隐私保护等级选择策略。
多个数据所有者,用于从多个个性化合约中选择一个个性化合约;利用数据所有者的本地的私有数据集对数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型;根据隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,完成隐私保护数据共享。
可选地,多个个性化合约包括多个隐私保护等级以及与多个隐私保护等级对应的奖励。
可选地,多个数据所有者,进一步用于:
S31、每个数据所有者从中心服务器获取原始GAN模型。
S32、每个数据所有者利用数据所有者的本地私有数据集对原始GAN模型进行预训练,得到预训练后的本地GAN模型。
可选地,本地GAN模型包括本地生成器以及本地鉴别器。
多个数据所有者,进一步用于:
每个数据所有者将预训练后的本地生成器隐藏。
可选地,多个数据所有者,进一步用于:
S51、中心服务器根据隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ。
S52、中心服务器根据隐私保护等级ρ以及每个数据所有者选择的个性化合约,得到多个数据所有者中隐私保护等级为ρ的多个数据所有者。
S53、中心服务器从隐私保护等级为ρ的多个数据所有者中随机选择一个数据所有者,作为协助训练的数据所有者。
S54、协助训练的数据所有者根据协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,优化后转去执行S51进行迭代训练,直到迭代次数达到预设阈值停止迭代,中心生成器模型训练完成。
可选地,中心服务器,进一步用于:
S511、中心服务器根据中心生成器模型训练过程的迭代次数确定噪声规模的衰减函数。
S512、中心服务器根据衰减函数确定噪声规模。
S513、中心服务器根据噪声规模确定协助本轮训练的数据所有者的隐私保护等级ρ。
可选地,多个数据所有者,进一步用于:
S541、协助训练的数据所有者从中心服务器获取中心生成器模型生成的数据。
S542、协助训练的数据所有者根据中心生成器模型生成的数据以及协助训练的数据所有者的私有数据集,对预训练后的本地GAN模型中的本地鉴别器进行更新。
S543、协助训练的数据所有者根据更新后的本地鉴别器计算梯度。
S544、协助训练的数据所有者基于个性化差分隐私理论扰动梯度,得到扰动后的梯度。
S545、中心服务器根据扰动后的梯度对中心服务器的中心生成器模型进行优化。
可选地,多个数据所有者,进一步用于:
基于高斯机制以及扰动程度对梯度进行扰动;其中,扰动程度由个性化合约的隐私保护等级决定。
本发明实施例中,针对物联网数据共享中的隐私问题,提出了一种基于异步分布式GAN的隐私保护数据共享方案。结合差分隐私理论和分布式GAN,使用每个数据所有者本地的数据集以个性化的隐私保护方式训练一个中心生成模型。所提出的分布式GAN训练框架,能够在不传输原始数据的前提下,利用数据所有者的本地数据集协同训练中心生成模型以实现数据共享,然后利用中心生成模型为下游任务重建数据集。结合差分隐私理论提出梯度“脱敏”策略,在保护用户隐私的前提下最大程度保留梯度的可用性,在差分隐私的保障下实现模型的优化。为拥有不同隐私偏向的数据所有者设计多等级的隐私保护合约,以及差分隐私级别选择策略的提出,能够平衡数据可用性和用户隐私保护需求,以最小的隐私消耗完成模型的训练。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种基于分布式GAN的隐私保护数据共享方法,其特征在于,所述方法由基于分布式GAN的隐私保护数据共享系统实现,所述系统包括中心服务器以及多个数据所有者;
    所述方法包括:
    S1、所述中心服务器提供多个个性化合约;
    S2、所述多个数据所有者中的每个数据所有者从所述多个个性化合约中选择一个个性化合约;
    S3、所述每个数据所有者利用所述数据所有者的本地私有数据集对所述数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型;
    S4、所述中心服务器设计隐私保护等级选择策略;
    S5、所述多个数据所有者根据所述隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,完成隐私保护数据共享。
  2. 根据权利要求1所述的方法,其特征在于,所述S1中的多个个性化合约包括多个隐私保护等级以及与多个隐私保护等级对应的奖励。
  3. 根据权利要求1所述的方法,其特征在于,所述S3中的所述每个数据所有者利用所述数据所有者的本地私有数据集对所述数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型包括:
    S31、所述每个数据所有者从所述中心服务器获取原始GAN模型;
    S32、所述每个数据所有者利用所述数据所有者的本地私有数据集对所述原始GAN模型进行预训练,得到预训练后的本地GAN模型。
  4. 根据权利要求3所述的方法,其特征在于,所述本地GAN模型包括本地生成器以及本地鉴别器;
    所述S32中的得到预训练后的本地GAN模型后还包括:
    所述每个数据所有者将预训练后的本地生成器隐藏。
  5. 根据权利要求1所述的方法,其特征在于,所述S5中的所述多个数据所有者根据所述隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化包括:
    S51、所述中心服务器根据所述隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ;
    S52、所述中心服务器根据所述隐私保护等级ρ以及每个数据所有者选择的个性化合约,得到多个数据所有者中隐私保护等级为ρ的多个数据所有者;
    S53、所述中心服务器从所述隐私保护等级为ρ的多个数据所有者中随机选择一个数据所有者,作为协助训练的数据所有者;
    S54、所述协助训练的数据所有者根据所述协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,优化后转去执行S51进行迭代训练,直到迭代次数达到预设阈值停止迭代,所述中心生成器模型训练完成。
  6. 根据权利要求5所述的方法,其特征在于,所述S51中的所述中心服务器根据所述隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ包括:
    S511、所述中心服务器根据中心生成器模型训练过程的迭代次数确定噪声规模的衰减函数;
    S512、所述中心服务器根据所述衰减函数确定噪声规模;
    S513、所述中心服务器根据所述噪声规模确定协助本轮训练的数据所有者的隐私保护等级ρ。
  7. 根据权利要求5所述的方法,其特征在于,所述S54中的所述协助训练的数据所有者根据所述协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化包括:
    S541、所述协助训练的数据所有者从所述中心服务器获取中心生成器模型生成的数据;
    S542、所述协助训练的数据所有者根据所述中心生成器模型生成的数据以及协助训练的数据所有者的私有数据集,对预训练后的本地GAN模型中的本地鉴别器进行更新;
    S543、所述协助训练的数据所有者根据更新后的本地鉴别器计算梯度;
    S544、所述协助训练的数据所有者基于个性化差分隐私理论扰动所述梯度,得到扰动后的梯度;
    S545、所述中心服务器根据所述扰动后的梯度对中心服务器的中心生成器模型进行优化。
  8. 根据权利要求7所述的方法,其特征在于,所述S544中的基于个性化差分隐私理论扰动所述梯度包括:
    基于高斯机制以及扰动程度对所述梯度进行扰动;其中,所述扰动程度由个性化合约的隐私保护等级决定。
  9. 一种基于分布式GAN的隐私保护数据共享系统,其特征在于,所述系统用于实现基于分布式GAN的隐私保护数据共享方法,所述系统包括中心服务器以及多个数据所有者,其中:
    所述中心服务器,用于提供多个个性化合约;设计隐私保护等级选择策略;
    所述多个数据所有者,用于从所述多个个性化合约中选择一个个性化合约;利用所述数据所有者的本地私有数据集对所述数据所有者的本地生成对抗网络GAN模型进行预训练,得到预训练后的本地GAN模型;根据所述隐私保护等级选择策略、每个数据所有者选择的个性化合约以及预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,完成隐私保护数据共享。
  10. 根据权利要求9所述的系统,其特征在于,所述多个数据所有者,进一步用于:
    S51、所述中心服务器根据所述隐私保护等级选择策略,确定协助本轮训练的数据所有者的隐私保护等级ρ;
    S52、所述中心服务器根据所述隐私保护等级ρ以及每个数据所有者选择的个性化合约,得到多个数据所有者中隐私保护等级为ρ的多个数据所有者;
    S53、所述中心服务器从所述隐私保护等级为ρ的多个数据所有者中随机选择一个数据所有者,作为协助训练的数据所有者;
    S54、所述协助训练的数据所有者根据所述协助训练的数据所有者的预训练后的本地GAN模型,对中心服务器的中心生成器模型进行优化,优化后转去执行S51进行迭代训练,直到迭代次数达到预设阈值停止迭代,所述中心生成器模 型训练完成。
PCT/CN2023/083568 2022-08-28 2023-03-24 一种基于分布式gan的隐私保护数据共享方法及系统 WO2024045581A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211036310.XA CN115442099B (zh) 2022-08-28 2022-08-28 一种基于分布式gan的隐私保护数据共享方法及系统
CN202211036310.X 2022-08-28

Publications (1)

Publication Number Publication Date
WO2024045581A1 true WO2024045581A1 (zh) 2024-03-07

Family

ID=84244624

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/083568 WO2024045581A1 (zh) 2022-08-28 2023-03-24 一种基于分布式gan的隐私保护数据共享方法及系统

Country Status (3)

Country Link
CN (1) CN115442099B (zh)
LU (1) LU504296B1 (zh)
WO (1) WO2024045581A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117852627A (zh) * 2024-03-05 2024-04-09 湘江实验室 一种预训练模型微调方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115442099B (zh) * 2022-08-28 2023-06-06 北方工业大学 一种基于分布式gan的隐私保护数据共享方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160283735A1 (en) * 2015-03-24 2016-09-29 International Business Machines Corporation Privacy and modeling preserved data sharing
CN113255004A (zh) * 2021-06-16 2021-08-13 大连理工大学 一种安全且高效的联邦学习内容缓存方法
CN114841364A (zh) * 2022-04-14 2022-08-02 北京理工大学 一种满足个性化本地差分隐私需求的联邦学习方法
CN115442099A (zh) * 2022-08-28 2022-12-06 北方工业大学 一种基于分布式gan的隐私保护数据共享方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704930B (zh) * 2017-09-25 2021-02-26 创新先进技术有限公司 基于共享数据的建模方法、装置、系统及电子设备
CN110348241B (zh) * 2019-07-12 2021-08-03 之江实验室 一种数据共享策略下的多中心协同预后预测系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160283735A1 (en) * 2015-03-24 2016-09-29 International Business Machines Corporation Privacy and modeling preserved data sharing
CN113255004A (zh) * 2021-06-16 2021-08-13 大连理工大学 一种安全且高效的联邦学习内容缓存方法
CN114841364A (zh) * 2022-04-14 2022-08-02 北京理工大学 一种满足个性化本地差分隐私需求的联邦学习方法
CN115442099A (zh) * 2022-08-28 2022-12-06 北方工业大学 一种基于分布式gan的隐私保护数据共享方法及系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117852627A (zh) * 2024-03-05 2024-04-09 湘江实验室 一种预训练模型微调方法及系统

Also Published As

Publication number Publication date
LU504296B1 (en) 2024-03-08
CN115442099A (zh) 2022-12-06
CN115442099B (zh) 2023-06-06

Similar Documents

Publication Publication Date Title
Wu et al. An adaptive federated learning scheme with differential privacy preserving
Zhao et al. Privacy-preserving collaborative deep learning with unreliable participants
WO2024045581A1 (zh) 一种基于分布式gan的隐私保护数据共享方法及系统
Liu et al. Fedcoin: A peer-to-peer payment system for federated learning
Chen et al. Fedgraph: Federated graph learning with intelligent sampling
Lee et al. Digestive neural networks: A novel defense strategy against inference attacks in federated learning
Kang et al. Privacy-preserving federated adversarial domain adaptation over feature groups for interpretability
Miao et al. Federated deep reinforcement learning based secure data sharing for Internet of Things
CN112668877B (zh) 结合联邦学习和强化学习的事物资源信息分配方法及系统
CN114417427B (zh) 一种面向深度学习的数据敏感属性脱敏系统及方法
CN113297175A (zh) 数据处理方法、装置、系统和可读存储介质
Joachims et al. Recommendations as treatments
Sun et al. The QoS and privacy trade-off of adversarial deep learning: an evolutionary game approach
Cheng et al. Dynamic games for social model training service market via federated learning approach
Alnajar et al. Tactile internet of federated things: Toward fine-grained design of FL-based architecture to meet TIoT demands
CN117390664A (zh) 面向联邦学习的博弈驱动隐私自适应定价方法和装置
CN112101555A (zh) 多方联合训练模型的方法和装置
Luo et al. Privacy-preserving clustering federated learning for non-IID data
Wang et al. Variance of the gradient also matters: Privacy leakage from gradients
CN114723012A (zh) 基于分布式训练系统的计算方法和装置
CN113657611A (zh) 联合更新模型的方法及装置
Wu et al. Federated Split Learning with Data and Label Privacy Preservation in Vehicular Networks
Shen et al. Simultaneously advising via differential privacy in cloud servers environment
Zhang et al. FedMPT: Federated Learning for Multiple Personalized Tasks Over Mobile Computing
Li Adaptive image restoration by a novel neuro-fuzzy approach using complex fuzzy sets

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23858623

Country of ref document: EP

Kind code of ref document: A1