CN115454949A - Shared data determination method and device, electronic equipment and storage medium - Google Patents

Shared data determination method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115454949A
CN115454949A CN202210892219.1A CN202210892219A CN115454949A CN 115454949 A CN115454949 A CN 115454949A CN 202210892219 A CN202210892219 A CN 202210892219A CN 115454949 A CN115454949 A CN 115454949A
Authority
CN
China
Prior art keywords
data
record data
network
discriminator
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210892219.1A
Other languages
Chinese (zh)
Inventor
苏森
程祥
王振亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210892219.1A priority Critical patent/CN115454949A/en
Publication of CN115454949A publication Critical patent/CN115454949A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a shared data determining method and device, electronic equipment and a storage medium. The method comprises the following steps: receiving the generated record data and the sensitive record data of the current batch; updating the local discriminator network according to the generated record data and the sensitive record data of the current batch by using the local discriminator network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and a discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; generating a record data set comprising: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The data sharing method and the device have the advantages that the vertically-divided data sharing can be realized while privacy disclosure is avoided, and the shared data is guaranteed to have high usability.

Description

Shared data determination method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining shared data, an electronic device, and a storage medium.
Background
In the related art, to implement vertical split data sharing, a generative model is generally constructed according to each data owner by using its own local data set, then data is generated by using the learned generative model, and finally the data generated by each party is integrated to form a shared data set. However, the related art has a problem of poor availability of shared data due to ID mismatch between respective local data sets.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for determining shared data.
Based on the purpose, in a first aspect, the application provides a method for determining shared data, including:
s1: receiving the generated record data and the sensitive record data of the current batch;
s2: updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
s3: constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance;
s4: inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
s5: and constructing target shared data according to the weight of each generated record data.
In one possible implementation manner, the receiving the generated record data and the sensitive record data of the current batch includes:
receiving the generation record data for a current batch from the generator network;
and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
In one possible implementation manner, the updating, by using a local arbiter network, the local arbiter network according to the generated record data and the sensitive record data of the current batch includes:
determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information;
according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters;
and updating the local arbiter network according to the updated parameters.
In a possible implementation manner, the determining, by the local arbiter network, a current batch loss function according to the generated record data and the sensitive record data of the current batch further includes:
initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimation
Figure BDA0003768040520000021
And second order momentum estimation
Figure BDA0003768040520000022
In a possible implementation manner, the perturbing the pruned gradient information by using gaussian noise sampled from the gaussian distribution to determine the update parameter includes:
estimating the first order momentum according to an updated formula
Figure BDA0003768040520000023
And second order momentum estimation
Figure BDA0003768040520000024
Updating to determine the updated parameters; wherein the update formula is expressed as
Figure BDA0003768040520000025
Figure BDA0003768040520000026
Wherein,
Figure BDA0003768040520000027
representing the updated first-order momentum estimate,
Figure BDA0003768040520000028
representing an updated second order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of a second rate of decay,
Figure BDA0003768040520000029
representing the first order momentum estimate for the t-1 th round,
Figure BDA00037680405200000210
representing the second order momentum estimate for the t-1 th round,
Figure BDA00037680405200000211
representing the gradient vector of the discriminator t round.
In one possible implementation, the inputting the random vector collected in advance into the updated generator network to obtain the generated record data set includes:
inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network;
and (4) repeatedly executing the steps S1-S3 until the number of iterations reaches a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
In a possible implementation manner, the constructing target shared data according to the weight of each generated record data further includes:
the weight of the generated record data determined according to the generator network in each iteration is saved;
inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;
assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Figure BDA0003768040520000031
Wherein w ri The weight is represented by a weight that is,
Figure BDA0003768040520000032
it is indicated that the resultant recorded data,
Figure BDA0003768040520000033
representing the generation record of the R-th generator, R representing the number of generator networks selected, d j Represents a distance function and M represents a feature number.
An update formula for updating the composite record data in each iteration, expressed as
Figure BDA0003768040520000034
Wherein,
Figure BDA0003768040520000035
representing the generated record for the r-th generator.
In a second aspect, the present application provides a shared data determining apparatus, including:
the receiving module is configured to receive the generated record data and the sensitive record data of the current batch;
a first updating module configured to update the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
the second updating module is configured to construct a local discriminator response by using the updated local discriminator network, and update the generator network by using the data sharing platform according to the real integrated record training data, the synthetic integrated record training data and the discriminator response training relation discriminator which are acquired in advance;
a determining module configured to input a pre-collected random vector to the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
a construction module configured to construct target shared data according to the weight of each of the generated record data.
In a third aspect, the present application provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the method for determining shared data according to the first aspect.
In a fourth aspect, the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the shared data determination method according to the first aspect.
As can be seen from the foregoing, the shared data determining method, apparatus, electronic device and storage medium provided by the present application receive generated record data and sensitive record data of a current batch; updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The vertically-divided data sharing can be realized while privacy disclosure is avoided, and the finally obtained shared data has high usability.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating a multi-party data sharing scenario in the related art.
Fig. 2 shows an exemplary flowchart of a shared data determination method provided in an embodiment of the present application.
FIG. 3 shows a schematic diagram of a multi-party data sharing algorithm satisfying differential privacy according to an embodiment of the application.
Fig. 4 (a) shows IS score comparison plots matching the first MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 4 (b) shows FID score comparison maps matching the first MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 4 (c) shows IS score comparison plots matching the second MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 4 (d) shows FID score comparison plots matching the second MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 5 (a) shows a comparison graph of the results of a first experiment of various algorithms according to embodiments of the present application at different privacy budgets.
Fig. 5 (b) shows a comparison graph of the results of a second experiment with different privacy budgets for various algorithms according to embodiments of the present application.
Fig. 6 (a) shows a comparison graph of the results of the first experiment of various algorithms according to the embodiments of the present application at different numbers of participants.
Fig. 6 (b) shows a comparison of results of a second experiment with various algorithms according to embodiments of the present application at different numbers of participants.
Fig. 7 shows an exemplary structural diagram of a shared data determining apparatus provided in an embodiment of the present application.
Fig. 8 shows an exemplary structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As described in the background section, data sharing helps to motivate the economic value of hiding in data, reported by the company mackentin: if the data in the current seven industries (business, finance, medical health, education, transportation, electric power and natural gas in the oil industry) are mutually disclosed, a great deal of economic benefit is brought. Data sharing helps to mine the knowledge contained in the data, such as: data of hospitals and disease control centers can be shared and then used for analyzing disease propagation modes, so that the public medical level is improved; the data of the multiple shopping platforms can be used for personalized commodity recommendation after being shared, and the shopping experience of consumers is improved; after data of a plurality of banks are shared, credit assessment of customers can be better carried out, and multi-party loan and financial fraud can be monitored. However, in general, data is often distributed among multiple data owners (for example, medical records of residents in hospitals and account records of customers in banks), and contains a large amount of sensitive information, which may cause serious privacy disclosure problems if data of different data owners are directly shared.
Referring to fig. 1, the scenario mainly involves three roles of a data sharing platform, a data owner, and a data consumer. Wherein each data owner holds a locally sensitive data set for a different attribute of the same group of users. The sharing platform assists in sharing of the local sensitive data set and building a data sharing model, new shared data generated by the data sharing model and the integrated data set have the same statistical distribution characteristics, and meanwhile, the local sensitive data set is not directly shared, so that the privacy of each data is protected to a certain extent. The data user can use the shared data to carry out various data analysis and mining tasks.
As can be seen from the above process, in the process of performing multi-party data sharing, the data sharing model and the privacy of the final data owner obtained by the data user can be avoided, but before the multi-party data sharing model is formed, the privacy data of each data owner may still be leaked. Specifically, for each locally sensitive data set, there are three roles that may pose a privacy threat: 1) A data sharing platform; 2) A data owner participating in data sharing; 3) Data consumers or other potential attackers who may obtain the final shared data.
The multi-party data sharing technology with privacy protection provides a feasible scheme for solving the privacy disclosure problem brought by multi-party data sharing. Differential Privacy (DP) techniques proposed in recent years provide a feasible solution to the problem of Privacy disclosure caused by data sharing. Unlike traditional anonymity-based privacy models (e.g., k-anonymity [1] and l-diversity, etc.), differential privacy provides a strict, quantifiable method of privacy protection for sensitive data. By adding a proper amount of noise into the statistical result, the requirement of privacy protection is met by ensuring that one record in the modified data set does not have obvious influence on the statistical result.
The applicant finds, through research, that in the related art, an intuitive method for vertically dividing data sharing is as follows: each data owner constructs a generative model such as GAN or Bayes Network by using the local data set of the data owner, then generates data by using the learned generative model, and finally integrates the data generated by each party to form a shared data set. However, it is difficult to guarantee availability of the resulting data due to ID mismatch between the respective local data sets.
Therefore, the shared data determining method, the device, the electronic equipment and the storage medium provided by the application receive the generated record data and the sensitive record data of the current batch; updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The vertically-divided data sharing can be realized while privacy disclosure is avoided, and the finally obtained shared data has high usability.
The method for determining shared data provided in the embodiments of the present application is specifically described below by using specific embodiments.
Referring to fig. 2, a method for determining shared data provided in the embodiment of the present application specifically includes the following steps:
s1: receiving the generated record data and the sensitive record data of the current batch;
s2: updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
s3: constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance;
s4: inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
s5: and constructing target shared data according to the weight of each generated record data.
In some embodiments, the present application may be applied to a scenario where multiple data owners and a data sharing platform are provided, for example, each data owner k holds one arbiter network, and the data sharing platform holds one generator network G and two relationship arbiters.
In some embodiments, the hidden variable z (typically random noise following a gaussian distribution) produces generated samples through the generator network G, which is a two-classification problem for the discriminator D, and V (D, G) is a cross-entropy loss common in two-classification problems. To ensure that V (D, G) takes a maximum, one can train the arbiter k times iteratively, then iterate 1 more generator. The specific training process is as follows:
the parameters of both networks of generator G and discriminator D are initialized. N samples are extracted from the training set, and a generator generates the n samples using the defined noise profile. The fixed generator G trains the discriminator D to distinguish between true and false as much as possible. After k times of update of discriminator D, 1 time of update of generator G is performed to make the discriminator as indistinguishable as possible from true and false. After multiple updating iterations, in an ideal state, the final discriminator D cannot distinguish whether the picture comes from a real training sample set or from a sample generated by the generator G, the discrimination probability is 0.5 at this time, and the training is completed.
In some embodiments, the DPGDAN algorithm presented in this application involves two phases, in phase 1, the data sharing platform trains K on 1 with K data owners. In particular, the arbiter and the generator are alternately trained. Each data owner updates the parameters of the discriminator held by each data owner by using a self-adaptive gradient disturbance method; and the data sharing platform updates the parameters of the generator network held by the data sharing platform by using the received feedback of the discriminator meeting the differential privacy. And 2, sampling from a certain prior distribution to obtain a random vector, transmitting the random vector to a generator network obtained in the training process, constructing a synthetic record, and aggregating the synthetic record to obtain a final shared data record.
In some embodiments, the DPGDAN algorithm referred to in this application involves the training of four neural networks, namely a local arbiter network held by each data owner, a generator network held by the data sharing platform, and a relationship arbiter network. In particular, the local network of discriminators is used to distinguish between locally sensitive data records and generator-generated records, while the generator network is dedicated to generating synthetic records that are considered "true" by the K networks of discriminators. Given the different objectives described above, the objective function of the generator can be written as follows:
Figure BDA0003768040520000081
wherein, m represents the batch size,
Figure BDA0003768040520000082
representing true record relational interpretationThe device is a device for identifying the human body,
Figure BDA0003768040520000083
a composite-record relational type discriminator is shown,
Figure BDA0003768040520000084
and
Figure BDA0003768040520000085
a judgment device for representing the relation type,
Figure BDA0003768040520000086
an integrated record representing the generator is then recorded,
Figure BDA0003768040520000087
the partial record of the representation generator with respect to a,
Figure BDA0003768040520000088
representing the local record of the generator with respect to B.
The objective function of the synthetic record relationship discriminator is as follows:
Figure BDA0003768040520000089
where m represents the batch size,
Figure BDA00037680405200000810
represents a true-record relational type discriminator,
Figure BDA00037680405200000811
a composite-record relational type discriminator is shown,
Figure BDA00037680405200000812
and
Figure BDA00037680405200000813
a judgment device for representing the relation type,
Figure BDA00037680405200000814
represents the integrated record after the obfuscation operation,
Figure BDA0003768040520000091
the gradient is shown for the entire recording,
Figure BDA0003768040520000092
representing the consolidated record generated by the generator.
The objective function of the true record relationship discriminator is as follows:
Figure BDA0003768040520000093
where m represents the batch size,
Figure BDA0003768040520000094
represents a true-record relational type discriminator,
Figure BDA0003768040520000095
a composite-record relational type discriminator is shown,
Figure BDA0003768040520000096
and
Figure BDA0003768040520000097
a judgment device for the expression of the relation type,
Figure BDA0003768040520000098
it is indicated that the recorded data is integrated,
Figure BDA0003768040520000099
representing obfuscated integration records, x i Indicating an integrated record.
Two local discriminators
Figure BDA00037680405200000910
And
Figure BDA00037680405200000911
for distinguishing real local recording data from generated local recording data, the objective functions are respectively as follows:
Figure BDA00037680405200000912
Figure BDA00037680405200000913
wherein x is iA And x iB Representing locally recorded data.
In some embodiments, gradient updates of the generator network require gradient descent to be performed on its objective function, however, this contains sensitive information of the feedback of the arbiter. In order to update the generator gradient without revealing privacy of each data owner, the generator network gradient calculation is first decomposed, sensitive information related to the discriminator network is then determined, and then desensitized.
Based on the chain rule of calculus, the gradient calculation of the generator can be decomposed as follows:
Figure BDA00037680405200000914
wherein, the target function is shown, K represents the number of local discriminators,
Figure BDA00037680405200000915
indicating the response of the arbiter network, including sensitive information,
Figure BDA00037680405200000916
representing a non-sensitive calculation factor.
After the decomposition of the gradient calculation, the repeated iteration processing process of the following steps can be utilized to perform desensitization processing on the parameters of the generator network and generate the feedback of the arbiter network for privacy protection.
In some embodiments, the determining, with the local arbiter network, a current lot loss function from the generated record data and the sensitive record data of the current lot further comprises: initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimation
Figure BDA0003768040520000101
And second order momentum estimation
Figure BDA0003768040520000102
In some embodiments, the receiving the generated record data and the sensitive record data for the current lot includes: receiving the generation record data for a current batch from the generator network; and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
In some embodiments, said updating the local arbiter network from the generated log data and sensitive log data of the current lot using the local arbiter network comprises: determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information; according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters; and updating the local arbiter network according to the updated parameters.
After the decomposition of the gradient calculation, the repeated iterative processing process of the following steps can be utilized to perform desensitization processing on the parameters of the generator network and generate arbiter network feedback for privacy protection.
Step 1: initializing arbiter network parameters, first order momentum estimation
Figure BDA0003768040520000103
Second order momentum estimation
Figure BDA0003768040520000104
And the scale of the noise σ 0
And 2, step: each data owner receives a batch of synthetic record data from the generator
Figure BDA0003768040520000105
And 3, step 3: each data owner locally extracts a batch of sensitive data record data
Figure BDA0003768040520000106
And 4, step 4: inputting the data records obtained in the steps 1 and 2 into a generator network, and calculating a loss function of the data records:
Figure BDA0003768040520000107
wherein,
Figure BDA0003768040520000111
representing the mean, D is the discriminator network, G is the generator network, z is the hidden vector, x k The data is recorded for reality.
And 5: gradient information of the discriminator network is calculated according to the loss function.
Step 6: according to the self-adaptive noise generation method, gaussian noise is sampled from Gaussian distribution to disturb the gradient:
Figure BDA0003768040520000112
and 7: for first order momentum estimation
Figure BDA0003768040520000113
Second order momentum estimation
Figure BDA0003768040520000114
Updating:
Figure BDA0003768040520000115
wherein,
Figure BDA0003768040520000116
representing the updated first-order momentum estimate,
Figure BDA0003768040520000117
representing an updated second order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of the second rate of decay,
Figure BDA0003768040520000118
representing the first order momentum estimate for the t-1 th round,
Figure BDA0003768040520000119
representing the second order momentum estimate for round t-1,
Figure BDA00037680405200001110
representing the gradient vector of the discriminator t round.
And step 8: the weights of the arbiter network are updated.
In some embodiments, the inputting the pre-collected random vector to the updated generator network to obtain the generated record data set includes: inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network; and (4) repeatedly executing the steps S1-S3 until the iteration times reach a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
In some embodiments, the specific steps of snapshot aggregation include:
storing the weight information of the generator network obtained by different iterations;
feeding the hidden vector extracted from the prior distribution into a generator network to obtain corresponding synthetic record data;
assigning a weight to each of the synthetic record data;
and updating the weights and the shared data according to an updating formula.
Specifically, the constructing target shared data according to the weight of each generated record data further includes: the weight of the generated record data determined according to the generator network in each iteration is saved; inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data; assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Figure BDA00037680405200001111
Wherein, w ri The weight is represented by a weight that is,
Figure BDA0003768040520000121
which represents the resultant recorded data, is,
Figure BDA0003768040520000122
representing the generation record of the R-th generator, R representing the number of generator networks selected, d j Representing the distance function and M the number of features.
An update formula for updating the composite record data in each iteration, expressed as
Figure BDA0003768040520000123
Wherein,
Figure BDA0003768040520000124
representing the generation record of the r-th generator.
The snapshot aggregation method is based on the following ideas: ideally, a trained generator is able to reproduce the true data distribution. The synthesis of shared records is done by following the standard practice in GAN, i.e. providing the trained generator network with the implicit vector z _ i sampled from the a priori distribution, and then taking the output of the generators as shared data. However, due to the limited privacy budget, the generator network and the arbiter network may not be trained for a sufficiently long time. Thus, using only a trained generator network to generate shared records would ignore useful information of generators in the training process. To this end, the present application proposes a snapshot aggregation method that utilizes a generator network obtained during training to improve the utility of the final shared data.
In the comparison, the widely used real data set Adult and Bank are used for experimental verification. Wherein, table 2 lists the hyper-parameter values in the experiment, and the Adult data set comprises 48842 U.S. census data records; the Bank data set contained 45211 account records from the portugal banking institution. The statistics for both data sets are shown in table 1 below.
TABLE 1 Adult and Bank data set statistical analysis
Figure BDA0003768040520000125
TABLE 2 Superparameter settings
Figure BDA0003768040520000126
Figure BDA0003768040520000131
The method is evaluated following the standard practice of data sharing tasks, namely, the utility of shared data is measured through the effectiveness of machine learning. In particular, we first train the predictive model using shared data, and then test the trained predictive model on the actual test set. The higher the accuracy of the prediction model, the better the data utility.
The effectiveness and performance of the method of the present application is illustrated by comparing the results of the accuray experiments of the method of the present application with those of the prior art. Referring to fig. 3, the effectiveness of the shared data determination algorithm provided by the present application is demonstrated. Experimental results refer to fig. 4 (a) -4 (d), 5 (a) -5 (b), and 6 (a) -6 (b), in which the DPGDAN algorithm represents the algorithms provided in the embodiments of the present application, and the Nonprivat algorithm and the Nosplit algorithm are algorithms in the related art. As can be seen from fig. 4 (a) -4 (d), 5 (a) -5 (b), and 6 (a) -6 (b), the present application achieves performance close to that of the comparative algorithm under different prediction models, different privacy budgets, and different numbers of participants, and the present application can better preserve the original distribution characteristics of the original data under a given privacy budget. It should be noted that the comparison algorithm used in the present application is only for indicating the upper limit of performance, and cannot be directly used for the vertically partitioned data sharing task of privacy protection.
As can be seen from fig. 4 (a) -4 (d), the shared data generated by the present application can better support a variety of data prediction or machine learning tasks. As can be seen from fig. 5 (a) -5 (b), the present application maintains good performance under different privacy budgets because the snapshot aggregation method can ensure better balance between privacy and data utility, and the learning paradigm of the present application can also utilize the data of the parties to guide model updates. In addition, the 1-summary-K design of the present application makes it more likely that the generator network will receive information signals from the network of discriminators, thereby better guiding the generator towards true data distribution.
As can be seen from fig. 6 (a) -6 (b), the accuracy of the present application decreased by a small amount compared to the baseline. The reason is that increasing the number of discriminators will reduce the variance of the minimum maximum objective function throughout the run and will increase the convergence speed.
As can be seen from the foregoing, the shared data determining method, apparatus, electronic device and storage medium provided by the present application receive generated record data and sensitive record data of a current batch; updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The vertically-divided data sharing can be realized while privacy disclosure is avoided, and the finally obtained shared data has high usability.
In order to solve the problem that the unilateral data issuing method meeting the differential privacy mentioned in the background art cannot be directly applied to vertical segmentation data sharing, the applicant proposes a multi-party data sharing method (DPGDAN algorithm) meeting the differential privacy. The method has the main idea that each data owner and the data sharing platform jointly train a customized generation countermeasure network (GAN) to extract distribution information of each local sensitive data set. Specifically, each data owner holds a discriminator, trains the discriminator by using a local sensitive data set, then carries out desensitization processing on feedback information of the discriminator and uploads the desensitization processing to a data sharing platform, and the data sharing platform updates parameters of a generator by using the collected feedback information of the discriminator and feedback of a relation discriminator. After training is completed, the generator generates shared data with the aid of the arbiter. Because the feedback information from each data owner is used in the updating process of the generator, the finally shared data has high utility.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that the description describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the described embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 7 shows an exemplary structural diagram of a shared data determination apparatus provided in an embodiment of the present application.
Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a shared data determining device.
Referring to fig. 7, the shared data determination apparatus includes: the device comprises a receiving module, a first updating module, a second updating module, a determining module and a constructing module; wherein,
the receiving module is configured to receive the generated record data and the sensitive record data of the current batch;
a first updating module configured to update the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
the second updating module is configured to construct a local discriminator response by using the updated local discriminator network, and update the generator network by using the data sharing platform according to the real integrated record training data, the synthetic integrated record training data and the discriminator response training relation discriminator which are acquired in advance;
a determining module configured to input a pre-collected random vector to the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
a construction module configured to construct target shared data according to the weight of each of the generated record data.
In one possible implementation, the receiving module is further configured to:
receiving the generation record data for a current batch from the generator network;
and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
In one possible implementation, the first update module is further configured to:
determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information;
according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters;
and updating the local arbiter network according to the updated parameters.
In one possible implementation manner, the apparatus further includes: initializing a module;
the initialization module is further configured to:
initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimation
Figure BDA0003768040520000161
And second order momentum estimation
Figure BDA0003768040520000162
In one possible implementation, the second update module is further configured to:
according to the updated formula pairThe first order momentum estimation
Figure BDA0003768040520000163
And second order momentum estimation
Figure BDA0003768040520000164
Updating to determine the updated parameters; wherein the update formula is expressed as
Figure BDA0003768040520000165
Figure BDA0003768040520000166
Wherein,
Figure BDA0003768040520000167
representing the updated first-order momentum estimate,
Figure BDA0003768040520000168
representing an updated second order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of a second rate of decay,
Figure BDA0003768040520000169
representing the first order momentum estimate for the t-1 th round,
Figure BDA00037680405200001610
representing the second order momentum estimate for the t-1 th round,
Figure BDA00037680405200001611
representing the gradient vector of the discriminator t round.
In one possible implementation, the determining module is further configured to:
inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network;
and (4) repeatedly executing the steps S1-S3 until the number of iterations reaches a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
In one possible implementation manner, the apparatus further includes: a third updating module;
the third update module is further configured to:
storing the weight of the generated record data determined according to the generator network in each iteration;
inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;
assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Figure BDA00037680405200001612
Wherein w ri The weight is represented by a weight that is,
Figure BDA00037680405200001613
it is indicated that the resultant recorded data,
Figure BDA00037680405200001614
represents the generative record of the R-th generator, R represents the number of generator networks selected, d j Representing the distance function and M the number of features.
An update formula for updating the composite record data in each iteration, expressed as
Figure BDA0003768040520000171
Wherein,
Figure BDA0003768040520000172
representing the generated record for the r-th generator.
For convenience of description, the above devices are described as being divided into various modules by functions, which are described separately. Of course, the functionality of the various modules may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.
The apparatus in this embodiment is configured to implement the corresponding shared data determining method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Fig. 8 shows an exemplary structural diagram of an electronic device provided in an embodiment of the present application.
Based on the same inventive concept, corresponding to the method of any embodiment, the application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the shared data determination method of any embodiment is implemented. Fig. 8 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the device may include: a processor 810, a memory 820, an input/output interface 830, a communication interface 840, and a bus 850. Wherein processor 810, memory 820, input/output interface 830, and communication interface 840 are communicatively coupled to each other within the device via bus 850.
The processor 810 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.
The Memory 820 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 820 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 820 and called to be executed by the processor 810.
The input/output interface 830 is used for connecting an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various sensors, etc., and the output devices may include a display, speaker, vibrator, indicator light, etc.
The communication interface 840 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (for example, USB, network cable, etc.), and can also realize communication in a wireless mode (for example, mobile network, WIFI, bluetooth, etc.).
Bus 850 includes a path to transfer information between various components of the device, such as processor 810, memory 820, input/output interface 830, and communication interface 840.
It should be noted that although the device only shows the processor 810, the memory 820, the input/output interface 830, the communication interface 840 and the bus 850, in a specific implementation, the device may also comprise other components necessary for normal operation. In addition, it will be understood by those skilled in the art that the apparatus may include only the components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the drawings.
The electronic device of the embodiment is used for implementing the corresponding shared data determining method in any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to the method of any of the embodiments, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method of determining shared data according to any of the embodiments.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the embodiment are used to enable the computer to execute the shared data determination method according to any one of the above embodiments, and have the beneficial effects of the corresponding method embodiment, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made without departing from the spirit or scope of the embodiments of the present application are intended to be included within the scope of the claims.

Claims (10)

1. A method for determining shared data, comprising:
s1: receiving the generated record data and the sensitive record data of the current batch;
s2: updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
s3: constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance;
s4: inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
s5: and constructing target shared data according to the weight of each generated record data.
2. The method of claim 1, wherein receiving production record data and sensitive record data for a current batch comprises:
receiving the generation record data for a current batch from the generator network;
and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
3. The method of claim 1, wherein said updating the local arbiter network from the generated log data and the sensitive log data of the current batch using the local arbiter network comprises:
determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information;
according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters;
and updating the local arbiter network according to the updated parameters.
4. The method of claim 3, wherein determining a current lot loss function from the generated log data and the sensitive log data of the current lot using the local arbiter network further comprises:
initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimation
Figure FDA0003768040510000021
And second order momentum estimation
Figure FDA0003768040510000022
5. The method of claim 4, wherein perturbing the pruned gradient information with Gaussian noise sampled from the Gaussian distribution to determine the update parameters comprises:
estimating the first order momentum according to an updated formula
Figure FDA0003768040510000023
And second order momentum estimation
Figure FDA0003768040510000024
Updating to determine the updated parameters; wherein the update formula is expressed as
Figure FDA0003768040510000025
Figure FDA0003768040510000026
Wherein,
Figure FDA0003768040510000027
representing the updated first-order momentum estimate,
Figure FDA0003768040510000028
representing the updated second-order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of a second rate of decay,
Figure FDA0003768040510000029
representing the first order momentum estimate for the t-1 th round,
Figure FDA00037680405100000210
representing the second order momentum estimate for the t-1 th round,
Figure FDA00037680405100000211
the gradient vector for round t-1 is shown.
6. The method of claim 1, wherein inputting the pre-collected random vectors into an updated generator network to obtain the generated record data set comprises:
inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network;
and (4) repeatedly executing the steps S1-S3 until the iteration times reach a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
7. The method of claim 6, wherein said constructing target shared data according to the weight of each of said generated record data further comprises:
storing the weight of the generated record data determined according to the generator network in each iteration;
inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;
assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Figure FDA0003768040510000031
Wherein w ri The weight is represented by a weight that is,
Figure FDA0003768040510000032
it is indicated that the resultant recorded data,
Figure FDA0003768040510000033
representing the generation record of the R-th generator, R representing the number of generator networks selected, d j Representing the distance function and M the number of features.
An update formula for updating the composite record data in each iteration, expressed as
Figure FDA0003768040510000034
Wherein,
Figure FDA0003768040510000035
the representation represents the generated record representing the r-th generator.
8. A shared data determining apparatus, comprising:
the receiving module is configured to receive the generated record data and the sensitive record data of the current batch;
a first update module configured to update the local arbiter network according to the generated record data and sensitive record data of the current batch using a local arbiter network;
the second updating module is configured to construct a local discriminator response by using the updated local discriminator network, and update the generator network by using the data sharing platform according to the real integrated record training data, the synthetic integrated record training data and the discriminator response training relation discriminator which are acquired in advance;
a determining module configured to input a pre-collected random vector to the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
a construction module configured to construct target shared data according to the weight of each of the generated record data.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to implement the method of any one of claims 1 to 7.
CN202210892219.1A 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium Pending CN115454949A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210892219.1A CN115454949A (en) 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210892219.1A CN115454949A (en) 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115454949A true CN115454949A (en) 2022-12-09

Family

ID=84297189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210892219.1A Pending CN115454949A (en) 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115454949A (en)

Similar Documents

Publication Publication Date Title
US12014253B2 (en) System and method for building predictive model for synthesizing data
CN108140075B (en) Classifying user behavior as anomalous
Hur et al. A variable impacts measurement in random forest for mobile cloud computing
EP3889829A1 (en) Integrated clustering and outlier detection using optimization solver machine
Sina Mirabdolbaghi et al. Model optimization analysis of customer churn prediction using machine learning algorithms with focus on feature reductions
CN113408668A (en) Decision tree construction method and device based on federated learning system and electronic equipment
CN112883227B (en) Video abstract generation method and device based on multi-scale time sequence characteristics
Du et al. Modeling spatial cross-correlation of multiple ground motion intensity measures (SAs, PGA, PGV, Ia, CAV, and significant durations) based on principal component and geostatistical analyses
CN112529101A (en) Method and device for training classification model, electronic equipment and storage medium
Islam et al. Incorporating spatial information in machine learning: The Moran eigenvector spatial filter approach
Wang et al. Landslide susceptibility analysis based on a PSO-DBN prediction model in an earthquake-stricken area
Babaei et al. InstanceSHAP: an instance-based estimation approach for Shapley values
CN108446738A (en) A kind of clustering method, device and electronic equipment
Zhao et al. Pareto-based many-objective convolutional neural networks
CN115454949A (en) Shared data determination method and device, electronic equipment and storage medium
CN117235633A (en) Mechanism classification method, mechanism classification device, computer equipment and storage medium
CN111368337B (en) Sample generation model construction and simulation sample generation method and device for protecting privacy
CN114443593B (en) Multi-party data sharing method based on generation of countermeasure network and related equipment
Pelegrina et al. A novel multi-objective-based approach to analyze trade-offs in Fair Principal Component Analysis
Dharmawan et al. Tsunami tide prediction in shallow water using recurrent neural networks: Model implementation in the Indonesia Tsunami Early Warning System
CN111221880B (en) Feature combination method, device, medium, and electronic apparatus
CN113947431A (en) User behavior quality evaluation method, device, equipment and storage medium
CN114549174A (en) User behavior prediction method and device, computer equipment and storage medium
CN111445282A (en) Service processing method, device and equipment based on user behaviors
CN116703498B (en) Commodity recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination