CN115454949A - Shared data determination method and device, electronic equipment and storage medium - Google Patents
Shared data determination method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115454949A CN115454949A CN202210892219.1A CN202210892219A CN115454949A CN 115454949 A CN115454949 A CN 115454949A CN 202210892219 A CN202210892219 A CN 202210892219A CN 115454949 A CN115454949 A CN 115454949A
- Authority
- CN
- China
- Prior art keywords
- data
- record data
- network
- discriminator
- record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 39
- 239000013598 vector Substances 0.000 claims abstract description 30
- 230000004044 response Effects 0.000 claims abstract description 21
- 230000006870 function Effects 0.000 claims description 26
- 238000009826 distribution Methods 0.000 claims description 20
- 238000005070 sampling Methods 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 claims description 8
- 238000013138 pruning Methods 0.000 claims description 8
- 239000002131 composite material Substances 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 230000003094 perturbing effect Effects 0.000 claims description 2
- 238000004519 manufacturing process Methods 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000004422 calculation algorithm Methods 0.000 description 20
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 230000002776 aggregation Effects 0.000 description 4
- 238000004220 aggregation Methods 0.000 description 4
- 238000000586 desensitisation Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- VNWKTOKETHGBQD-UHFFFAOYSA-N methane Chemical compound C VNWKTOKETHGBQD-UHFFFAOYSA-N 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000003345 natural gas Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/176—Support for shared access to files; File sharing support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a shared data determining method and device, electronic equipment and a storage medium. The method comprises the following steps: receiving the generated record data and the sensitive record data of the current batch; updating the local discriminator network according to the generated record data and the sensitive record data of the current batch by using the local discriminator network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and a discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; generating a record data set comprising: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The data sharing method and the device have the advantages that the vertically-divided data sharing can be realized while privacy disclosure is avoided, and the shared data is guaranteed to have high usability.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining shared data, an electronic device, and a storage medium.
Background
In the related art, to implement vertical split data sharing, a generative model is generally constructed according to each data owner by using its own local data set, then data is generated by using the learned generative model, and finally the data generated by each party is integrated to form a shared data set. However, the related art has a problem of poor availability of shared data due to ID mismatch between respective local data sets.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for determining shared data.
Based on the purpose, in a first aspect, the application provides a method for determining shared data, including:
s1: receiving the generated record data and the sensitive record data of the current batch;
s2: updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
s3: constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance;
s4: inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
s5: and constructing target shared data according to the weight of each generated record data.
In one possible implementation manner, the receiving the generated record data and the sensitive record data of the current batch includes:
receiving the generation record data for a current batch from the generator network;
and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
In one possible implementation manner, the updating, by using a local arbiter network, the local arbiter network according to the generated record data and the sensitive record data of the current batch includes:
determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information;
according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters;
and updating the local arbiter network according to the updated parameters.
In a possible implementation manner, the determining, by the local arbiter network, a current batch loss function according to the generated record data and the sensitive record data of the current batch further includes:
initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimationAnd second order momentum estimation
In a possible implementation manner, the perturbing the pruned gradient information by using gaussian noise sampled from the gaussian distribution to determine the update parameter includes:
estimating the first order momentum according to an updated formulaAnd second order momentum estimationUpdating to determine the updated parameters; wherein the update formula is expressed as
Wherein,representing the updated first-order momentum estimate,representing an updated second order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of a second rate of decay,representing the first order momentum estimate for the t-1 th round,representing the second order momentum estimate for the t-1 th round,representing the gradient vector of the discriminator t round.
In one possible implementation, the inputting the random vector collected in advance into the updated generator network to obtain the generated record data set includes:
inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network;
and (4) repeatedly executing the steps S1-S3 until the number of iterations reaches a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
In a possible implementation manner, the constructing target shared data according to the weight of each generated record data further includes:
the weight of the generated record data determined according to the generator network in each iteration is saved;
inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;
assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Wherein w ri The weight is represented by a weight that is,it is indicated that the resultant recorded data,representing the generation record of the R-th generator, R representing the number of generator networks selected, d j Represents a distance function and M represents a feature number.
An update formula for updating the composite record data in each iteration, expressed as
In a second aspect, the present application provides a shared data determining apparatus, including:
the receiving module is configured to receive the generated record data and the sensitive record data of the current batch;
a first updating module configured to update the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
the second updating module is configured to construct a local discriminator response by using the updated local discriminator network, and update the generator network by using the data sharing platform according to the real integrated record training data, the synthetic integrated record training data and the discriminator response training relation discriminator which are acquired in advance;
a determining module configured to input a pre-collected random vector to the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
a construction module configured to construct target shared data according to the weight of each of the generated record data.
In a third aspect, the present application provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the method for determining shared data according to the first aspect.
In a fourth aspect, the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the shared data determination method according to the first aspect.
As can be seen from the foregoing, the shared data determining method, apparatus, electronic device and storage medium provided by the present application receive generated record data and sensitive record data of a current batch; updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The vertically-divided data sharing can be realized while privacy disclosure is avoided, and the finally obtained shared data has high usability.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating a multi-party data sharing scenario in the related art.
Fig. 2 shows an exemplary flowchart of a shared data determination method provided in an embodiment of the present application.
FIG. 3 shows a schematic diagram of a multi-party data sharing algorithm satisfying differential privacy according to an embodiment of the application.
Fig. 4 (a) shows IS score comparison plots matching the first MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 4 (b) shows FID score comparison maps matching the first MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 4 (c) shows IS score comparison plots matching the second MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 4 (d) shows FID score comparison plots matching the second MNIST dataset according to various algorithms of embodiments of the present application.
Fig. 5 (a) shows a comparison graph of the results of a first experiment of various algorithms according to embodiments of the present application at different privacy budgets.
Fig. 5 (b) shows a comparison graph of the results of a second experiment with different privacy budgets for various algorithms according to embodiments of the present application.
Fig. 6 (a) shows a comparison graph of the results of the first experiment of various algorithms according to the embodiments of the present application at different numbers of participants.
Fig. 6 (b) shows a comparison of results of a second experiment with various algorithms according to embodiments of the present application at different numbers of participants.
Fig. 7 shows an exemplary structural diagram of a shared data determining apparatus provided in an embodiment of the present application.
Fig. 8 shows an exemplary structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As described in the background section, data sharing helps to motivate the economic value of hiding in data, reported by the company mackentin: if the data in the current seven industries (business, finance, medical health, education, transportation, electric power and natural gas in the oil industry) are mutually disclosed, a great deal of economic benefit is brought. Data sharing helps to mine the knowledge contained in the data, such as: data of hospitals and disease control centers can be shared and then used for analyzing disease propagation modes, so that the public medical level is improved; the data of the multiple shopping platforms can be used for personalized commodity recommendation after being shared, and the shopping experience of consumers is improved; after data of a plurality of banks are shared, credit assessment of customers can be better carried out, and multi-party loan and financial fraud can be monitored. However, in general, data is often distributed among multiple data owners (for example, medical records of residents in hospitals and account records of customers in banks), and contains a large amount of sensitive information, which may cause serious privacy disclosure problems if data of different data owners are directly shared.
Referring to fig. 1, the scenario mainly involves three roles of a data sharing platform, a data owner, and a data consumer. Wherein each data owner holds a locally sensitive data set for a different attribute of the same group of users. The sharing platform assists in sharing of the local sensitive data set and building a data sharing model, new shared data generated by the data sharing model and the integrated data set have the same statistical distribution characteristics, and meanwhile, the local sensitive data set is not directly shared, so that the privacy of each data is protected to a certain extent. The data user can use the shared data to carry out various data analysis and mining tasks.
As can be seen from the above process, in the process of performing multi-party data sharing, the data sharing model and the privacy of the final data owner obtained by the data user can be avoided, but before the multi-party data sharing model is formed, the privacy data of each data owner may still be leaked. Specifically, for each locally sensitive data set, there are three roles that may pose a privacy threat: 1) A data sharing platform; 2) A data owner participating in data sharing; 3) Data consumers or other potential attackers who may obtain the final shared data.
The multi-party data sharing technology with privacy protection provides a feasible scheme for solving the privacy disclosure problem brought by multi-party data sharing. Differential Privacy (DP) techniques proposed in recent years provide a feasible solution to the problem of Privacy disclosure caused by data sharing. Unlike traditional anonymity-based privacy models (e.g., k-anonymity [1] and l-diversity, etc.), differential privacy provides a strict, quantifiable method of privacy protection for sensitive data. By adding a proper amount of noise into the statistical result, the requirement of privacy protection is met by ensuring that one record in the modified data set does not have obvious influence on the statistical result.
The applicant finds, through research, that in the related art, an intuitive method for vertically dividing data sharing is as follows: each data owner constructs a generative model such as GAN or Bayes Network by using the local data set of the data owner, then generates data by using the learned generative model, and finally integrates the data generated by each party to form a shared data set. However, it is difficult to guarantee availability of the resulting data due to ID mismatch between the respective local data sets.
Therefore, the shared data determining method, the device, the electronic equipment and the storage medium provided by the application receive the generated record data and the sensitive record data of the current batch; updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The vertically-divided data sharing can be realized while privacy disclosure is avoided, and the finally obtained shared data has high usability.
The method for determining shared data provided in the embodiments of the present application is specifically described below by using specific embodiments.
Referring to fig. 2, a method for determining shared data provided in the embodiment of the present application specifically includes the following steps:
s1: receiving the generated record data and the sensitive record data of the current batch;
s2: updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
s3: constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance;
s4: inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
s5: and constructing target shared data according to the weight of each generated record data.
In some embodiments, the present application may be applied to a scenario where multiple data owners and a data sharing platform are provided, for example, each data owner k holds one arbiter network, and the data sharing platform holds one generator network G and two relationship arbiters.
In some embodiments, the hidden variable z (typically random noise following a gaussian distribution) produces generated samples through the generator network G, which is a two-classification problem for the discriminator D, and V (D, G) is a cross-entropy loss common in two-classification problems. To ensure that V (D, G) takes a maximum, one can train the arbiter k times iteratively, then iterate 1 more generator. The specific training process is as follows:
the parameters of both networks of generator G and discriminator D are initialized. N samples are extracted from the training set, and a generator generates the n samples using the defined noise profile. The fixed generator G trains the discriminator D to distinguish between true and false as much as possible. After k times of update of discriminator D, 1 time of update of generator G is performed to make the discriminator as indistinguishable as possible from true and false. After multiple updating iterations, in an ideal state, the final discriminator D cannot distinguish whether the picture comes from a real training sample set or from a sample generated by the generator G, the discrimination probability is 0.5 at this time, and the training is completed.
In some embodiments, the DPGDAN algorithm presented in this application involves two phases, in phase 1, the data sharing platform trains K on 1 with K data owners. In particular, the arbiter and the generator are alternately trained. Each data owner updates the parameters of the discriminator held by each data owner by using a self-adaptive gradient disturbance method; and the data sharing platform updates the parameters of the generator network held by the data sharing platform by using the received feedback of the discriminator meeting the differential privacy. And 2, sampling from a certain prior distribution to obtain a random vector, transmitting the random vector to a generator network obtained in the training process, constructing a synthetic record, and aggregating the synthetic record to obtain a final shared data record.
In some embodiments, the DPGDAN algorithm referred to in this application involves the training of four neural networks, namely a local arbiter network held by each data owner, a generator network held by the data sharing platform, and a relationship arbiter network. In particular, the local network of discriminators is used to distinguish between locally sensitive data records and generator-generated records, while the generator network is dedicated to generating synthetic records that are considered "true" by the K networks of discriminators. Given the different objectives described above, the objective function of the generator can be written as follows:
wherein, m represents the batch size,representing true record relational interpretationThe device is a device for identifying the human body,a composite-record relational type discriminator is shown,anda judgment device for representing the relation type,an integrated record representing the generator is then recorded,the partial record of the representation generator with respect to a,representing the local record of the generator with respect to B.
The objective function of the synthetic record relationship discriminator is as follows:
where m represents the batch size,represents a true-record relational type discriminator,a composite-record relational type discriminator is shown,anda judgment device for representing the relation type,represents the integrated record after the obfuscation operation,the gradient is shown for the entire recording,representing the consolidated record generated by the generator.
The objective function of the true record relationship discriminator is as follows:
where m represents the batch size,represents a true-record relational type discriminator,a composite-record relational type discriminator is shown,anda judgment device for the expression of the relation type,it is indicated that the recorded data is integrated,representing obfuscated integration records, x i Indicating an integrated record.
Two local discriminatorsAndfor distinguishing real local recording data from generated local recording data, the objective functions are respectively as follows:
wherein x is iA And x iB Representing locally recorded data.
In some embodiments, gradient updates of the generator network require gradient descent to be performed on its objective function, however, this contains sensitive information of the feedback of the arbiter. In order to update the generator gradient without revealing privacy of each data owner, the generator network gradient calculation is first decomposed, sensitive information related to the discriminator network is then determined, and then desensitized.
Based on the chain rule of calculus, the gradient calculation of the generator can be decomposed as follows:
wherein, the target function is shown, K represents the number of local discriminators,indicating the response of the arbiter network, including sensitive information,representing a non-sensitive calculation factor.
After the decomposition of the gradient calculation, the repeated iteration processing process of the following steps can be utilized to perform desensitization processing on the parameters of the generator network and generate the feedback of the arbiter network for privacy protection.
In some embodiments, the determining, with the local arbiter network, a current lot loss function from the generated record data and the sensitive record data of the current lot further comprises: initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimationAnd second order momentum estimation
In some embodiments, the receiving the generated record data and the sensitive record data for the current lot includes: receiving the generation record data for a current batch from the generator network; and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
In some embodiments, said updating the local arbiter network from the generated log data and sensitive log data of the current lot using the local arbiter network comprises: determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information; according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters; and updating the local arbiter network according to the updated parameters.
After the decomposition of the gradient calculation, the repeated iterative processing process of the following steps can be utilized to perform desensitization processing on the parameters of the generator network and generate arbiter network feedback for privacy protection.
Step 1: initializing arbiter network parameters, first order momentum estimationSecond order momentum estimationAnd the scale of the noise σ 0 。
And 4, step 4: inputting the data records obtained in the steps 1 and 2 into a generator network, and calculating a loss function of the data records:
wherein,representing the mean, D is the discriminator network, G is the generator network, z is the hidden vector, x k The data is recorded for reality.
And 5: gradient information of the discriminator network is calculated according to the loss function.
Step 6: according to the self-adaptive noise generation method, gaussian noise is sampled from Gaussian distribution to disturb the gradient:
wherein,representing the updated first-order momentum estimate,representing an updated second order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of the second rate of decay,representing the first order momentum estimate for the t-1 th round,representing the second order momentum estimate for round t-1,representing the gradient vector of the discriminator t round.
And step 8: the weights of the arbiter network are updated.
In some embodiments, the inputting the pre-collected random vector to the updated generator network to obtain the generated record data set includes: inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network; and (4) repeatedly executing the steps S1-S3 until the iteration times reach a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
In some embodiments, the specific steps of snapshot aggregation include:
storing the weight information of the generator network obtained by different iterations;
feeding the hidden vector extracted from the prior distribution into a generator network to obtain corresponding synthetic record data;
assigning a weight to each of the synthetic record data;
and updating the weights and the shared data according to an updating formula.
Specifically, the constructing target shared data according to the weight of each generated record data further includes: the weight of the generated record data determined according to the generator network in each iteration is saved; inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data; assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Wherein, w ri The weight is represented by a weight that is,which represents the resultant recorded data, is,representing the generation record of the R-th generator, R representing the number of generator networks selected, d j Representing the distance function and M the number of features.
An update formula for updating the composite record data in each iteration, expressed as
The snapshot aggregation method is based on the following ideas: ideally, a trained generator is able to reproduce the true data distribution. The synthesis of shared records is done by following the standard practice in GAN, i.e. providing the trained generator network with the implicit vector z _ i sampled from the a priori distribution, and then taking the output of the generators as shared data. However, due to the limited privacy budget, the generator network and the arbiter network may not be trained for a sufficiently long time. Thus, using only a trained generator network to generate shared records would ignore useful information of generators in the training process. To this end, the present application proposes a snapshot aggregation method that utilizes a generator network obtained during training to improve the utility of the final shared data.
In the comparison, the widely used real data set Adult and Bank are used for experimental verification. Wherein, table 2 lists the hyper-parameter values in the experiment, and the Adult data set comprises 48842 U.S. census data records; the Bank data set contained 45211 account records from the portugal banking institution. The statistics for both data sets are shown in table 1 below.
TABLE 1 Adult and Bank data set statistical analysis
TABLE 2 Superparameter settings
The method is evaluated following the standard practice of data sharing tasks, namely, the utility of shared data is measured through the effectiveness of machine learning. In particular, we first train the predictive model using shared data, and then test the trained predictive model on the actual test set. The higher the accuracy of the prediction model, the better the data utility.
The effectiveness and performance of the method of the present application is illustrated by comparing the results of the accuray experiments of the method of the present application with those of the prior art. Referring to fig. 3, the effectiveness of the shared data determination algorithm provided by the present application is demonstrated. Experimental results refer to fig. 4 (a) -4 (d), 5 (a) -5 (b), and 6 (a) -6 (b), in which the DPGDAN algorithm represents the algorithms provided in the embodiments of the present application, and the Nonprivat algorithm and the Nosplit algorithm are algorithms in the related art. As can be seen from fig. 4 (a) -4 (d), 5 (a) -5 (b), and 6 (a) -6 (b), the present application achieves performance close to that of the comparative algorithm under different prediction models, different privacy budgets, and different numbers of participants, and the present application can better preserve the original distribution characteristics of the original data under a given privacy budget. It should be noted that the comparison algorithm used in the present application is only for indicating the upper limit of performance, and cannot be directly used for the vertically partitioned data sharing task of privacy protection.
As can be seen from fig. 4 (a) -4 (d), the shared data generated by the present application can better support a variety of data prediction or machine learning tasks. As can be seen from fig. 5 (a) -5 (b), the present application maintains good performance under different privacy budgets because the snapshot aggregation method can ensure better balance between privacy and data utility, and the learning paradigm of the present application can also utilize the data of the parties to guide model updates. In addition, the 1-summary-K design of the present application makes it more likely that the generator network will receive information signals from the network of discriminators, thereby better guiding the generator towards true data distribution.
As can be seen from fig. 6 (a) -6 (b), the accuracy of the present application decreased by a small amount compared to the baseline. The reason is that increasing the number of discriminators will reduce the variance of the minimum maximum objective function throughout the run and will increase the convergence speed.
As can be seen from the foregoing, the shared data determining method, apparatus, electronic device and storage medium provided by the present application receive generated record data and sensitive record data of a current batch; updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network; constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance; inputting the random vector collected in advance into the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data; and constructing target shared data according to the weight of each generated record data. The vertically-divided data sharing can be realized while privacy disclosure is avoided, and the finally obtained shared data has high usability.
In order to solve the problem that the unilateral data issuing method meeting the differential privacy mentioned in the background art cannot be directly applied to vertical segmentation data sharing, the applicant proposes a multi-party data sharing method (DPGDAN algorithm) meeting the differential privacy. The method has the main idea that each data owner and the data sharing platform jointly train a customized generation countermeasure network (GAN) to extract distribution information of each local sensitive data set. Specifically, each data owner holds a discriminator, trains the discriminator by using a local sensitive data set, then carries out desensitization processing on feedback information of the discriminator and uploads the desensitization processing to a data sharing platform, and the data sharing platform updates parameters of a generator by using the collected feedback information of the discriminator and feedback of a relation discriminator. After training is completed, the generator generates shared data with the aid of the arbiter. Because the feedback information from each data owner is used in the updating process of the generator, the finally shared data has high utility.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that the description describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the described embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Fig. 7 shows an exemplary structural diagram of a shared data determination apparatus provided in an embodiment of the present application.
Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a shared data determining device.
Referring to fig. 7, the shared data determination apparatus includes: the device comprises a receiving module, a first updating module, a second updating module, a determining module and a constructing module; wherein,
the receiving module is configured to receive the generated record data and the sensitive record data of the current batch;
a first updating module configured to update the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
the second updating module is configured to construct a local discriminator response by using the updated local discriminator network, and update the generator network by using the data sharing platform according to the real integrated record training data, the synthetic integrated record training data and the discriminator response training relation discriminator which are acquired in advance;
a determining module configured to input a pre-collected random vector to the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
a construction module configured to construct target shared data according to the weight of each of the generated record data.
In one possible implementation, the receiving module is further configured to:
receiving the generation record data for a current batch from the generator network;
and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
In one possible implementation, the first update module is further configured to:
determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information;
according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters;
and updating the local arbiter network according to the updated parameters.
In one possible implementation manner, the apparatus further includes: initializing a module;
the initialization module is further configured to:
initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimationAnd second order momentum estimation
In one possible implementation, the second update module is further configured to:
according to the updated formula pairThe first order momentum estimationAnd second order momentum estimationUpdating to determine the updated parameters; wherein the update formula is expressed as
Wherein,representing the updated first-order momentum estimate,representing an updated second order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of a second rate of decay,representing the first order momentum estimate for the t-1 th round,representing the second order momentum estimate for the t-1 th round,representing the gradient vector of the discriminator t round.
In one possible implementation, the determining module is further configured to:
inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network;
and (4) repeatedly executing the steps S1-S3 until the number of iterations reaches a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
In one possible implementation manner, the apparatus further includes: a third updating module;
the third update module is further configured to:
storing the weight of the generated record data determined according to the generator network in each iteration;
inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;
assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Wherein w ri The weight is represented by a weight that is,it is indicated that the resultant recorded data,represents the generative record of the R-th generator, R represents the number of generator networks selected, d j Representing the distance function and M the number of features.
An update formula for updating the composite record data in each iteration, expressed as
For convenience of description, the above devices are described as being divided into various modules by functions, which are described separately. Of course, the functionality of the various modules may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.
The apparatus in this embodiment is configured to implement the corresponding shared data determining method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Fig. 8 shows an exemplary structural diagram of an electronic device provided in an embodiment of the present application.
Based on the same inventive concept, corresponding to the method of any embodiment, the application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the shared data determination method of any embodiment is implemented. Fig. 8 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the device may include: a processor 810, a memory 820, an input/output interface 830, a communication interface 840, and a bus 850. Wherein processor 810, memory 820, input/output interface 830, and communication interface 840 are communicatively coupled to each other within the device via bus 850.
The processor 810 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.
The Memory 820 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 820 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 820 and called to be executed by the processor 810.
The input/output interface 830 is used for connecting an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various sensors, etc., and the output devices may include a display, speaker, vibrator, indicator light, etc.
The communication interface 840 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (for example, USB, network cable, etc.), and can also realize communication in a wireless mode (for example, mobile network, WIFI, bluetooth, etc.).
It should be noted that although the device only shows the processor 810, the memory 820, the input/output interface 830, the communication interface 840 and the bus 850, in a specific implementation, the device may also comprise other components necessary for normal operation. In addition, it will be understood by those skilled in the art that the apparatus may include only the components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the drawings.
The electronic device of the embodiment is used for implementing the corresponding shared data determining method in any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to the method of any of the embodiments, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method of determining shared data according to any of the embodiments.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the embodiment are used to enable the computer to execute the shared data determination method according to any one of the above embodiments, and have the beneficial effects of the corresponding method embodiment, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made without departing from the spirit or scope of the embodiments of the present application are intended to be included within the scope of the claims.
Claims (10)
1. A method for determining shared data, comprising:
s1: receiving the generated record data and the sensitive record data of the current batch;
s2: updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
s3: constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance;
s4: inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
s5: and constructing target shared data according to the weight of each generated record data.
2. The method of claim 1, wherein receiving production record data and sensitive record data for a current batch comprises:
receiving the generation record data for a current batch from the generator network;
and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
3. The method of claim 1, wherein said updating the local arbiter network from the generated log data and the sensitive log data of the current batch using the local arbiter network comprises:
determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information;
according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters;
and updating the local arbiter network according to the updated parameters.
4. The method of claim 3, wherein determining a current lot loss function from the generated log data and the sensitive log data of the current lot using the local arbiter network further comprises:
5. The method of claim 4, wherein perturbing the pruned gradient information with Gaussian noise sampled from the Gaussian distribution to determine the update parameters comprises:
estimating the first order momentum according to an updated formulaAnd second order momentum estimationUpdating to determine the updated parameters; wherein the update formula is expressed as
Wherein,representing the updated first-order momentum estimate,representing the updated second-order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of a second rate of decay,representing the first order momentum estimate for the t-1 th round,representing the second order momentum estimate for the t-1 th round,the gradient vector for round t-1 is shown.
6. The method of claim 1, wherein inputting the pre-collected random vectors into an updated generator network to obtain the generated record data set comprises:
inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network;
and (4) repeatedly executing the steps S1-S3 until the iteration times reach a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
7. The method of claim 6, wherein said constructing target shared data according to the weight of each of said generated record data further comprises:
storing the weight of the generated record data determined according to the generator network in each iteration;
inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;
assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Wherein w ri The weight is represented by a weight that is,it is indicated that the resultant recorded data,representing the generation record of the R-th generator, R representing the number of generator networks selected, d j Representing the distance function and M the number of features.
An update formula for updating the composite record data in each iteration, expressed as
8. A shared data determining apparatus, comprising:
the receiving module is configured to receive the generated record data and the sensitive record data of the current batch;
a first update module configured to update the local arbiter network according to the generated record data and sensitive record data of the current batch using a local arbiter network;
the second updating module is configured to construct a local discriminator response by using the updated local discriminator network, and update the generator network by using the data sharing platform according to the real integrated record training data, the synthetic integrated record training data and the discriminator response training relation discriminator which are acquired in advance;
a determining module configured to input a pre-collected random vector to the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
a construction module configured to construct target shared data according to the weight of each of the generated record data.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to implement the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210892219.1A CN115454949A (en) | 2022-07-27 | 2022-07-27 | Shared data determination method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210892219.1A CN115454949A (en) | 2022-07-27 | 2022-07-27 | Shared data determination method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115454949A true CN115454949A (en) | 2022-12-09 |
Family
ID=84297189
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210892219.1A Pending CN115454949A (en) | 2022-07-27 | 2022-07-27 | Shared data determination method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115454949A (en) |
-
2022
- 2022-07-27 CN CN202210892219.1A patent/CN115454949A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12014253B2 (en) | System and method for building predictive model for synthesizing data | |
CN108140075B (en) | Classifying user behavior as anomalous | |
Hur et al. | A variable impacts measurement in random forest for mobile cloud computing | |
EP3889829A1 (en) | Integrated clustering and outlier detection using optimization solver machine | |
Sina Mirabdolbaghi et al. | Model optimization analysis of customer churn prediction using machine learning algorithms with focus on feature reductions | |
CN113408668A (en) | Decision tree construction method and device based on federated learning system and electronic equipment | |
CN112883227B (en) | Video abstract generation method and device based on multi-scale time sequence characteristics | |
Du et al. | Modeling spatial cross-correlation of multiple ground motion intensity measures (SAs, PGA, PGV, Ia, CAV, and significant durations) based on principal component and geostatistical analyses | |
CN112529101A (en) | Method and device for training classification model, electronic equipment and storage medium | |
Islam et al. | Incorporating spatial information in machine learning: The Moran eigenvector spatial filter approach | |
Wang et al. | Landslide susceptibility analysis based on a PSO-DBN prediction model in an earthquake-stricken area | |
Babaei et al. | InstanceSHAP: an instance-based estimation approach for Shapley values | |
CN108446738A (en) | A kind of clustering method, device and electronic equipment | |
Zhao et al. | Pareto-based many-objective convolutional neural networks | |
CN115454949A (en) | Shared data determination method and device, electronic equipment and storage medium | |
CN117235633A (en) | Mechanism classification method, mechanism classification device, computer equipment and storage medium | |
CN111368337B (en) | Sample generation model construction and simulation sample generation method and device for protecting privacy | |
CN114443593B (en) | Multi-party data sharing method based on generation of countermeasure network and related equipment | |
Pelegrina et al. | A novel multi-objective-based approach to analyze trade-offs in Fair Principal Component Analysis | |
Dharmawan et al. | Tsunami tide prediction in shallow water using recurrent neural networks: Model implementation in the Indonesia Tsunami Early Warning System | |
CN111221880B (en) | Feature combination method, device, medium, and electronic apparatus | |
CN113947431A (en) | User behavior quality evaluation method, device, equipment and storage medium | |
CN114549174A (en) | User behavior prediction method and device, computer equipment and storage medium | |
CN111445282A (en) | Service processing method, device and equipment based on user behaviors | |
CN116703498B (en) | Commodity recommendation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |