CN114912142A

CN114912142A - Data desensitization method and device, electronic equipment and storage medium

Info

Publication number: CN114912142A
Application number: CN202210429481.2A
Authority: CN
Inventors: 张正欣; 牟黎明; 王豪; 肖春亮; 张宏; 何坤
Original assignee: Shenzhou Lvmeng Chengdu Technology Co ltd; Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Current assignee: Shenzhou Lvmeng Chengdu Technology Co ltd; Nsfocus Technologies Inc; Nsfocus Technologies Group Co Ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-16

Abstract

The invention discloses a data desensitization method, a data desensitization device, electronic equipment and a storage medium, wherein a data desensitization confrontation network model is trained in advance, the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are obtained by training through confrontation learning. The desensitized data obtained based on the data desensitization confrontation network model has good safety. And when the desensitized data are generated, original data to be processed are obtained, classification marking is carried out on the original data, and a condition vector is generated according to the classification marking. After the random distribution vector is generated, the random distribution vector and the condition vector are input into a data desensitization countermeasure network model, and desensitized data corresponding to original data are output based on a generation model in the data desensitization countermeasure network model. And adding condition vectors to optimize desensitized data, wherein the desensitized data comprises numerical value information and category information. So that the availability of data after desensitization is good.

Description

Data desensitization method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence and data security, in particular to a data desensitization method, a data desensitization device, electronic equipment and a storage medium.

Background

At present, with the development of big data, artificial intelligence and internet of things, how to effectively protect personal privacy and protect sensitive data of enterprises from being revealed becomes the focus of attention of more and more security manufacturers. With the integration and development of high and new information technology and the industrial field, the circulation and sharing of industrial data are promoted. However, industrial data is very sensitive, and for example, data such as enterprise data and APP will seriously threaten user security, enterprise security, and even national security if tampered, stolen, and the like. In this context, data desensitization techniques have emerged as an effective method to address the above-mentioned safety issues and risks.

Data desensitization refers to data transformation of sensitive information according to a preset rule or a transformation algorithm, so that the individual identity cannot be identified or the sensitive information is directly hidden. The desensitized data can have a certain information loss, so that privacy disclosure and data availability are naturally contradictory.

The traditional desensitization technology of structured data is based on anonymization technology and scrambling technology, the two technologies not only have large information loss, but also have one-to-one correspondence relationship between original data and desensitized data, which causes the desensitized data to have great risk of being re-identified.

How to ensure the safety and the availability of the desensitized data is a technical problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the invention provides a data desensitization method, a data desensitization device, electronic equipment and a storage medium, which are used for ensuring the safety and the usability of desensitized data.

The embodiment of the invention provides a data desensitization method, which comprises the following steps:

acquiring original data to be processed, performing classification marking on the original data, and generating a condition vector according to the classification marking;

generating a random distribution vector, and inputting the random distribution vector and the condition vector into a pre-trained data desensitization confrontation network model, wherein the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are trained through confrontation learning;

and outputting desensitized data corresponding to the original data based on a generated model in the data desensitization confrontation network model, wherein the desensitized data comprises numerical value information and category information.

Further, the process of training the generative model and the discriminant model through counterlearning includes:

carrying out classification marking on the sample data aiming at the sample data in the training set, and generating a sample condition vector according to the classification marking;

generating a sample random distribution vector, inputting the sample random distribution vector and the sample condition vector into a generative model in a data desensitization countermeasure network model, and outputting sample desensitization data by the generative model;

inputting the sample desensitization data and the sample condition vector into a discriminant model in a data desensitization countermeasure network model, wherein the discriminant model outputs a first probability that the sample desensitization data is a true sample;

inputting the sample data and the sample condition vector into the discrimination model, wherein the discrimination model outputs a second probability that the sample data is a true sample;

and training the generation model and the discrimination model according to the first probability and the second probability.

Further, the training the generative model and the discriminant model according to the first probability and the second probability comprises:

and calculating the Wasserstein distance between the first probability and the second probability by adopting a WGAN method, and training the generating model and the discriminant model according to the Wasserstein distance.

Further, the training process of the discriminant model includes:

cutting the determined gradient by adopting a differential privacy algorithm before updating the parameters, and adding random noise; and updating parameters according to the gradient after cutting and adding random noise.

Further, the process of determining the gradient includes:

respectively determining a joint loss function of the generated countermeasure network and a conditional generation countermeasure network, a statistical information loss function including expectation and variance and a hinge loss function, and determining a target loss function and a corresponding gradient according to the joint loss function, the statistical information loss function and the hinge loss function.

Further, the structure of the generative model comprises, in order from the input side to the output side: a full connectivity layer, a Relu layer, a residual network layer, a full connectivity layer, and a Softmax layer.

Further, the structure of the discriminant model sequentially comprises, from the input side to the output side: a full link layer, a LeakyRelu layer, and a full link layer.

In another aspect, an embodiment of the present invention provides a data desensitization apparatus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original data to be processed, carrying out classification marking on the original data and generating a condition vector according to the classification marking;

the input module is used for generating a random distribution vector and inputting the random distribution vector and the condition vector into a pre-trained data desensitization confrontation network model, wherein the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are trained through confrontation learning;

and the output module is used for outputting desensitized data corresponding to the original data based on a generated model in the data desensitization confrontation network model, wherein the desensitized data comprises numerical value information and category information.

Further, the apparatus further comprises:

the training module is used for carrying out classification marking on the sample data in a training set and generating a sample condition vector according to the classification marking; generating a sample random distribution vector, inputting the sample random distribution vector and the sample condition vector into a generative model in a data desensitization countermeasure network model, and outputting sample desensitization data by the generative model; inputting the sample desensitization data and the sample condition vector into a discriminant model in a data desensitization countermeasure network model, wherein the discriminant model outputs a first probability that the sample desensitization data is a true sample; inputting the sample data and the sample condition vector into the discrimination model, wherein the discrimination model outputs a second probability that the sample data is a true sample; and training the generation model and the discrimination model according to the first probability and the second probability.

The training module is specifically configured to calculate a Wasserstein distance between the first probability and the second probability by using a WGAN method, and train the generation model and the discriminant model according to the Wasserstein distance.

The training module is specifically used for clipping the determined gradient before updating the parameters by adopting a differential privacy algorithm and adding random noise; and updating parameters according to the gradient after cutting and adding random noise.

The training module is specifically configured to determine a joint loss function for generating the countermeasure network and a conditional generation countermeasure network, a statistical information loss function including expectation and variance, and a hinge loss function, and determine a target loss function and a corresponding gradient according to the joint loss function, the statistical information loss function, and the hinge loss function.

On the other hand, the embodiment of the invention provides electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the above method steps when executing a program stored in the memory.

In another aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the above.

The embodiment of the invention provides a data desensitization method, a data desensitization device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring original data to be processed, carrying out classification marking on the original data, and generating a condition vector according to the classification marking; generating a random distribution vector, and inputting the random distribution vector and the condition vector into a pre-trained data desensitization confrontation network model, wherein the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are trained through confrontation learning; and outputting desensitized data corresponding to the original data based on a generated model in the data desensitization confrontation network model, wherein the desensitized data comprises numerical value information and category information.

The technical scheme has the following advantages or beneficial effects:

in the embodiment of the invention, the pre-training is completed with the data desensitization confrontation network model, the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are obtained by training through confrontation learning. After the data desensitization confrontation network model is trained, the discrimination model is difficult to distinguish the true and false of the desensitization data obtained by the generated model and the original data, so that the desensitized data obtained based on the data desensitization confrontation network model has good safety. And when the desensitized data is generated, acquiring original data to be processed, performing classification marking on the original data, and generating a condition vector according to the classification marking. After the random distribution vector is generated, the random distribution vector and the condition vector are input into a data desensitization countermeasure network model, and desensitized data corresponding to original data are output based on a generation model in the data desensitization countermeasure network model. And adding condition vectors to optimize desensitized data, wherein the desensitized data comprises numerical value information and category information. So that the availability of data after desensitization is good.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a data desensitization process provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a training process for generating a model and a discriminant model according to an embodiment of the present invention;

FIG. 3 is a diagram of an overall framework for desensitization based on neural networks according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a conventional GAN network structure;

FIG. 5 is a flow chart of data desensitization provided by embodiments of the present invention;

FIG. 6 is a schematic diagram of a generative model structure provided in an embodiment of the present invention;

fig. 7 is a schematic diagram of a residual error network structure according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a discriminant model according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a data desensitization apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the attached drawings, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms to which the present invention relates will be explained first.

Data representation form classification:

data in reality mainly exists in two expressions, namely structured data and unstructured data. Common structured data is data that exists in the form of a database or the like. There are many types of unstructured data, and common unstructured data includes data in the form of text, images, voice, and so on.

The structured data has a very fixed mode and is characterized in that: data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same. The storage and arrangement of the structured data is very regular, which is helpful for operations such as query and modification. Desensitization to structured data therefore often starts with an overall distribution of data.

Data desensitization is presented:

data desensitization refers to data transformation of sensitive information according to a preset rule or a transformation algorithm, so that the individual identity cannot be identified or the sensitive information is directly hidden. The desensitized data can have a certain information loss, so privacy disclosure and data availability are natural contradictions.

The prior art and the technical defects are as follows:

traditional structured data desensitization techniques:

traditional research on desensitization of structured data is largely divided into anonymization-based techniques and scrambling-based techniques.

Common techniques based on anonymization include k-anonymity (k-anonymity) and its upgraded versions, L-diversity and t-closeness, which all make the record indistinguishable in the entire dataset by generalizing a quasi-identifier for a single piece of data.

While the scrambling-based technique adds noise to the data, which obscures the data. The two technologies not only have large information loss, but also have one-to-one correspondence relationship between original data and desensitized data, which causes the risk that the desensitized data is greatly re-identified.

Specific examples are shown in table 1 below, and the data in table 1 are only exemplary data and have no meaning:

name (I)	Sex	Age (age)	Post code	Purchasing preferences
					Alice	For male	24	100083	Body-building apparatus
Bob	For male	23	100084	Garment
					Candy	Woman	26	100102	Skin care product
Davis	Woman	27	100104	Cooking tool
					Edith	For male	36	102208	Body-building apparatus
Frank	For male	36	102201	Garment
					Grace	Female	34	102218	Book with detachable cover
Holley	Woman	33	102219	Garment

TABLE 1

The data is represented in the form of tables, each row representing a piece of data (record) and each column representing an attribute (attribute). Each piece of data is associated with a particular user/individual. Table 1 attributes can be divided into three categories:

identifiers (identifiers): generally, the unique identification of an individual, such as name, address, telephone, etc., is not shown when the data needs to be disclosed;

quasi-identifier (quasi-identifier): such as zip code, age, birthday, etc. are not unique, but can help researchers manage the identification of relevant data;

sensitive data: the user does not want data that is known to someone. Such as purchasing preferences, salaries, etc., which are of the greatest interest to researchers.

In addition to the three categories described above, there are also categories of attributes that are not shown in table 1, such as:

non-sensitive data: can be directly disclosed without any dangerous data, such as serial numbers.

Briefly, the purpose of k-anonymity is to ensure that at least another k-1 record in the published data is indistinguishable on the quasi-identifier.

For example, assuming that one public data is 2-anonymity protected, if an attacker wants to confirm Frank's sensitive information, by querying his age, zip code, and gender, the attacker will find that at least two people in the data have the same age, zip code, and gender. Therefore, an attacker cannot distinguish which of the two pieces of data is Frank, and therefore it is guaranteed that the privacy of Frank cannot be revealed. As shown in table 2 below: names are all replaced with "+" as identifiers, gender, age, and zip code are quasi identifiers, and purchasing preferences are sensitive attributes.

TABLE 22-exemplary Anonymity

Although k-anonymity reduces the likelihood of re-identification attacks (records may be recovered or identified if they possess background knowledge or other sources of information), information about other sensitive attributes of the table may still be obtained. This sensitive attribute information enables attackers to conduct both homogeneous and background knowledge attacks, also known as attribute disclosure attacks. To cope with these attacks, L-diversity has been proposed on k-anonymity, and the concept of diversity of sensitive attributes has also been proposed. That is, for data with the same quasi-identifier, the sensitive data must have diversity, so as to ensure that the privacy of the user cannot be inferred by methods such as background knowledge. The L-diversity ensures that the sensitive attributes of at least L contents in the same type of data. Table 3 below shows an example of 3-diversity: the combination of quasi-identifier values contains 3 different pieces of sensitive data.

Name (I)	Sex	Age (age)	Post code	Purchasing preferences
					＊	For male	(20，30]	10008＊	Body-building apparatus
＊	For male	(20，30]	10008＊	Body-building apparatus
					＊	For male	(20，301	10008＊	Body-building apparatus
＊	For male	(20，30]	10008＊	…
					＊	For male	(20，30]	10008＊	…
＊	…	(20，30]	…	…
					＊	…	(20，30]	…	Book with detachable cover
＊	For male	(20，30]	10008＊	Cooking tool

Table 33-diversity example

For L-diversity, it is still vulnerable to attacks where the attacker knows the global distribution of the properties. To solve this problem, t-close algorithm has been proposed in succession. the t-closeness guarantees that the distribution of the sensitive information in the quasi-identifier combination is close to the distribution of the sensitive information in the whole data, so that the difference value of the two distributions does not exceed the threshold value t. But even if k-anonymity, L-diversity, t-close is guaranteed at the same time, information is still revealed due to the attacker's background knowledge.

Furthermore, data perturbation is typically a numerical transformation that adds "noise" to the database. In contrast to anonymization methods, data perturbation techniques modify not only the quasi-identifiers but also the sensitive data. In doing so, the availability of these modified data may be significantly impacted.

Based on the technical problems, the technical problems and the invention thought to be solved by the invention are as follows:

the desensitization system based on the neural network has the following advantages that (1) the one-to-one mapping relation of data before and after desensitization is disconnected, and the problem that the data after desensitization has reverse attack risks is solved. (2) Better usability and safety after desensitization. (3) The method has better fault-tolerant capability and self-adaptive capability aiming at data of different sources (different regions or enterprises and the like). (4) By the learning capabilities of the example.

Therefore, the invention mainly aims at the defects of the anonymization and disturbance desensitization method, starts from the direction of solving the safety of the data generation stage, and provides a countermeasure generation network ResNet + GAN + DP for the structured data generation and desensitization. Finally, desensitized data with high usability and safety is generated.

The data desensitization process provided by the embodiments of the present invention is described in detail below.

Fig. 1 is a schematic diagram of a data desensitization process provided by an embodiment of the present invention, where the process includes:

s101: the method comprises the steps of obtaining original data to be processed, carrying out classification marking on the original data, and generating a condition vector according to the classification marking.

S102: generating a random distribution vector, and inputting the random distribution vector and the condition vector into a pre-trained data desensitization confrontation network model, wherein the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are trained through confrontation learning.

S103: and outputting desensitized data corresponding to the original data based on a generated model in the data desensitization confrontation network model, wherein the desensitized data comprises numerical value information and category information.

The data desensitization method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be equipment such as a PC (personal computer), a tablet personal computer and the like, and can also be a server.

After the electronic equipment acquires original data to be processed, determining the category information of the original data, performing classification marking on the original data according to the category information of the original data, and then generating a condition vector according to the classification marking. For example, the classification tag of the original data includes "name", and the "name" is encoded to obtain a condition vector. The electronic device generates a randomly distributed vector, which may also be referred to as a hidden variable. And then inputting the random distribution vector and the condition vector into a pre-trained data desensitization confrontation network model, and outputting desensitized data corresponding to the original data based on a generation model in the data desensitization confrontation network model, wherein the desensitized data comprises numerical value information and category information.

The embodiment of the invention mainly introduces a training process of a data desensitization confrontation network model, wherein the data desensitization confrontation network model comprises a generation model and a discrimination model, and the training process of the data desensitization confrontation network model is the training process of the generation model and the discrimination model.

Fig. 2 is a schematic diagram of a training process of a generative model and a discriminant model according to an embodiment of the present invention, where the training process includes:

s201: and carrying out classification marking on the sample data in the training set, and generating a sample condition vector according to the classification marking.

S202: generating a sample random distribution vector, inputting the sample random distribution vector and the sample condition vector into a generative model in a data desensitization countermeasure network model, and outputting sample desensitization data by the generative model.

S203: and inputting the sample desensitization data and the sample condition vector into a discriminant model in a data desensitization countermeasure network model, wherein the discriminant model outputs a first probability that the sample desensitization data is a true sample.

S204: and inputting the sample data and the sample condition vector into the discriminant model, wherein the discriminant model outputs a second probability that the sample data is a true sample.

S205: and training the generation model and the discrimination model according to the first probability and the second probability.

When the electronic device trains the generated model and the discriminant model, a training set is stored in advance, and data in the training set is called sample data. And carrying out classification marking on the sample data, and generating a sample condition vector according to the classification marking. And then inputting the generated sample random distribution vector and the sample condition vector into a generation model in the data desensitization countermeasure network model, and outputting sample desensitization data by the generation model.

Inputting sample desensitization data and a sample condition vector into a discrimination model in a data desensitization countermeasure network model, and outputting a first probability that the sample desensitization data is a true sample by the discrimination model; and inputting the sample data and the sample condition vector into the discrimination model, and outputting a second probability that the sample data is a true sample by the discrimination model. And finally, training a generating model and a distinguishing model according to the first probability and the second probability.

Aiming at the problems that the traditional confrontation network model has the problems of training non-convergence, gradient disappearance, mode collapse and the like, the embodiment of the invention trains the generation model and the discrimination model according to the first probability and the second probability comprises the following steps: and calculating the Wasserstein distance between the first probability and the second probability by adopting a WGAN method, and training the generating model and the discriminant model according to the Wasserstein distance. The distance between the first probability and the second probability of Wasserstein distance measurement is introduced to replace JS divergence and KL divergence which measure differences between different probabilities in a traditional confrontation network model, and the defects of the traditional confrontation network model are made up to the greatest extent.

In order to further ensure the security of the desensitized data, in the embodiment of the present invention, the training process of the discriminant model includes: cutting the determined gradient by adopting a differential privacy algorithm before updating the parameters, and adding random noise; and updating parameters according to the gradient after cutting and adding random noise.

To ensure data security, embodiments of the present invention introduce differential privacy that can largely secure the desensitized data generated because, by definition, if the training process of a model is differentially private, the probability that the training data set contains the same model as if it was trained without containing a particular sample is close. Therefore, differential privacy is added in the training process of the discriminant model, so that the whole data desensitization confrontation network model keeps differential privacy.

In the embodiment of the invention, a DP-SGD algorithm is used for training a discriminant model, namely a traditional random gradient descent (SGD) algorithm, in the parameter optimization process of the DP-SGD algorithm, the gradient of a loss function to a model parameter theta t is firstly calculated, and then the product of the gradient and a learning rate lr is subtracted from the current parameter to update the current parameter to theta t + 1. This process is repeated until loss converges. The DP-SGD algorithm, in turn, crops the calculated gradient and adds corresponding noise before the parameters are updated. After the gradient is cut, random noise meeting Gaussian distribution needs to be added, randomness is introduced into a discriminator model, namely, the Gaussian noise is added to the cut gradient, then the average is calculated, and finally the value is used for parameter updating.

The loss function is the key of the training of the data desensitization confrontation network model and guides the training process of the data desensitization confrontation network model, and essentially, the training process of the data desensitization confrontation network model is the process of optimizing the loss function. In an embodiment of the present invention, the process of determining the gradient includes: respectively determining a joint loss function of the generated countermeasure network and the conditional generation countermeasure network, a statistical information loss function comprising expectation and variance and a hinge loss function, and determining a target loss function and a corresponding gradient according to the joint loss function, the statistical information loss function and the hinge loss function.

In the whole network of the embodiment of the invention, in addition to using the common combined loss function of the WGAN and the CGAN, in order to optimize and control the desensitization effect and improve the data generation quality, a statistical information loss function and a hinge loss function are added, and a target loss function and a corresponding gradient are determined according to the combined loss function, the statistical information loss function and the hinge loss function.

In the embodiment of the present invention, the structure of the generative model sequentially includes, from the input side to the output side: a full connectivity layer, a Relu layer, a residual network layer, a full connectivity layer, and a Softmax layer. The structure of the discriminant model sequentially comprises from the input side to the output side: a full link layer, a LeakyRelu layer, and a full link layer.

The following describes in detail the training process of the generative model and the discriminant model in the data desensitization countermeasure network model, and after the training of the generative model and the discriminant model is completed, the generative model is used to generate desensitized data.

The English interpretation is as follows:

wgan (wasserstein genetic adaptive nets): wasserstein generates a countermeasure network;

CGAN (conditional genetic additive nets) conditionally generating a countermeasure network;

dp (differential privacy): differential privacy;

ResNet: a residual network;

DP-SGD (Differential Privacy-statistical gradient device) a random gradient descent based on Differential Privacy;

l-diversity;

t-close enerss: and (4) t-approximation.

The embodiment of the invention provides a data desensitization method based on a neural network. Firstly, taking a sample random distribution vector and a corresponding sample condition vector as the input of a generation model (G), wherein D is a discrimination model and judging whether sample data is real data or G generated data; then, a discrimination model is trained by fixing the parameters of the generated model, and Differential Privacy (DP) is added during the training of the discrimination model, and gradient cutting and noise addition are carried out; and finally, fixing the parameters of the discrimination model, and feeding back the probability of the discrimination model to G to guide the gradient update of G. And G and D can generate data after multiple iterations and alternate optimization. Meanwhile, the thinking of the WGAN and the CGAN is adopted in the training process to optimize the quality of the generated desensitized data, and the safety and the availability of the generated data can be ensured through the DP, the WGAN, the CGAN and other strategies. Thereby better serving personal data, enterprise data and national data security. In the embodiment of the invention, the input desensitization data generated by the GAN and the DP aiming at the sample data are close in distribution and high in availability, but are not real data, so that the desensitization data is high in safety.

And the WGAN stably updates the D network in a mode of cutting the parameters of the D network.

The CGAN is the first attempt to generate GAN under the condition, and adds the condition information c directly to the network input to control the conditional output mode of the network.

Differential privacy techniques: the method can ensure that the change of a single individual in the data set hardly affects the statistical result of the whole data set, thereby providing strong privacy protection for individual sensitive information. The DP describes the influence of the tiny change of the original data set on the final statistical output in a probability mode, and the probability change is restrained through a privacy parameter epsilon, so that the purpose of privacy protection is achieved.

Fig. 3 is a desensitization overall framework diagram based on a neural network according to an embodiment of the present invention, which includes the following specific steps:

the method comprises the following steps: and (5) input data construction.

For the whole system, firstly, input data needs to be constructed, the input of a generation model (G) is formed by adding or splicing sample random distribution vectors, or hidden vectors z which are called sample random distribution, and corresponding sample condition vectors, wherein the sample random distribution vectors refer to random noise vectors, and generated desensitization data can be obtained by training the vectors. A sample condition vector refers to an embedded vector of attribute class labels of the sample data. The role is to guide to which attribute class the corresponding value needs to be generated to generate the data. For example: one database has four attributes such as name, address, work, identification number and the like, each attribute is coded to obtain an embedding vector representation eco of random initialization of each attribute, an embedding vector representation ez corresponding to an attribute value (hidden variable z) is generated aiming at the four attributes, and finally the coding of data of each generator is shown as a formula 1:

e _i ＝concat[e _cond ，e _z ]equation 1.

Step two: and (5) training.

The idea of the traditional GAN network is a two-player game idea, and the sum of the benefits of two players is a constant. It can be seen that there are two such players in the GAN, one is the generative model (G) and the other is the discriminant model (D). As shown in fig. 4. The generation model is used for generating a vivid sample, and the discrimination model is used for judging whether the input of the model is true or false. Generally speaking, the generation model needs to continuously improve the self counterfeiting ability, and finally the purpose of cheating the discriminant model is achieved. The discrimination model continuously improves the discrimination capability of the discrimination model to achieve the purpose of discriminating true and false, and thus, a game, namely a countermeasure is formed. The two networks with opposite targets are continuously trained alternately. When the data is finally converged, if the judging network can not judge the source of a sample any more, the method is equivalent to that the generating network can generate a sample which accords with the real data distribution. The method is applied to the field of desensitization, namely the desensitized (generated) data are made to accord with the real data distribution and can be used for subsequent data analysis mining (availability) but not real data (safety).

However, the original GAN network has the problems of training non-convergence, gradient disappearance, mode collapse and the like, and for the problems, the WGAN strategy is used in the scheme, and JS divergence and KL divergence which measure differences among different distributions in the original GAN are replaced by introducing Wasserstein distance (measuring the distance between two probability distributions), so that the deficiency of the GAN is made up to a great extent. Meanwhile, in order to solve the problem that the original GAN does not control which kind of output belongs to in the generation stage, the CGAN network is adopted in the scheme, the improvement is made on the basis of the GAN, and the condition distribution is added, namely, extra condition information, which can be a category label or other auxiliary information, is added to a generator and a discriminator of the original GAN, so that a condition generation model is realized.

With some of the concepts of GAN, we now describe how to desensitize structured data using GAN, and the training procedure is shown in fig. 3. Firstly, inputting input data ei into a GAN generation model to fit data distribution of training data (namely data to be desensitized), obtaining generated data (namely desensitized data) G (ei) through a series of transformation, then inputting the generated data into a discrimination model to predict, optimizing discrimination model parameters to obtain D (G (ei)), fixing the discrimination model parameters, feeding discrimination probability back to G to guide gradient updating of G, and finally, after model training is converged, achieving the purpose of utilizing the generated model to generate data and simultaneously desensitizing the data. However, considering how the quality of the generated data, i.e., the degree of desensitization and the degree of security are controlled, the embodiment of the present invention introduces a statistical information loss function to control the degree of fitting and thus the degree of desensitization of the data. Meanwhile, because the training concept of CGAN is adopted, condition information needs to be added, which is equivalent to adding a layer of condition constraint (class label constraint) on the hidden variable z to constrain the generated data, specifically formula 7. In addition, DP thought (gradient clipping, noise addition) is added in the training of the discriminator to better ensure the safety of data.

Step three: desensitization process.

After the trained model is available, desensitization is carried out by using the model, the ei vector of the data to be desensitized is input into the generative model (G) to obtain G (ei) by only using the trained GAN generative model, as shown in the following figure 5, desensitization data generation can be carried out when in use, and the generated data can be used for data distribution statistical analysis, data mining and other purposes by data users.

Step four: and constructing a generation model (G) and a discrimination model (D).

After the training and desensitization process of the whole desensitization system is introduced, a generation model (G) and a discriminant model (D) in a network structure are specifically introduced below, which are core modules of the whole GAN network.

Generating model (G):

the generative model can be in various forms, and can be combined by using different neural network models, and the main structure of the generator described herein is a multilayer perceptron and ResNet structure, as shown in FIG. 6. First, a brief description of the ResNet network will be given.

ResNet is a residual network that can be understood as a sub-network that can be stacked in multiple layers to form a very deep network. The plus sign in fig. 6 can be simply understood as vector addition, being part of implementing a residual network; the residual network is to add the information of the previous layer and the output of the previous layer non-linearly varying. Looking first at the structure of ResNet, as in FIG. 7, it is generally understood that for neural networks, the deeper the network, the more information can be obtained and the richer the features. However, experiments show that the optimization effect is worse as the network is deepened, and the accuracy of test data and training data is reduced. This is because the deepening of the network causes problems of gradient explosion and gradient disappearance. In view of this, a residual network is generated, and it can be seen that the output in the frame is not only obtained by the nonlinear transformation of the output of the previous layer, but also the information x of the shallow layer is additionally added, so that the situation that the information is lost along with the deeper and deeper network layer numbers can be greatly alleviated.

The generation model of the embodiment of the invention is built under the condition that a ResNet structure is used as a main component, a multilayer perceptron and the ResNet are combined, the shallow information representation of the model is reserved, the input of the model is mainly formed by adding or splicing a hidden variable z and a condition embedding vector, and the step I is the step I. The condition Embedding layer converts the sparse condition vector represented by one-hot coding into a distributed (Embedding) representation. The specific process of generating the model generated data is shown in formula 2.

h _w1 ＝e _i +relu(w ₁ e _i +b ₁ )；

h _w2 ＝h _w1 +relu(w ₂ h _w1 +b ₂ )；

g(i)＝Gumbel-Softmax(w ₃ h _w2 +b ₃ )；

Equation 2.

The input vector ei continuously passes through two residual error network modules, and finally passes through Gumbel-Softmax to obtain generated data. The sampling output using the Gumbel-softmax distribution is to ensure that the parameters of both the generative model and the discriminative model can be updated by a gradient-based algorithm.

Discrimination model (D):

after the generation model is generated, a discriminant model needs to be constructed, a framework of a multilayer sensor is also used, an activation function adopted by each full-connection layer is LEAKYRELU, and the problem of neuron death of the Relu activation function is solved. In addition, the output or the original data of the generated model is used as input in each forward propagation, the two-classification judgment is carried out, and the parameters of the discrimination model are updated. The structure of the discriminant model is shown in FIG. 8. The specific discriminator decision process is shown in equation 3. And a fifth step of adding a DP idea for ensuring the safety of the generated data.

h _w1 ＝leakyrelu(w ₁ g(i)+b ₁ )；

h _w2 ＝leakyrelu(w ₂ h _w1 +b ₂ )；

D(i)＝w ₃ h _w2 +b ₃ ；

Equation 3.

Step five: the discriminator is differentiated for privacy.

To ensure data security, embodiments of the present invention also introduce differential privacy, which can largely guarantee the security of the generated data, because by its definition, if the training process of a model is differentially private, the probability that the training data set contains the same model as if it was trained without a particular sample is close. Differential privacy is added at the arbiter so that the overall network maintains differential privacy.

In the embodiment of the invention, a DP-SGD algorithm is used for training a discriminator, namely a traditional random gradient descent (SGD) algorithm, in the parameter optimization process of the algorithm, the gradient of a loss function to a model parameter theta t is firstly calculated, and then the product of the gradient and a learning rate lr is subtracted from the current parameter to update the current parameter to theta t + 1. This process is repeated until loss converges. And the DP-SGD algorithm clips the calculated gradient and adds corresponding noise before updating the parameters.

The embodiment of the present invention performs L2 norm clipping on the gradient in the updating process of the discriminator parameter, which is specifically shown in formula 4.

Wherein g (xi) represents a gradient vector, C is a gradient threshold, and the formula has the specific meaning that if the norm of the gradient L2 is less than or equal to C, the gradient is not changed, otherwise, the formula is expressed as

The ratio is scaled down in a gradient. By gradient cutting, gradient explosion is avoided, overfitting is reduced, and therefore the model is converged better.

After the gradient is cut, random noise meeting Gaussian distribution needs to be added, randomness is introduced into a discriminator model, namely, Gaussian noise is added to the cut gradient, then averaging is carried out, and finally, the value is used for parameter updating, and the method is specifically shown in formula 5:

θ _t+1 ＝θ _t -lr·g；

equation 5.

Wherein σ is the noise level, the higher the noise is, the better the security is, but too much noise may bring about the situation of data distortion.

It is important for the DP-SGD to also control the loss of privacy throughout the whole arbiter training process so that it does not exceed the overall privacy budget. Therefore, the embodiment of the present invention uses a novel privacy estimation method Moments account, which is specifically shown in the following formula 6:

wherein M is a random algorithm, d and d-represent data sets before and after differential privacy, and o is output. The method has strict requirements on privacy loss, and DP-SGD can iterate for more times in the training process to learn more knowledge, so that better effect can be achieved on the premise of ensuring safety. Namely, the data of the discriminator is ensured to have good safety and good effect.

The whole GAN network training process: the discriminant model is trained first and then the model is generated.

Judging the training process of the model: firstly, fixing a generated model parameter, taking a random distribution vector and a corresponding condition vector as the input of a generated model to obtain the output of the generated model, performing fitting (training) of a discriminator according to the output (generated sample probability distribution) and real sample probability distribution, calculating loss and a corresponding gradient by using a combined loss function of WGAN and CGAN of a formula 7, a loss function of statistical information of a formula 8 and a hinge loss function formula of a formula 9, clipping and adding noise to the calculated gradient by combining a differential privacy technology, and finishing one round of updating of the discriminator parameter according to the gradient after differential privacy.

Generating a model training process: and fixing the parameters of the discrimination model after one round of updating, and feeding back the discrimination probability to the generator to guide the parameter updating of the generator.

And after the discriminant model and the generated model are subjected to multiple alternate optimization, convergence and loss function minimization are realized.

Step six: a Loss function;

with the architecture of the whole network, the loss function of the embodiment of the present invention is described below, which is a key of the neural network and guides the training process of the neural network. The whole network in the present case uses the common joint loss function of WGAN and CGAN as formula 7, and also adds a statistical information loss function for optimizing and controlling the desensitization effect (data generation quality).

Wherein x is original data, z is a hidden variable, Pz represents the distribution of input variables of the generative model, G is the generative model, and the output of the model is the generative data; d is a discriminant model. y is additional condition information.

A loss function of statistical information, which mainly takes into account expectations and variances, can be used for model learning to optimize the quality of the generated data. In addition, the desensitization process also desirably controls the quality of the generated data, and particularly controls the expectation and variance of the generated data by equation 8, which is a physical meaning for measuring the difference between the expectation and variance of the original data and the desensitized data.

Equation 8.

Where f represents a series of linear, non-linear transformation operations on the data, E represents the expectation of the last layer features of the model, and SD represents the variance of the last layer features. In a desensitization scenario, it is often not desirable to share desensitization data that is very similar to the original data to an untrusted buddy, and therefore the effect of desensitization (quality of the generated data) needs to be controlled. By the following formula 9, the hinge loss functions of Lf-mean and Lf-sd are added, mf-mean and mf-sd are hyper-parameters of the hinge loss functions, and the quality of data generation and the desensitization degree can be controlled by adjusting the two parameters.

L _statis ＝max(0,L _f-mean -m _f-mean )+max(0,L _f-sd -m _f-sd )

(ii) a Equation 9.

Step seven: security and usability.

For desensitization algorithms, an important principle is the need for security and usability assessments. In addition, because data generated by the GAN network still has specific numerical values or categories after desensitization, the data can be directly used for model training at the data availability level, and the visual feeling can be brought when data analysis and mining are carried out.

In conclusion, the structural data desensitization algorithm achieves certain achievements in usability and safety, and a new idea is brought for the desensitization algorithm in balancing usability and safety.

The embodiment of the invention carries out data desensitization by using a neural network algorithm based on Resnet + GAN network, is different from the existing anonymization and perturbation algorithm, overcomes the defect that the prior desensitization algorithm and the subsequent desensitization data are possibly re-identified and attacked due to one-to-one correspondence, and provides a new idea for the development of the desensitization algorithm. The concept of WGAN + CGAN + DP is used in the whole system network, and compared with the traditional desensitization algorithm, the quality of generated data is controlled, the safety of the generated data is guaranteed, and the availability and the safety of the desensitization data are enhanced.

Fig. 9 is a schematic structural diagram of a data desensitization apparatus according to an embodiment of the present invention, where the apparatus includes:

the acquiring module 91 is configured to acquire original data to be processed, perform classification labeling on the original data, and generate a condition vector according to the classification labeling;

an input module 92, configured to generate a random distribution vector, and input the random distribution vector and the condition vector into a pre-trained data desensitization countermeasure network model, where the data desensitization countermeasure network model includes a generation model and a discriminant model, and the generation model and the discriminant model are trained by countermeasure learning;

and an output module 93, configured to output desensitized data corresponding to the original data based on a generative model in the data desensitization confrontation network model, where the desensitized data includes numerical value information and category information.

The device further comprises:

a training module 94, configured to perform classification labeling on sample data in a training set, and generate a sample condition vector according to the classification label; generating a sample random distribution vector, inputting the sample random distribution vector and the sample condition vector into a generative model in a data desensitization countermeasure network model, the generative model outputting sample desensitization data; inputting the sample desensitization data and the sample condition vector into a discriminant model in a data desensitization countermeasure network model, wherein the discriminant model outputs a first probability that the sample desensitization data is a true sample; inputting the sample data and the sample condition vector into the discrimination model, wherein the discrimination model outputs a second probability that the sample data is a true sample; and training the generation model and the discrimination model according to the first probability and the second probability.

The training module 94 is specifically configured to calculate a Wasserstein distance between the first probability and the second probability by using a WGAN method, and train the generation model and the discriminant model according to the Wasserstein distance.

The training module 94 is specifically configured to perform clipping on the determined gradient before updating the parameters by using a differential privacy algorithm, and add random noise; and updating parameters according to the gradient after cutting and adding random noise.

The training module 94 is specifically configured to determine a joint loss function of the generated countermeasure network and the conditionally generated countermeasure network, a statistical information loss function including expectation and variance, and a hinge loss function, and determine a target loss function and a corresponding gradient according to the joint loss function, the statistical information loss function, and the hinge loss function.

An embodiment of the present invention further provides an electronic device, as shown in fig. 10, including: the system comprises a processor 301, a communication interface 302, a memory 303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 complete mutual communication through the communication bus 304;

the memory 303 has stored therein a computer program which, when executed by the processor 301, causes the processor 301 to perform the steps of:

acquiring original data to be processed, carrying out classification marking on the original data, and generating a condition vector according to the classification marking;

Based on the same inventive concept, the embodiment of the present invention further provides an electronic device, and as the principle of the electronic device for solving the problem is similar to the data desensitization method, the implementation of the electronic device may refer to the implementation of the method, and repeated details are not repeated.

The electronic device provided by the embodiment of the invention can be a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a network side device and the like.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 302 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the aforementioned processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

When the processor executes the program stored in the memory in the embodiment of the invention, the original data to be processed is obtained, the original data is classified and marked, and a condition vector is generated according to the classification and marking; generating a random distribution vector, and inputting the random distribution vector and the condition vector into a pre-trained data desensitization confrontation network model, wherein the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are trained through confrontation learning; and outputting desensitized data corresponding to the original data based on a generated model in the data desensitization confrontation network model, wherein the desensitized data comprises numerical value information and category information.

In the embodiment of the invention, the pre-training is completed with the data desensitization confrontation network model, the data desensitization confrontation network model comprises a generation model and a discrimination model, and the generation model and the discrimination model are obtained by training through confrontation learning. After the data desensitization confrontation network model is trained, the discrimination model is difficult to distinguish the true and false of the desensitization data obtained by the generated model and the original data, so that the desensitized data obtained based on the data desensitization confrontation network model has good safety. And when the desensitized data are generated, original data to be processed are obtained, classification marking is carried out on the original data, and a condition vector is generated according to the classification marking. After the random distribution vector is generated, the random distribution vector and the condition vector are input into a data desensitization countermeasure network model, and desensitized data corresponding to original data are output based on a generation model in the data desensitization countermeasure network model. And adding condition vectors to optimize desensitized data, wherein the desensitized data comprises numerical value information and category information. So that the availability of data after desensitization is good.

An embodiment of the present invention further provides a computer storage readable storage medium, in which a computer program executable by an electronic device is stored, and when the program runs on the electronic device, the electronic device is caused to execute the following steps:

Based on the same inventive concept, embodiments of the present invention further provide a computer-readable storage medium, and since the principle of solving the problem when the processor executes the computer program stored on the computer-readable storage medium is similar to that of the data desensitization method, the implementation of the computer program stored on the computer-readable storage medium by the processor may refer to the implementation of the method, and repeated details are not repeated.

The computer readable storage medium may be any available medium or data storage device that can be accessed by a processor in an electronic device, including but not limited to magnetic memory such as floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc., optical memory such as CDs, DVDs, BDs, HVDs, etc., and semiconductor memory such as ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs), etc.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of data desensitization, the method comprising:

2. The method of claim 1, wherein the process of training the generative model and the discriminative model through counterlearning comprises:

3. The method of claim 2, wherein training the generative and discriminative models based on the first and second probabilities comprises:

4. The method of claim 2, wherein the training process of the discriminant model comprises:

5. The method of claim 4, wherein determining the gradient comprises:

6. The method of claim 1, wherein generating the structure of the model comprises, in order from input side to output side: a full connectivity layer, a Relu layer, a residual network layer, a full connectivity layer, and a Softmax layer.

7. The method of claim 1, wherein the structure of the discriminant model comprises, in order from input side to output side: a fully-connected layer, a LeakyRelu layer, and a fully-connected layer.

8. A data desensitization apparatus, characterized in that the apparatus comprises:

and the output module is used for outputting desensitized data corresponding to the original data based on the generative model in the data desensitization confrontation network model, wherein the desensitized data comprises numerical value information and category information.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 7 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.