CN112861179B

CN112861179B - Method for desensitizing personal digital spatial data based on text-generated countermeasure network

Info

Publication number: CN112861179B
Application number: CN202110199023.XA
Authority: CN
Inventors: 孙伟; 官明哲; 张武军
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2023-04-07
Anticipated expiration: 2041-02-22
Also published as: CN112861179A

Abstract

The invention provides a method for desensitizing personal digital spatial data of an antagonistic network based on text generation, which comprises the following steps: s1: acquiring a data file to be desensitized in a personal digital space, and constructing a text to generate a confrontation network model; s2: analyzing the data file to be desensitized to obtain an analysis file containing sensitive information; s3: inputting the analysis file as source data into a text to generate a confrontation network model for training; s4: judging whether the trained text generation confrontation network model is converged, if so, obtaining desensitization text data with the same statistical characteristics as the source data; if not, the procedure returns to step S3. The invention provides a text-based personal digital spatial data desensitization method for generating an antagonistic network, which solves the problem that the structured format of medical source data can be changed when the existing data desensitization technology is applied in a medical scene.

Description

Method for desensitizing personal digital spatial data based on text-generated countermeasure network

Technical Field

The invention relates to the technical field of data desensitization processing, in particular to a method for desensitizing personal digital spatial data based on a text-generated confrontation network.

Background

Data desensitization is a data processing technique that can reduce or remove the sensitivity of data by processing the data. By adopting a data desensitization technology, the risk and harm of data leakage can be reduced, and the privacy of user data is effectively protected. In the field of internet and medical treatment, users can store, check and share personal medical treatment health data through a personal digital space, but the personal medical treatment data can face the risk of leakage of medical treatment sensitive information of the users in the processes of online seeing a doctor, online purchasing of medicines, clinic reservation and the like, the data of the users in the medical treatment industry have extremely high authenticity and sensitivity characteristics, and once the personal sensitive information of the users is leaked, potential life threat can be caused to the users. With data desensitization, information in the personal digital space can be used for business related analysis and processing while avoiding leakage of user data.

The existing data desensitization mode is usually used in a covering or generalization mode and the like, so that private data is protected, and meanwhile, the usability of the data is kept, so that the desensitized data can be continuously used in application scenes such as development testing, data mining, data distribution and the like. Data replacement, namely replacing data in the sensitive information by using random data; data shuffling, which performs row-to-row exchange in source data; numerical value conversion, which is to perform conversion processing on numerical data such as age, time and the like; data occlusion, replacing or altering sensitive data with special symbols such as "+, NULL", etc.; data deletion, namely sensitive data deletion and clearing; and (3) data generalization, namely representing the data from a specific dimension by using a more fuzzy dimension, enlarging the data representation range, eliminating sensitive information and the like. However, when the existing data desensitization technology is applied in a medical scene, the structured format of medical source data can be changed, and the requirements of desensitization and protection of medical sensitive information of a user in the medical scene cannot be met.

In the prior art, for example, chinese patent published in 2019, 8, 16, a data desensitization method, apparatus, device, and computer readable storage medium, publication No. CN110135193A, maximizes the degree of data desensitization, ensures that privacy information is not revealed, and simultaneously effectively improves the practicability of desensitized data, but does not evaluate a complete sequence, and may change the structured format of source data.

Disclosure of Invention

The invention provides a personal digital space data desensitization method based on a text generation confrontation network, aiming at overcoming the technical defect that the structural format of medical source data can be changed when the existing data desensitization technology is applied in a medical scene.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a method of generating personal digital spatial data desensitization against a network based on text, comprising the steps of:

s1: acquiring a data file to be desensitized in a personal digital space,

constructing a text to generate an confrontation network model;

s2: analyzing the data file to be desensitized to obtain an analysis file containing sensitive information;

s3: inputting the analysis file as source data into a text to generate a confrontation network model for training;

s4: judging whether the trained text generation confrontation network model converges or not,

if yes, desensitization text data with the same statistical characteristics as the source data are obtained;

if not, the procedure returns to step S3.

Preferably, the data file to be desensitized is based on semi-structured medical information data in a distributed database.

Preferably, the text generation confrontation network model comprises a generator and a discriminator.

Preferably, the generator generates the sequence using a recurrent neural network.

Preferably, the discriminator discriminates the sequence generated by the generator using a convolutional neural network.

Preferably, in step S3, the text generation countermeasure network model is trained in combination with the strategy of Monte Carlo search.

Preferably, the specific steps of training the text generation confrontation network model are as follows:

inputting a vector obtained by encoding a word of source data into an embedding layer of a recurrent neural network to obtain an embedding layer vector x ₁ ,...,x _T Output the hidden layer vector h ₁ ,..., _T To obtain

h _t ＝R(h _t-1 ，x _t )

Wherein h is _t-1 Is the hidden layer vector of the previous state, h _t 、x _t Hidden layer vectors and embedded layer vectors of the current state, respectively; t belongs to T, T is a word vector sequence number, and R is an RNN network;

obtaining a sequence Y generated by the current state by the hidden layer vector through the softmax layer of the recurrent neural network _1:t Middle y _t Distribution probability of (2):

p(y _t |x ₁ ，...，x _t )＝softmax(b+Wh _t )

where b is the offset vector, W is the weight matrix, y _t Is a sequence of length t;

reward (Reward) Q for the current sentence, denoted as

Q＝D(Y _1：t )

For an n-time Monte Carlo search, denoted as

The strategy of operating the Monte Carlo search obtains N output sequences from the current state to the end of the sequence, thus obtaining a more accurate reward Q, denoted as

For each sequence, embedding a layer vector x ₁ ,..., _T Concatenated to represent a current sequence

Wherein the content of the first and second substances,

the operation is the connection operation according to the rows;

pairing sequence vectors d by convolution kernel omega _1:T Performing convolution operation

Wherein the content of the first and second substances,

for multiplication of corresponding positions, p is a non-linear function, c _i Is the output value of the convolution layer;

after pooling layer, the vector c = max (c) ₁ ,...,c _T-1+1 ) Outputting the probability that the sequence is judged to be real through a sigmoid function of the full connection layer, namely rewarding Q;

updating the parameters of the generator according to the high and low of the reward Q, thereby reducing the loss of the generated sentence; and (5) carrying out cyclic training to make the model converge when the error of the discriminator is minimum.

Preferably, the obtaining of the loss of the current sentence is based on the output distribution of the discriminator by solving the binary cross entropy, which specifically includes: let P be the probability of state 1 of output P, 1-P be the probability of state 0 of output P, Q be the probability of state 1 of input Q, and 1-Q be the probability of state 0 of input Q, then the cross entropy of P, Q is

H(P|Q)＝-(p*logq+(1-p)log(1-q))。

Preferably, for the generated sequence, when the generator generates a false sequence, the cross entropy at which the discriminator judges true is

loss＝-(1*logD(Y _1∶T )+0*log(l-D(Y _l:T ))

＝-logD(Y _1∶T )

For the discriminated sequences, the discriminator identifies the true source of the sequences, one sequence is true, and the cross entropy when the discriminator determines that it is true is

One sequence is false, and the cross entropy when the discriminator judges false is

The minimum cross entropy is calculated by the following formula:

preferably, the same statistical properties are: the proportion of the numbers or characters in the text is the same.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a text-based personal digital space data desensitization method for generating an antagonistic network, which is characterized in that desensitization data with the same statistical characteristics and structure as an analytic file containing sensitive information is generated by training a text-generated antagonistic network model, so that data desensitization processing on structured text information is realized, and a good text data desensitization effect is achieved under the condition that the structure of data in a personal digital space is not influenced.

Drawings

FIG. 1 is a flow chart of the steps for implementing the technical solution of the present invention;

FIG. 2 is a flow chart of the desensitization operation of the text-generated confrontation network model in the present invention;

FIG. 3 is a network diagram of the generator of the present invention;

FIG. 4 is a network diagram of the arbiter in the present invention;

FIG. 5 is a schematic diagram of a structure of a text-generated confrontation network model according to the present invention;

FIG. 6 is a graph showing a comparison between before and after desensitization with Gaussian-distributed random numbers according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described with reference to the drawings and the embodiments.

Example 1

1-2, a method for text-based generation of personal digital spatial data desensitization to a confrontation network, comprising the steps of:

s1: acquiring a data file to be desensitized in a personal digital space;

more specifically, the data file to be desensitized is based on semi-structured medical information data in a distributed database;

constructing a text generation confrontation network model;

more specifically, the text generation confrontation network model comprises a generator and a discriminator;

more specifically, as shown in fig. 3, the generator generates the sequence using a recurrent neural network;

more specifically, as shown in fig. 4, the discriminator uses a convolutional neural network to discriminate the sequence generated by the generator;

s2: analyzing the data file to be desensitized to obtain an analyzed file containing sensitive information; the analysis file is a json format file;

more specifically, in step S3, a text generation countermeasure network model is trained in combination with a strategy of Monte Carlo search;

more specifically, as shown in fig. 5, the specific steps of training the text generation confrontation network model are as follows:

inputting the vector obtained by encoding the word of the source data into the embedding layer of the recurrent neural network to obtain the vector x of the embedding layer ₁ ,...,x _T Output the hidden layer vector h ₁ ,..., _T To obtain

h _t ＝R(h _t-1 ，x _t )

obtaining a sequence Y generated by the current state by the hidden layer vector through a softmax layer of the recurrent neural network _1:t Middle y _t Distribution probability of (2):

p(y _t |x ₁ ，...，x _t )＝softmax(b+Wh _t )

the reward Q for the current sentence, denoted as

Q＝D(Y _1：t )

In order to obtain the evaluation of the discriminator on a complete sequence, a Monte Carlo search strategy is adopted to generate T-T current unknown words, so that the complete sequence is obtained for evaluation; for an n-time Monte Carlo search, denoted as

/>

Wherein the content of the first and second substances,

the operation is the connection operation according to the rows;

Wherein, the first and the second end of the pipe are connected with each other,

for multiplication of corresponding positions, p is a non-linear function, c _i Is the output value of the convolutional layer;

after pooling layer, the vector c = max (c) ₁ ,...,c _T-1+1 ) Sigmo through fully connected layersThe id function outputs the probability that the sequence is judged to be 'true', namely the reward Q;

performing cyclic training by adopting a policy gradient (gradient strategy), and updating the parameters of the generator according to the level of the reward Q, thereby reducing the loss of the generated sentence; the model is converged when the error of the discriminator is minimum through cyclic training;

more specifically, solving the binary cross entropy based on the output distribution of the discriminator to obtain the loss of the current sentence specifically includes: let P be the probability of state 1 of output P, 1-P be the probability of state 0 of output P, Q be the probability of state 1 of input Q, and 1-Q be the probability of state 0 of input Q, then the cross entropy of P, Q is

H(P|Q)＝-(p*logq+(1-p)log(1-q))；

More specifically, for the generated sequence, the cross entropy when the generator generates a false sequence for which the discriminator judges true is

loss＝-(l*logD(Y _1∶T )+0*log(l-D(Y _1∶T ))

＝-logD(Y _1：T )

For the discriminating sequence, the discriminator identifies the true source of the sequence, one sequence is true, and the cross entropy when the discriminator judges true is

The minimum cross entropy is calculated by the following formula:

in practical implementation, in order to make the discriminator accurately identify, the smaller the cross entropy is, the better the cross entropy is;

if so, desensitization text data with the same statistical characteristics as the source data is obtained, wherein the comparison before and after Gaussian distribution random number desensitization is shown in FIG. 6;

more specifically, the same statistical properties are: the proportion of the numbers or characters in the text is the same;

if not, the step S3 is returned to.

Table 1 is a comparison of textual data before and after desensitization by the described method.

TABLE 1

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for generating personal digital spatial data desensitization to an antagonistic network based on text, comprising the steps of:

s1: acquiring a data file to be desensitized in a personal digital space,

constructing a text to generate an confrontation network model; the text generation confrontation network model comprises a generator and a discriminator;

s2: analyzing the data file to be desensitized to obtain an analyzed file containing sensitive information;

in step S3, training a text generation countermeasure network model by combining a strategy of Monte Carlo search;

the specific steps of training the text generation confrontation network model are as follows:

inputting the vector obtained by encoding the word of the source data into the embedding layer of the recurrent neural network to obtain the vector x of the embedding layer ₁ ，...，x _T Output the hidden layer vector h ₁ ，...，h _T To obtain

h _t ＝R(h _t-1 ，x _t )

Wherein h is _t-1 Is the hidden layer vector of the previous state, h _t 、x _t Hidden layer vectors and embedded layer vectors of the current state, respectively; t is less than or equal to T, T is a word vector sequence number, and R is an RNN network;

obtaining a sequence Y generated by the current state by the hidden layer vector through the softmax layer of the recurrent neural network _1：t Middle y _t Distribution probability of (2):

p(y _t |x ₁ ，...，x _t )＝softmax(b+Wh _t )

reward Q for the current sentence, denoted Q = D (Y) _1：t )

For an n-time Monte Carlo search, denoted as

For each sequence, embedding a layer vector x ₁ ，...，x _T Concatenated to represent a current sequence

Wherein the content of the first and second substances,

the connection operation is performed according to rows;

pairing sequence vectors d by convolution kernels omega _1：T Performing convolution operation

multiplication by the corresponding position, p being a non-linear function, c _i Is the output value of the convolutional layer;

after pooling layer, the vector c = max (c) ₁ ，...，c _T-l+1 ) Outputting the probability that the sequence is judged to be real through a sigmoid function of the full connection layer, namely rewarding Q;

updating the parameters of the generator according to the high and low of the reward Q, thereby reducing the loss of the generated sentence; the model is converged when the error of the discriminator is minimum through cyclic training;

if not, the procedure returns to step S3.

2. The method for text-based desensitization of personal digital spatial data generated against a network according to claim 1, wherein said data files to be desensitized are based on semi-structured medical information data in a distributed database.

3. The method for text-based generation of personal digital spatial data desensitization of an antagonistic network according to claim 1, wherein said generator generates sequences using a recurrent neural network.

4. The method for text-based generation of personal digital spatial data desensitization of an antagonistic network according to claim 1, wherein said arbiter employs a convolutional neural network to discriminate between the sequences generated by said generator.

5. The method for text-based generation of personal digital spatial data desensitization of antagonistic networks according to claim 1, characterized in that binary cross entropy is solved based on the output distribution of the discriminators to obtain the loss of the current sentence, specifically: let P be the probability of state 1 of output P, 1-P be the probability of state 0 of output P, Q be the probability of state 1 of input Q, and 1-Q be the probability of state 0 of input Q, then the cross entropy of P, Q is

H(P|Q)＝-(p*logq+(1-p)log(1-q))。

6. The method for desensitizing personal digital space data of text-based generation anti-networking of claim 5, wherein for a generation sequence, when the generator generates a false sequence, the cross entropy at which the discriminator determines true is

loss＝-(1*logD(Y _1：T )+0*log(1-D(Y _1：T ))

＝-logD(Y _1：T )

The minimum cross entropy is calculated by the following formula:

7. the method for text-based desensitization of personal digital spatial data to an antagonistic network in accordance with claim 1, wherein the same statistical properties are: the proportion of the numbers or characters in the text is the same.