CN112861179A

CN112861179A - Method for desensitizing personal digital spatial data based on text-generated countermeasure network

Info

Publication number: CN112861179A
Application number: CN202110199023.XA
Authority: CN
Inventors: 孙伟; 官明哲; 张武军
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-05-28
Anticipated expiration: 2041-02-22
Also published as: CN112861179B

Abstract

The invention provides a method for desensitizing personal digital spatial data of an antagonistic network based on text generation, which comprises the following steps: s1: acquiring a data file to be desensitized in a personal digital space, and constructing a text to generate an confrontation network model; s2: analyzing the data file to be desensitized to obtain an analysis file containing sensitive information; s3: inputting the analysis file as source data into a text to generate a confrontation network model for training; s4: judging whether the trained text generation confrontation network model is converged, if so, obtaining desensitization text data with the same statistical characteristics as the source data; if not, the process returns to step S3. The invention provides a text-based personal digital spatial data desensitization method for generating an antagonistic network, which solves the problem that the structured format of medical source data can be changed when the existing data desensitization technology is applied in a medical scene.

Description

Method for desensitizing personal digital spatial data based on text-generated countermeasure network

Technical Field

The invention relates to the technical field of data desensitization processing, in particular to a method for desensitizing personal digital spatial data based on a text-generated confrontation network.

Background

Data desensitization is a data processing technique that can reduce or remove the sensitivity of data by processing the data. By adopting a data desensitization technology, the risk and harm of data leakage can be reduced, and the privacy of user data is effectively protected. In the field of internet and medical treatment, users can store, check and share personal medical treatment health data through personal digital space, but the personal medical treatment data face the risk of leakage of user medical treatment sensitive information in the processes of online doctor watching, online medicine purchasing, outpatient service appointment and the like, and the data of the users in the medical treatment industry have extremely high authenticity and sensitivity, and once the personal sensitive information of the users is leaked, potential life threat can be caused to the users. With data desensitization, information in the personal digital space can be used for business related analysis and processing while avoiding leakage of user data.

The existing data desensitization mode is usually used in a covering or generalization mode and the like, so that private data is protected, and meanwhile, the usability of the data is kept, so that the desensitized data can be continuously used in application scenes such as development testing, data mining, data distribution and the like. Data replacement, namely replacing data in the sensitive information by using random data; data shuffling, which performs row-to-row exchange in source data; numerical value conversion, which is to perform conversion processing on numerical data such as age, time and the like; data occlusion, replacing or altering sensitive data with special symbols such as "+, NULL", etc.; data deletion, namely sensitive data deletion and clearing; and (3) data generalization, namely representing the data from a specific dimension by using a more fuzzy dimension, enlarging the data representation range, eliminating sensitive information and the like. However, when the existing data desensitization technology is applied in a medical scene, the structured format of medical source data can be changed, and the requirements of desensitization and protection of medical sensitive information of a user in the medical scene cannot be met.

In the prior art, such as chinese patent published in 2019, 8, 16, a data desensitization method, apparatus, device, and computer readable storage medium, publication number CN110135193A, maximizes data desensitization degree, ensures that privacy information is not revealed, and effectively improves the practicability of desensitized data, but does not evaluate complete sequences, and changes the structured format of source data.

Disclosure of Invention

The invention provides a personal digital space data desensitization method based on a text generation confrontation network, aiming at overcoming the technical defect that the structural format of medical source data can be changed when the existing data desensitization technology is applied in a medical scene.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a method of generating personal digital spatial data desensitization against a network based on text, comprising the steps of:

s1: acquiring a data file to be desensitized in the personal digital space,

constructing a text to generate an confrontation network model;

s2: analyzing the data file to be desensitized to obtain an analysis file containing sensitive information;

s3: inputting the analysis file as source data into a text to generate a confrontation network model for training;

s4: judging whether the trained text generation confrontation network model converges or not,

if yes, desensitization text data with the same statistical characteristics as the source data are obtained;

if not, the process returns to step S3.

Preferably, the data file to be desensitized is based on semi-structured medical information data in a distributed database.

Preferably, the text generation confrontation network model comprises a generator and a discriminator.

Preferably, the generator generates the sequence using a recurrent neural network.

Preferably, the discriminator discriminates the sequence generated by the generator using a convolutional neural network.

Preferably, in step S3, the text generation countermeasure network model is trained in conjunction with the strategy of Monte Carlo search.

Preferably, the specific steps of training the text generation confrontation network model are as follows:

vector input cycle obtained by encoding words of source dataObtaining an embedding layer vector x by an embedding layer of the ring neural network₁,...,x_TOutput the hidden layer vector h₁,...,_TTo obtain

h_t＝R(h_t-1，x_t)

Wherein h is_t-1Is the hidden layer vector of the previous state, h_t、x_tHidden layer vectors and embedded layer vectors of the current state, respectively; t belongs to T, T is a word vector sequence number, and R is an RNN network;

obtaining a sequence Y generated by the current state by the hidden layer vector through a softmax layer of the recurrent neural network_1:tMiddle y_tDistribution probability of (2):

p(y_t|x₁，...，x_t)＝softmax(b+Wh_t)

where b is the offset vector, W is the weight matrix, y_tIs a sequence of length t;

reward (Reward) Q for the current sentence, denoted as

Q＝D(Y_1：t)

For an n-time Monte Carlo search, it is expressed as

The strategy of operating the Monte Carlo search obtains N output sequences from the current state to the end of the sequence, thus obtaining a more accurate reward Q, denoted as

For each sequence, embedding a layer vector x₁,...,_TConcatenated to represent a current sequence

Wherein the content of the first and second substances,

the connection operation is performed according to rows;

pairing sequence vectors d by convolution kernels omega_1:TPerforming convolution operation

Wherein the content of the first and second substances,

for multiplication of corresponding positions, p is a non-linear function, c_iIs the output value of the convolutional layer;

after the pooling layer, the vector c is obtained as max (c)₁,...,c_T-1+1) Outputting the probability that the sequence is judged to be real through a sigmoid function of the full connection layer, namely rewarding Q;

updating the parameters of the generator according to the high and low of the reward Q, thereby reducing the loss of the generated sentence; and (5) carrying out cyclic training to make the model converge when the error of the discriminator is minimum.

Preferably, the obtaining of the loss of the current sentence is based on the output distribution of the discriminator by solving the binary cross entropy, which specifically includes: let P be the probability of state 1 of output P, 1-P be the probability of state 0 of output P, Q be the probability of state 1 of input Q, and 1-Q be the probability of state 0 of input Q, then the cross entropy of P, Q is

H(P|Q)＝-(p*logq+(1-p)log(1-q))。

Preferably, for the generated sequence, when the generator generates a false sequence, the cross entropy at which the discriminator judges true is

loss＝-(1*logD(Y_1∶T)+0*log(l-D(Y_l:T))

＝-logD(Y_1∶T)

For the discriminating sequence, the discriminator identifies the true source of the sequence, one sequence is true, and the cross entropy when the discriminator judges true is

One sequence is false, and the cross entropy when the discriminator judges false is

The minimum cross entropy is calculated by the following formula:

preferably, the same statistical properties are: the proportion of the numbers or characters in the text is the same.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a text-based personal digital space data desensitization method for generating an antagonistic network, which is characterized in that desensitization data with the same statistical characteristics and structure as an analytic file containing sensitive information is generated by training a text-generated antagonistic network model, so that data desensitization processing on structured text information is realized, and a good text data desensitization effect is achieved under the condition that the structure of data in a personal digital space is not influenced.

Drawings

FIG. 1 is a flow chart of the steps for implementing the technical solution of the present invention;

FIG. 2 is a flow chart of the desensitization work flow of the text generation confrontation network model in the present invention;

FIG. 3 is a network diagram of the generator of the present invention;

FIG. 4 is a network diagram of the arbiter in the present invention;

FIG. 5 is a schematic diagram of a structure of a text-generated confrontation network model according to the present invention;

FIG. 6 is a graph showing a comparison between before and after desensitization with Gaussian-distributed random numbers according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

1-2, a method for generating personal digital spatial data desensitization against a network based on text, comprising the steps of:

s1: acquiring a data file to be desensitized in a personal digital space;

more specifically, the data file to be desensitized is based on semi-structured medical information data in a distributed database;

constructing a text to generate an confrontation network model;

more specifically, the text generation confrontation network model comprises a generator and a discriminator;

more specifically, as shown in fig. 3, the generator generates the sequence using a recurrent neural network;

more specifically, as shown in fig. 4, the discriminator uses a convolutional neural network to discriminate the sequence generated by the generator;

s2: analyzing the data file to be desensitized to obtain an analysis file containing sensitive information; the analysis file is a json format file;

more specifically, in step S3, the strategy of Monte Carlo search is combined to train the text generation countermeasure network model;

more specifically, as shown in fig. 5, the specific steps of training the text generation confrontation network model are as follows:

inputting the vector obtained by encoding the word of the source data into the recurrent nerveEmbedding layer of network to obtain embedding layer vector x₁,...,x_TOutput the hidden layer vector h₁,...,_TTo obtain

h_t＝R(h_t-1，x_t)

p(y_t|x₁，...，x_t)＝softmax(b+Wh_t)

the reward Q for the current sentence, denoted as

Q＝D(Y_1：t)

In order to obtain the evaluation of the discriminator on a complete sequence, a Monte Carlo search strategy is adopted to generate T-T current unknown words, so that the complete sequence is obtained for evaluation; for an n-time Monte Carlo search, it is expressed as

Wherein the content of the first and second substances,

the connection operation is performed according to rows;

Wherein the content of the first and second substances,

carrying out cyclic training by adopting a policy gradient (gradient strategy), and updating the parameters of the generator according to the height of the reward Q, thereby reducing the loss of the generated sentences; the model is converged when the error of the discriminator is minimum through cyclic training;

more specifically, solving the binary cross entropy based on the output distribution of the discriminator to obtain the loss of the current sentence specifically includes: let P be the probability of state 1 of output P, 1-P be the probability of state 0 of output P, Q be the probability of state 1 of input Q, and 1-Q be the probability of state 0 of input Q, then the cross entropy of P, Q is

H(P|Q)＝-(p*logq+(1-p)log(1-q))；

More specifically, for the generated sequence, the cross entropy when the generator generates a false sequence for which the discriminator judges true is

loss＝-(l*logD(Y_1∶T)+0*log(l-D(Y_1∶T))

＝-logD(Y_1：T)

The minimum cross entropy is calculated by the following formula:

in practical implementation, in order to make the discriminator accurately identify, the smaller the cross entropy is, the better the cross entropy is;

if so, desensitization text data with the same statistical characteristics as the source data is obtained, wherein the comparison before and after Gaussian distribution random number desensitization is shown in FIG. 6;

more specifically, the same statistical properties are: the proportion of the numbers or characters in the text is the same;

if not, the process returns to step S3.

Table 1 is a comparison of textual data before and after desensitization by the described method.

TABLE 1

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for generating personal digital spatial data desensitization to an antagonistic network based on text, comprising the steps of:

s1: acquiring a data file to be desensitized in the personal digital space,

constructing a text to generate an confrontation network model;

if not, the process returns to step S3.

2. The method for text-based desensitization of personal digital spatial data generated against a network according to claim 1, wherein said data files to be desensitized are based on semi-structured medical information data in a distributed database.

3. The method for desensitizing personal digital spatial data based on text-generated countermeasure networks according to claim 1, wherein the text-generated countermeasure network model includes a generator and a discriminator.

4. The method for text-based generation of personal digital spatial data desensitization of an antagonistic network according to claim 3, wherein said generator generates sequences using a recurrent neural network.

5. The method for text-based generation of personal digital spatial data desensitization of an antagonistic network according to claim 3, wherein said arbiter employs a convolutional neural network to discriminate between sequences generated by said generator.

6. The method for desensitizing personal digital spatial data of a text-based generated confrontation network according to claim 3, wherein in step S3, the text-based generated confrontation network model is trained in conjunction with the strategy of Monte Carlo search.

7. The method for desensitizing personal digital spatial data based on text-generated confrontation network of claim 6, wherein the specific steps for training the text-generated confrontation network model are:

inputting a vector obtained by encoding a word of source data into an embedding layer of a recurrent neural network to obtain an embedding layer vector x₁,...,x_TOutput the hidden layer vector h₁,...,h_TTo obtain

h_t＝R(h_t-1，x_t)

Wherein h is_t－1Is the hidden layer vector of the previous state, h_t、x_tHidden layer vectors and embedded layer vectors of the current state, respectively; t belongs to T, T is a word vector sequence number, and R is an RNN network;

obtaining a sequence Y generated by the current state by the hidden layer vector through a softmax layer of the recurrent neural network_1：tMiddle y_tDistribution probability of (2):

p(y_t|x₁，...，x_t)＝softmax(b+Wh_t)

the reward Q for the current sentence, denoted as

Q＝D(Y_1：t)

For an n-time Monte Carlo search, it is expressed as

For each sequence, embedding a layer vector x₁，...，x_TConcatenated to represent a current sequence

Wherein the content of the first and second substances,

the connection operation is performed according to rows;

pairing sequence vectors d by convolution kernels omega_1：TPerforming convolution operation

Wherein the content of the first and second substances,

after the pooling layer, the vector c is obtained as max (c)₁，...，c_T-1+1) Outputting the probability that the sequence is judged to be real through a sigmoid function of the full connection layer, namely rewarding Q;

8. The method for text-based generation of personal digital spatial data desensitization of antagonistic networks according to claim 7, characterized in that binary cross entropy is solved based on the output distribution of the discriminators to obtain the loss of the current sentence, in particular: let p be the probability of state 1 of the output bin, 1-p be the probability of state 0 of the output bin, Q be the probability of state 1 of the input Q, and 1-Q be the probability of state 0 of the input Q, then the cross entropy of P, Q is P, Q

H(P|Q)＝-(p*logq+(1-p)log(1-q))。

9. The method for text-based generation of personal digital spatial data desensitization of countermeasure networks according to claim 8, wherein for a generated sequence, when the generator generates a false sequence, the cross entropy at which the discriminator determines true is

loss＝-(1*logD(Y_1：T)+0*log(1-D(Y_1：T))

＝-logD(Y_1：T)

The minimum cross entropy is calculated by the following formula:

10. the method for text-based desensitization of personal digital spatial data to an antagonistic network in accordance with claim 1, wherein the same statistical properties are: the proportion of the numbers or characters in the text is the same.