CN111241587B

CN111241587B - Data desensitization method and device

Info

Publication number: CN111241587B
Application number: CN202010071239.3A
Authority: CN
Inventors: 张美跃; 周业; 陈佳伟; 周定云; 俞宏青; 俞基锋
Original assignee: Hengruitong Fujian Information Technology Co ltd
Current assignee: Hengruitong Fujian Information Technology Co ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2023-09-29
Anticipated expiration: 2040-01-21
Also published as: CN111241587A

Abstract

According to the data desensitization method and device, original data are obtained, and nucleation is carried out to obtain new data; the new data is subjected to dimension reduction processing to obtain dimension reduced data, redundant information in the data is removed, the calculation complexity is simplified, and unnecessary expenditure is reduced; and carrying out centering treatment on the dimension-reduced data to obtain desensitized data, and protecting the privacy data on the premise of ensuring the usability of the data.

Description

Data desensitization method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a data desensitizing method and device.

Background

In recent years, with the development of information technology, the generation of personal data has been exponentially increased, and a large amount of personal information has been stored and distributed by government parts, commercial establishments, and the like. Data distribution is taken as a means of information sharing, and the risk of personal privacy data disclosure is increased while data exchange and data sharing are facilitated. The "privacy data" is sensitive information that the data owner is not willing to know by others, such as home address, identification card number, phone number, disease information, location information, etc. For example, in order to study the amount of each type of drug used and the patient's illness, the relevant departments may need to provide relevant purchase list data, and the purchase list data contains a lot of private data. Obviously, if the drug purchase table data is directly released, the privacy information of the patient may be revealed. How to process the table data to prevent the disease privacy of the patient from being revealed, the simplest method is to remove the name attribute of the patient, so that the aggressiveness can infer personal identity information according to sensitive attribute by means of background knowledge, association attack and the like. Such data would lead to research becoming meaningless if sensitive attributes in the data were all removed.

At present, regarding the problem of privacy disclosure in data distribution, the existing research is mainly to limit methods of data distribution, data scrambling, k-anonymity and the like, and although the methods can protect the privacy of data to a certain extent, the methods have some security and usability defects. For example, limiting data distribution mainly cuts off the association between data, but the usability of the data is reduced, and the number of data distributed is not well controlled; the data scrambling mainly comprises the steps of disturbing data, changing the data by adding proper noise, and being beneficial to maintaining data characteristics, but has lower clustering availability and large calculation cost; the k-anonymity is mainly characterized in that k indistinguishable records exist in published data, so that an attacker cannot distinguish a specific individual to which private information belongs, personal privacy is protected, and the k-anonymity protects the personal privacy to a certain extent, but reduces the clustering availability of the data.

Therefore, in the existing privacy protection mechanism of data distribution, there are mainly two problems: on one hand, the method has the problems of complex calculation and high expenditure; on the other hand, it is difficult to maintain a balance of data availability and privacy.

Disclosure of Invention

First, the technical problem to be solved

In order to solve the problems in the prior art, the invention provides a data desensitizing method and device, which can reduce the calculation cost and protect the private data on the premise of ensuring the usability of the data.

(II) technical scheme

In order to achieve the above purpose, the present invention adopts a main technical scheme comprising:

a method of desensitizing data, comprising the steps of:

s1, acquiring original data, and performing nucleation treatment to obtain new data;

s2, performing dimension reduction processing on the new data to obtain dimension reduced data;

and S3, carrying out centering treatment on the dimension-reduced data to obtain desensitized data.

In order to achieve the above purpose, another main technical scheme adopted by the invention comprises:

an apparatus for desensitizing data, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of, when executing the program:

(III) beneficial effects

The invention has the beneficial effects that: obtaining new data by obtaining original data and carrying out nucleation treatment; the new data is subjected to dimension reduction processing to obtain dimension reduced data, redundant information in the data is removed, the calculation complexity is simplified, and unnecessary expenditure is reduced; and carrying out centering treatment on the dimension-reduced data to obtain desensitized data, and protecting the privacy data on the premise of ensuring the usability of the data.

Drawings

FIG. 1 is a flow chart of a method of data desensitization according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a data desensitizing apparatus according to an embodiment of the present invention.

[ reference numerals description ]

1: a device for desensitizing data;

2: a memory;

3: a processor.

Detailed Description

The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.

Example 1

Referring to fig. 1, a method for desensitizing data includes the steps of:

the step S1 specifically comprises the following steps:

and obtaining original data, and carrying out nucleation processing on nonlinear data in the original data to convert the nonlinear data into linear data so as to obtain new data.

the step S2 specifically comprises the following steps:

and performing dimension reduction processing on the new data through principal component analysis to obtain dimension reduced data.

The centering process includes:

and constructing an equivalent set corresponding to the dimension-reduced data according to a distance minimization principle.

The centralizing process further includes:

and calculating the average value of each column of data of the dimension-reduced data, and replacing the specific value of each column of data with the average value of each column of data.

Example two

The difference between the present embodiment and the first embodiment is that the present embodiment will further explain how the above-mentioned data desensitizing method of the present invention is implemented with reference to a specific application scenario:

the invention mainly comprises two stages of data dimension reduction and centering treatment;

1. data dimension reduction stage

Acquiring original table data S to be published _n×h Wherein n is the record number of the table data, h is the dimension of the table data, and the original table data S is firstly _n×h The numerical nonlinear data in the table is converted into numerical linear data through nucleation processing to obtain new table data S' _n×h The method comprises the steps of carrying out a first treatment on the surface of the Then, the new data S 'is analyzed by principal component analysis' _n×h Performing dimension reduction processing to obtain dimension reduced table data S'。

Each record includes m public attributes and t sensitive attributes, where m+t=w. Let u= (U) ₁ ,u ₂ ,…,u _m ) Is a public attribute in the table data, where u _i (i=1, 2, …, m) is the i-th public attribute; v= (V) ₁ ,v ₂ ,…,v _t ) Is a sensitive attribute in the table data, where v _j (j=1, 2, …, t) is the j-th sensitive attribute, from the raw table data S _n×h Numerical nonlinear table data T extracted from the data _n×l The samples are recorded as n and the dimension is l.

The specific implementation steps are as follows:

first, the numerical nonlinear data in the table data is nucleated, and the table data S _n×h Conversion to Table data S' _n×h 。

S111: nonlinear data in the table data are extracted and represented by a matrix A: a= (a ₁ ,A ₂ ,…,A _n ) ^T . Wherein A is _f ＝(a _f1 ,a _f2 ,…,a _fl ) Represents the f-th data in a;

s112: data A of each row in A _f ＝(a _f1 ,a _f2 ,…,a _fl ) Projected sequentially onto the hyperplane Z _f ＝(z _f1 ,z _f2 ,…,z _fd ) Resulting in post-projection data, where f=1, 2, …, n;

s113: acquiring the f-th data of the projected data: then for z _fj Satisfy the following requirementsWherein w is _fi Is data a _fi Images in hyperplane, i.e. w _fi ＝φ(a _fi )；

S114: calculating z _fj ：Wherein->Is a _fi The j-th division of (2)Amount lambda _f Is A _f Is a characteristic value of (2);

s115: introducing a kernel function: k (k) _f (a _fi ,a _fj )＝φ(a _fi ) ^T φ(a _fj )；

S116: calculating to obtain K _f a ^j ＝λ _j a ^j Wherein K is _f Vector nucleated for the f line data;

s117: finally obtaining a kernel matrix: k= (K) ₁ ,K ₂ ,…,K _n ) ^T 。

By carrying out nucleation treatment on the numerical nonlinear data in the original table data, the table data is converted into S' _n×h ＝(S _n×(h-l) ,K _n×l )。

Next, for S' _n×h The numerical value linear data in the model (1) is subjected to dimension reduction by adopting a principal component analysis method, so as to obtain dimension reduced table data S'. The specific implementation steps are as follows:

s121: calculating the mean value of each column of data:where j=1, …, h;

s122: the individual data in the linear data is de-centered, i.e. each data minus the mean of the corresponding column: s is(s) _ij ＝s _ij -E _j Where i=1, …, n;

s123: calculating a covariance matrix F:

s124: f is subjected to eigenvalue decomposition, and the eigenvalue lambda is calculated _i And the corresponding feature vector mu _i : let |λe-f|=0, solve the value of λ to be the eigenvalue; the value of lambda is brought into |lambda E-F|=0, and the solved linear independent vector is the feature vector.

S125: and for characteristic value lambda _i Sequencing: lambda (lambda) ₁ >λ ₂ >…>λ _h Its corresponding feature vector is mu ₁ ，μ ₂ ，…，μ _h ；

S126: the number of main components is selected: giving a threshold value alpha of availability and a remaining principal component number parameter b, then selecting the number of principal components according to whether the judgment formula 1-p is less than or equal to alpha, outputting b if the inequality is satisfied, otherwise, making b=b+1. Wherein:λ _i is a characteristic value;

s127: outputting feature vector sets corresponding to the first b feature values: v (V) _b ＝{μ ₁ ,μ ₂ ,…,μ _b }；

S128: unitized feature vector V _b Obtaining a feature matrix A: first, a feature vector set V is calculated _b Modulus of each feature vector:and then carrying out unitization treatment to obtain a unit matrix: />

S129: calculating a projection matrix: s' _n×b ＝S′ _n×h A。

2. Centralizing treatment stage

S21: creating a data set S corresponding to the reduced-dimension table data S' ^* Order-makingSetting the number r of the equivalent sets to obtain r equivalent sets D ₁ ,…,D _r . Let->Let j=1;

s22: from S' optionally one record S _i As an equivalent set D _j Is a primitive element of (2); namely D _j ＝{s _i }, and S "=s" - { S _i }；

S23: calculating the set D of medium and equivalent values in S' _j Record s closest to _i ，D _i ←D _i +{s _i }，S″＝S″-{s _i -a }; repeating the step until D _j The number of records in (a) is greater than or equal to k;

s24: pair equivalence set D _j The elements in (3) are subjected to centering treatment: calculation D _j The mean value of each column data attribute is used for replacing the specific value of each column data attribute to obtain a new equivalence set D' _j ；

S25：S ^* ＝S ^* +{D′ _j -a }; if j<And 5, repeating the step S22 if j=j+1, and ending if not.

Specifically, (1) aiming at the problem that the availability of the original numerical data clustering is difficult to ensure by the privacy protection method in the existing data release, the invention constructs an equivalent set of n records according to the distance minimization principle through the clustering thought, replaces the attribute value in the equivalent set with the average value, realizes centralization anonymity, ensures the privacy security of the data and simultaneously ensures smaller information loss degree; and the effectiveness and the safety of the algorithm are analyzed theoretically; (2) Aiming at the problems of large data overhead, high calculation complexity and the like in the existing data release protection mechanism, the invention performs privacy protection on the data after the data is reduced in dimension: the numerical nonlinear data is converted into linear data through nucleation, and then the linear data is subjected to dimension reduction by adopting a principal component analysis method. Redundant information can be removed, so that the calculation complexity is simplified, and unnecessary expenditure is reduced; (3) The invention makes the record forward reflect the centralized data information loss by reasonable distance between the two records and distance between the record and the equivalent set.

Example III

Referring to fig. 2, a data desensitizing apparatus 1 includes a memory 2, a processor 3, and a computer program stored in the memory 2 and executable on the processor 3, wherein the processor 3 implements the steps of the first embodiment when executing the program.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent changes made by the specification and drawings of the present invention, or direct or indirect application in the relevant art, are included in the scope of the present invention.

Claims

1. A method of desensitizing data comprising the steps of:

s3, carrying out centering treatment on the dimension-reduced data to obtain desensitized data;

the step S1 specifically comprises the following steps:

acquiring original data, and carrying out nucleation treatment on nonlinear data in the original data to convert the nonlinear data into linear data to obtain new data;

s111: nonlinear data in the table data are extracted and represented by a matrix A: a= (a ₁ ,A ₂ ,…,A _n ) ^T The method comprises the steps of carrying out a first treatment on the surface of the Wherein A is _f ＝(a _f1 ,a _f2 ,…,a _fl ) Represents the f-th data in a;

s112: data A of each row in A _f ＝(a _f1 ,a _f2 ,…,a _fl ) Projected sequentially onto the hyperplane Z _f ＝(z _f1 ,z _f2 ,…,z _fl ) Resulting in post-projection data, where f=1, 2, …, n;

s113: acquiring the f-th data of the projected data: for z _fj Satisfy the following requirements Wherein w is _fi Is data a _fi Images in hyperplane, i.e. w _fi ＝φ(a _fi )，/>Represents W _fi Is a transpose of (2);

s114: calculating z _fj ：Wherein->Is a _fi Lambda of the j-th component of (2) _f Is A _f Is a characteristic value of (2);

S116: calculating to obtain K _f a ^j _fi ＝λ _f a ^j _fi Wherein K is _f Vector nucleated for the f line data;

s117: finally obtaining a kernel matrix: k= (K) ₁ ,K ₂ ,…,K _n ) ^T ；

The step S2 specifically comprises the following steps:

performing dimension reduction processing on the new data through principal component analysis to obtain dimension reduced data;

s121: calculating the mean value of each column of data:where j=1, …, h;

s122: the individual data in the linear data is de-centered, i.e. each data minus the mean of the corresponding column: s' _ij ＝s′ _ij -E _j Where i=1, …, n;

s123: calculating a covariance matrix F:

s124: f is subjected to eigenvalue decomposition, and the eigenvalue lambda is calculated _i And the corresponding feature vector mu _i : let |lambda _E -f|=0, and solving for the value of λ as the eigenvalue; bringing the value of λ into |λ _E -f|=0, the solved linear independent vector isIs a feature vector;

s125: and for characteristic value lambda _i Sequencing: lambda (lambda) ₁ >λ ₂ >…>λ _n Its corresponding feature vector is mu ₁ ，μ ₂ ，…，μ _n ；

S126: the number of main components is selected: giving a threshold value alpha of availability and a remaining principal component number parameter b, then selecting the number of principal components according to whether the judgment formula 1-p is less than or equal to alpha, if the inequality is satisfied, outputting b, otherwise, letting b=b+1, wherein:

S128: unitized feature vector V _b Obtaining a feature matrix A': first, a feature vector set V is calculated _b Modulus of each feature vector:and then carrying out unitization treatment to obtain a unit matrix: />

S129: calculating a projection matrix: s' _n×b ＝S′ _n×h A’；

The centering process includes:

constructing an equivalent set corresponding to the dimension-reduced data according to a distance minimization principle;

s21: creating a data set S corresponding to the reduced-dimension table data S', lettingSetting the number r of the equivalent sets to obtain r equivalent sets D ₁ ,…,D _r Let->Let j=1;

s22: from S' optionally one record S _i As an equivalent set D _j Is a primitive element of (2); namely D _j ＝{s _i "}, and S" =s "- { S _i ”}；

S23: calculating the set D of medium and equivalent values in S' _j Record s closest to _i ，D _i ←D _i +{s _i ”}，S″＝S″-{s _i "}; repeating the step until D _j The number of records in (a) is greater than or equal to k;

the centralizing process further includes:

calculating the average value of each column of data of the dimension reduced data, and replacing the specific value of each column of data with the average value of each column of data;

S25：S*＝S*+{D′ _j -a }; if j<5, repeating the step S22 if j=j+1, otherwise ending;

by carrying out nucleation treatment on numerical nonlinear data in the original table data, the table data is converted into S' _n×h ＝(S _n×(h-l) ,K _n×l ) Next, for S' _n×h The numerical value linear data in the model (1) is subjected to dimension reduction by adopting a principal component analysis method, so as to obtain dimension reduced table data S'.

2. An apparatus for desensitizing data, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of:

the step S1 specifically comprises the following steps:

s112: data A of each row in A _f ＝(a _f1 ,a _f2 ,…,a _fl ) Projected sequentially onto the hyperplane Z _f ＝(z _f1 ,z _f2 ,…,z _fl) Resulting in post-projection data, where f=1, 2, …, n;

s115: guiding deviceKernel function: k (k) _f (af _i ,a _fj )＝φ(a _fi ) ^T φ(a _fj )；

s117: finally obtaining a kernel matrix: k= (K) ₁ ,K ₂ ,…,K _n ) ^T ；

The step S2 specifically comprises the following steps:

s121: calculating the mean value of each column of data:where j=1, …, h;

s123: calculating a covariance matrix F:

s124: f is subjected to eigenvalue decomposition, and the eigenvalue lambda is calculated _i And the corresponding feature vector mu _i : let |lambda _E -f|=0, and solving for the value of λ as the eigenvalue; bringing the value of λ into |λ _E -f|=0, and the solved linear independent vector is the eigenvector;

S129: calculating a projection matrix: s' _n×b ＝S′ _n×h A’；

The centering process includes:

the centralizing process further includes: