CN112464289A

CN112464289A - Method for cleaning private data

Info

Publication number: CN112464289A
Application number: CN202011453316.8A
Authority: CN
Inventors: 吴晓鸰; 胡庆鹏; 胡可; 凌捷
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-09
Anticipated expiration: 2040-12-11
Also published as: CN112464289B

Abstract

The invention provides a method for cleaning private data, which comprises the following steps: s1: obtaining privacy data from a data owner, and preprocessing the privacy data; s2: forming a first missing data set; forming a defect-free data set; s3: forming a second missing data set in the first missing data set to form an abnormal data set; s4: constructing a data filling prediction model by using a data mining algorithm, and predicting and filling missing values of all attribute data in the second missing data set by using the data filling prediction model so as to obtain a filled data set; s5: and merging the filled data set and the non-missing data set to obtain a merged data set, and sending the merged data set and the abnormal data set back to the data owner to finish cleaning the private data. The invention provides a private data cleaning method, which solves the problem that the existing data cleaning method cannot clean the private data.

Description

Method for cleaning private data

Technical Field

The invention relates to the technical field of private data processing, in particular to a private data cleaning method.

Background

While data mining creates wealth, it creates a problem of privacy disclosure. Data mining does play a positive role in deep-level trend-oriented applications, but at the same time brings about a number of problems. For example, for data such as financial transactions, medical records, and network communications, sensitive information may be leaked during mining.

In the field of data mining, privacy can be divided into two categories, one category of privacy being sensitive information contained in the original data itself. Since the conventional data mining technology is based on unencrypted raw data, that is, raw data containing personal or enterprise privacy must be given to a data miner to mine useful knowledge, such as personal home phone, bank account number, property status, etc., which may have a bad influence on the life of the individual. Another type of privacy is sensitive knowledge implied by the original data, such as rules like behavior characteristics of a good client of a certain company, which will seriously affect the core competitiveness of the enterprise if illegally obtained by someone with no particular interest.

However, the conventional data cleansing method is to cleanse the source data, and the meaning and data value of the data field need to be known, so that the private data cannot be cleansed.

In the prior art, for example, in a chinese patent authorized 7/3/2020, a data link method based on privacy protection and secure multiparty computation, with a publication number of CN110609831B, local data is blocked by using an improved k-means classification method, so as to prevent sensitive information of a user from being acquired by an adversary, but the missing value of private data cannot be filled.

Disclosure of Invention

The invention provides a private data cleaning method, aiming at overcoming the technical defect that the existing data cleaning method cannot clean the private data.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a method for cleaning private data comprises the following steps:

s1: obtaining privacy data from a data owner, and preprocessing the privacy data;

the private data comprises a plurality of items of attribute data;

s2: classifying attribute data with missing values in the privacy data to form a first missing data set;

classifying each item of attribute data without missing values in the privacy data to form a non-missing data set;

s3: classifying normal data values of the attribute data into a second missing data set in the first missing data set, and classifying abnormal data values of the attribute data into an abnormal data set;

s4: constructing a data filling prediction model by using a data mining algorithm, and predicting and filling missing values of all attribute data in the second missing data set by using the data filling prediction model so as to obtain a filled data set;

s5: and merging the filled data set and the non-missing data set to obtain a merged data set, and sending the merged data set and the abnormal data set back to the data owner to finish cleaning the private data.

Preferably, the privacy data further comprises a data item identifier for uniquely determining an item of attribute data.

Preferably, the private data is encrypted by the data owner, and the data owner identifies each item of attribute data, so as to respectively indicate whether each item of attribute data is classified data, continuous data or class label;

wherein the content of the first and second substances,

for the classification data, further comprising identifying ordered classification data and unordered classification data;

and for the continuous data, sorting each continuous data according to the data item identification.

Preferably, in step S1, the privacy data is preprocessed, specifically: discretizing the continuous data to obtain discretized data; and classifying the classified data into two categories, namely ordered classified data and unordered classified data.

Preferably, the discretization algorithm based on the information entropy is adopted to discretize the continuous data, and specifically comprises the following steps: traversing each data value of a certain item of continuous data in the private data, setting a partitioning point to carry out recursive partitioning on the data value of the item of continuous data, and stopping partitioning until the partitioned information entropy is smaller than a preset entropy threshold value or the partitioned data packet number is not smaller than the specified data packet number; and the information entropy obtained by dividing the data by the dividing point is minimum.

Preferably, in the information entropy-based discretization algorithm,

the magnitude of the information quantity l (x) is inversely proportional to the probability p (x) of occurrence of the event x, i.e. l (x) -log₂p(x)；

The entropy of information E (x) is expressed as:

taking the weighted average as the total entropy:

wherein x is_iN events, i 1,2, n, s are the measurement units of information entropy, E_iIs the ith information entropy, m is the total number of divided data in n events,

m_ithe number of divided data in the ith event.

Preferably, in step S3, the method further includes: identifying data values of various attribute data in the first missing data set by adopting a K-means clustering algorithm so as to identify normal data values and abnormal data values; the method specifically comprises the following steps:

s3.1: using a principal component analysis method to reduce the dimension of the data value of each item of attribute data in the first missing data set;

s3.2: randomly selecting k data values from the data values subjected to dimensionality reduction as cluster centers, and correspondingly obtaining k clusters;

s3.3: respectively calculating the distance between each data value and the center of each cluster, and respectively classifying each data value into the cluster with the closest distance;

s3.4: calculating the mean value of the data values in each cluster, and taking the mean value as a new cluster center;

s3.5: judging whether the change of the cluster center tends to be stable or not;

if yes, obtaining final k clusters;

if not, returning to the step S3.3;

s3.6: respectively calculating the distance between each data value and the cluster center of the cluster in which the data value is located in the final k clusters;

s3.7: comparing the distance between the data value and the cluster center of the family in which the data value is positioned with a preset distance threshold value;

if the distance between the data value and the cluster center of the cluster in which the data value is located is larger than a preset distance threshold value, identifying the data value as an abnormal data value;

and if the distance between the data value and the cluster center of the cluster in which the data value is positioned is not greater than a preset distance threshold value, identifying the data value as a normal data value.

Preferably, in the K-means clustering algorithm, the value of K is determined by using an inflection point method, a contour coefficient method, a statistical interval method or an empirical method.

Preferably, in step S4, the data mining algorithm is a BP neural network algorithm.

Preferably, in step S4, the data mining algorithm is a cart (classification And Regression tree) algorithm.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a method for cleaning private data, which can clean the data under the condition of not knowing the meaning and the data value of a data field, fully protect the private data and improve the data security; meanwhile, missing values of the private data are filled by using a data mining algorithm.

Drawings

FIG. 1 is a flow chart of the steps for implementing the technical solution of the present invention;

FIG. 2 is a schematic diagram of a clustering result obtained after clustering data values in the present invention;

FIG. 3 is a schematic diagram of outlier rejection of a clustering result in accordance with the present invention;

FIG. 4 is a gradient descent curve obtained by training with a BP neural network according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a method for cleaning private data includes the following steps:

the privacy data includes a number of items of attribute data having missing values and attribute data having no missing values; the attribute data comprises normal data values and abnormal data values;

Example 2

More specifically, the privacy data further includes a data item identification for uniquely determining an item of attribute data.

More specifically, the private data is encrypted by a data owner, and various attribute data of the private data are identified by the data owner, so that whether the various attribute data are classified data, continuous data or class labels is respectively described;

wherein the content of the first and second substances,

More specifically, in step S1, the privacy data is preprocessed, specifically: discretizing the continuous data to obtain discretized data; and classifying the classified data into two categories, namely ordered classified data and unordered classified data.

In particular implementations, unordered classification data is the lack of order and degree of difference between the classified classes or attributes. For unordered classification data, it may be processed using one-hot encoding. For m values, the expression is performed in m dimensions.

For example, for [ Chongqing, Guangdong and Sichuan ], corresponding classification can be carried out by using [100,010,001 ].

Using the OneHotEncoder method in the sklern framework for example, assuming the dataset has two attributes, gender and height, the following code is executed:

fromsklearn.preprocessing import OneHotEncoder

enc＝OneHotEncoder()

fit ([ [ [ 'male', 'middle', 'high', ], [ 'male', 'middle', ], [ 'female', 'low' ])

print (enc. transform ([ 'female', 'high') ]. toarray ())

Gender and height were encoded by one heat, 10 for males and 01 for females for gender. For height, the height was 100, the middle 010 and the height 001. The code output result is [ [0.1.0.0.1] ], where 01 represents female and 001 represents high.

The ordered classification variable can directly utilize the values after the division, such as [ low, medium and high ], and can be directly converted into [0,1 and 2], and can be directly mapped and discretized by using a map function in the pandas.

More specifically, a discretization algorithm based on information entropy is adopted to discretize continuous data, specifically: traversing each data value of a certain item of continuous data in the private data, setting a partitioning point to carry out recursive partitioning on the data value of the item of continuous data, and stopping partitioning until the partitioned information entropy is smaller than a preset entropy threshold value or the partitioned data packet number is not smaller than the specified data packet number; and the information entropy obtained by dividing the data by the dividing point is minimum.

More specifically, in the information entropy-based discretization algorithm,

The entropy of information E (x) is expressed as:

taking the weighted average as the total entropy:

wherein x is_iN events, i is 1,2, n, s is a measurement unit of information entropy, s takes 2 and takes a unit of information entropy as bit (bit), s takes 2E is unit of nano (nat), E_iIs the ith information entropy, m is the total number of divided data in n events,

m_ithe number of divided data in the ith event.

More specifically, step S3 further includes: identifying data values of various attribute data in the first missing data set by adopting a K-means clustering algorithm so as to identify normal data values and abnormal data values; the method specifically comprises the following steps:

if yes, obtaining final k clusters;

if not, returning to the step S3.3;

In the specific implementation process, the normal data values and the abnormal data values are separated through a K-means clustering algorithm, so that the accuracy of subsequent data processing is improved.

More specifically, in the K-means clustering algorithm, the value of K is determined by using an inflection point method, a contour coefficient method, a statistical interval method or an empirical method.

More specifically, in step S4, the data mining algorithm is a BP neural network algorithm.

In a specific implementation process, the BP neural network algorithm comprises the following steps:

s4.1: randomly initializing the weight and the deviation of the BP neural network; wherein each nerve unit of the BP neural network has a deviation theta_j；

S4.2: forward by the input layer of the BP neural network:

wherein, I_jIs a current layer neuron; o is_jIs I_jAn input value of (a); w_ijIs neuron I_iAnd neuron I_jWeight of the connection, I_iIs I_jThe upper layer of neurons of (a);

to I_jNonlinear conversion was carried out using the following formula to give I_jNext neuron I_kInput value of

S4.3: and (3) error calculation:

output layer I_oError of (2):

E_rro＝O_o(1-O_o)(T_o-O_o)；

wherein, T_oIs the true value, O_oTo predict value, E_rroIs an output layer I_oAn error of (2);

error of hidden layer:

wherein E is_rrkIs I_jError of the upper layer of neurons, w_jkIs I_jAnd I_kThe weight of the connection line;

s4.4: weight update and bias update:

ΔW_ij＝(l)E_rrjO_j

W_ij＝W_ij+ΔW_ij

wherein E_rrjIs the error of the neuron, (l) is the learning rate, usually taken as [0, 1]]；

Updating the deviation:

Δθ_j＝(l)E_rrj

θ_j＝θ_j+Δθ_j

s4.5: repeating the steps until the termination condition is reached: the updating of the weight is lower than a threshold value, the prediction error rate is lower than a threshold value or a preset circulation number is reached.

More specifically, in step S4, the data mining algorithm is a CART algorithm.

In the specific implementation process, the CART algorithm is used for carrying out deletion value prediction and filling, so that quantitative regression can be realized while qualitative classification is carried out; the CART algorithm comprises the following steps:

firstly, a certain characteristic J is confirmed, then a dividing point s is selected on the characteristic, and all data are divided into R₁And R₂Two parts, but should satisfy

Wherein, c₁Denotes y_iAt R₁Average value of c₂Denotes y_iAt R₂The average value of (a) is obtained, Q represents the sum of variances of two newly generated child nodes, at the moment, when a segmentation point s is selected, the requirement is met so that Q obtains the minimum value, then all J are traversed to obtain (J, s) and Q corresponding to the (J, s), the minimum Q is selected from the (J, s) and Q corresponding to the (J, s) and the s corresponding to the minimum Q are taken as root nodes, and R on the left and right of s is taken as a root node₁And R₂The two parts are used as the left and right subtrees of s;

performing the above operations on the left and right subtrees in a recursion manner until the stopping conditions (such as the limitation of the number of leaves and the limitation of the tree depth … …) are met, determining an output value according to the values of the leaf nodes, and generally taking an average number as the output value;

and determining the feature to be filled as y in prediction, then using a regression decision tree to calculate the relationship between the feature to be filled and the rest features, and filling the missing value.

Example 3

And carrying out simulation experiments on 100 pieces of experimental data by adopting the private data cleaning method.

Table 1 is a portion of 100 experimental data.

TABLE 1

The sex and race are unordered classes, and one-hot codes are adopted, but the values of the sex and race are only two, so that the values are represented by 0 and 1. The fields age, fnlwgt, education-num and hours-per-week are continuous data, and therefore discretization is performed. The in _ come field is a class label and has 2 values, so it is represented by 0 and 1. And carrying out encryption operation and pretreatment on the experimental data.

Table 2 shows the results of a partial experimental data preprocessing.

TABLE 2

age	fnlwgt	edu-num	race	sex	hrs-per-wk	income
								1	0	0	0	0	1	0
1	0	2	0	0	1	0
							0	0	2	0	1	1	0
1	0	2	0	1	1	1
							1	0	1	0	1	1	0
1	0	3	0	1	1	1
							1	0	0	0	0	1	0
1	0	2	0	0	1	0
							1	0	2	0	0	1	0
1	0	1	0	1	1	0
							1	2	1	1	0	1	0
0	2	2	1	0	1	0
							1	2	2	1	0	1	1

Table 3 is the result of data discretization.

TABLE 3

The data discretization has the condition that the attribute data are the same but have different values, such as taking 1 from 13-14 and taking 3 from 14-16. The value of edu-num is 14, which is determined according to the position of the data. Since the information entropy is discretized, the information entropy is divided into the threshold values, and thus the information entropy may be divided into the same values but may be divided into different divisions. The same value can be newly established and divided, so that the method is more accurate and optimized.

Dimensionality reduction using PCA (principal component analysis), followed by a binary K-means or K-means clustering algorithm. This is visualized to give fig. 2. The outliers are removed as shown in FIG. 3.

And (3) training by adopting a BP neural network, setting 10 neurons in the middle layer, iterating 12500 times, and enabling the learning rate to be 25%. By selecting 70 data as training set and 50 data as test set, the gradient decreasing curve as shown in fig. 4 was obtained, and it can be seen that convergence is reached around 15000 times. Through experiments and tests, the prediction accuracy rate can be about 85% by continuously changing the learning rate or the learning times.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for cleaning private data is characterized by comprising the following steps:

the private data comprises a plurality of items of attribute data;

2. A method for cleansing private data according to claim 1, wherein the private data further comprises a data item identifier, and the data item identifier is used for uniquely determining an item of attribute data.

3. A method for cleaning private data according to claim 1, wherein the private data is encrypted by a data owner and each attribute data is identified by the data owner, so as to respectively indicate whether each attribute data is classified data, continuous data or class label;

wherein the content of the first and second substances,

4. A method for cleaning private data according to claim 3, wherein in step S1, the private data is preprocessed, specifically: discretizing the continuous data to obtain discretized data; and classifying the classified data into two categories, namely ordered classified data and unordered classified data.

5. The method for cleaning private data according to claim 4, wherein a discretization algorithm based on information entropy is adopted to discretize continuous data, specifically: traversing each data value of a certain item of continuous data in the private data, setting a partitioning point to carry out recursive partitioning on the data value of the item of continuous data, and stopping partitioning until the partitioned information entropy is smaller than a preset entropy threshold value or the partitioned data packet number is not smaller than the specified data packet number; and the information entropy obtained by dividing the data by the dividing point is minimum.

6. The private data cleansing method according to claim 5, wherein, in the information entropy-based discretization algorithm,

The entropy of information E (x) is expressed as:

taking the weighted average as the total entropy:

m_ithe number of divided data in the ith event.

7. A method for cleaning private data according to claim 1, further comprising, in step S3: identifying data values of various attribute data in the first missing data set by adopting a K-means clustering algorithm so as to identify normal data values and abnormal data values; the method specifically comprises the following steps:

if yes, obtaining final k clusters;

if not, returning to the step S3.3;

8. The method for cleaning private data according to claim 7, wherein the K value is determined by using an inflection point method, a contour coefficient method, an interval statistic method or an empirical method in the K-means clustering algorithm.

9. The method for cleaning private data according to claim 1, wherein in step S4, the data mining algorithm is a BP neural network algorithm.

10. The method for cleansing private data according to claim 1, wherein in step S4, the data mining algorithm is a CART algorithm.