CN112464289A - Method for cleaning private data - Google Patents

Method for cleaning private data Download PDF

Info

Publication number
CN112464289A
CN112464289A CN202011453316.8A CN202011453316A CN112464289A CN 112464289 A CN112464289 A CN 112464289A CN 202011453316 A CN202011453316 A CN 202011453316A CN 112464289 A CN112464289 A CN 112464289A
Authority
CN
China
Prior art keywords
data
value
missing
private
values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011453316.8A
Other languages
Chinese (zh)
Other versions
CN112464289B (en
Inventor
吴晓鸰
胡庆鹏
胡可
凌捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202011453316.8A priority Critical patent/CN112464289B/en
Publication of CN112464289A publication Critical patent/CN112464289A/en
Application granted granted Critical
Publication of CN112464289B publication Critical patent/CN112464289B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for cleaning private data, which comprises the following steps: s1: obtaining privacy data from a data owner, and preprocessing the privacy data; s2: forming a first missing data set; forming a defect-free data set; s3: forming a second missing data set in the first missing data set to form an abnormal data set; s4: constructing a data filling prediction model by using a data mining algorithm, and predicting and filling missing values of all attribute data in the second missing data set by using the data filling prediction model so as to obtain a filled data set; s5: and merging the filled data set and the non-missing data set to obtain a merged data set, and sending the merged data set and the abnormal data set back to the data owner to finish cleaning the private data. The invention provides a private data cleaning method, which solves the problem that the existing data cleaning method cannot clean the private data.

Description

Method for cleaning private data
Technical Field
The invention relates to the technical field of private data processing, in particular to a private data cleaning method.
Background
While data mining creates wealth, it creates a problem of privacy disclosure. Data mining does play a positive role in deep-level trend-oriented applications, but at the same time brings about a number of problems. For example, for data such as financial transactions, medical records, and network communications, sensitive information may be leaked during mining.
In the field of data mining, privacy can be divided into two categories, one category of privacy being sensitive information contained in the original data itself. Since the conventional data mining technology is based on unencrypted raw data, that is, raw data containing personal or enterprise privacy must be given to a data miner to mine useful knowledge, such as personal home phone, bank account number, property status, etc., which may have a bad influence on the life of the individual. Another type of privacy is sensitive knowledge implied by the original data, such as rules like behavior characteristics of a good client of a certain company, which will seriously affect the core competitiveness of the enterprise if illegally obtained by someone with no particular interest.
However, the conventional data cleansing method is to cleanse the source data, and the meaning and data value of the data field need to be known, so that the private data cannot be cleansed.
In the prior art, for example, in a chinese patent authorized 7/3/2020, a data link method based on privacy protection and secure multiparty computation, with a publication number of CN110609831B, local data is blocked by using an improved k-means classification method, so as to prevent sensitive information of a user from being acquired by an adversary, but the missing value of private data cannot be filled.
Disclosure of Invention
The invention provides a private data cleaning method, aiming at overcoming the technical defect that the existing data cleaning method cannot clean the private data.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a method for cleaning private data comprises the following steps:
s1: obtaining privacy data from a data owner, and preprocessing the privacy data;
the private data comprises a plurality of items of attribute data;
s2: classifying attribute data with missing values in the privacy data to form a first missing data set;
classifying each item of attribute data without missing values in the privacy data to form a non-missing data set;
s3: classifying normal data values of the attribute data into a second missing data set in the first missing data set, and classifying abnormal data values of the attribute data into an abnormal data set;
s4: constructing a data filling prediction model by using a data mining algorithm, and predicting and filling missing values of all attribute data in the second missing data set by using the data filling prediction model so as to obtain a filled data set;
s5: and merging the filled data set and the non-missing data set to obtain a merged data set, and sending the merged data set and the abnormal data set back to the data owner to finish cleaning the private data.
Preferably, the privacy data further comprises a data item identifier for uniquely determining an item of attribute data.
Preferably, the private data is encrypted by the data owner, and the data owner identifies each item of attribute data, so as to respectively indicate whether each item of attribute data is classified data, continuous data or class label;
wherein the content of the first and second substances,
for the classification data, further comprising identifying ordered classification data and unordered classification data;
and for the continuous data, sorting each continuous data according to the data item identification.
Preferably, in step S1, the privacy data is preprocessed, specifically: discretizing the continuous data to obtain discretized data; and classifying the classified data into two categories, namely ordered classified data and unordered classified data.
Preferably, the discretization algorithm based on the information entropy is adopted to discretize the continuous data, and specifically comprises the following steps: traversing each data value of a certain item of continuous data in the private data, setting a partitioning point to carry out recursive partitioning on the data value of the item of continuous data, and stopping partitioning until the partitioned information entropy is smaller than a preset entropy threshold value or the partitioned data packet number is not smaller than the specified data packet number; and the information entropy obtained by dividing the data by the dividing point is minimum.
Preferably, in the information entropy-based discretization algorithm,
the magnitude of the information quantity l (x) is inversely proportional to the probability p (x) of occurrence of the event x, i.e. l (x) -log2p(x);
The entropy of information E (x) is expressed as:
Figure BDA0002832326440000021
taking the weighted average as the total entropy:
Figure BDA0002832326440000031
wherein x isiN events, i 1,2, n, s are the measurement units of information entropy, EiIs the ith information entropy, m is the total number of divided data in n events,
Figure BDA0002832326440000032
mithe number of divided data in the ith event.
Preferably, in step S3, the method further includes: identifying data values of various attribute data in the first missing data set by adopting a K-means clustering algorithm so as to identify normal data values and abnormal data values; the method specifically comprises the following steps:
s3.1: using a principal component analysis method to reduce the dimension of the data value of each item of attribute data in the first missing data set;
s3.2: randomly selecting k data values from the data values subjected to dimensionality reduction as cluster centers, and correspondingly obtaining k clusters;
s3.3: respectively calculating the distance between each data value and the center of each cluster, and respectively classifying each data value into the cluster with the closest distance;
s3.4: calculating the mean value of the data values in each cluster, and taking the mean value as a new cluster center;
s3.5: judging whether the change of the cluster center tends to be stable or not;
if yes, obtaining final k clusters;
if not, returning to the step S3.3;
s3.6: respectively calculating the distance between each data value and the cluster center of the cluster in which the data value is located in the final k clusters;
s3.7: comparing the distance between the data value and the cluster center of the family in which the data value is positioned with a preset distance threshold value;
if the distance between the data value and the cluster center of the cluster in which the data value is located is larger than a preset distance threshold value, identifying the data value as an abnormal data value;
and if the distance between the data value and the cluster center of the cluster in which the data value is positioned is not greater than a preset distance threshold value, identifying the data value as a normal data value.
Preferably, in the K-means clustering algorithm, the value of K is determined by using an inflection point method, a contour coefficient method, a statistical interval method or an empirical method.
Preferably, in step S4, the data mining algorithm is a BP neural network algorithm.
Preferably, in step S4, the data mining algorithm is a cart (classification And Regression tree) algorithm.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a method for cleaning private data, which can clean the data under the condition of not knowing the meaning and the data value of a data field, fully protect the private data and improve the data security; meanwhile, missing values of the private data are filled by using a data mining algorithm.
Drawings
FIG. 1 is a flow chart of the steps for implementing the technical solution of the present invention;
FIG. 2 is a schematic diagram of a clustering result obtained after clustering data values in the present invention;
FIG. 3 is a schematic diagram of outlier rejection of a clustering result in accordance with the present invention;
FIG. 4 is a gradient descent curve obtained by training with a BP neural network according to the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a method for cleaning private data includes the following steps:
s1: obtaining privacy data from a data owner, and preprocessing the privacy data;
the privacy data includes a number of items of attribute data having missing values and attribute data having no missing values; the attribute data comprises normal data values and abnormal data values;
s2: classifying attribute data with missing values in the privacy data to form a first missing data set;
classifying each item of attribute data without missing values in the privacy data to form a non-missing data set;
s3: classifying normal data values of the attribute data into a second missing data set in the first missing data set, and classifying abnormal data values of the attribute data into an abnormal data set;
s4: constructing a data filling prediction model by using a data mining algorithm, and predicting and filling missing values of all attribute data in the second missing data set by using the data filling prediction model so as to obtain a filled data set;
s5: and merging the filled data set and the non-missing data set to obtain a merged data set, and sending the merged data set and the abnormal data set back to the data owner to finish cleaning the private data.
Example 2
More specifically, the privacy data further includes a data item identification for uniquely determining an item of attribute data.
More specifically, the private data is encrypted by a data owner, and various attribute data of the private data are identified by the data owner, so that whether the various attribute data are classified data, continuous data or class labels is respectively described;
wherein the content of the first and second substances,
for the classification data, further comprising identifying ordered classification data and unordered classification data;
and for the continuous data, sorting each continuous data according to the data item identification.
More specifically, in step S1, the privacy data is preprocessed, specifically: discretizing the continuous data to obtain discretized data; and classifying the classified data into two categories, namely ordered classified data and unordered classified data.
In particular implementations, unordered classification data is the lack of order and degree of difference between the classified classes or attributes. For unordered classification data, it may be processed using one-hot encoding. For m values, the expression is performed in m dimensions.
For example, for [ Chongqing, Guangdong and Sichuan ], corresponding classification can be carried out by using [100,010,001 ].
Using the OneHotEncoder method in the sklern framework for example, assuming the dataset has two attributes, gender and height, the following code is executed:
fromsklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder()
fit ([ [ [ 'male', 'middle', 'high', ], [ 'male', 'middle', ], [ 'female', 'low' ])
print (enc. transform ([ 'female', 'high') ]. toarray ())
Gender and height were encoded by one heat, 10 for males and 01 for females for gender. For height, the height was 100, the middle 010 and the height 001. The code output result is [ [0.1.0.0.1] ], where 01 represents female and 001 represents high.
The ordered classification variable can directly utilize the values after the division, such as [ low, medium and high ], and can be directly converted into [0,1 and 2], and can be directly mapped and discretized by using a map function in the pandas.
More specifically, a discretization algorithm based on information entropy is adopted to discretize continuous data, specifically: traversing each data value of a certain item of continuous data in the private data, setting a partitioning point to carry out recursive partitioning on the data value of the item of continuous data, and stopping partitioning until the partitioned information entropy is smaller than a preset entropy threshold value or the partitioned data packet number is not smaller than the specified data packet number; and the information entropy obtained by dividing the data by the dividing point is minimum.
More specifically, in the information entropy-based discretization algorithm,
the magnitude of the information quantity l (x) is inversely proportional to the probability p (x) of occurrence of the event x, i.e. l (x) -log2p(x);
The entropy of information E (x) is expressed as:
Figure BDA0002832326440000061
taking the weighted average as the total entropy:
Figure BDA0002832326440000062
wherein x isiN events, i is 1,2, n, s is a measurement unit of information entropy, s takes 2 and takes a unit of information entropy as bit (bit), s takes 2E is unit of nano (nat), EiIs the ith information entropy, m is the total number of divided data in n events,
Figure BDA0002832326440000063
mithe number of divided data in the ith event.
More specifically, step S3 further includes: identifying data values of various attribute data in the first missing data set by adopting a K-means clustering algorithm so as to identify normal data values and abnormal data values; the method specifically comprises the following steps:
s3.1: using a principal component analysis method to reduce the dimension of the data value of each item of attribute data in the first missing data set;
s3.2: randomly selecting k data values from the data values subjected to dimensionality reduction as cluster centers, and correspondingly obtaining k clusters;
s3.3: respectively calculating the distance between each data value and the center of each cluster, and respectively classifying each data value into the cluster with the closest distance;
s3.4: calculating the mean value of the data values in each cluster, and taking the mean value as a new cluster center;
s3.5: judging whether the change of the cluster center tends to be stable or not;
if yes, obtaining final k clusters;
if not, returning to the step S3.3;
s3.6: respectively calculating the distance between each data value and the cluster center of the cluster in which the data value is located in the final k clusters;
s3.7: comparing the distance between the data value and the cluster center of the family in which the data value is positioned with a preset distance threshold value;
if the distance between the data value and the cluster center of the cluster in which the data value is located is larger than a preset distance threshold value, identifying the data value as an abnormal data value;
and if the distance between the data value and the cluster center of the cluster in which the data value is positioned is not greater than a preset distance threshold value, identifying the data value as a normal data value.
In the specific implementation process, the normal data values and the abnormal data values are separated through a K-means clustering algorithm, so that the accuracy of subsequent data processing is improved.
More specifically, in the K-means clustering algorithm, the value of K is determined by using an inflection point method, a contour coefficient method, a statistical interval method or an empirical method.
More specifically, in step S4, the data mining algorithm is a BP neural network algorithm.
In a specific implementation process, the BP neural network algorithm comprises the following steps:
s4.1: randomly initializing the weight and the deviation of the BP neural network; wherein each nerve unit of the BP neural network has a deviation thetaj
S4.2: forward by the input layer of the BP neural network:
Figure BDA0002832326440000071
wherein, IjIs a current layer neuron; o isjIs IjAn input value of (a); wijIs neuron IiAnd neuron IjWeight of the connection, IiIs IjThe upper layer of neurons of (a);
to IjNonlinear conversion was carried out using the following formula to give IjNext neuron IkInput value of
Figure BDA0002832326440000072
S4.3: and (3) error calculation:
output layer IoError of (2):
Erro=Oo(1-Oo)(To-Oo);
wherein, ToIs the true value, OoTo predict value, ErroIs an output layer IoAn error of (2);
error of hidden layer:
Figure BDA0002832326440000073
wherein E isrrkIs IjError of the upper layer of neurons, wjkIs IjAnd IkThe weight of the connection line;
s4.4: weight update and bias update:
ΔWij=(l)ErrjOj
Wij=Wij+ΔWij
wherein ErrjIs the error of the neuron, (l) is the learning rate, usually taken as [0, 1]];
Updating the deviation:
Δθj=(l)Errj
θj=θj+Δθj
s4.5: repeating the steps until the termination condition is reached: the updating of the weight is lower than a threshold value, the prediction error rate is lower than a threshold value or a preset circulation number is reached.
More specifically, in step S4, the data mining algorithm is a CART algorithm.
In the specific implementation process, the CART algorithm is used for carrying out deletion value prediction and filling, so that quantitative regression can be realized while qualitative classification is carried out; the CART algorithm comprises the following steps:
firstly, a certain characteristic J is confirmed, then a dividing point s is selected on the characteristic, and all data are divided into R1And R2Two parts, but should satisfy
Figure BDA0002832326440000081
Wherein, c1Denotes yiAt R1Average value of c2Denotes yiAt R2The average value of (a) is obtained, Q represents the sum of variances of two newly generated child nodes, at the moment, when a segmentation point s is selected, the requirement is met so that Q obtains the minimum value, then all J are traversed to obtain (J, s) and Q corresponding to the (J, s), the minimum Q is selected from the (J, s) and Q corresponding to the (J, s) and the s corresponding to the minimum Q are taken as root nodes, and R on the left and right of s is taken as a root node1And R2The two parts are used as the left and right subtrees of s;
performing the above operations on the left and right subtrees in a recursion manner until the stopping conditions (such as the limitation of the number of leaves and the limitation of the tree depth … …) are met, determining an output value according to the values of the leaf nodes, and generally taking an average number as the output value;
and determining the feature to be filled as y in prediction, then using a regression decision tree to calculate the relationship between the feature to be filled and the rest features, and filling the missing value.
Example 3
And carrying out simulation experiments on 100 pieces of experimental data by adopting the private data cleaning method.
Table 1 is a portion of 100 experimental data.
TABLE 1
Figure BDA0002832326440000082
The sex and race are unordered classes, and one-hot codes are adopted, but the values of the sex and race are only two, so that the values are represented by 0 and 1. The fields age, fnlwgt, education-num and hours-per-week are continuous data, and therefore discretization is performed. The in _ come field is a class label and has 2 values, so it is represented by 0 and 1. And carrying out encryption operation and pretreatment on the experimental data.
Table 2 shows the results of a partial experimental data preprocessing.
TABLE 2
age fnlwgt edu-num race sex hrs-per-wk income
1 0 0 0 0 1 0
1 0 2 0 0 1 0
0 0 2 0 1 1 0
1 0 2 0 1 1 1
1 0 1 0 1 1 0
1 0 3 0 1 1 1
1 0 0 0 0 1 0
1 0 2 0 0 1 0
1 0 2 0 0 1 0
1 0 1 0 1 1 0
1 2 1 1 0 1 0
0 2 2 1 0 1 0
1 2 2 1 0 1 1
Table 3 is the result of data discretization.
TABLE 3
Figure BDA0002832326440000091
The data discretization has the condition that the attribute data are the same but have different values, such as taking 1 from 13-14 and taking 3 from 14-16. The value of edu-num is 14, which is determined according to the position of the data. Since the information entropy is discretized, the information entropy is divided into the threshold values, and thus the information entropy may be divided into the same values but may be divided into different divisions. The same value can be newly established and divided, so that the method is more accurate and optimized.
Dimensionality reduction using PCA (principal component analysis), followed by a binary K-means or K-means clustering algorithm. This is visualized to give fig. 2. The outliers are removed as shown in FIG. 3.
And (3) training by adopting a BP neural network, setting 10 neurons in the middle layer, iterating 12500 times, and enabling the learning rate to be 25%. By selecting 70 data as training set and 50 data as test set, the gradient decreasing curve as shown in fig. 4 was obtained, and it can be seen that convergence is reached around 15000 times. Through experiments and tests, the prediction accuracy rate can be about 85% by continuously changing the learning rate or the learning times.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A method for cleaning private data is characterized by comprising the following steps:
s1: obtaining privacy data from a data owner, and preprocessing the privacy data;
the private data comprises a plurality of items of attribute data;
s2: classifying attribute data with missing values in the privacy data to form a first missing data set;
classifying each item of attribute data without missing values in the privacy data to form a non-missing data set;
s3: classifying normal data values of the attribute data into a second missing data set in the first missing data set, and classifying abnormal data values of the attribute data into an abnormal data set;
s4: constructing a data filling prediction model by using a data mining algorithm, and predicting and filling missing values of all attribute data in the second missing data set by using the data filling prediction model so as to obtain a filled data set;
s5: and merging the filled data set and the non-missing data set to obtain a merged data set, and sending the merged data set and the abnormal data set back to the data owner to finish cleaning the private data.
2. A method for cleansing private data according to claim 1, wherein the private data further comprises a data item identifier, and the data item identifier is used for uniquely determining an item of attribute data.
3. A method for cleaning private data according to claim 1, wherein the private data is encrypted by a data owner and each attribute data is identified by the data owner, so as to respectively indicate whether each attribute data is classified data, continuous data or class label;
wherein the content of the first and second substances,
for the classification data, further comprising identifying ordered classification data and unordered classification data;
and for the continuous data, sorting each continuous data according to the data item identification.
4. A method for cleaning private data according to claim 3, wherein in step S1, the private data is preprocessed, specifically: discretizing the continuous data to obtain discretized data; and classifying the classified data into two categories, namely ordered classified data and unordered classified data.
5. The method for cleaning private data according to claim 4, wherein a discretization algorithm based on information entropy is adopted to discretize continuous data, specifically: traversing each data value of a certain item of continuous data in the private data, setting a partitioning point to carry out recursive partitioning on the data value of the item of continuous data, and stopping partitioning until the partitioned information entropy is smaller than a preset entropy threshold value or the partitioned data packet number is not smaller than the specified data packet number; and the information entropy obtained by dividing the data by the dividing point is minimum.
6. The private data cleansing method according to claim 5, wherein, in the information entropy-based discretization algorithm,
the magnitude of the information quantity l (x) is inversely proportional to the probability p (x) of occurrence of the event x, i.e. l (x) -log2p(x);
The entropy of information E (x) is expressed as:
Figure FDA0002832326430000021
taking the weighted average as the total entropy:
Figure FDA0002832326430000022
wherein x isiN events, i 1,2, n, s are the measurement units of information entropy, EiIs the ith information entropy, m is the total number of divided data in n events,
Figure FDA0002832326430000023
mithe number of divided data in the ith event.
7. A method for cleaning private data according to claim 1, further comprising, in step S3: identifying data values of various attribute data in the first missing data set by adopting a K-means clustering algorithm so as to identify normal data values and abnormal data values; the method specifically comprises the following steps:
s3.1: using a principal component analysis method to reduce the dimension of the data value of each item of attribute data in the first missing data set;
s3.2: randomly selecting k data values from the data values subjected to dimensionality reduction as cluster centers, and correspondingly obtaining k clusters;
s3.3: respectively calculating the distance between each data value and the center of each cluster, and respectively classifying each data value into the cluster with the closest distance;
s3.4: calculating the mean value of the data values in each cluster, and taking the mean value as a new cluster center;
s3.5: judging whether the change of the cluster center tends to be stable or not;
if yes, obtaining final k clusters;
if not, returning to the step S3.3;
s3.6: respectively calculating the distance between each data value and the cluster center of the cluster in which the data value is located in the final k clusters;
s3.7: comparing the distance between the data value and the cluster center of the family in which the data value is positioned with a preset distance threshold value;
if the distance between the data value and the cluster center of the cluster in which the data value is located is larger than a preset distance threshold value, identifying the data value as an abnormal data value;
and if the distance between the data value and the cluster center of the cluster in which the data value is positioned is not greater than a preset distance threshold value, identifying the data value as a normal data value.
8. The method for cleaning private data according to claim 7, wherein the K value is determined by using an inflection point method, a contour coefficient method, an interval statistic method or an empirical method in the K-means clustering algorithm.
9. The method for cleaning private data according to claim 1, wherein in step S4, the data mining algorithm is a BP neural network algorithm.
10. The method for cleansing private data according to claim 1, wherein in step S4, the data mining algorithm is a CART algorithm.
CN202011453316.8A 2020-12-11 2020-12-11 Method for cleaning private data Active CN112464289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011453316.8A CN112464289B (en) 2020-12-11 2020-12-11 Method for cleaning private data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011453316.8A CN112464289B (en) 2020-12-11 2020-12-11 Method for cleaning private data

Publications (2)

Publication Number Publication Date
CN112464289A true CN112464289A (en) 2021-03-09
CN112464289B CN112464289B (en) 2023-01-17

Family

ID=74801445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011453316.8A Active CN112464289B (en) 2020-12-11 2020-12-11 Method for cleaning private data

Country Status (1)

Country Link
CN (1) CN112464289B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687329A (en) * 2022-11-15 2023-02-03 联洋国融(北京)科技有限公司 Filling method and device for processing missing values of multiple data sources based on privacy calculation
US11983152B1 (en) * 2022-07-25 2024-05-14 Blackrock, Inc. Systems and methods for processing environmental, social and governance data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874779A (en) * 2017-03-10 2017-06-20 广东工业大学 A kind of data mining method for secret protection and system
CN108805193A (en) * 2018-06-01 2018-11-13 广东电网有限责任公司 A kind of power loss data filling method based on mixed strategy
CN109740694A (en) * 2019-01-24 2019-05-10 燕山大学 A kind of smart grid inartful loss detection method based on unsupervised learning
CN110232420A (en) * 2019-06-21 2019-09-13 安阳工学院 A kind of clustering method of data
CN110674621A (en) * 2018-07-03 2020-01-10 北京京东尚科信息技术有限公司 Attribute information filling method and device
CN111241079A (en) * 2020-01-08 2020-06-05 哈尔滨工业大学 Data cleaning method and device and computer readable storage medium
CN111861013A (en) * 2020-07-23 2020-10-30 长沙理工大学 Power load prediction method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874779A (en) * 2017-03-10 2017-06-20 广东工业大学 A kind of data mining method for secret protection and system
CN108805193A (en) * 2018-06-01 2018-11-13 广东电网有限责任公司 A kind of power loss data filling method based on mixed strategy
CN110674621A (en) * 2018-07-03 2020-01-10 北京京东尚科信息技术有限公司 Attribute information filling method and device
CN109740694A (en) * 2019-01-24 2019-05-10 燕山大学 A kind of smart grid inartful loss detection method based on unsupervised learning
CN110232420A (en) * 2019-06-21 2019-09-13 安阳工学院 A kind of clustering method of data
CN111241079A (en) * 2020-01-08 2020-06-05 哈尔滨工业大学 Data cleaning method and device and computer readable storage medium
CN111861013A (en) * 2020-07-23 2020-10-30 长沙理工大学 Power load prediction method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11983152B1 (en) * 2022-07-25 2024-05-14 Blackrock, Inc. Systems and methods for processing environmental, social and governance data
CN115687329A (en) * 2022-11-15 2023-02-03 联洋国融(北京)科技有限公司 Filling method and device for processing missing values of multiple data sources based on privacy calculation
CN115687329B (en) * 2022-11-15 2023-05-30 联洋国融(北京)科技有限公司 Filling method and device for processing missing values of multiple data sources based on privacy calculation

Also Published As

Publication number Publication date
CN112464289B (en) 2023-01-17

Similar Documents

Publication Publication Date Title
CN109657947B (en) Enterprise industry classification-oriented anomaly detection method
Bandekar et al. Design and analysis of machine learning algorithms for the reduction of crime rates in India
CN112464289B (en) Method for cleaning private data
US20220036137A1 (en) Method for detecting anomalies in a data set
CN111222638B (en) Neural network-based network anomaly detection method and device
CN111539444A (en) Gaussian mixture model method for modified mode recognition and statistical modeling
CN111641608A (en) Abnormal user identification method and device, electronic equipment and storage medium
CN112270355A (en) Active safety prediction method based on big data technology and SAE-GRU
US11748448B2 (en) Systems and techniques to monitor text data quality
Alghobiri A comparative analysis of classification algorithms on diverse datasets
CN116823496A (en) Intelligent insurance risk assessment and pricing system based on artificial intelligence
CN107169515B (en) Personal income classification method based on improved naive Bayes
Azimlu et al. House price prediction using clustering and genetic programming along with conducting a comparative study
US11144938B2 (en) Method and system for predictive modeling of consumer profiles
Rouzot et al. Learning Optimal Fair Scoring Systems for Multi-Class Classification
Sulayman et al. User modeling via anomaly detection techniques for user authentication
CN113179276A (en) Intelligent intrusion detection method and system based on explicit and implicit feature learning
CN117093849A (en) Digital matrix feature analysis method based on automatic generation model
Wu et al. Fairness and cost constrained privacy-aware record linkage
Marabad Credit card fraud detection using machine learning
CN113516189B (en) Website malicious user prediction method based on two-stage random forest algorithm
CN115510248A (en) Method for constructing and analyzing person behavior characteristic knowledge graph based on deep learning
CN113469288A (en) High-risk personnel early warning method integrating multiple machine learning algorithms
Sharma Credit Card Fraud Detection Predictive Modeling
CN113822755B (en) Identification method of credit risk of individual user by feature discretization technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant