CN117216796B

CN117216796B - Energy big data privacy protection method based on privacy class

Info

Publication number: CN117216796B
Application number: CN202311238880.1A
Authority: CN
Inventors: 张宸; 于翔; 詹昕; 王春蕾; 刘钰; 严安; 刘全; 崔惠
Original assignee: Yangzhou Power Supply Branch Of State Grid Jiangsu Electric Power Co ltd
Current assignee: Yangzhou Power Supply Branch Of State Grid Jiangsu Electric Power Co ltd
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2024-05-28
Anticipated expiration: 2043-09-22
Also published as: CN117216796A

Abstract

The invention discloses an energy big data privacy protection method based on privacy class, which comprises the following steps: firstly, classifying the features of an original data set through privacy class marking; secondly, clustering the data set based on a K-means++ clustering algorithm aiming at the characteristics of the power grid data; and finally, searching optimal noise parameter configuration for different categories by utilizing a genetic algorithm, and balancing anonymization effect and query result accuracy. The method can provide privacy protection with finer granularity according to the sensitivity and importance of the data, improves the availability of the published data, and effectively differentiates the sensitivity of the query function.

Description

Energy big data privacy protection method based on privacy class

Technical Field

The invention belongs to the field of data privacy protection, and particularly relates to an energy big data privacy protection method based on a privacy class.

Background

Along with the rapid development of information technologies such as the Internet of things and artificial intelligence, the intelligent power grid becomes more and more intelligent and informationized, and brings about the influence of aspects on social life. Smart grids serve as an important infrastructure, aggregating large amounts of sensitive power data. Such sensitive data typically requires external distribution and sharing for data analysis and mining by off-grid research institutions. The power grid enterprises provide personalized services for the data users, and meanwhile potential privacy disclosure and data security problems are brought.

Privacy preserving technology has received extensive attention from the industry and a number of researchers have developed different studies against this problem. Existing privacy protection techniques can be categorized into three techniques, data encryption based techniques, restricted release based techniques, and data distortion based techniques.

Data encryption-based techniques typically employ techniques such as symmetric encryption and asymmetric encryption, homomorphic encryption, etc. to cryptographically protect data to ensure that only authorized users can decrypt and access the data. However, the data encryption technology algorithm is complex in data processing procedure and large in calculation overhead.

Technologies based on restricted distribution limit and control the distribution of data, but an attacker can analyze the privacy information by using the associated information through background knowledge.

Techniques based on data distortion achieve user privacy protection by moderately perturbing or distorting the original data, but may result in reduced usability and accuracy of the data.

The existing differential privacy model can not easily infer individual privacy information even if an attacker obtains any other data except target data by adding noise to the query result to disturb the data. However, noise introduction may lead to inaccuracy of the query results and loss of information, reducing availability of published data.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art.

The technical scheme of the invention is as follows: an energy big data privacy protection method based on privacy class, the method comprising the following steps:

step 1, carrying out privacy grade marking and screening on an original data set, selecting part of attributes as a feature set, selecting a candidate set with high correlation with the feature set by utilizing a pearson correlation coefficient, and finally forming an output data set together;

step 2, realizing classification of the data set based on a K-means++ clustering algorithm;

And 3, searching optimal noise parameter configuration for different categories by utilizing a genetic algorithm, and carrying out differential privacy on each group of classified data.

In step 1, the method comprises the following steps:

step 1-1, extracting an original electric power data set from a power grid database;

step 1-2, setting W as a power data set, wherein the power data set comprises n data records, a privacy class list rank= {1,2,3,4,5}, dividing the power grid privacy class into 5 classes according to the privacy classes, wherein each record comprises a plurality of attributes (A1, the first-order, the An), and the five-order privacy class is the highest-order privacy;

Step 1-3, dividing the attribute with the privacy level higher than the threshold mu into a feature set F= { F ₁,f₂,...,f_k }, and taking the remaining data set F' as a candidate set;

and step 1-4, outputting a final data set classified based on the privacy level.

In the steps 1-4, the method comprises the following steps:

step 1-4-1, calculating a pearson correlation coefficient r between the candidate set variable f' and the feature set variable f:

Wherein n is the number of records;

step 1-4-2, comparing the pearson correlation coefficient with a threshold value rho, and screening out a feature set B with high correlation with the feature set;

Step 1-4-3, forming the final dataset d=f.

In step 2, it includes:

Step 2-1, randomly selecting a record t from data as a first type clustering center according to the attribute with the highest privacy class;

Step 2-2, calculating the Euclidean distance between each record variable x _i∈n and the current cluster record variable x _j∈n on different attributes m:

Wherein x _im and x _jm represent the mth attribute of the f record and the mth attribute of the j record, respectively, and M is the number of attributes;

Step 2-3, calculating a probability P _i that each record is selected as the next cluster center, the probability of which is obtained by the following formula:

wherein d _ij represents the Euclidean distance between record i and record j, n is the number of records;

step 2-4, selecting the next cluster center by using a roulette method;

step 2-5, repeating the steps 2-3, 2-4 and 2-5 until K cluster centers are selected;

And 2-6, classifying the data set into K classes through a K-means algorithm.

In the steps 2-6, the method comprises the following steps:

step 2-6-1, calculating the distance between each record and each clustering center point, and dividing the records into nearest clustering centers;

Step 2-6-2, calculating the average value of all records classified into each category, and taking the average value as a clustering center of each category;

And step 2-6-3, repeating the above two steps until the data set is divided into K classes.

In step 3, it includes:

step 3-1, searching optimal noise parameter configurations for different categories by utilizing a genetic algorithm;

and 3-2, performing differential privacy protection on each well-classified record.

In step 3-1, it includes:

Step 3-1-1, defining a multi-objective fitness function fit:

fit(x)＝w1*Anon(x)+w2*Accuracy(x)

where x represents the noise parameter configuration, including the differential privacy budget epsilon and the global sensitivity deltaf, Representing the degree of anonymization of data calculated from noise parameter configuration x,/>Representing the accuracy of the query result calculated according to the noise parameter configuration x, W1 and W2 being weight coefficients;

Step 3-1-2, initializing a population; randomly generating an initial population p= { x ₁,x₂,...,x_n }, comprising N individuals, each individual representing a noise parameter configuration, i.e. x _i＝(ε_i,Δf_i);

The initial population P needs to be binary coded, decimal to binary, mapped by:

x′＝g(x)

wherein the mapping g functions to map the decimal value x to a binary value x';

Step 3-1-3, evaluating fitness; calculating fitness value fit _i =fit (xi) for each individual x _i e P;

step 3-1-4, selecting operation; selecting a parent individual from the population P by using a roulette method, and constructing a parent set Q= { Q ₁,q₂,...,q_n };

Step 3-1-5, performing cross operation; generating a child set R= { R ₁,r₂,...,r_n } for individuals in the parent set Q through a crossover operator;

Step 3-1-6, mutation operation; introducing randomness to individuals in the child set R through mutation operators, and generating a mutated child set M= { M ₁,m₂,...,m_n }, through mutation operator gene mutation;

step 3-1-7, evaluating fitness; carrying out fitness evaluation on each child generation individual M _i epsilon M, and calculating a fitness value fit _i =fit (mi);

Step 3-1-8, updating the population; generating a new population P= { P ₁,p₂,...,p_n } with large adaptability from the child set M;

Step 3-1-9, repeating the steps 3-1-4 to 3-1-8 until the maximum iteration times are reached or the maximum adaptability meeting the requirements is found;

Step 3-1-10, returning the individual with the best fitness value, namely the best noise parameter configuration;

And 3-1-11, performing the operations on the divided K groups of data respectively to obtain K groups of optimal noise parameter configurations.

In step 3-2, it includes:

Step 3-2-1, adding noise to each piece of data in a group by using a Laplace mechanism, so as to realize data privacy protection:

q′＝q+Laplace(b)

Wherein b=Δf/epsilon represents the noise dispersion degree, and the global sensitivity Δf and the privacy budget epsilon are found by using a genetic algorithm through the step 3-1 to find the optimal noise parameter configuration; q is data before privacy protection, and q' is data after privacy protection;

Step 3-2-2, repeating the step 3-2-1 for K groups, each group performing differential privacy operations using the configuration corresponding to its optimal noise parameters;

and step 3-2-3, outputting the data set after privacy protection.

The privacy class includes:

The five-level privacy class comprises the electricity utilization time, the electricity consumption amount and the electricity utilization behavior of the user;

The fourth-level privacy class comprises sensitive personal identity information such as the name, the ID card number, the address and the like of the user;

The three-level privacy class comprises real-time positions and movement tracks of the user;

The secondary privacy level comprises electricity utilization habit, activity time period and energy saving consciousness of the user;

the primary privacy level includes total power usage statistics and trend analysis for a particular region or country.

Compared with the prior art, the invention has the following remarkable advantages:

1) And adopting a non-uniform privacy protection strategy, and adopting personalized privacy protection measures aiming at different types of data according to the characteristics and privacy requirements of the data. The optimal noise parameter configuration is searched through a genetic algorithm, and optimization is carried out on each data set, so that more flexible and fine privacy protection is realized.

2) Comprehensively considering a plurality of targets such as data anonymization degree, query result accuracy and the like. Through the definition of the fitness function and the optimization process of the genetic algorithm, the data privacy can be protected while the accuracy of the query result is maintained as much as possible.

3) Searching for the optimal noise parameter configuration by using a genetic algorithm, and performing personalized optimization for each data set. Therefore, the method can better adapt to the characteristics and privacy requirements of different data sets, and improves the privacy protection effect and the data processing efficiency.

4) The method has good expandability and adaptability in the aspect of processing the privacy protection problem of the power grid data. The method can adapt to grid data sets with different scales and complexity, and is optimized in a customized mode according to actual requirements, and has strong universality and adaptability.

The invention solves the problem of privacy protection and data release sharing by combining cluster analysis and differential privacy. The method effectively differentiates the sensitivity of the query function and improves the availability of the published data.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow of feature classification based on privacy classes in the present invention;

FIG. 3 is a flowchart of a clustering and heterogeneous privacy preserving method in an embodiment of the present invention;

Fig. 4 is a functional block diagram of the present invention.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

1-4, The invention provides an energy big data privacy protection method based on privacy class, which comprises the following steps:

step 1, marking and screening privacy classes of an original data set; selecting part of attributes as feature sets, selecting candidate sets with high correlation with the feature sets by using pearson correlation coefficients, and finally forming an output data set together;

The step 1 is used for preprocessing data, so that the method is more in line with the actual scene. For example, the power data is to issue the data to the bank, wherein not all the information is extremely high in privacy, for example, the power consumption time privacy level is low, privacy protection is not needed, and finally the data-oriented privacy protection is realized.

by combining the clustering method, the follow-up privacy operation is facilitated.

And step 3, performing non-uniform privacy protection processing. And searching optimal noise parameter configuration for different categories by using a genetic algorithm, and carrying out differential privacy on each group of classified data.

Because the traditional method does not consider non-uniform privacy protection, the data are always subjected to uniform privacy protection, and the difference of data fine granularity is not considered.

The method can adapt to the grid data sets with different scales and complexity, and is optimized in a customized mode according to actual requirements, and has strong universality and adaptability.

Further, in step 1, it includes:

Step 1-2, let W be the power data set, including n data records, privacy level list rank= {1,2,3,4,5}, according to the privacy level, the invention classifies the power grid privacy level into 5 levels, each record contains several attributes (a 1..a., an), five levels of privacy level are the highest level of privacy.

(1) Five-level privacy level (FIRST LEVEL PRIVACY): such information may include sensitive content such as user lifestyle, behavior patterns, etc., and is generally considered the highest level of privacy. It is required to be classified into higher-level privacy such as detailed information of the user's power consumption time, power consumption amount, and power consumption behavior.

(2) Four-level privacy level (Second LEVEL PRIVACY): such information needs to be classified into a higher level of privacy and need to be strictly protected from sensitive personal identification information such as the user's name, identification number, address, etc.

(3) Three-level privacy level (THIRD LEVEL PRIVACY): such information relates to the whereabouts and range of motion of the user, and therefore requires a high level of privacy protection, such as sensitive location information, e.g., real-time location and movement trajectories of the user.

(4) Secondary privacy level (fourier LEVEL PRIVACY): such information may reveal life style and behavioral characteristics of the user, and need to be classified into general levels of privacy, such as information of the user's electricity usage habits, activity time periods, and energy conservation consciousness.

(5) Primary privacy level (FIFTH LEVEL PRIVACY): such information is typically used for energy planning and decision making, requiring a lower level of privacy protection, such as total power usage statistics and trend analysis for a particular region or country. .

The threshold value is used as a super parameter and can be artificially selected in the [1,5] interval.

Further, in the steps 1-4, it includes:

Wherein n is the number of records;

Step 1-4-3, forming the final dataset d=f.

As is known from d=fub, the feature set to be processed is composed of the feature set F having a high privacy level and the feature set B having a high similarity thereto. The purpose of such processing is to perform privacy protection only for these data, and to ignore irrelevant (low-privacy-level) data, thereby improving privacy calculation efficiency and distribution efficiency.

Fig. 2 is a flow chart of feature classification based on privacy classes in this embodiment, and by performing privacy class marking and screening on an original data set, data with higher correlation with the feature set is finally obtained, so as to improve the correlation and usefulness of the data.

Further, in step 2, it includes:

Wherein x _im and x _jm represent the mth attribute of the ith record and the mth attribute of the jth record, respectively, and M is the attribute number;

Step 2-4, selecting the next cluster center by using a roulette method; the principle of roulette selection is: the record with larger distance has larger probability of being selected as the clustering center;

And 2-6, classifying the data set into K classes through a K-means algorithm.

Further, in the step 2-6, it includes:

Step 2-6 classifies the records and then differential privacy is performed for each class, thus forming a "heterogeneous privacy preserving policy".

Further, in step 3, it includes:

The traditional method uses unified noise to disturb all data, and the invention realizes a nonuniform privacy protection strategy by setting specific noise parameters for different types of records.

Further, in step 3-1, it includes:

Step 3-1-1, the invention comprehensively considers two factors and defines a multi-objective fitness function fit:

fit(x)＝w1*Anon(x)+w2*Accuracy(x)

x′＝g(x)

Step 3-1-4, selecting operation; selecting a parent individual from the population P by using a roulette method, wherein the larger the adaptability is, the larger the selected probability is, and constructing a parent set Q= { Q ₁,q₂,...,q_n };

Step 3-1-5, performing cross operation; the individuals in the parent set Q are exchanged by crossover operators, i.e. the codes of several of them located at the same position are randomly exchanged, so as to generate new child individuals. Generating a child set r= { R ₁,r₂,...,r_n };

Step 3-1-6, mutation operation; the individuals in the sub-generation set R introduce certain randomness through mutation operators, namely, randomly changing the coding of one position. Generating a mutated offspring set M= { M ₁,m₂,...,m_n } through mutation of mutation operator genes;

step 3-1-10, returning the individual with the best fitness value, i.e. the best noise parameter configuration (differential privacy budget, global sensitivity);

Further, in step 3-2, it includes:

q′＝q+Laplace(b)

and step 3-2-3, outputting the data set after privacy protection.

Fig. 3 is a flowchart of a clustering and non-uniform privacy preserving method in this embodiment. Data are divided into different clusters through attribute cluster analysis with highest privacy class, and a non-uniform differential privacy technology is applied to each cluster. In this way, the usability of the data can be improved while protecting the privacy of the data.

In summary, the method provided by the invention provides non-uniform privacy protection, and solves the problem of sharing privacy protection and data release by combining cluster analysis and differential privacy. The non-uniform privacy protection makes up for the shortages of inaccurate query results, information loss, low availability of published data and the like in the traditional method. The method and the device effectively improve the usability of the data while protecting the privacy of the data.

The foregoing has outlined and described the basic principles, features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The energy big data privacy protection method based on the privacy class is characterized by comprising the following steps:

In step 1, the method comprises the following steps:

Step 1-4, outputting a final data set classified based on privacy classes;

step 3, searching optimal noise parameter configuration for different categories by utilizing a genetic algorithm, and carrying out differential privacy on each group of classified data;

in step 3, it includes:

step 3-2, performing differential privacy protection on each well-classified record;

In step 3-1, it includes:

Step 3-1-1, defining a multi-objective fitness function fit:

fit(x)＝w1*Anon(x)+w2*Accuracy(x)

x′＝g(x)

step 3-1-11, performing the operations on the divided K groups of data respectively to obtain K groups of optimal noise parameter configurations;

in step 3-2, it includes:

q′＝q+Laplace(b)

and step 3-2-3, outputting the data set after privacy protection.

2. The privacy class-based energy big data privacy protection method according to claim 1, wherein in step 1-4, comprising:

Wherein n is the number of records;

Step 1-4-3, forming the final dataset d=f.

3. The privacy class-based energy big data privacy protection method according to claim 2, wherein in step 2, comprising:

step 2-4, selecting the next cluster center by using a roulette method;

And 2-6, classifying the data set into K classes through a K-means algorithm.

4. The privacy class-based energy big data privacy protection method according to claim 3, wherein in step 2-6, comprising:

5. The privacy class-based energy big data privacy protection method of claim 1,

The privacy class includes:

the primary privacy level includes total power usage statistics and trend analysis for a predetermined area or country.