CN112131608A

CN112131608A - Classification tree difference privacy protection method meeting LKC model

Info

Publication number: CN112131608A
Application number: CN202011227876.1A
Authority: CN
Inventors: 李晓会; 白雨靓; 李波; 伊华伟; 贾旭; 李锐
Original assignee: Liaoning University of Technology
Current assignee: Liaoning University of Technology
Priority date: 2020-08-03
Filing date: 2020-11-06
Publication date: 2020-12-25
Anticipated expiration: 2040-11-06
Also published as: CN111859460A; CN112131608B

Abstract

The invention discloses a classification tree differential privacy protection method meeting an LKC model, which comprises the following steps: step 1, determining a sequence set needing global suppression according to data to be issued; step 2, calculating a newly generated minimum violation sequence according to the trajectory data in the sequence set; wherein the minimum violation sequence is discarded when a new minimum violation sequence is generated; and 3, when a new minimum violation sequence is not generated, establishing a classification tree according to the track data in the sequence set and adding noise to the data through a Laplace mechanism to obtain release data.

Description

Classification tree difference privacy protection method meeting LKC model

Technical Field

The invention relates to the technical field of information security, in particular to a classification tree differential privacy protection method meeting an LKC model.

Background

The track data contains personal information of a large number of mobile users, and researchers obtain a large amount of valuable information from the track data through analysis and exploration of the track data so as to carry out privacy protection research on the user information. If the trace data is not processed by effective privacy protection before being released, an attacker with knowledge of the background can deduce the privacy information of the user by analyzing the trace data, such as physical diseases, family income and the like, which may cause economic loss and even personal safety problems for the user. If the original track data set is not properly processed in the publishing process, a large amount of user information is lost, the usability and integrity of the published data are reduced, and the information is wasted. It is therefore an ongoing research topic to ensure that the published trace data does not reveal user privacy while having high data availability.

At present, certain achievements have been achieved for the research of privacy protection methods in track data release. For example, Mohammed et al propose an LKC privacy model that is applicable to RFID data, and implement the LKC privacy model using an anonymization algorithm. According to the algorithm, a minimum violating sequence set is firstly identified in a track data set, then a violating sequence is subjected to global suppression through a greedy method, and the purpose of reducing the loss of the maximum frequent sequence as far as possible is achieved, but the global suppression method needs to delete a large amount of data, and the data availability is not effectively improved. Chen et al put forward the concept of local suppression through the (K, C) L privacy model and algorithm. The algorithm firstly determines all sequences in the track data set which do not meet the requirements of the (K, C) L privacy model; and then simplifying the track data set through local suppression on the premise of ensuring the efficient availability of the data. The privacy protection of the trajectory data is realized by global inhibition by studying the condition that C is 1 in the LKC-privacy model by Ghasemzadeh et al; komishani et al propose a privacy protection algorithm for generalizing sensitive information, which implements suppression of high-dimensional trajectory data sets by building classification trees for sensitive information attributes, but suppresses a large amount of data due to uncertainty of the length of background knowledge mastered by attackers, resulting in loss of data set mining value.

Disclosure of Invention

Based on the existing research results and the existing problems, the classification tree differential privacy protection method meeting the LKC model is designed and developed, and the invention aims to solve the problems that the data availability is reduced and the privacy disclosure risk of a user is reduced due to the fact that global suppression is carried out on track data.

The technical scheme provided by the invention is as follows:

a classification tree differential privacy protection method meeting an LKC model comprises the following steps:

step 1, determining a sequence set needing global suppression according to data to be issued;

step 2, calculating a newly generated minimum violation sequence according to the trajectory data in the sequence set;

wherein the minimum violation sequence is discarded when a new minimum violation sequence is generated;

and 3, when a new minimum violation sequence is not generated, establishing a classification tree according to the track data in the sequence set and adding noise to the data through a Laplace mechanism to obtain release data.

Preferably, in the step 2, calculating the newly generated minimum violation sequence includes:

step 2.1, finding out a minimum violation sequence set in the trajectory data set in the sequence set, and determining a maximum frequent sequence set according to a given frequent threshold;

step 2.2, constructing an MFS tree, and determining the suppression sequence according to the suppression priority scores of the position points;

step 2.3, updating the MFS according to the order of the inhibition;

and 2.4, recalculating the inhibition priority scores of the rest position points, and updating the minimum violation sequence set to obtain the minimum violation sequence.

Preferably, in said step 2.2, said suppression priority score is

Where elimate (p) is the minimum number of violating sequences that a position point p can eliminate, and loss of usefulness (loss of usefulness) is brought about by loss of usefulness (loss of sequence) for position point p.

Preferably, in step 2.2, the point with the highest suppression priority score is selected for suppression each time, and the order of suppression is determined.

Preferably, in the step 2, the method further includes: when a new minimum violation sequence is generated, whether a track data set of the track data meets an LKC-privacy model needs to be verified, if the data sequences existing in the track data set cannot meet the LKC-privacy model, the minimum violation sequence needs to be updated until all the data sequences meet the LKC-privacy model;

wherein the trajectory data set of the trajectory data satisfies an LKC-privacy model when the following conditions are satisfied:

|p|＜L；

| T (p) | is more than or equal to K; and

Conf(s|T(p))≤C；

in the formula, Conf (S | T (P)), | T (P |/| T (P)) |, Conf is confidence threshold values calculated under different conditions, L is a maximum track length value grasped by an attacker, T is a track data set of all users, S is a sensitive attribute value in the data set T, P is any subsequence in the data set T, C is greater than or equal to 0 and less than or equal to 1, S ∈ S, C is a confidence threshold value of an anonymous set, and K is an anonymous number in the sequence.

Preferably, in the step 3, the process of establishing the classification tree includes the following steps:

step 3.1, initializing track data sets of all users, and selecting two groups of frequent sequences from the track data sets of all users to construct a classification tree;

3.2, selecting a track sequence corresponding to the position point with the most times as a first group according to the times of occurrence of any two position points in each track record;

3.3, picking out the sequence with the least times from all the sequences with the position points with the most times, and then picking out the position points with the most frequency from the track where the sequence is positioned as a second group;

and 3.4, repeating the step 3.2 and the step 3.3, and selecting other tracks to be placed in the first group and the second group until all tracks are placed in the classification tree to obtain the finally constructed classification tree.

Preferably, the process of adding noise to the data through the laplacian mechanism in the step 3 includes:

the privacy budget used in the classification tree iterative segmentation process is subjected to refined segmentation by a Laplace mechanism, and the average is distributed to each increment updating data set_m', will_m' on average divided into two parts

The method is respectively used for a Laplace mechanism in the data iteration process and adding Laplace noise to leaf nodes;

for any function f T- → R^dIf the output result of algorithm A satisfies the inequality A (T) ═ f (T) +<Lap₁(Δf/),Lap₂(Δf/),…,Lap_i(Δf/)>Then a satisfies differential privacy;

where T is the trajectory data set, R is the real number threshold of the mapping, d is f: T → R^dA (T) is the output of the algorithm A on the trajectory data set T, f (T) is the function f: T → R^dOutput of the result, Lap, on the trajectory dataset T_i(Δ f /) (1. ltoreq. i. ltoreq. d) are Laplace variables which are independent of one another.

Compared with the prior art, the invention has the following beneficial effects: in the process of releasing the track data, global suppression is replaced by local suppression, the usability of the track data is improved, meanwhile, a classification tree is established according to user information in a track data set, noise is added to the data through a Laplace mechanism, and the safety of the data to be released is improved while the usability of the data is ensured; through experimental verification, compared with other algorithms, the algorithm provided by the invention effectively reduces the MFS (maximum frequent item set) loss rate and the sequence loss rate, and the average relative error of counting queries is lower under the condition of the same privacy budget.

Drawings

Fig. 1 is a flowchart of a classification tree differential privacy protection method satisfying an LKC model according to the present invention.

FIG. 2 is a graph showing the effect of different K values on MFS loss rate according to the present invention.

FIG. 3 shows the effect of different K values on the sequence loss rate according to the present invention.

FIG. 4 is a graph showing the effect of different C values on MFS loss rate according to the present invention.

FIG. 5 is a graph showing the effect of different C values on the sequence loss rate according to the present invention.

Fig. 6 is the effect of dataset length on the average relative error when 0.5.

Fig. 7 is the effect of dataset length on the average relative error when 1.

Detailed Description

The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

As shown in fig. 1, the present invention provides a classification tree differential privacy protection method satisfying an LKC model, in consideration of the problem that global suppression on trajectory data may cause data availability reduction, local suppression is adopted to replace processed data, an MVS set in the trajectory data set is found, a maximum frequent sequence set is found and an MFS tree is constructed according to a given frequent threshold E, a suppression order is determined according to a suppression priority score of a location point, and a minimum violation sequence is updated. During the noise adding process, a classification tree algorithm is used, a Laplace noise mechanism is introduced to protect data, the safety in the track data issuing process is improved, and the data loss rate caused by global suppression is reduced; the method specifically comprises the following steps:

step 1, calculating a newly generated minimum violation sequence (NewMVS): finding an MVS set in a trajectory data set, finding a maximum frequent sequence set according to a given frequent threshold E, then constructing an MFS tree, and determining the order of suppression according to a suppression priority score (p) of a position point p, where the suppression priority score is the number of MVS that a suppression point p can eliminate (elimate (p))/loss of usefulness by the suppression point p (loss (p)):

selecting a point p with the highest score each time, restraining the sequence of the point p, updating an MFS (maximum frequent sequence), recalculating restraining priority scores of other position points, and updating a Minimum Violation Sequence (MVS) set;

step 2, verifying whether the track data set meets an LKC-privacy model so as to judge whether the minimum violation sequence set needs to be continuously updated or not, if the sequences in the track data set do not meet the step 2, updating the minimum violation sequence set, judging whether a new minimum violation sequence is generated or not until all the sequences meet the step 2, discarding the violation sequence if the new minimum violation sequence is generated, and if the new minimum violation sequence is not generated, establishing a classification tree according to sensitive information in the track data set;

wherein, L is the maximum track length value grasped by an attacker, T is the track data set of all users, S is the sensitive attribute value in the data set T, K is the secret number in the sequence, the track data set T satisfies LKC-privacy if and only satisfies the following conditions when any subsequence P in T is | P | < L:

l T (p) l ≧ K, T (p) is the user containing p in the track;

conf (S | T (P) ≦ C, Conf (S | T (P)) ═ T (P ≦ S) |/| T (P) |, where Conf is an abbreviation of confidence, representing confidence threshold values (confidence threshold) calculated under different conditions, for comparison with a given confidence threshold value C, 0 ≦ C ≦ 1, S ∈ S, C is the confidence threshold value of the anonymous set, the degree of anonymity can be flexibly adjusted according to the requirements;

step 3, establishing a classification tree: firstly, initializing a data set T, and selecting two groups of frequent sequences from a track data set to construct a classification tree; selecting the track sequence with the most times as a first group according to the times of occurrence of any two position points in each track record, then picking out the sequence with the least times from all the sequences of the position points, then picking out the position point with the most frequent frequency from the track of the sequence as a second group, iteratively selecting other tracks to be put into the two groups until all the tracks are put into the classification tree, and then constructing a classification tree

Step 4, redistributing the privacy budget: privacy budgets for use in iterative partitioning of classification trees a refined partitioning scheme with the Laplacian mechanism, with an average assigned to each incremental update data set first'_mThen will'_mIs divided into two parts

step 5, adding noise: t → R for any function f^dIf the output result of algorithm A satisfies the inequality A (T) ═ f (T) +<Lap₁(Δf/),Lap₂(Δf/),…,Lap_i(Δf/)>Then a satisfies-differential privacy; where T represents the trajectory data set, R represents the mapped real threshold, and d represents f: T → R^dA (T) represents the output of the algorithm A on the trajectory data set T, f (T) represents the function f: T → R^dOutput of the result, Lap, on the trajectory dataset T_i(Δ f /) (1. ltoreq. i. ltoreq. d) are Laplace variables independent of each other, and the amount of noise is proportional to Δ f and inversely proportional to Δ f.

Examples

In order to prove the effectiveness of the invention, the invention operates in Python environment, the algorithm is realized by Myeclipse integrated development software, and the experimental hardware environment is as follows: the processor is an Intel (R) core (TM) i7-5500U CPU 2.40GH, and the RAM is an 8.0G, Lnuix operating system, the invention adopts an initial data set provided by the Geoligofe project of the Microsoft Asian institute to carry out experimental verification, and the data set comprises 18670 real user tracks and is widely applied to track data correlation research experiments.

As shown in FIGS. 2-5, data loss is an important reference for measuring the availability of track data, and the invention measures the availability of track data in terms of both frequent sequences (MFS) and track sequences:

(1) MFS data loss mfslos, which depends on the number of MFS in the original trajectory data set and the number of MFS remaining in the data set after the local suppression process:

wherein, M (T) is the MFS number in the original trajectory data set, and M (T °) is the MFS number in the data set after the local suppression processing;

(2) trace sequence loss TLoss, which depends on the number of sequences in the original trace data set and the number of sequences after data processing:

where L (T) is the number of tracks in the original track data set, and L (T °) is the number of tracks in the data set subjected to the local suppression processing.

As shown in fig. 6 and 7, the average relative error of the calculated data of the counting query is used as a standard for measuring data loss, and the counting query R:

wherein R is: (T) represents a count query of the original data set,

b is a mental constraint set to prevent the denominator from being too small.

Results of the experiment

As shown in fig. 2 and 3, as the K value increases, the MFS loss and the sequence loss increase, and the data loss increases because the increase of the K value causes the increase of the Minimum Violating Sequence (MVS), which results in the increase of the sequences to be suppressed. Although having some utility in reducing data loss compared to another TP-NSA algorithm in the figure, the KTP algorithm represented herein causes less data loss.

As can be seen from fig. 4 and 5, as the C value increases, the MFS loss and the sequence loss decrease, and the number of Minimum Violating Sequences (MVS) to be suppressed decreases due to the increase in the C value, so that the MFS loss and the sequence loss both decrease gradually. The data results show that the data processing in the KTP algorithm representing the invention has a lower data loss rate than the other TP-NSA algorithm in the figure.

As can be seen from fig. 6 and 7, the average relative error of the data gradually increases as the length of the trajectory data set increases, but the average relative error of the data decreases in both experiments under the condition that the privacy estimate is larger. Compared with another HDFPM algorithm in the graph, the CTL algorithm represented by the invention is more effective, average relative error is reduced, user track privacy is effectively protected, and data availability is improved.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. A classification tree differential privacy protection method meeting an LKC model is characterized by comprising the following steps:

2. The method of claim 1, wherein computing the newly generated minimal violation sequence in step 2 comprises:

step 2.3, updating the MFS according to the order of the inhibition;

3. The method for class tree differential privacy preservation satisfying an LKC model as claimed in claim 2 wherein at step 2.2, the suppression priority score is

4. The method of claim 3 for classification tree differential privacy preserving satisfied LKC model wherein in step 2.2, the point with the highest suppression priority score is selected each time to perform suppression, and the order of suppression is determined.

5. The method for class tree differential privacy protection satisfying an LKC model as claimed in claim 1, wherein in step 2, further comprising: when a new minimum violation sequence is generated, whether a track data set of the track data meets an LKC-privacy model needs to be verified, if the data sequences existing in the track data set cannot meet the LKC-privacy model, the minimum violation sequence needs to be updated until all the data sequences meet the LKC-privacy model;

|p|＜L；

| T (p) | is more than or equal to K; and

Conf(s|T(p))≤C；

6. The method for classification tree differential privacy protection satisfying an LKC model as claimed in claim 1, wherein in the step 3, the process of building the classification tree includes the steps of:

7. The method of claim 6, wherein the step 3 of applying a noise adding process to the data through the Laplace mechanism comprises:

t → R for any function f^dIf the output result of algorithm A satisfies the inequality A (T) ═ f (T) +<Lap₁(Δf/),Lap₂(Δf/),…,Lap_i(Δf/)>Then a satisfies differential privacy;