CN112131608B

CN112131608B - Classification tree differential privacy protection method meeting LKC model

Info

Publication number: CN112131608B
Application number: CN202011227876.1A
Authority: CN
Inventors: 李晓会; 白雨靓; 李波; 伊华伟; 贾旭; 李锐
Original assignee: Liaoning University of Technology
Current assignee: Liaoning University of Technology
Priority date: 2020-08-03
Filing date: 2020-11-06
Publication date: 2024-01-26
Anticipated expiration: 2040-11-06
Also published as: CN111859460A; CN112131608A

Abstract

The invention discloses a classification tree differential privacy protection method meeting an LKC model, which comprises the following steps: step 1, determining a sequence set needing global inhibition according to data to be distributed; step 2, calculating a newly generated minimum violation sequence according to the track data in the sequence set; wherein when a new minimum violation sequence is generated, the minimum violation sequence is discarded; and 3, when a new minimum violation sequence is not generated, establishing a classification tree according to the track data in the sequence set, and adding noise to the data through a Laplace mechanism to obtain release data.

Description

Classification tree differential privacy protection method meeting LKC model

Technical Field

The invention relates to the technical field of information security, in particular to a classification tree differential privacy protection method meeting an LKC model.

Background

The track data contains a large amount of personal information of mobile users, and researchers acquire a large amount of valuable information from the track data through analysis and exploration of the track data so as to study the privacy protection of the user information. If the trace data is not effectively privacy-protected before being distributed, an attacker who grasps the background knowledge can infer the privacy information of the user, such as physical diseases, household income and the like, by analyzing the trace data, which can cause economic loss of the user and even personal safety problems. If the original track data set is improperly processed in the release process, a great deal of loss of user information is caused, the availability and the integrity of release data are reduced, and the information waste is caused. It is a research topic to be solved to ensure that published track data does not reveal user privacy while having high data availability.

At present, a certain result is obtained for researching a privacy protection method in track data release. For example, mohammed et al propose an LKC privacy model applicable to RFID data, and implement the LKC privacy model using an anonymization algorithm. The algorithm firstly identifies the minimum violating sequence set in the track data set, then carries out global suppression on the violating sequence through a greedy method, and achieves the aim of reducing the maximum frequent sequence loss as much as possible, but the global suppression method needs to delete a large amount of data, and the data availability is not effectively improved. Chen et al propose the concept of local suppression through (K, C) L privacy models and algorithms. The algorithm firstly determines all sequences in the track data set which do not meet the requirements of the (K, C) L privacy model; and then simplifying the track data set by local inhibition on the premise of ensuring the efficient availability of the data. Ghasemzadeh et al studied the case of c=1 in the LKC-privacy model, and realized privacy protection of track data by global suppression; komishani et al propose a privacy protection algorithm for generalizing sensitive information, which implements suppression of a high-dimensional trajectory dataset by building a classification tree for sensitive information attributes, but suppresses a large amount of data due to uncertainty in the length of background knowledge mastered by an attacker, resulting in loss of the mining value of the dataset.

Disclosure of Invention

Based on the existing research results and the existing problems, the invention designs and develops a classification tree differential privacy protection method meeting the LKC model, and aims to solve the problems that the overall inhibition of track data can cause the reduction of data availability and the reduction of the risk of privacy leakage of users.

The technical scheme provided by the invention is as follows:

a classification tree differential privacy protection method meeting LKC model includes the following steps:

step 1, determining a sequence set needing global inhibition according to data to be distributed;

step 2, calculating a newly generated minimum violation sequence according to the track data in the sequence set;

wherein when a new minimum violation sequence is generated, the minimum violation sequence is discarded;

and 3, when a new minimum violation sequence is not generated, establishing a classification tree according to the track data in the sequence set, and adding noise to the data through a Laplace mechanism to obtain release data.

Preferably, in the step 2, calculating the newly generated minimum violation sequence includes:

step 2.1, finding out a minimum violation sequence set in the track data set from the sequence set, and determining a maximum frequent sequence set according to a given frequent threshold;

step 2.2, constructing an MFS tree, and determining the suppression sequence according to the suppression priority scores of the position points;

step 2.3, updating the MFS according to the suppression sequence;

and 2.4, recalculating the suppression priority scores of the rest position points, and updating the minimum violation sequence set to obtain the minimum violation sequence.

Preferably, in the step 2.2, the suppression priority score is

Where elimate (p) is the minimum number of violations that can be eliminated by position point p, and Loss (p) is the Loss of usefulness at position point p.

Preferably, in the step 2.2, the suppression is performed by selecting a point with the highest suppression priority score, and the order of the suppression is determined.

Preferably, in the step 2, the method further includes: when a new minimum violation sequence is generated, verifying whether a track data set of the track data meets an LKC-privacy model or not, and if the data sequence existing in the track data set cannot meet the LKC-privacy model, updating the minimum violation sequence until all the data sequences meet the LKC-privacy model;

wherein the trace data set of trace data satisfies an LKC-privacy model when:

|p|＜L；

t (p) is not less than K; and

Conf(s|T(p))≤C；

in the formula, conf (s|T (P))= |T (P ∈s) |T (P) |, conf is a confidence threshold value calculated under different conditions, L is a maximum track length value mastered by an attacker, T is a track data set of all users, S is a sensitive attribute value in the data set T, P is any subsequence in the data set T, C is more than or equal to 0 and less than or equal to 1, S is epsilon S, C is a confidence threshold value of an anonymous set, and K is a number of hidden names in the sequence.

Preferably, in the step 3, the process of creating the classification tree includes the steps of:

step 3.1, initializing a track data set of all users, and selecting two groups of frequent sequences from the track data set of all users to construct a classification tree;

step 3.2, selecting a track sequence corresponding to the position point with the largest frequency as a first group according to the frequency of occurrence of any two position points in each track record;

step 3.3, selecting the sequence with the least frequency from all the sequences with the most frequency position points, and selecting the most frequent position point as a second group on the track where the sequence is positioned;

and 3.4, repeating the step 3.2 and the step 3.3, selecting other tracks to be placed in the first group and the second group until all tracks are placed in the classification tree, and obtaining the finally constructed classification tree.

Preferably, the adding noise to the data by the laplace mechanism in the step 3 includes:

the privacy budget epsilon used in the iterative segmentation process of the classification tree is subjected to refinement segmentation by a Laplacian mechanism, and epsilon is evenly distributed to each increment update data set epsilon _m ' will epsilon _m ' average divided into two partsRespectively used for a Laplace mechanism in the data iteration process and adding Laplace noise into leaf nodes;

for any function f T-R ^d If the output result of algorithm a satisfies inequality a (T) =f (T) + the algorithm a is applied to the image data<Lap ₁ (Δf/ε),Lap ₂ (Δf/ε),…,Lap _i (Δf/ε)>A satisfies epsilon differential privacy;

wherein T is the track data set, R is the real number threshold of the mapping, d is f:T.fwdarw.R ^d A (T) is the output result of algorithm A on track data set T, f (T) is the function f: T→R ^d Output results on track dataset T, lap _i (Δf/. Epsilon.) (1.ltoreq.i.ltoreq.d) is a Laplace variable independent of each other.

Compared with the prior art, the invention has the following beneficial effects: in the process of releasing track data, the availability of the track data is improved by replacing global inhibition with local inhibition, meanwhile, a classification tree is established according to user information in the track data set, noise is added to the data through a Laplacian mechanism, and the safety of the data to be released is improved while the availability of the data is ensured; experiments prove that compared with other algorithms, the algorithm provided by the method effectively reduces the MFS (maximum frequent item set) loss rate and the sequence loss rate, and the average relative error of counting queries is lower under the condition of the same privacy budget.

Drawings

Fig. 1 is a flowchart of a classification tree differential privacy protection method satisfying an LKC model according to the present invention.

FIG. 2 is a graph showing the effect of different K values on MFS loss rate according to the present invention.

FIG. 3 shows the effect of different K values on the sequence loss rate according to the present invention.

FIG. 4 is a graph showing the effect of different C values on MFS loss rate according to the present invention.

FIG. 5 shows the effect of different C values on the sequence loss rate according to the present invention.

Fig. 6 is the effect of data set length on average relative error when epsilon=0.5.

Fig. 7 is the effect of data set length on average relative error when epsilon=1.

Detailed Description

The present invention is described in further detail below with reference to the drawings to enable those skilled in the art to practice the invention by referring to the description.

As shown in fig. 1, the present invention provides a differential privacy protection method for classification trees, which satisfies an LKC model, considers the problem that global suppression of trace data causes reduced data availability, adopts local suppression to replace processing data, finds out an MVS set in the trace data set, finds out a maximum frequent sequence set according to a given frequent threshold E, constructs an MFS tree, determines the suppression order according to the suppression priority score of a location point, and updates a minimum violation sequence. When the noise is added, a classification tree algorithm is utilized, a Laplace noise mechanism is cited for protecting the data, so that the safety in the track data release process is improved, and the data loss rate caused by global inhibition is reduced; the method specifically comprises the following steps:

step 1, calculating newly generated minimum violation sequences (NewMVS): finding out the MVS set in the track data set, finding out the maximum frequent sequence set according to a given frequent threshold E, then constructing an MFS tree, determining the inhibition sequence according to an inhibition priority Score (p) of a position point p, wherein the inhibition priority Score = the number of MVSs (Eliminate (p)) which can be eliminated by the inhibition point p/the Loss of usefulness (Loss (p)) caused by the inhibition point p:

selecting a point p with the highest score each time, suppressing the sequence of the point p, updating the MFS (maximum frequent sequence), then recalculating the suppression priority scores of the rest position points, and updating the Minimum Violation Sequence (MVS) set;

step 2, verifying whether the track data set meets the LKC-privacy model to further judge whether the minimum violation sequence set needs to be continuously updated, if the sequence in the track data set does not meet the step 2, the minimum violation sequence set needs to be updated until all the sequences meet the step 2, judging whether a new minimum violation sequence is generated, if the new minimum violation sequence is generated, discarding the violation sequence, and if the new minimum violation sequence is not generated, building a classification tree according to sensitive information in the track data set;

wherein L is the maximum track length value mastered by an attacker, T is a track data set of all users, S is a sensitive attribute value in the data set T, K is the number of hidden names in the sequence, and the track data set T meets LKC-privacy and the following conditions only when any subsequence P in T is within |P| < L:

t (p) is not less than K, and T (p) is a user containing p in the track;

conf (s|T (P) is less than or equal to C, conf (s|T (P))= |T (P.u.s) |/|T (P) |, wherein Conf is an abbreviation of confidence, represents a confidence threshold (confidence threshold) calculated under different conditions, is used for being compared with a given confidence threshold C, 0 is less than or equal to C is less than or equal to 1, S epsilon S, C is the confidence threshold of an anonymity set, and the anonymity degree can be flexibly adjusted according to requirements;

step 3, building a classification tree: firstly, initializing a data set T, and selecting two groups of frequent sequences from a track data set to construct a classification tree; selecting a track sequence corresponding to the largest number of times as a first group according to the number of times of occurrence of any two position points in each track record, selecting the sequence with the smallest number of times from all sequences occurring at the position points, selecting the most frequent position point from the tracks where the sequence is located as a second group, and iteratively selecting other tracks to be placed in the two groups until all the tracks are placed in a classification tree, thereby constructing a classification tree

Step 4, reassigning privacy budget: the privacy budget epsilon used in the iterative segmentation process of the classification tree is subjected to a refinement segmentation scheme by a Laplacian mechanism, and epsilon is firstly averagely distributed to each increment update data set epsilon' _m Then epsilon 'is added' _m Is divided into two partsRespectively using Laplace mechanism in the data iteration process and adding Laplace noise into the leaf nodes;

step 5, adding noise: T.fwdarw.R for any function f ^d If the output result of algorithm a satisfies inequality a (T) =f (T) + with<Lap ₁ (Δf/ε),Lap ₂ (Δf/ε),…,Lap _i (Δf/ε)>A satisfies epsilon-differential privacy; wherein T represents the trace dataset, R represents the real number threshold of the mapping, d represents f: T→R ^d A (T) represents the output result of algorithm A on track data set T, and f (T) represents the function f: T→R ^d Output results on track dataset T, lap _i (Δf/. Epsilon.) (1.ltoreq.i.ltoreq.d) is a Laplace variable, noise level, independent of each otherThe magnitude is proportional to Δf and inversely proportional to ε.

Examples

In order to prove the effectiveness of the invention, the invention operates in a Python environment, and is realized by an algorithm implemented by Myeclipse integrated development software, and the invention is in an experimental hardware environment: the processor is an Intel (R) Core (TM) i7-5500U CPU 2.40GH and the RAM is an 8.0G, lnuix operating system, the invention adopts an open source data set provided by a Geolife project of Microsoft Asian institute to carry out experimental verification, and the data set comprises 18670 real user tracks and is widely applied to track data related research experiments.

As shown in fig. 2 to 5, the data loss is an important reference for measuring the availability of track data, and the present invention measures both the frequent sequence (MFS) and the track sequence:

(1) MFS data loss mfslass depends on the number of MFSs in the original trajectory data set and the number of MFSs remaining in the data set after the local suppression process:

wherein M (T) is the MFS number in the original track data set, and M (T) is the MFS number in the data set after the local inhibition processing;

(2) Track sequence loss TLoss depends on the number of sequences in the original track dataset and the number of sequences after data processing:

wherein L (T) is the number of tracks in the original track data set, and L (T) is the number of tracks in the data set after the local suppression processing.

As shown in fig. 6 and 7, the average relative error of the data is calculated by using the count query as a measure of data loss, and the count query R:

wherein R (T) represents a count query of the original dataset,the count query representing the processed dataset, b, is a rational constraint set to prevent denominator from being too small.

Experimental results

As shown in fig. 2 and 3, MFS loss and sequence loss increase with increasing K value, and data loss increases because increasing K value results in increasing Minimum Violation Sequences (MVSs) and thus increasing sequences that need to be suppressed. While another TP-NSA algorithm has some effect on reducing data loss compared to the figure, it represents that the KTP algorithm herein causes less loss of data.

As can be seen from fig. 4 and 5, as the C value increases, the MFS loss and the sequence loss decrease, and the increase in the C value decreases the number of Minimum Violating Sequences (MVSs) that need to be suppressed, so that both the MFS loss and the sequence loss decrease gradually. The data result shows that compared with another TP-NSA algorithm in the figure, the loss rate of data processing in the KTP algorithm representing the invention is lower.

As can be seen from fig. 6 and 7, the average relative error of the data gradually increases with the length of the trace data set, but the average relative error of the data decreases in both experiments under the condition that the privacy budget value is larger. Compared with another HDMPM algorithm in the figure, the CTL algorithm representing the invention is more effective, reduces average relative error, effectively protects the user track privacy and improves the usability of data.

Although embodiments of the present invention have been disclosed above, it is not limited to the details and embodiments shown and described, it is well suited to various fields of use for which the invention would be readily apparent to those skilled in the art, and accordingly, the invention is not limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.

Claims

1. A classification tree differential privacy protection method meeting an LKC model is characterized by comprising the following steps:

step 3, when a new minimum violation sequence is not generated, establishing a classification tree according to track data in the sequence set, and adding noise to the data through a Laplace mechanism to obtain release data;

in said step 2, calculating the newly generated minimum violation sequence comprises:

step 2.3, updating the MFS according to the suppression sequence;

step 2.4, recalculating suppression priority scores of other position points, and obtaining the minimum violation sequence after updating the minimum violation sequence set;

in the step 2.2, the suppression priority score is

Wherein elimate (p) is the minimum number of violations that can be eliminated by position point p, and Loss (p) is the Loss of usefulness of position point p;

in the step 2.2, selecting the point with the highest priority score for each inhibition to implement inhibition, and determining the order of inhibition;

in the step 2, further includes: when a new minimum violation sequence is generated, verifying whether a track data set of the track data meets an LKC-privacy model or not, and if the data sequence existing in the track data set cannot meet the LKC-privacy model, updating the minimum violation sequence until all the data sequences meet the LKC-privacy model;

wherein the trace data set of trace data satisfies an LKC-privacy model when:

|p|<L；

t (p) is not less than K; and

Conf(s|T(p))≤C；

in the formula, conf (s|T (P))= |T (P ∈s) |T (P) |, conf is a confidence threshold value calculated under different conditions, L is a maximum track length value mastered by an attacker, T is a track data set of all users, S is a sensitive attribute value in the data set T, P is any subsequence in the data set T, C is more than or equal to 0 and less than or equal to 1, S is epsilon S, C is a confidence threshold value of an anonymous set, and K is a hidden name number in the sequence;

in the step 3, the process of building the classification tree includes the following steps:

2. The classification tree differential privacy protection method according to claim 1, wherein the adding noise to the data by the laplace mechanism in the step 3 comprises:

T.fwdarw.R for any function f ^d If the output result of algorithm a satisfies inequality a (T) =f (T) + the algorithm a is applied to the image data<Lap ₁ (Δf/ε),Lap ₂ (Δf/ε),…,Lap _i (Δf/ε)>A satisfies epsilon differential privacy;

wherein T is the track data set, R is the real number threshold of the mapping, d is f:T.fwdarw.R ^d A (T) is the output result of algorithm A on track data set T, f (T) is the function f: T→R ^d Output results on track dataset T, lap _i (Δf/ε); wherein i is more than or equal to 1 and less than or equal to d; are independent laplace variables.