CN107247909B

CN107247909B - Differential privacy method for protecting multiple positions in position information service

Info

Publication number: CN107247909B
Application number: CN201710433690.3A
Authority: CN
Inventors: 朱马克; 华景煜; 仲盛
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2020-05-05
Anticipated expiration: 2037-06-09
Also published as: CN107247909A

Abstract

The invention discloses a differential privacy method for protecting a plurality of positions in position information service, which improves an original geographic indistinguishable algorithm and provides a prediction test mechanism. The method reduces the overall privacy consumption by consuming a small amount of privacy to construct an approximation of the true location. It can greatly reduce privacy consumption without unduly compromising data availability. To test the performance of the proposed mechanism, we performed experiments on two popular data sets. The results show that our mechanism does greatly reduce the privacy consumption while guaranteeing the availability of data.

Description

Differential privacy method for protecting multiple positions in position information service

Technical Field

The invention relates to a differential privacy method for protecting a plurality of positions in a position information service, belonging to the technical field of position information prediction.

Background

In recent years, with the popularization of smart phones with GPS functions, location information services have also played an increasingly important role in people's lives. Almost all smartphones use the user's location data explicitly or implicitly for a variety of reasons. For example, a facebook uses the user's location data to query friends near the user. The news application then uses the user's location data to push local news.

Unfortunately, although location information services provide great convenience to our lives, it also poses serious privacy concerns. Users are often reluctant to expose their own real-time location data to third parties, including service providers. This is because the user cannot control various operations performed on the location data by the third party once uploading the location data of the user. A malicious third party may utilize the location data to track the user and thereby infer the user's home address, location of interest, and even sensitive information such as health or religious beliefs. Therefore, there is an urgent need for privacy protection type location information services that hide the location information of users while ensuring high quality of service.

Existing solutions to this problem can be divided into two broad categories. The first category is cryptographic methods. The method encrypts the location data before the user uploads it. It can completely protect the personal privacy of the user, providing provable privacy protection. However, the encryption method seriously impairs the availability of data, making it difficult for service providers to provide valuable location services. Moreover, cryptographic methods are often very time consuming, which is intolerable on handheld devices. The second category of methods is based on perturbing the raw data, thereby preventing third parties from obtaining the exact location of the user. Compared to cryptographic methods, perturbation methods are much lighter weight and compromise data availability much less. Therefore, third parties would also prefer to accept this approach. However, the security of this method remains questionable as it still reveals inaccurate data of the user.

More recently, Andres et al proposed geographic indistinguishability-the first formal privacy model based on location perturbation. This model evolved from differential privacy and can provide provable privacy safeguards. In particular, a perturbation mechanism satisfies geographic indistinguishability if any two locations that are less than a given threshold apart produce the same output location with a close probability. Thus, the third party cannot identify the user's true location from a range of nearby locations. Andres et al designed a corresponding algorithm that satisfies geographic indistinguishability by adding binary laplacian noise. However, this mechanism is primarily designed to protect a single location. If it is used directly to protect multiple locations, the overall privacy consumption increases as the number of locations increases. This means that the number of location services that can be performed by the user is very limited. Otherwise, the privacy is consumed to zero and the user's privacy will be destroyed. This limitation is unacceptable because many location service providers in real life use the user's location data multiple times a day or even an hour. For example, when using online navigation, a driver of a vehicle may make a location query every few seconds to obtain real-time road information. Therefore, there is a strong need to improve the existing geographical indistinguishable perturbation mechanism so as to reduce the privacy consumption as much as possible while maintaining the original data availability.

Disclosure of Invention

The purpose of the invention is as follows: as location information services become more prevalent and the privacy issues raised, users are often reluctant to expose their location to third party users (including service providers). To address this problem, geographic indistinguishability, a variant of differential privacy, has been proposed. This privacy model makes neighboring locations produce the same output location with similar probability by adding noise to the original location. However, this method was originally designed to protect a single location. If used directly to protect multiple locations, the overall privacy consumption increases rapidly with the number of locations, resulting in a very limited number of location query services available to the user. The invention provides a differential privacy method for protecting a plurality of positions in a position information service, improves an original geographic indistinguishable algorithm and provides a prediction and Test Mechanism (PTM for short). The method reduces the overall privacy consumption by consuming a small amount of privacy to construct an approximation of the true location. It can greatly reduce privacy consumption without unduly compromising data availability. To test the performance of the proposed mechanism, we performed experiments on two popular data sets. The results show that our mechanism does greatly reduce the privacy consumption while guaranteeing the availability of data.

The technical scheme is as follows: a differential privacy method for protecting multiple locations in a location information service, comprising the steps of:

step 1, using a threshold value R (usually set to be 1-2 times of the radius of a variable circle) to test the accuracy degree of the predicted point and determine a distribution point, and inputting a true tracing point X to be protected by a user₁,X₂…, respectively; taking value l (equation is as follows) according to gamma distribution gamma (2, 1/epsilon)

Wherein β is 2, α is 2 pi, x is a random variable, and l is x after sampling x), epsilon is a privacy parameter designated by a user;

step 2, taking a value theta (equation is as follows) according to the uniformly distributed U (0,2 pi)

Wherein a is 0, b is 2 pi, x is a random variable, and after sampling one x, the theta is x);

step 3, let Y₁＝X₁+(lcosθ,lsinθ)；

Step 4, outputting Y₁；

Step 5, making i equal to 2, s equal to 0 and f equal to 1;

step 6, taking value l according to gamma distribution gamma (2, 1/epsilon);

step 7, taking a value theta according to the uniformly distributed U (0,2 pi);

step 8, let Y_i＝X_i+(lcosθ,lsinθ)；

Step 9, constructing the predicted point P, generating a random number ran (between 0 and 1), if ran < s/(s + f): issue Y_iReturning to the step 6 when the value is P, i is i + 1; otherwise: executing the step 10;

step 10, if dis (Y)_iAnd, P) < R: issue Y_iP, s + 1; otherwise: issue Y_i，f＝f+1；

Step 11: and i is i +1, and the process returns to the step 6.

The method for constructing the predicted point P comprises the following steps:

when the user is stationary, P_i＝Y_i-1；

P is when the user moves slowly (typically meaning that the two true position distances do not exceed R)_i＝Y_i-1；

When the user moves at high speed (usually two times the true position distance exceeds R), P_i＝2Y_i-1-Y_i-2。

Has the advantages that: compared with the prior art, the invention provides a differential privacy method for protecting a plurality of positions in the position information service, which has the following advantages:

(1) when a user wants to perform a new round of location query, the present invention first predicts the user's current location based on the past location query history (i.e., the user's published points. The predicted location is then compared to the location obtained using the original geographically indistinguishable perturbation mechanism. If the distance between two locations is less than a predefined threshold, we consider the prediction to be successful and use the prediction to make a location query. Otherwise, the prediction is considered to be failed, and the position is inquired by using the position obtained by the original geographical indistinguishable disturbance mechanism. We demonstrate that this mechanism still satisfies the property of geographic indistinguishability and that privacy consumption is significantly reduced.

(2) Simple and efficient prediction methods are designed for the three main scenarios (stationary users, slow users and high speed users). We also add a jump strategy to further improve. This step may be to sense the availability status of the user's point of issuance and attempt to sacrifice some availability to further reduce privacy consumption. Based on the method, the privacy consumption is greatly reduced. In some scenarios even privacy consumption is reduced to a constant level.

(3) Experiments were performed on the real dataset to verify the performance of the invention. The experiment was based on two popular data sets: geolife and T-drive. Experiments were performed in three scenarios, stationary, slow and high speed, respectively, and the corresponding privacy consumption and usability were evaluated. Experimental results show that privacy can be saved by the method under three situations respectively to 98%, 81% and 55%, and meanwhile, the usability of user data is not greatly influenced.

Drawings

FIG. 1(a) is a schematic diagram showing the position of Y' when the test is successful;

FIG. 1(b) is a schematic diagram showing the position of Y' when the test fails;

in FIG. 2(a), ε_cA curve graph of the change with d when the test is successful;

FIG. 2(b) is ε_cA graph of variation with R at successful test;

FIG. 3(a) is a graph of probability as a function of d when the test was successful;

FIG. 3(b) is a graph of probability as a function of R at test success;

fig. 4(a) is a graph showing a change of the privacy consumption probability accumulation function when R is 100 for a stationary user;

fig. 4(b) is a graph of the variation of the error probability accumulation function when R is 100 under a stationary user;

fig. 4(c) is a graph showing a variation of the privacy consumption probability accumulation function when R is 200 for a stationary user;

fig. 4(d) is a graph of the variation of the error probability accumulation function when R is 200 under a stationary user;

fig. 4(e) is a graph showing a change of the privacy consumption probability accumulation function when R is 300 for a stationary user;

fig. 4(f) is a graph of the variation of the error probability accumulation function when R is 300 for a stationary user;

fig. 5(a) is a graph of the change of the privacy consumption probability accumulation function when R is 100 for a slow user;

fig. 5(b) is a graph of the variation of the error probability accumulation function when R is 100 for a slow user;

fig. 5(c) is a graph of the change of the privacy consumption probability accumulation function when R is 200 for a slow user;

fig. 5(d) is a graph of the variation of the error probability accumulation function when R is 200 for a slow user;

fig. 5(e) is a graph of the change of the privacy consumption probability accumulation function when R is 300 for a slow user;

fig. 5(f) is a graph of the variation of the error probability accumulation function at R300 for slow users;

fig. 6(a) is a graph showing a variation of the privacy consumption probability accumulation function when R is 100 for a high-speed user;

fig. 6(b) is a graph showing the variation of the error probability accumulation function when R is 100 for a high-speed user;

fig. 6(c) is a graph showing a variation of the privacy consumption probability accumulation function when R is 200 for a high-speed user;

fig. 6(d) is a graph showing the variation of the error probability accumulation function when R is 200 for a high-speed user;

fig. 6(e) is a graph showing a variation of the privacy consumption probability accumulation function when R is 300 for a high-speed user;

fig. 6(f) is a graph showing the variation of the error probability accumulation function when R is 300 for a high-speed user.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

Differential privacy and location information protection

This section introduces the concept of differential privacy and the extension of differential privacy into location information services (i.e., prior art geographic indistinguishability). After which the expected privacy consumption and expected error of the perturbation mechanism designed by Andres et al are analyzed. These will serve as comparative baseline for our experiments.

Differential privacy

Differential privacy has become a very popular privacy model because of the ability to provide a privacy guarantee that can be proven. The key idea is that the output of the statistical database does not vary much in the case that at most only one data record is modified. In other words, the loss of personal data has a very limited effect on the whole system, thereby achieving the purpose of protecting personal privacy. The criteria for differential privacy are defined as follows:

definition (ε -differential privacy): one publication mechanism a satisfies epsilon-differential privacy if and only if a satisfies the following condition: for any adjacent data sets D and D' (i.e., they differ by at most one record), we have the arbitrary output Z ∈ range (A):

epsilon in the equation is used to measure the privacy level of the publication mechanism. The smaller the value, the more privacy level the publication mechanism can guarantee.

To satisfy differential privacy, we need to add noise to the raw output of the database. The amplitude of the noise is determined by the sensitivity, which is defined as follows:

definition (sensitivity): for arbitrary neighboring datasets D and D', a query function f is given D → R^dThe sensitivity is defined as follows:

the laplacian mechanism is a commonly used algorithm to implement differential privacy. It achieves differential privacy by adding noise to each output that meets a certain laplacian distribution, namely:

definition (laplace mechanism): for an arbitrary query function f: D → R^dIf one mechanism A outputs f (D) + Lap (Δ f/ε), then mechanism A satisfies ε -differential privacy.

However, it is difficult to directly apply standard differential privacy to location protection because locations do not have the concept of adjacency and the distance between different locations is arbitrary and continuous. To apply differential privacy to location protection, we need a generalized definition of differential privacy, as follows:

definition (generalized ε -differential privacy): one publication mechanism a satisfies generalized epsilon-differential privacy if and only if a satisfies the following condition: for any two datasets D and D' that differ by at most k records, we have the arbitrary output Z ∈ range (A):

this privacy definition shows that the larger the database changes, the larger the change in the probability distribution of the output we allow. This is the basis for applying differential privacy to location protection.

Location privacy protection

In view of the characteristics of two-dimensional space and the practical application requirements, we need to make some changes when applying differential privacy to location privacy.

First, location publishing does not have the concept of a query function f. To simplify the problem, we can assume a query function. The query function is an identity function, i.e., the function output is exactly equal to the input. This facilitates the definition of subsequent concepts, such as sensitivity.

Second, location publishing does not have the notion of a contiguous location. This is also why we use generalized differential privacy. We need a metric to describe the difference between different locations. This is similar to how many records differ between two different databases. It is clear that euclidean distance is a good choice for describing this difference.

Finally, we cannot protect the privacy of all points on earth. We need to preset a threshold of 2r and consider only protecting privacy for locations that differ by a distance of at most 2 r. In other words, we draw a circle with its center at the true position and radius r. We need only to ensure that any two points in the circle will produce the same output with similar probability. We refer to this circle as the variation circle C.

Based on the above discussion, we have the following definitions:

definition (geographical indistinguishable): one publication mechanism A satisfies the generalized ε -geographic indistinguishability if and only if A satisfies the following condition: for any two positions X in a variable circle C with radius r₁And X₂Any output position Y, we have:

it is obvious that X is applied to any two points in C₁And X₂We have dis (X)₁,X₂) Less than or equal to 2 r. Here, dis (X)₁,X₂) Representing the euclidean distance between two points.

To satisfy the geographical indistinguishability, we need to add a noise vector to the true location. Andres et al introduced a simple and general mechanism as follows:

algorithm 1 (geographically indistinguishable continuity mechanism):

inputting: a true point X; a privacy parameter epsilon;

and (3) outputting: a distribution point Y;

1: taking value l according to gamma distribution gamma (2, 1/epsilon);

2: taking a value theta according to the uniform distribution U (0,2 pi);

3：Y＝X+(lcosθ,lsinθ)；

4: and outputting the publishing point Y.

The correctness of the algorithm has been proven and reference is made to the paper of Andres et al. For convenience, we will refer to this algorithm as the GICM algorithm.

Expected value of privacy consumption and error

Many prior works have used epsilon as a comparative baseline in the experiment. However, due to the particularities of location privacy protection, we have found that using epsilon directly to measure the security level of the publication mechanism can pose problems. The following is our analysis: privacy consumption is a better choice to measure the privacy strength, which we denote as ε_c. In the laplace mechanism, we can calculate this value by the following formula:

in the formula_cIndicating how much new knowledge we can obtain by observing the output Z. This value is equal to epsilon in the laplace mechanism. Likewise, for the GICM algorithm, we can also define the privacy consumption ε in the same way_c：

The above definition shows how much new knowledge we can get after observing the output point Y. However, unlike the laplace mechanism, this value is not equal to epsilon in the GICM. Note that ε is for all possible outputs Y, and ε_cOnly for one particular output Y. We use the GICM algorithm as an example. Assume that the input point is X and the output point is Y; the radius of the circle of variation C is r and the distance between X and Y is l. If Y falls outside the circle C, i.e., l ≧ r, we can calculate ε_c：

This does not seem problematic. However, when l < r, the situation changes. We can still calculate epsilon_c：

This value is smaller than the privacy parameter epsilon. This result indicates that when the output point Y falls within the variation circle, the output Y does not leak as much information as we expected. Therefore, using only epsilon does not accurately describe how much information the mechanism leaks. Considering that the output point Y has its own probability distribution, we choose E (ε)_c) To measure how much information the mechanism leaks. Applying it to the GICM algorithm we get:

theorem: for the GICM algorithm, it expects privacy consumption

Where r is the radius of the varying circle and epsilon is the privacy parameter.

And (3) proving that: we need to consider two cases: y falls within the variation circle and Y falls outside the variation circle. With X as the origin of coordinates, we have:

as r increases, the expected value E (. epsilon.)_c) From epsilon to epsilon/2. This is consistent with our intuition. When r is close to 0, most of the output points Y fall outside the variation circle, and thus E (ε)_c) Close to epsilon. When r becomes large enough, the vast majority of output points Y can be viewed approximately as falling at the center of the variation circle, thus E (ε)_c) Close to e/2.

In addition to the expectation of privacy consumption, the expectation of the distance between the real point and the issue point is also heavily considered, i.e. the error expectation e (error). E (error) reflects the availability of the point of issuance. E (error) is not difficult to calculate:

theorem: for the GICM algorithm, its expected error, e (error), is 2/epsilon.

And (3) proving that: we also set X to the origin of coordinates, then:

this explains why the privacy parameter epsilon cannot be set too small, otherwise the availability of the distribution point will be severely affected.

Prediction and test mechanism

Although the GICM algorithm works well for single point cases, applying it directly to multiple points, such as trajectory data, creates privacy issues. Depending on the nature of differential privacy, if each point on the trace satisfies ε -differential privacy, the total privacy consumption may be accumulated. That is, the privacy consumption of the entire track is n ε, where n is the number of points in the track. Therefore, if we use the GICM algorithm directly to protect our privacy, we cannot do multiple location information service queries, otherwise the privacy is quickly consumed.

In view of the above discussion, it is of great interest to reduce the privacy consumption. Since we consider the online mode, we can only process point by point. The problem then becomes: how to reduce privacy consumption when dealing with a point? The answer is: historical data is utilized. Here, historical data refers to data already known to the publishing point and other attackers that can be used to infer the user's true location (note that this does not reveal additional privacy, as we do not use any information about the true point). Using these historical data, we can obtain a predicted point. Although it is unlikely that the predicted point will fall directly on the real point, the distance between the predicted point and the real point will not differ too much as long as our prediction is good enough. An intuitive approach is to directly publish the predicted points as distribution points. The advantages of this approach are evident: since we do not use any information of the real spot at all, the privacy consumption is 0. However, since the predicted point is based on the last data, the last data is based on the last data. So that prediction errors accumulate, eventually rendering the prediction unusable. Therefore, it is desirable to use not too much information of the true position to correct the accumulated prediction error without incurring too much extra privacy consumption. And most importantly, we have designed methods that still satisfy geographic indistinguishability, i.e., differential privacy.

Based on the above discussion, we designed prediction and testing mechanisms (Predict and Test Mechanism we call PTM). Our algorithm still takes the real point X and the privacy parameter epsilon as inputs and outputs a position Y. Algorithm 2 shows three main steps. First, it uses algorithm 1 to generate a noise point Y' based on X. We call this point the preliminary issue point. Thereafter, it uses the historical knowledge held by the attacker to generate a predicted point P. The method of generating the predicted point will be discussed in the next section. In the last step, the algorithm uses a threshold R to test the accuracy of the predicted points and determine the distribution point. Specifically, if the distance between Y' and P is less than R, we consider the prediction to be successful and set P as the final issue point. We call this test successful as shown in fig. 1 (a). Otherwise, we consider the test to fail and use Y' as the final issue point. We call this test failure as shown in fig. 1 (b).

Algorithm 2 (prediction and test mechanism):

inputting: a true point X; a threshold value R; a privacy parameter epsilon;

and (3) outputting: a distribution point Y;

1: taking value l according to gamma distribution gamma (2, 1/epsilon);

2: taking a value theta according to the uniform distribution U (0,2 pi);

3：Y'＝X+(lcosθ,lsinθ)；

4: constructing a predicted point P;

5: if dis (Y ', P) < R, P is issued, otherwise Y' is issued.

Theorem: the prediction and testing mechanisms satisfy geographic indistinguishability.

And (3) proving that: let us use figure 1 as an illustration. In the figure, we use P as the origin of coordinates, then the coordinates of X are (-d, 0).

When the test fails, the algorithm reverts to the GICM algorithm, which is apparently satisfied with geographic indistinguishability. When the test is successful, the calculation is somewhat complex. If P is issued, then Y' should be within the test circle. The center of the test circle is P and the radius is R. We can therefore calculate Pr (a (x) ═ P):

we need to find the maximum and minimum values of Pr (a) (x) ═ P). It can be seen that

Decreases with increasing d. That is, the farther X is away from P, the smaller Pr (a) (X) ═ P). We can find a point within the modified circle, which is closest to P, and set the distance between them as dmin. We set dmax in the same way. Then, for any two points X within the variation circle C₁And X₂We have:

we have again:

the inequality is true because of the triangle inequality, and dmax-dmin is less than or equal to 2 r. Finally, we have:

namely:

and (5) finishing the certification.

Note that we cannot calculate E (ε) for Algorithm 2_c) And e (error), since our prediction method is unknown. For analysis, however, we can calculate the privacy consumption and error associated with the algorithm for a given predicted point P.

Theorem: for a given P, the epsilon of the algorithm_cThe method comprises the following steps:

the error of the algorithm is:

and (3) proving that: epsilon when the test is successful_cThe calculation of (c) can be derived from the last proof. Epsilon at test failure_cCan be derived from the section "expected value of privacy consumption and error". The calculation of the error is trivial and is therefore omitted.

Since the formula for calculating the privacy consumption is very complex, we cannot see how the privacy consumption is reduced by our algorithm. Therefore we performed a series of experiments to observe the privacy consumption ε_cHow it changes with other variables. Let e equal 0.02, d equal 100 (distance between true and predicted points), R equal 300 and R equal 100. FIG. 2 shows ε_cHow d and R change when the test is successful. The straight line in FIG. 2 represents E (. epsilon.) of the GICM algorithm_c) We use this value as a comparison. Figure 3 shows how the probability of test success varies with d and R.

From FIG. 2(a) we can see that ε_cIncreases with increasing d and finally reaches epsilon. When d is less than 370, epsilon_cE (epsilon) less than GICM_c). When d is less than 150, epsilon_cOnly about E (ε) of GICM_c) One tenth of the total. Fig. 3(a) shows that the smaller d, the greater the likelihood of test success. Thus, if we can predict that the comparison is accurate, ε_cCan be made very small.

FIG. 2(b) shows ε_cDecreases with increasing d and eventually reaches 0. While figure 3(b) shows that the probability of test success increases with increasing R and eventually reaches 1. Therefore, a larger R can not only reduce the privacy consumption when the test is successful, but also increase the probability of successful test. However, R should not be too large, otherwise the error becomes unacceptable.

Prediction method and further improvements

As described in the previous section, our method requires prediction using historical data. In this section, we present our prediction approach, as well as an improved approach that can further reduce privacy consumption. These two parts ensure that our algorithm works well in real production life.

Prediction method

We first understand that there is no general method that fits all scenarios. Therefore, we consider three main scenarios and design prediction methods for them separately. The three scenarios include: stationary users, slow users and high-speed users. For convenience of explanation, we use X₁,X₂···X_nRepresenting the real points we want to protect, by Y₁,Y₂···Y₃Indicating a point of issue, by P₁,P₂···P_nRepresenting the corresponding predicted point. Now suppose we have published point Y_i-1Now, the protection point X is desired_iThe first thing we want to do is to predict point P_i。

Let us first consider the extreme case: the user is stationary. This is indeed likely to occur in real life. For example, a person sits in a cafe to query a restaurant or movie theater for accessories. One ideal strategy is to use the GICM to protect the first point and then use the first issue point as the issue point for all subsequent queries. This only results in privacy consumption at the first point, and the overall privacy consumption is reduced to a constant level. Inspired by this, we have P_i＝Y_i-1. Although seemingly simple, the practical effect of this method isIt is very good. The reason is simple: last issue point Y_i-1From the last real point X_i-1Not far away (otherwise no availability was available at the last issue point), and X_i＝X_i-1Thus Y is_i-1And X_iNor is the distance between them too great. Together with the skip mechanism described in the next section, we can bring the privacy consumption of this scenario to a constant level.

The second scenario indicates that the user is moving, but at a slow rate. Such scenarios are also common, for example, where a user is playing a cell phone and walking a walk. We have found that the prediction method for a stationary scene is also applicable to this scene, since slow users can be considered approximately stationary. Another explanation is Y_i-1Is close to X_i-1And X is_i-1Is close to X_iThus Y is_i-1And X_iAnd also quite close.

The situation is somewhat complicated when considering high speed users. Previous prediction methods do not work well because X_i-1And X_iUsually at a large distance. The relationship between real points becomes weak and accurate prediction is therefore very difficult. But in some scenarios we can still predict more accurately. We use the following prediction method: p_i＝2Y_i-1-Y_i-2. The prediction method is based on the following idea: the user's heading direction and heading speed do not vary much. We can therefore approximately assume that the user advances the same distance in the same direction each time.

But there is still a problem: how does we know which scene we are currently in? Fortunately, this problem is not difficult to solve. We can determine the scene using only the publishing point. If the distribution points are dense, we know we are in slow or even static scenes. Likewise, sparse distribution points indicate that we are in a high-speed scenario. Note that this process does not result in additional privacy consumption since the attacker can also see the issue point.

Further improvements

Our goal is to reduce the privacy consumption as much as possible without increasing the error significantly. However, in our experiments we found a very interesting phenomenon: privacy consumption and errors are reduced simultaneously. This means that we have room to reduce the privacy consumption. Privacy and error remain a balance. Now that the error is now smaller than we want, we can further reduce the privacy consumption by sacrificing some usability. In other words, our mechanism can be further improved.

The first problem is: how do we detect this phenomenon? This step is important because this phenomenon does not always occur. A straightforward approach is to calculate the error of the previous issue point and use the average to decide whether or not to take action. However, doing so would cause additional privacy consumption, as we use information of real points that the attacker is not aware of. A more suitable approach is to use a prediction success rate. Where the prediction success rate is the number of times the prediction succeeded divided by the total number of predictions. This variable may reflect whether the error is too small, since the smaller the error, the greater the prediction success rate. This step does not cause additional privacy consumption since the attacker itself can also get the test success rate.

Another problem is that: how can we reduce privacy consumption by sacrificing some usability? A straightforward approach is to dynamically adjust the parameters. For example, the privacy parameter epsilon is decreased or the test radius R is increased. However, we have found that these methods are not ideal because their reduced privacy is not significant. We therefore adopted an efficient and simple approach: the test step is skipped directly. When we find the error small enough, we directly make the issue point the predicted point. In this way, the privacy consumption becomes directly 0, since we have no real point at all. This approach does not add too much error because in most cases our predictions do not deviate too far.

Based on the above discussion, our skipping mechanism is as follows: we denote the number of test successes by the variable s and the number of test failures by the variable f. For the first point, we use the original GICM algorithm and set s to 0 and f to 1. For subsequent points, we use PTM. Each time we predict a point, we generate a random number ran between 0 and 1. If ran < s/(s + t), we directly skip the test step and use the predicted point as the distribution point, while keeping s and f unchanged. Otherwise, we perform the test step and update s and f according to the test result. In short, the probability of skipping is the test success rate.

Experiment of

To understand the actual performance of our algorithm, we performed experiments on two well-known data sets. We consider three scenarios: stationary users, slow users and high-speed users. We compared the corresponding privacy consumption and error with the GICM algorithm.

Basic setting

We first present the two data sets used:

(1) geolife. This data set was collected by microsoft asian institute. It has 182 users and 17621 tracks. The total length reaches 1292951 km, and the total time reaches 50176 hours (from 2007 to 2012 and 8 months). Most of the locations are in Beijing of China. We use this data set mainly in slow scenes, since most of the movement in the data set is slow.

(2) T-Drive. The data set contained 10357 taxi tracks in a week. The total length reaches 9000000 km and the total number of positions reaches 15000000. Since taxis are moving very fast, we use this data set for high speed users.

There are three important parameters that we need to set at the beginning: radius R of the varying circle, radius R of the test circle, and privacy parameter epsilon. These parameters are correlated so we cannot set them at will. For example, if r is 1000, then 0.1 is meaningless, since most distribution points have an error less than 4/40, and 1000 is obviously too large. In our experiments, we set these parameters as follows: r100, e 0.02, R100,150,200 or even 300 (for comparison). The unit is meters. Although we can set other parameter values, we find that the performance of the mechanism does not change much. We use the above settings as an illustration.

Using E (. epsilon.)_c) And e (error) as baseline for comparison. E (. epsilon.) is calculated using the above parameter settings_c) 0.0173, E (error) is 100. Although the corresponding E (. epsilon.) of PTM could not be calculated_c) And e (error), but the average may be used instead and compared to a baseline to illustrate the performance of the mechanism.

Stationary user

In this scenario, simulated data is used instead of real data. The reason is simple: the user is stationary so the simulated data is indistinguishable from the real data. We have 100 tracks with 1000 identical points per track. PTM was done 100 times on these traces and all privacy-consuming errors were recorded and then plotted as a cumulative distribution function. Then we calculate the average privacy consumption and error and compare it with E (ε) of the GICM algorithm_c) And E (error) compared. Fig. 4 shows our results.

The three graphs (a, c, e) on the left of fig. 4 show the cumulative distribution function of the corresponding privacy consumption at different test radii R. Fig. 4(a) shows that when R is 100, about 35% of privacy is consumed no more than ∈/10. The average privacy consumption was calculated to be 0.0096, which means we saved privacy consumption by 44.51%. This result is not significant, but in fig. 4(c), when R is 200, about 85% of the privacy consumption does not exceed e/10. We can see that the cumulative distribution function of the privacy consumption is very close to the line y of 0.9. The average privacy consumption calculated at this time was 0.0017. That is, we save more than 90% of privacy consumption. When R reaches 300, we can see in fig. 4(e) that most of the points have no privacy consumption. This benefits from our skipping mechanism. The average privacy consumption is 0.00023, we save more than 98% of the privacy consumption. This is a huge boost, almost ideal: a constant level.

The three graphs (b, d, f) on the right of fig. 4 show the cumulative distribution function of the errors at different R. We can see from the figure that these three curves are very close, which indicates that our PTM mechanism does not affect the availability of the distribution point (sometimes even can improve the availability, like fig. 4 (f)).

Slow speed user

In slow scenes, where the distance between points is small, we use the data set Geolife. While the users in the Geolife data set have a variety of vehicles, such as walking, bicycles, cars, and subways, the vast majority of them use walking. It is therefore appropriate to use this data set for slow scenes. Of course, we can also delete some non-walking traces to make the dataset meet our requirements. We randomly chosen 1000 tracks and then used the PTM mechanism 1000 times. Just as we are in a static scene, we collect all privacy consumptions and errors and then draw their cumulative distribution function. Then we use the mean to match E (ε) of the GICM algorithm_c) And E (error) compared.

The results are shown in FIG. 5. The three graphs on the left (a, c, e) show the cumulative distribution function of the corresponding privacy consumption when R varies from 100 to 300. We can see that they are very close to fig. 4, which means that our mechanism works well also in slow scenarios. In fact, we calculated that our PTMs saved privacy costs of 40.46%, 81.5% and 93% when R ═ 100,200,300, respectively.

From the three graphs on the right (b, d, f) we can see that the error increases with increasing R. When R is 100, the three curves are almost the same. As R increases, the cumulative distribution function of the error becomes lower. This is easily explained: when R is too large, errors accumulate more easily, making our predictions less and less accurate. However, in real life, this degree of loss of usability is acceptable. We calculated that the errors increased by 5%, 22%, 46% for different R. The reduced privacy consumption is clearly more pronounced than the lost usability.

High speed user

And finally a high-speed user. In this scenario, we use the T-Drive dataset. Most of the data in this data set comes from taxis, so the user moves at a relatively high speed. We sample the points from the raw data at a fixed frequency to ensure that the distance between the two points is in most cases over 100 meters. We also plotted the cumulative distribution function and calculated the mean. Note that the prediction method used in this scenario is different from the prediction methods of the first two scenarios.

The results are shown in fig. 6. We can see that our PTM can still save about half of the privacy consumption. Specifically, we saved privacy consumption by 28%, 55% and 69%. The reason why it cannot save a lot of privacy consumption like the first two scenarios is simple: the user moves quickly so that the trajectory is not well regular. Therefore, we cannot predict more accurately. We also note how the privacy savings do not increase as R changes from 200 to 300. This suggests that the accuracy of our prediction method and the test radius R together constrain the overall performance of PTM.

Fig. 6 also illustrates that usability is affected. While the loss of availability in fig. 6(d) and 6(b) is still acceptable, the loss of availability in fig. 6(f) is unacceptable. In fact, setting R to 300 does not save much privacy consumption. Setting R to 200 is a good choice. We consider that an excessive R is generally not useful when the prediction accuracy is not high enough.

Claims

1. A differential privacy method for protecting multiple locations in a location information service, comprising the steps of:

step 1, using a threshold value R to test the accuracy degree of the predicted point and determine a distribution point, and inputting a true tracing point X to be protected by a user₁,X₂…, respectively; taking a value l according to gamma distribution gamma (2, 1/epsilon), wherein epsilon is a privacy parameter;

step 2, taking a value theta according to the uniformly distributed U (0,2 pi);

step 3, let Y₁＝X₁+(lcosθ,lsinθ)；

Step 4, outputting Y₁；

Step 5, making i equal to 2, s equal to 0 and f equal to 1;

step 6, taking value l according to gamma distribution gamma (2, 1/epsilon);

step 7, taking a value theta according to the uniformly distributed U (0,2 pi);

step 8, let Y_i＝X_i+(lcosθ,lsinθ)；

Step 9, constructing a prediction point P, generating a random number ran, the value of ran being between 0 and 1, if ran < s/(s + f): issue Y_iReturning to the step 6 when the value is P, i is i + 1; otherwise: executing the step 10;

Step 11: returning to the step 6 when i is i + 1;

in the above step, Y_iDenotes the ith issue point, s denotes the number of test successes, f denotes the number of test failures, dis (Y)_iP) represents Y_iThe euclidean distance between P.

2. The differential privacy method for protecting multiple locations in a location information service of claim 1, wherein the method of constructing the predicted point P is:

when the user is stationary, P_i＝Y_i-1；

When the user moves slowly, P_i＝Y_i-1；

When the user moves at high speed, P_i＝2Y_i-1-Y_i-2。