CN107247909B - Differential privacy method for protecting multiple positions in position information service - Google Patents

Differential privacy method for protecting multiple positions in position information service Download PDF

Info

Publication number
CN107247909B
CN107247909B CN201710433690.3A CN201710433690A CN107247909B CN 107247909 B CN107247909 B CN 107247909B CN 201710433690 A CN201710433690 A CN 201710433690A CN 107247909 B CN107247909 B CN 107247909B
Authority
CN
China
Prior art keywords
privacy
consumption
point
user
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710433690.3A
Other languages
Chinese (zh)
Other versions
CN107247909A (en
Inventor
朱马克
华景煜
仲盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710433690.3A priority Critical patent/CN107247909B/en
Publication of CN107247909A publication Critical patent/CN107247909A/en
Application granted granted Critical
Publication of CN107247909B publication Critical patent/CN107247909B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2111Location-sensitive, e.g. geographical location, GPS

Abstract

The invention discloses a differential privacy method for protecting a plurality of positions in position information service, which improves an original geographic indistinguishable algorithm and provides a prediction test mechanism. The method reduces the overall privacy consumption by consuming a small amount of privacy to construct an approximation of the true location. It can greatly reduce privacy consumption without unduly compromising data availability. To test the performance of the proposed mechanism, we performed experiments on two popular data sets. The results show that our mechanism does greatly reduce the privacy consumption while guaranteeing the availability of data.

Description

Differential privacy method for protecting multiple positions in position information service
Technical Field
The invention relates to a differential privacy method for protecting a plurality of positions in a position information service, belonging to the technical field of position information prediction.
Background
In recent years, with the popularization of smart phones with GPS functions, location information services have also played an increasingly important role in people's lives. Almost all smartphones use the user's location data explicitly or implicitly for a variety of reasons. For example, a facebook uses the user's location data to query friends near the user. The news application then uses the user's location data to push local news.
Unfortunately, although location information services provide great convenience to our lives, it also poses serious privacy concerns. Users are often reluctant to expose their own real-time location data to third parties, including service providers. This is because the user cannot control various operations performed on the location data by the third party once uploading the location data of the user. A malicious third party may utilize the location data to track the user and thereby infer the user's home address, location of interest, and even sensitive information such as health or religious beliefs. Therefore, there is an urgent need for privacy protection type location information services that hide the location information of users while ensuring high quality of service.
Existing solutions to this problem can be divided into two broad categories. The first category is cryptographic methods. The method encrypts the location data before the user uploads it. It can completely protect the personal privacy of the user, providing provable privacy protection. However, the encryption method seriously impairs the availability of data, making it difficult for service providers to provide valuable location services. Moreover, cryptographic methods are often very time consuming, which is intolerable on handheld devices. The second category of methods is based on perturbing the raw data, thereby preventing third parties from obtaining the exact location of the user. Compared to cryptographic methods, perturbation methods are much lighter weight and compromise data availability much less. Therefore, third parties would also prefer to accept this approach. However, the security of this method remains questionable as it still reveals inaccurate data of the user.
More recently, Andres et al proposed geographic indistinguishability-the first formal privacy model based on location perturbation. This model evolved from differential privacy and can provide provable privacy safeguards. In particular, a perturbation mechanism satisfies geographic indistinguishability if any two locations that are less than a given threshold apart produce the same output location with a close probability. Thus, the third party cannot identify the user's true location from a range of nearby locations. Andres et al designed a corresponding algorithm that satisfies geographic indistinguishability by adding binary laplacian noise. However, this mechanism is primarily designed to protect a single location. If it is used directly to protect multiple locations, the overall privacy consumption increases as the number of locations increases. This means that the number of location services that can be performed by the user is very limited. Otherwise, the privacy is consumed to zero and the user's privacy will be destroyed. This limitation is unacceptable because many location service providers in real life use the user's location data multiple times a day or even an hour. For example, when using online navigation, a driver of a vehicle may make a location query every few seconds to obtain real-time road information. Therefore, there is a strong need to improve the existing geographical indistinguishable perturbation mechanism so as to reduce the privacy consumption as much as possible while maintaining the original data availability.
Disclosure of Invention
The purpose of the invention is as follows: as location information services become more prevalent and the privacy issues raised, users are often reluctant to expose their location to third party users (including service providers). To address this problem, geographic indistinguishability, a variant of differential privacy, has been proposed. This privacy model makes neighboring locations produce the same output location with similar probability by adding noise to the original location. However, this method was originally designed to protect a single location. If used directly to protect multiple locations, the overall privacy consumption increases rapidly with the number of locations, resulting in a very limited number of location query services available to the user. The invention provides a differential privacy method for protecting a plurality of positions in a position information service, improves an original geographic indistinguishable algorithm and provides a prediction and Test Mechanism (PTM for short). The method reduces the overall privacy consumption by consuming a small amount of privacy to construct an approximation of the true location. It can greatly reduce privacy consumption without unduly compromising data availability. To test the performance of the proposed mechanism, we performed experiments on two popular data sets. The results show that our mechanism does greatly reduce the privacy consumption while guaranteeing the availability of data.
The technical scheme is as follows: a differential privacy method for protecting multiple locations in a location information service, comprising the steps of:
step 1, using a threshold value R (usually set to be 1-2 times of the radius of a variable circle) to test the accuracy degree of the predicted point and determine a distribution point, and inputting a true tracing point X to be protected by a user1,X2…, respectively; taking value l (equation is as follows) according to gamma distribution gamma (2, 1/epsilon)
Figure GDA0002276426220000021
Wherein β is 2, α is 2 pi, x is a random variable, and l is x after sampling x), epsilon is a privacy parameter designated by a user;
step 2, taking a value theta (equation is as follows) according to the uniformly distributed U (0,2 pi)
Figure GDA0002276426220000031
Wherein a is 0, b is 2 pi, x is a random variable, and after sampling one x, the theta is x);
step 3, let Y1=X1+(lcosθ,lsinθ);
Step 4, outputting Y1
Step 5, making i equal to 2, s equal to 0 and f equal to 1;
step 6, taking value l according to gamma distribution gamma (2, 1/epsilon);
step 7, taking a value theta according to the uniformly distributed U (0,2 pi);
step 8, let Yi=Xi+(lcosθ,lsinθ);
Step 9, constructing the predicted point P, generating a random number ran (between 0 and 1), if ran < s/(s + f): issue YiReturning to the step 6 when the value is P, i is i + 1; otherwise: executing the step 10;
step 10, if dis (Y)iAnd, P) < R: issue YiP, s + 1; otherwise: issue Yi,f=f+1;
Step 11: and i is i +1, and the process returns to the step 6.
The method for constructing the predicted point P comprises the following steps:
when the user is stationary, Pi=Yi-1
P is when the user moves slowly (typically meaning that the two true position distances do not exceed R)i=Yi-1
When the user moves at high speed (usually two times the true position distance exceeds R), Pi=2Yi-1-Yi-2
Has the advantages that: compared with the prior art, the invention provides a differential privacy method for protecting a plurality of positions in the position information service, which has the following advantages:
(1) when a user wants to perform a new round of location query, the present invention first predicts the user's current location based on the past location query history (i.e., the user's published points. The predicted location is then compared to the location obtained using the original geographically indistinguishable perturbation mechanism. If the distance between two locations is less than a predefined threshold, we consider the prediction to be successful and use the prediction to make a location query. Otherwise, the prediction is considered to be failed, and the position is inquired by using the position obtained by the original geographical indistinguishable disturbance mechanism. We demonstrate that this mechanism still satisfies the property of geographic indistinguishability and that privacy consumption is significantly reduced.
(2) Simple and efficient prediction methods are designed for the three main scenarios (stationary users, slow users and high speed users). We also add a jump strategy to further improve. This step may be to sense the availability status of the user's point of issuance and attempt to sacrifice some availability to further reduce privacy consumption. Based on the method, the privacy consumption is greatly reduced. In some scenarios even privacy consumption is reduced to a constant level.
(3) Experiments were performed on the real dataset to verify the performance of the invention. The experiment was based on two popular data sets: geolife and T-drive. Experiments were performed in three scenarios, stationary, slow and high speed, respectively, and the corresponding privacy consumption and usability were evaluated. Experimental results show that privacy can be saved by the method under three situations respectively to 98%, 81% and 55%, and meanwhile, the usability of user data is not greatly influenced.
Drawings
FIG. 1(a) is a schematic diagram showing the position of Y' when the test is successful;
FIG. 1(b) is a schematic diagram showing the position of Y' when the test fails;
in FIG. 2(a), εcA curve graph of the change with d when the test is successful;
FIG. 2(b) is εcA graph of variation with R at successful test;
FIG. 3(a) is a graph of probability as a function of d when the test was successful;
FIG. 3(b) is a graph of probability as a function of R at test success;
fig. 4(a) is a graph showing a change of the privacy consumption probability accumulation function when R is 100 for a stationary user;
fig. 4(b) is a graph of the variation of the error probability accumulation function when R is 100 under a stationary user;
fig. 4(c) is a graph showing a variation of the privacy consumption probability accumulation function when R is 200 for a stationary user;
fig. 4(d) is a graph of the variation of the error probability accumulation function when R is 200 under a stationary user;
fig. 4(e) is a graph showing a change of the privacy consumption probability accumulation function when R is 300 for a stationary user;
fig. 4(f) is a graph of the variation of the error probability accumulation function when R is 300 for a stationary user;
fig. 5(a) is a graph of the change of the privacy consumption probability accumulation function when R is 100 for a slow user;
fig. 5(b) is a graph of the variation of the error probability accumulation function when R is 100 for a slow user;
fig. 5(c) is a graph of the change of the privacy consumption probability accumulation function when R is 200 for a slow user;
fig. 5(d) is a graph of the variation of the error probability accumulation function when R is 200 for a slow user;
fig. 5(e) is a graph of the change of the privacy consumption probability accumulation function when R is 300 for a slow user;
fig. 5(f) is a graph of the variation of the error probability accumulation function at R300 for slow users;
fig. 6(a) is a graph showing a variation of the privacy consumption probability accumulation function when R is 100 for a high-speed user;
fig. 6(b) is a graph showing the variation of the error probability accumulation function when R is 100 for a high-speed user;
fig. 6(c) is a graph showing a variation of the privacy consumption probability accumulation function when R is 200 for a high-speed user;
fig. 6(d) is a graph showing the variation of the error probability accumulation function when R is 200 for a high-speed user;
fig. 6(e) is a graph showing a variation of the privacy consumption probability accumulation function when R is 300 for a high-speed user;
fig. 6(f) is a graph showing the variation of the error probability accumulation function when R is 300 for a high-speed user.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
Differential privacy and location information protection
This section introduces the concept of differential privacy and the extension of differential privacy into location information services (i.e., prior art geographic indistinguishability). After which the expected privacy consumption and expected error of the perturbation mechanism designed by Andres et al are analyzed. These will serve as comparative baseline for our experiments.
Differential privacy
Differential privacy has become a very popular privacy model because of the ability to provide a privacy guarantee that can be proven. The key idea is that the output of the statistical database does not vary much in the case that at most only one data record is modified. In other words, the loss of personal data has a very limited effect on the whole system, thereby achieving the purpose of protecting personal privacy. The criteria for differential privacy are defined as follows:
definition (ε -differential privacy): one publication mechanism a satisfies epsilon-differential privacy if and only if a satisfies the following condition: for any adjacent data sets D and D' (i.e., they differ by at most one record), we have the arbitrary output Z ∈ range (A):
Figure GDA0002276426220000061
epsilon in the equation is used to measure the privacy level of the publication mechanism. The smaller the value, the more privacy level the publication mechanism can guarantee.
To satisfy differential privacy, we need to add noise to the raw output of the database. The amplitude of the noise is determined by the sensitivity, which is defined as follows:
definition (sensitivity): for arbitrary neighboring datasets D and D', a query function f is given D → RdThe sensitivity is defined as follows:
Figure GDA0002276426220000062
the laplacian mechanism is a commonly used algorithm to implement differential privacy. It achieves differential privacy by adding noise to each output that meets a certain laplacian distribution, namely:
definition (laplace mechanism): for an arbitrary query function f: D → RdIf one mechanism A outputs f (D) + Lap (Δ f/ε), then mechanism A satisfies ε -differential privacy.
However, it is difficult to directly apply standard differential privacy to location protection because locations do not have the concept of adjacency and the distance between different locations is arbitrary and continuous. To apply differential privacy to location protection, we need a generalized definition of differential privacy, as follows:
definition (generalized ε -differential privacy): one publication mechanism a satisfies generalized epsilon-differential privacy if and only if a satisfies the following condition: for any two datasets D and D' that differ by at most k records, we have the arbitrary output Z ∈ range (A):
Figure GDA0002276426220000063
this privacy definition shows that the larger the database changes, the larger the change in the probability distribution of the output we allow. This is the basis for applying differential privacy to location protection.
Location privacy protection
In view of the characteristics of two-dimensional space and the practical application requirements, we need to make some changes when applying differential privacy to location privacy.
First, location publishing does not have the concept of a query function f. To simplify the problem, we can assume a query function. The query function is an identity function, i.e., the function output is exactly equal to the input. This facilitates the definition of subsequent concepts, such as sensitivity.
Second, location publishing does not have the notion of a contiguous location. This is also why we use generalized differential privacy. We need a metric to describe the difference between different locations. This is similar to how many records differ between two different databases. It is clear that euclidean distance is a good choice for describing this difference.
Finally, we cannot protect the privacy of all points on earth. We need to preset a threshold of 2r and consider only protecting privacy for locations that differ by a distance of at most 2 r. In other words, we draw a circle with its center at the true position and radius r. We need only to ensure that any two points in the circle will produce the same output with similar probability. We refer to this circle as the variation circle C.
Based on the above discussion, we have the following definitions:
definition (geographical indistinguishable): one publication mechanism A satisfies the generalized ε -geographic indistinguishability if and only if A satisfies the following condition: for any two positions X in a variable circle C with radius r1And X2Any output position Y, we have:
Figure GDA0002276426220000071
it is obvious that X is applied to any two points in C1And X2We have dis (X)1,X2) Less than or equal to 2 r. Here, dis (X)1,X2) Representing the euclidean distance between two points.
To satisfy the geographical indistinguishability, we need to add a noise vector to the true location. Andres et al introduced a simple and general mechanism as follows:
algorithm 1 (geographically indistinguishable continuity mechanism):
inputting: a true point X; a privacy parameter epsilon;
and (3) outputting: a distribution point Y;
1: taking value l according to gamma distribution gamma (2, 1/epsilon);
2: taking a value theta according to the uniform distribution U (0,2 pi);
3:Y=X+(lcosθ,lsinθ);
4: and outputting the publishing point Y.
The correctness of the algorithm has been proven and reference is made to the paper of Andres et al. For convenience, we will refer to this algorithm as the GICM algorithm.
Expected value of privacy consumption and error
Many prior works have used epsilon as a comparative baseline in the experiment. However, due to the particularities of location privacy protection, we have found that using epsilon directly to measure the security level of the publication mechanism can pose problems. The following is our analysis: privacy consumption is a better choice to measure the privacy strength, which we denote as εc. In the laplace mechanism, we can calculate this value by the following formula:
Figure GDA0002276426220000081
in the formulacIndicating how much new knowledge we can obtain by observing the output Z. This value is equal to epsilon in the laplace mechanism. Likewise, for the GICM algorithm, we can also define the privacy consumption ε in the same wayc
Figure GDA0002276426220000082
The above definition shows how much new knowledge we can get after observing the output point Y. However, unlike the laplace mechanism, this value is not equal to epsilon in the GICM. Note that ε is for all possible outputs Y, and εcOnly for one particular output Y. We use the GICM algorithm as an example. Assume that the input point is X and the output point is Y; the radius of the circle of variation C is r and the distance between X and Y is l. If Y falls outside the circle C, i.e., l ≧ r, we can calculate εc
Figure GDA0002276426220000083
This does not seem problematic. However, when l < r, the situation changes. We can still calculate epsilonc
Figure GDA0002276426220000091
This value is smaller than the privacy parameter epsilon. This result indicates that when the output point Y falls within the variation circle, the output Y does not leak as much information as we expected. Therefore, using only epsilon does not accurately describe how much information the mechanism leaks. Considering that the output point Y has its own probability distribution, we choose E (ε)c) To measure how much information the mechanism leaks. Applying it to the GICM algorithm we get:
theorem: for the GICM algorithm, it expects privacy consumption
Figure GDA0002276426220000092
Where r is the radius of the varying circle and epsilon is the privacy parameter.
And (3) proving that: we need to consider two cases: y falls within the variation circle and Y falls outside the variation circle. With X as the origin of coordinates, we have:
Figure GDA0002276426220000093
as r increases, the expected value E (. epsilon.)c) From epsilon to epsilon/2. This is consistent with our intuition. When r is close to 0, most of the output points Y fall outside the variation circle, and thus E (ε)c) Close to epsilon. When r becomes large enough, the vast majority of output points Y can be viewed approximately as falling at the center of the variation circle, thus E (ε)c) Close to e/2.
In addition to the expectation of privacy consumption, the expectation of the distance between the real point and the issue point is also heavily considered, i.e. the error expectation e (error). E (error) reflects the availability of the point of issuance. E (error) is not difficult to calculate:
theorem: for the GICM algorithm, its expected error, e (error), is 2/epsilon.
And (3) proving that: we also set X to the origin of coordinates, then:
Figure GDA0002276426220000101
this explains why the privacy parameter epsilon cannot be set too small, otherwise the availability of the distribution point will be severely affected.
Prediction and test mechanism
Although the GICM algorithm works well for single point cases, applying it directly to multiple points, such as trajectory data, creates privacy issues. Depending on the nature of differential privacy, if each point on the trace satisfies ε -differential privacy, the total privacy consumption may be accumulated. That is, the privacy consumption of the entire track is n ε, where n is the number of points in the track. Therefore, if we use the GICM algorithm directly to protect our privacy, we cannot do multiple location information service queries, otherwise the privacy is quickly consumed.
In view of the above discussion, it is of great interest to reduce the privacy consumption. Since we consider the online mode, we can only process point by point. The problem then becomes: how to reduce privacy consumption when dealing with a point? The answer is: historical data is utilized. Here, historical data refers to data already known to the publishing point and other attackers that can be used to infer the user's true location (note that this does not reveal additional privacy, as we do not use any information about the true point). Using these historical data, we can obtain a predicted point. Although it is unlikely that the predicted point will fall directly on the real point, the distance between the predicted point and the real point will not differ too much as long as our prediction is good enough. An intuitive approach is to directly publish the predicted points as distribution points. The advantages of this approach are evident: since we do not use any information of the real spot at all, the privacy consumption is 0. However, since the predicted point is based on the last data, the last data is based on the last data. So that prediction errors accumulate, eventually rendering the prediction unusable. Therefore, it is desirable to use not too much information of the true position to correct the accumulated prediction error without incurring too much extra privacy consumption. And most importantly, we have designed methods that still satisfy geographic indistinguishability, i.e., differential privacy.
Based on the above discussion, we designed prediction and testing mechanisms (Predict and Test Mechanism we call PTM). Our algorithm still takes the real point X and the privacy parameter epsilon as inputs and outputs a position Y. Algorithm 2 shows three main steps. First, it uses algorithm 1 to generate a noise point Y' based on X. We call this point the preliminary issue point. Thereafter, it uses the historical knowledge held by the attacker to generate a predicted point P. The method of generating the predicted point will be discussed in the next section. In the last step, the algorithm uses a threshold R to test the accuracy of the predicted points and determine the distribution point. Specifically, if the distance between Y' and P is less than R, we consider the prediction to be successful and set P as the final issue point. We call this test successful as shown in fig. 1 (a). Otherwise, we consider the test to fail and use Y' as the final issue point. We call this test failure as shown in fig. 1 (b).
Algorithm 2 (prediction and test mechanism):
inputting: a true point X; a threshold value R; a privacy parameter epsilon;
and (3) outputting: a distribution point Y;
1: taking value l according to gamma distribution gamma (2, 1/epsilon);
2: taking a value theta according to the uniform distribution U (0,2 pi);
3:Y'=X+(lcosθ,lsinθ);
4: constructing a predicted point P;
5: if dis (Y ', P) < R, P is issued, otherwise Y' is issued.
Theorem: the prediction and testing mechanisms satisfy geographic indistinguishability.
And (3) proving that: let us use figure 1 as an illustration. In the figure, we use P as the origin of coordinates, then the coordinates of X are (-d, 0).
When the test fails, the algorithm reverts to the GICM algorithm, which is apparently satisfied with geographic indistinguishability. When the test is successful, the calculation is somewhat complex. If P is issued, then Y' should be within the test circle. The center of the test circle is P and the radius is R. We can therefore calculate Pr (a (x) ═ P):
Figure GDA0002276426220000111
we need to find the maximum and minimum values of Pr (a) (x) ═ P). It can be seen that
Figure GDA0002276426220000112
Decreases with increasing d. That is, the farther X is away from P, the smaller Pr (a) (X) ═ P). We can find a point within the modified circle, which is closest to P, and set the distance between them as dmin. We set dmax in the same way. Then, for any two points X within the variation circle C1And X2We have:
Figure GDA0002276426220000121
we have again:
Figure GDA0002276426220000122
the inequality is true because of the triangle inequality, and dmax-dmin is less than or equal to 2 r. Finally, we have:
Figure GDA0002276426220000123
namely:
Figure GDA0002276426220000124
and (5) finishing the certification.
Note that we cannot calculate E (ε) for Algorithm 2c) And e (error), since our prediction method is unknown. For analysis, however, we can calculate the privacy consumption and error associated with the algorithm for a given predicted point P.
Theorem: for a given P, the epsilon of the algorithmcThe method comprises the following steps:
Figure GDA0002276426220000125
the error of the algorithm is:
Figure GDA0002276426220000131
and (3) proving that: epsilon when the test is successfulcThe calculation of (c) can be derived from the last proof. Epsilon at test failurecCan be derived from the section "expected value of privacy consumption and error". The calculation of the error is trivial and is therefore omitted.
Since the formula for calculating the privacy consumption is very complex, we cannot see how the privacy consumption is reduced by our algorithm. Therefore we performed a series of experiments to observe the privacy consumption εcHow it changes with other variables. Let e equal 0.02, d equal 100 (distance between true and predicted points), R equal 300 and R equal 100. FIG. 2 shows εcHow d and R change when the test is successful. The straight line in FIG. 2 represents E (. epsilon.) of the GICM algorithmc) We use this value as a comparison. Figure 3 shows how the probability of test success varies with d and R.
From FIG. 2(a) we can see that εcIncreases with increasing d and finally reaches epsilon. When d is less than 370, epsiloncE (epsilon) less than GICMc). When d is less than 150, epsiloncOnly about E (ε) of GICMc) One tenth of the total. Fig. 3(a) shows that the smaller d, the greater the likelihood of test success. Thus, if we can predict that the comparison is accurate, εcCan be made very small.
FIG. 2(b) shows εcDecreases with increasing d and eventually reaches 0. While figure 3(b) shows that the probability of test success increases with increasing R and eventually reaches 1. Therefore, a larger R can not only reduce the privacy consumption when the test is successful, but also increase the probability of successful test. However, R should not be too large, otherwise the error becomes unacceptable.
Prediction method and further improvements
As described in the previous section, our method requires prediction using historical data. In this section, we present our prediction approach, as well as an improved approach that can further reduce privacy consumption. These two parts ensure that our algorithm works well in real production life.
Prediction method
We first understand that there is no general method that fits all scenarios. Therefore, we consider three main scenarios and design prediction methods for them separately. The three scenarios include: stationary users, slow users and high-speed users. For convenience of explanation, we use X1,X2···XnRepresenting the real points we want to protect, by Y1,Y2···Y3Indicating a point of issue, by P1,P2···PnRepresenting the corresponding predicted point. Now suppose we have published point Yi-1Now, the protection point X is desirediThe first thing we want to do is to predict point Pi
Let us first consider the extreme case: the user is stationary. This is indeed likely to occur in real life. For example, a person sits in a cafe to query a restaurant or movie theater for accessories. One ideal strategy is to use the GICM to protect the first point and then use the first issue point as the issue point for all subsequent queries. This only results in privacy consumption at the first point, and the overall privacy consumption is reduced to a constant level. Inspired by this, we have Pi=Yi-1. Although seemingly simple, the practical effect of this method isIt is very good. The reason is simple: last issue point Yi-1From the last real point Xi-1Not far away (otherwise no availability was available at the last issue point), and Xi=Xi-1Thus Y isi-1And XiNor is the distance between them too great. Together with the skip mechanism described in the next section, we can bring the privacy consumption of this scenario to a constant level.
The second scenario indicates that the user is moving, but at a slow rate. Such scenarios are also common, for example, where a user is playing a cell phone and walking a walk. We have found that the prediction method for a stationary scene is also applicable to this scene, since slow users can be considered approximately stationary. Another explanation is Yi-1Is close to Xi-1And X isi-1Is close to XiThus Y isi-1And XiAnd also quite close.
The situation is somewhat complicated when considering high speed users. Previous prediction methods do not work well because Xi-1And XiUsually at a large distance. The relationship between real points becomes weak and accurate prediction is therefore very difficult. But in some scenarios we can still predict more accurately. We use the following prediction method: pi=2Yi-1-Yi-2. The prediction method is based on the following idea: the user's heading direction and heading speed do not vary much. We can therefore approximately assume that the user advances the same distance in the same direction each time.
But there is still a problem: how does we know which scene we are currently in? Fortunately, this problem is not difficult to solve. We can determine the scene using only the publishing point. If the distribution points are dense, we know we are in slow or even static scenes. Likewise, sparse distribution points indicate that we are in a high-speed scenario. Note that this process does not result in additional privacy consumption since the attacker can also see the issue point.
Further improvements
Our goal is to reduce the privacy consumption as much as possible without increasing the error significantly. However, in our experiments we found a very interesting phenomenon: privacy consumption and errors are reduced simultaneously. This means that we have room to reduce the privacy consumption. Privacy and error remain a balance. Now that the error is now smaller than we want, we can further reduce the privacy consumption by sacrificing some usability. In other words, our mechanism can be further improved.
The first problem is: how do we detect this phenomenon? This step is important because this phenomenon does not always occur. A straightforward approach is to calculate the error of the previous issue point and use the average to decide whether or not to take action. However, doing so would cause additional privacy consumption, as we use information of real points that the attacker is not aware of. A more suitable approach is to use a prediction success rate. Where the prediction success rate is the number of times the prediction succeeded divided by the total number of predictions. This variable may reflect whether the error is too small, since the smaller the error, the greater the prediction success rate. This step does not cause additional privacy consumption since the attacker itself can also get the test success rate.
Another problem is that: how can we reduce privacy consumption by sacrificing some usability? A straightforward approach is to dynamically adjust the parameters. For example, the privacy parameter epsilon is decreased or the test radius R is increased. However, we have found that these methods are not ideal because their reduced privacy is not significant. We therefore adopted an efficient and simple approach: the test step is skipped directly. When we find the error small enough, we directly make the issue point the predicted point. In this way, the privacy consumption becomes directly 0, since we have no real point at all. This approach does not add too much error because in most cases our predictions do not deviate too far.
Based on the above discussion, our skipping mechanism is as follows: we denote the number of test successes by the variable s and the number of test failures by the variable f. For the first point, we use the original GICM algorithm and set s to 0 and f to 1. For subsequent points, we use PTM. Each time we predict a point, we generate a random number ran between 0 and 1. If ran < s/(s + t), we directly skip the test step and use the predicted point as the distribution point, while keeping s and f unchanged. Otherwise, we perform the test step and update s and f according to the test result. In short, the probability of skipping is the test success rate.
Experiment of
To understand the actual performance of our algorithm, we performed experiments on two well-known data sets. We consider three scenarios: stationary users, slow users and high-speed users. We compared the corresponding privacy consumption and error with the GICM algorithm.
Basic setting
We first present the two data sets used:
(1) geolife. This data set was collected by microsoft asian institute. It has 182 users and 17621 tracks. The total length reaches 1292951 km, and the total time reaches 50176 hours (from 2007 to 2012 and 8 months). Most of the locations are in Beijing of China. We use this data set mainly in slow scenes, since most of the movement in the data set is slow.
(2) T-Drive. The data set contained 10357 taxi tracks in a week. The total length reaches 9000000 km and the total number of positions reaches 15000000. Since taxis are moving very fast, we use this data set for high speed users.
There are three important parameters that we need to set at the beginning: radius R of the varying circle, radius R of the test circle, and privacy parameter epsilon. These parameters are correlated so we cannot set them at will. For example, if r is 1000, then 0.1 is meaningless, since most distribution points have an error less than 4/40, and 1000 is obviously too large. In our experiments, we set these parameters as follows: r100, e 0.02, R100,150,200 or even 300 (for comparison). The unit is meters. Although we can set other parameter values, we find that the performance of the mechanism does not change much. We use the above settings as an illustration.
Using E (. epsilon.)c) And e (error) as baseline for comparison. E (. epsilon.) is calculated using the above parameter settingsc) 0.0173, E (error) is 100. Although the corresponding E (. epsilon.) of PTM could not be calculatedc) And e (error), but the average may be used instead and compared to a baseline to illustrate the performance of the mechanism.
Stationary user
In this scenario, simulated data is used instead of real data. The reason is simple: the user is stationary so the simulated data is indistinguishable from the real data. We have 100 tracks with 1000 identical points per track. PTM was done 100 times on these traces and all privacy-consuming errors were recorded and then plotted as a cumulative distribution function. Then we calculate the average privacy consumption and error and compare it with E (ε) of the GICM algorithmc) And E (error) compared. Fig. 4 shows our results.
The three graphs (a, c, e) on the left of fig. 4 show the cumulative distribution function of the corresponding privacy consumption at different test radii R. Fig. 4(a) shows that when R is 100, about 35% of privacy is consumed no more than ∈/10. The average privacy consumption was calculated to be 0.0096, which means we saved privacy consumption by 44.51%. This result is not significant, but in fig. 4(c), when R is 200, about 85% of the privacy consumption does not exceed e/10. We can see that the cumulative distribution function of the privacy consumption is very close to the line y of 0.9. The average privacy consumption calculated at this time was 0.0017. That is, we save more than 90% of privacy consumption. When R reaches 300, we can see in fig. 4(e) that most of the points have no privacy consumption. This benefits from our skipping mechanism. The average privacy consumption is 0.00023, we save more than 98% of the privacy consumption. This is a huge boost, almost ideal: a constant level.
The three graphs (b, d, f) on the right of fig. 4 show the cumulative distribution function of the errors at different R. We can see from the figure that these three curves are very close, which indicates that our PTM mechanism does not affect the availability of the distribution point (sometimes even can improve the availability, like fig. 4 (f)).
Slow speed user
In slow scenes, where the distance between points is small, we use the data set Geolife. While the users in the Geolife data set have a variety of vehicles, such as walking, bicycles, cars, and subways, the vast majority of them use walking. It is therefore appropriate to use this data set for slow scenes. Of course, we can also delete some non-walking traces to make the dataset meet our requirements. We randomly chosen 1000 tracks and then used the PTM mechanism 1000 times. Just as we are in a static scene, we collect all privacy consumptions and errors and then draw their cumulative distribution function. Then we use the mean to match E (ε) of the GICM algorithmc) And E (error) compared.
The results are shown in FIG. 5. The three graphs on the left (a, c, e) show the cumulative distribution function of the corresponding privacy consumption when R varies from 100 to 300. We can see that they are very close to fig. 4, which means that our mechanism works well also in slow scenarios. In fact, we calculated that our PTMs saved privacy costs of 40.46%, 81.5% and 93% when R ═ 100,200,300, respectively.
From the three graphs on the right (b, d, f) we can see that the error increases with increasing R. When R is 100, the three curves are almost the same. As R increases, the cumulative distribution function of the error becomes lower. This is easily explained: when R is too large, errors accumulate more easily, making our predictions less and less accurate. However, in real life, this degree of loss of usability is acceptable. We calculated that the errors increased by 5%, 22%, 46% for different R. The reduced privacy consumption is clearly more pronounced than the lost usability.
High speed user
And finally a high-speed user. In this scenario, we use the T-Drive dataset. Most of the data in this data set comes from taxis, so the user moves at a relatively high speed. We sample the points from the raw data at a fixed frequency to ensure that the distance between the two points is in most cases over 100 meters. We also plotted the cumulative distribution function and calculated the mean. Note that the prediction method used in this scenario is different from the prediction methods of the first two scenarios.
The results are shown in fig. 6. We can see that our PTM can still save about half of the privacy consumption. Specifically, we saved privacy consumption by 28%, 55% and 69%. The reason why it cannot save a lot of privacy consumption like the first two scenarios is simple: the user moves quickly so that the trajectory is not well regular. Therefore, we cannot predict more accurately. We also note how the privacy savings do not increase as R changes from 200 to 300. This suggests that the accuracy of our prediction method and the test radius R together constrain the overall performance of PTM.
Fig. 6 also illustrates that usability is affected. While the loss of availability in fig. 6(d) and 6(b) is still acceptable, the loss of availability in fig. 6(f) is unacceptable. In fact, setting R to 300 does not save much privacy consumption. Setting R to 200 is a good choice. We consider that an excessive R is generally not useful when the prediction accuracy is not high enough.

Claims (2)

1. A differential privacy method for protecting multiple locations in a location information service, comprising the steps of:
step 1, using a threshold value R to test the accuracy degree of the predicted point and determine a distribution point, and inputting a true tracing point X to be protected by a user1,X2…, respectively; taking a value l according to gamma distribution gamma (2, 1/epsilon), wherein epsilon is a privacy parameter;
step 2, taking a value theta according to the uniformly distributed U (0,2 pi);
step 3, let Y1=X1+(lcosθ,lsinθ);
Step 4, outputting Y1
Step 5, making i equal to 2, s equal to 0 and f equal to 1;
step 6, taking value l according to gamma distribution gamma (2, 1/epsilon);
step 7, taking a value theta according to the uniformly distributed U (0,2 pi);
step 8, let Yi=Xi+(lcosθ,lsinθ);
Step 9, constructing a prediction point P, generating a random number ran, the value of ran being between 0 and 1, if ran < s/(s + f): issue YiReturning to the step 6 when the value is P, i is i + 1; otherwise: executing the step 10;
step 10, if dis (Y)iAnd, P) < R: issue YiP, s + 1; otherwise: issue Yi,f=f+1;
Step 11: returning to the step 6 when i is i + 1;
in the above step, YiDenotes the ith issue point, s denotes the number of test successes, f denotes the number of test failures, dis (Y)iP) represents YiThe euclidean distance between P.
2. The differential privacy method for protecting multiple locations in a location information service of claim 1, wherein the method of constructing the predicted point P is:
when the user is stationary, Pi=Yi-1
When the user moves slowly, Pi=Yi-1
When the user moves at high speed, Pi=2Yi-1-Yi-2
CN201710433690.3A 2017-06-09 2017-06-09 Differential privacy method for protecting multiple positions in position information service Active CN107247909B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710433690.3A CN107247909B (en) 2017-06-09 2017-06-09 Differential privacy method for protecting multiple positions in position information service

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710433690.3A CN107247909B (en) 2017-06-09 2017-06-09 Differential privacy method for protecting multiple positions in position information service

Publications (2)

Publication Number Publication Date
CN107247909A CN107247909A (en) 2017-10-13
CN107247909B true CN107247909B (en) 2020-05-05

Family

ID=60019278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710433690.3A Active CN107247909B (en) 2017-06-09 2017-06-09 Differential privacy method for protecting multiple positions in position information service

Country Status (1)

Country Link
CN (1) CN107247909B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110022531B (en) * 2019-03-01 2021-01-19 华南理工大学 Localized differential privacy urban garbage data report and privacy calculation method
CN110516476B (en) * 2019-08-31 2022-05-13 贵州大学 Geographical indistinguishable location privacy protection method based on frequent location classification
CN110633402B (en) * 2019-09-20 2021-05-04 东北大学 Three-dimensional space-time information propagation prediction method with differential privacy mechanism
CN112487471B (en) * 2020-10-27 2022-01-28 重庆邮电大学 Differential privacy publishing method and system of associated metadata
CN114065287B (en) * 2021-11-18 2024-05-07 南京航空航天大学 Track differential privacy protection method and system for resisting predictive attack

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104135362A (en) * 2014-07-21 2014-11-05 南京大学 Availability computing method of data published based on differential privacy
CN105095447A (en) * 2015-07-24 2015-11-25 武汉大学 Distributed w-event differential privacy infinite streaming data distribution method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100077484A1 (en) * 2008-09-23 2010-03-25 Yahoo! Inc. Location tracking permissions and privacy

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104135362A (en) * 2014-07-21 2014-11-05 南京大学 Availability computing method of data published based on differential privacy
CN105095447A (en) * 2015-07-24 2015-11-25 武汉大学 Distributed w-event differential privacy infinite streaming data distribution method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Location Privacy via Differential Private Perturbation of Cloaking Area;Hoa Ngo 等;《2015 IEEE 28th Computer Security Foundations Symposium》;20150907;全文 *
抗大数据分析的隐私保护:研究现状与进展;仝伟 等;《网络与信息安全学报》;20160430;第2卷(第4期);全文 *
随机匿名的位置隐私保护方法;杨松涛 等;《哈尔滨工程大学学报》;20150331;第36卷(第3期);全文 *

Also Published As

Publication number Publication date
CN107247909A (en) 2017-10-13

Similar Documents

Publication Publication Date Title
CN107247909B (en) Differential privacy method for protecting multiple positions in position information service
US11710035B2 (en) Distributed labeling for supervised learning
Do et al. Contextual conditional models for smartphone-based human mobility prediction
Zhou et al. Privacy-preserving online task allocation in edge-computing-enabled massive crowdsensing
US9510145B2 (en) Battery-saving in geo-fence context method and system
US8589330B2 (en) Predicting or recommending a users future location based on crowd data
Luo et al. Enhancing Wi-Fi fingerprinting for indoor positioning using human-centric collaborative feedback
US20140349679A1 (en) Modifying A User&#39;s Contribution To An Aggregate Profile Based On Time Between Location Updates And External Events
US20160357774A1 (en) Segmentation techniques for learning user patterns to suggest applications responsive to an event on a device
US20140277648A1 (en) Motion-based Music Recommendation for Mobile Devices
CN110956255B (en) Difficult sample mining method and device, electronic equipment and computer readable storage medium
US20160357761A1 (en) Techniques for suggesting recipients based on a context of a device
WO2018145611A1 (en) Effective indoor localization using geo-magnetic field
CN107885742B (en) Service recommendation method and device
Li et al. Temporal influences-aware collaborative filtering for QoS-based service recommendation
Wang et al. Privacy-protected statistics publication over social media user trajectory streams
US8954737B2 (en) Method and apparatus for performing distributed privacy-preserving computations on user locations
CN112182430A (en) Method and device for recommending places, electronic equipment and storage medium
US10444062B2 (en) Measuring and diagnosing noise in an urban environment
Zheng et al. Context-aware recommendations via sequential predictions
CN109063727B (en) Method and device for calculating track frequency, storage medium and electronic equipment
CN111221827B (en) Database table connection method and device based on graphic processor, computer equipment and storage medium
Liu et al. $\mathtt {Radar} $: Adversarial Driving Style Representation Learning With Data Augmentation
Wang et al. P-STM: Privacy-protected social tie mining of individual trajectories
Suo et al. Quantifying the tradeoff between cybersecurity and location privacy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant