CN111770454B

CN111770454B - A game method of location privacy protection and platform task assignment in mobile crowd-sensing

Info

Publication number: CN111770454B
Application number: CN202010629965.2A
Authority: CN
Inventors: 沈航; 蔡威; 白光伟; 王天荆
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2021-06-01
Anticipated expiration: 2040-07-03
Also published as: CN111770454A

Abstract

The present invention proposes a game method of location privacy protection and platform task assignment in mobile crowd-sensing. The method first simulates the interaction between users and the platform through a trusted third party: each user selects a privacy budget to add noise to the location, and the platform selects a privacy budget to add noise to the location. perturbed location assignment task for each user. The interaction process is then modeled as a game, and equilibrium points are derived. Finally, the reinforcement learning method is used to continuously try different position perturbation strategies to output an optimal position perturbation scheme. The experimental results show that the mechanism can improve the overall utility of users as much as possible while optimizing the utility of task allocation, so that users and the platform can achieve a win-win situation. This method solves the problem that when the MCS system needs to provide users with personalized privacy protection to attract more users to participate in the task, due to the existence of malicious attackers, the user's enhanced privacy protection will lead to poor location availability and reduced task allocation utility. The problem.

Description

Game method for position privacy protection and platform task allocation in mobile crowd sensing

Technical Field

The technical scheme belongs to the technical field of networks, and particularly relates to a win-win game method for protecting position privacy and distributing platform tasks in a mobile crowd sensing MCS.

Background

In recent years, the popularity of Mobile Crowd Sensing (MCS) has been greatly promoted by the explosive development of internet of things technology. A typical MCS system consists of a data requester, a server (MCS platform) and a mobile user. The server distributes the tasks of the data requesters to the mobile users in the MCS system, and the mobile users use the mobile intelligent equipment to complete data acquisition and send back to the server and obtain certain rewards.

Task allocation is one of the most important links in the MCS system. The goal is to optimize the utility of the overall system while accomplishing all (or most) of the tasks of the target perception area. Minimizing the travel distance is typically selected as an optimization target for MCS task allocation. However, the travel distance cannot be calculated from the user's location information, and if the true location is transmitted to the MCS platform, the user will be at risk of personal privacy disclosure. Therefore, to attract more users to participate in the perception task, the MCS system must provide location privacy protection for the users.

The space camouflage technology in the traditional position privacy protection technology can also be used for user position privacy protection in MCS task allocation. The level of privacy protection provided by this technique is easily reduced if a malicious attacker in the MCS system has some a priori knowledge. Differential privacy techniques can be used to provide powerful location privacy protection for users without regard to prior knowledge of the adversary. In addition, considering that different users have different requirements on privacy protection, the MCS system needs to provide privacy protection with a plurality of different privacy budgets for the users to choose from.

The travel distance is an important index for measuring the MCS task allocation cost. Researchers have proposed an ActiveCrowd task allocation framework that considers time sensitivity, aiming at minimizing the total distance of movement, and solving the user selection problem of multitasking in MCS. Since the MCS platform predicts the true locations of all users, this may reveal user location privacy, reducing the willingness of the users to participate in perception. There are also researchers using the conventional spatial masquerading technique in LBS to protect the location privacy of users in the task allocation. Researchers also propose a spatial crowdsourcing mechanism based on differential privacy and geographic positioning, and provide efficient services for the outside while providing the users with the same location privacy protection with the same privacy budget. Some researchers use differential privacy technology to obscure the user positions and provide all users with the same degree of position privacy protection in the task allocation process. However, this framework is difficult to accommodate for user-differentiated privacy protection requirements. In consideration of the personalized privacy protection requirements of users, researchers also put forward a personalized privacy protection task distribution framework, and the users are allowed to specify the privacy budget by using a K-anonymity thought, so that personalized position privacy protection is provided for the users. Due to the fact that the randomness of the user for selecting the privacy budget is strong, especially when a malicious attacker exists in the MCS system, the user selects the privacy budget with higher privacy protection degree to cause the usability of the user position to be reduced, and the task of distributing the MCS platform is not facilitated.

Disclosure of Invention

From the foregoing discussion of the prior art, it can be seen that, in the process of designing a task allocation framework for providing personalized privacy protection, besides ensuring that the MCS platform allocates tasks efficiently, a more appropriate location privacy protection is provided for the user.

Game theory is an effective approach to the MCS system performance tradeoff problem, as in related studies of MCS incentive mechanisms, game theory is used to provide methods such as auctions, pricing, and reputation based mechanisms to incentivize users to participate in MCS awareness. Trusted Third Parties (TTPs) are the most important part of the mechanism. The TTP not only needs to provide position privacy protection for the user, but also simulates interaction of user selection of privacy budget and MCS platform allocation task, and establishes the most appropriate personalized privacy protection for the user.

The mobile crowd-sourcing aware MCS system needs to provide personalized privacy protection for users to attract more users to participate in tasks. However, due to the existence of malicious attackers, the usability of the location is deteriorated due to the enhancement of privacy protection of the user, and the task allocation utility is reduced.

The invention provides a game method for position privacy protection and platform task allocation in mobile crowd-sourcing perception, which is a user and platform win-win game method based on reinforcement learning and comprises the following steps:

firstly, simulating interaction between a user and an MCS platform through a TTP of a trusted third party: each user selects a privacy budget to add noise to the position, and the MCS platform allocates tasks according to the disturbance position of each user;

modeling the interaction process into a game, and deducing balance points;

and finally, continuously trying different position disturbance strategies by using a reinforcement learning method, and outputting an optimal position disturbance scheme.

The win-win game method for user position privacy protection and platform task allocation uses a reinforcement learning algorithm, and trains an off-line model capable of outputting an optimal position disturbance strategy by continuously trying the position disturbance scheme combination of all users. The experimental result shows that the privacy budget task allocation game can establish personalized and most appropriate position privacy protection for the user in an MCS system providing personalized privacy protection, so that the privacy protection of the user is improved as much as possible while the task allocation utility is ensured, and the win-win situation of the user and the platform is achieved.

Drawings

FIG. 1 is an overall frame of an MCS system;

FIG. 2 is a schematic illustration of a privacy budget-task allocation game in a trusted third party TTP;

FIG. 3 is a decision framework based on reinforcement learning;

FIGS. 4a and 4b are schematic diagrams comparing the performance of the algorithm of the present invention with a random algorithm;

wherein: FIG. 4a is user overall utility and FIG. 4b is task assignment utility;

FIGS. 5a and 5b are schematic diagrams of the effect of the number of users;

wherein: FIG. 5a is the user's overall utility and FIG. 5b is the average travel distance;

FIGS. 6a and 6b are schematic diagrams of the effect of the task publication radius;

wherein: fig. 6a is the user's overall utility and fig. 6b is the average travel distance.

Detailed Description

The technical scheme is further explained by combining the drawings and the detailed implementation mode as follows: scheme 1 overview

Aiming at the problem, the game method provided by the invention firstly simulates the interaction between the user and the platform through a credible third party: and each user selects a privacy budget to add noise to the position, and the platform allocates tasks according to the disturbance position of each user. The interaction process is then modeled as a game and equilibrium points are derived. And finally, continuously trying different position disturbance strategies by using a reinforcement learning method, and outputting an optimal position disturbance scheme. Experimental results show that the mechanism can improve the overall utility of the user as much as possible while optimizing the task allocation utility, so that the user and the platform achieve win-win results. The method solves the problems that in the process that an MCS system needs to provide personalized privacy protection for users to attract more users to participate in tasks, due to the existence of malicious attackers, the position availability is poor and the task allocation effectiveness is reduced when the privacy protection strength of the users is improved.

The game method is characterized in that in a mobile crowd sensing MCS system, after receiving a task request, an MCS platform issues a task; providing position information to the MCS platform intentionally wishing to perform a task; the MCS platform selects users and distributes tasks, and is characterized in that a trusted third party TTP simulates the interaction between the users and the MCS platform; the method comprises the following steps: 1) for the tasks issued by the MCS platform, users who wish to execute the tasks transmit the true distance and privacy budget to the applied tasks to the TTP; 2) simulating the interaction process of the user and the MCS platform in the TTP, and obtaining the optimal disturbance position of the user; 3) the MCS platform selects a user allocation task according to the optimal disturbance position of the user from the TTP;

in the step 2), the interaction process between the users and the MCS platform is simulated by the Stencoerberg game, the whole leader user is used as a leader in the Stencoerberg game model, and the MCS platform is used as a follower in the model; the steps of the leader and the following interaction process are as follows:

2.1) the leader selects a privacy budget and communicates to the follower the perturbation policy of its location;

2.2) the follower allocates tasks for the follower according to the disturbance strategy of the leader by minimizing the travel distance;

2.3) after receiving the task allocation result of the follower, the leader adjusts the disturbance strategy, transmits the disturbance strategy of the new position to the follower, and repeatedly executes the step 2.2) until the balance point is reached, and then the circulation is ended to obtain the optimal position disturbance strategy; under a balance point, the optimal state of the utility of the leader is maximized while the task distribution utility is ensured;

2.4) obtaining the optimal disturbance position of the user according to the optimal position disturbance strategy during the balance point, and then entering the step 3) for processing.

In the steps 2.2) and 2.3), different position disturbance strategies are continuously tried by using a reinforcement learning method, and finally, an optimal position disturbance strategy is obtained.

Firstly, a Markov decision process is used for representing a process of obtaining an optimal disturbance strategy: then, a Q-learning algorithm is adopted to solve the Markov decision process to obtain the initial state s⁽¹⁾A final execution action is initiated to maximize convergence of the accumulated reward values;

the following were used:

section 2 describes a system model of the present invention;

section 3 carries on problem modeling to the proposed game mechanism;

part 4, MDP the decision problem and use Q-learning algorithm to solve;

section 5 is comparison of algorithm performance and analysis of experimental results.

2 System model

As shown in fig. 1, the overall framework of the MCS system includes an MCS platform, a user and a trusted third party TTP.

Upon receiving the request task, the platform issues a set of tasks

The corresponding radius set of the task release area is

In the task allocation stage, the platform allocates tasks with the minimum travel distance as a target according to task application information transmitted by a trusted third party to generate an allocation matrix A_n×mAnd completing task allocation.

The set of mobile users in the system is represented as

After the platform releases the task, each user w_iSending a triplet

To a trusted third party. Wherein the variable

Is the user w_iMaximum privacy budget accepted (the greater the privacy budget, the less the privacy protection, the greater the possibility of location privacy disclosure), aggregation

Represents user w_iSet of requested tasks, vector

Represents user w_iTo k_iThe true distance vector of the individual tasks. After the platform allocates the tasks, according to the allocation matrix A_n×mAnd executing the task assigned to the user to a specific task position.

The trusted third party is a provider for protecting location privacy, is also a decision maker for a user optimal location perturbation scheme, and is an extremely important part in the system.

A set of privacy budgets representing h different degrees of protection offered by a trusted third party. After receiving the triples uploaded by the user, the trusted third party is the user w_iProviding a privacy budget ε_i(

I.e. satisfy user w_iPersonalized requirement of) to obtain the location perturbation policy vector pi ═ (epsilon) of all users₁,ε₂,...,ε_n) Then simulating the platform to assign tasks with minimized travel distance, generating an assignment matrix A_n×mThen, the position disturbance strategy pi of the user is adjusted according to the distribution matrix so that the utility of the user is maximized, then task distribution is carried out, and continuous iteration is carried out to obtain the optimal position disturbance strategy

Finally, the trusted third party connects user w_iTask application information of

And uploading to a real system platform.

It is assumed that each user can apply for a plurality of tasks, each task can be assigned to only one user, and only one user can perform one task.

3 problem modeling

The method comprises the steps of firstly introducing a generalized differential privacy concept, then analyzing position privacy protection provided by a system, then introducing a task allocation mode of a platform, finally explaining user privacy protection-platform task allocation game, and deducing a balance point of the game.

3.1 generalized differential privacy

For any two adjacent data sets x, x 'and any output Y, if the probability distribution M (x), M (x') is such that the maximum difference in Y is e^εI.e. M (x) (Y) e^εM (x') (Y), then the mechanism M is differential privacy satisfying the privacy budget epsilon. For any two positions x and x ', if their Euclidean distance satisfies d (x, x ') ≦ r, then under the obfuscation mechanism M, M (x) and M (x ') do not differ by more than ε r, which represents the privacy budget per unit distance. In this case, even if a malicious attacker knows the obfuscation mechanism M, the true location cannot be discerned.

Definition 1 (d)_xDifferential privacy) mechanism M satisfies d_xDifferential privacy if and only if for arbitrary

Are all provided with

Where M (x) and (Y) represent the probability that x belongs to set Y. d_xε d (x, x '), where ε is the privacy budget, the smaller ε, the more aggressive the privacy protection, and d (x, x ') represents the distance between x and x '.

Definition 2 (Laplace mechanism) hypothesis is

Two sets, d_xIs that

A metric of (1). The probability density function for all elements x in x is

Wherein

Then the mechanism

Is selected from (x, d)_x) To

And satisfies d_xDifferential privacy.

In particular, when the elements of x and y are both one-dimensional, the Laplace mechanism indicates that the transformed value y results from the initial value x adding the corresponding noise, i.e., y ═ x + Lap (1/epsilon). When the mechanism M satisfies the condition d_xDifferential privacy of ∈ | x-y |.

Proposition 1, if d_x≤d_xThen d is satisfied_xDifferential privacy also satisfies d_xDifferential privacy.

It is obvious that d is satisfied for any one_xMechanism for differential privacy M, when d_x≤d_xWhen M also satisfies d_xDifferential privacy.

3.2 location privacy protection

Users who are willing to perform a task need to upload the true distance and privacy budget to the applied task to the TTP. The TTP adds corresponding Laplace noise to the real distance according to the received privacy budget, so that an attacker cannot deduce the real position information of the user even knowing a specific position fuzzy mechanism, and the position privacy of the user is protected.

Because the user needs to upload the true distance of the applied task to the TTP, and the TTP finally uploads the disturbed disturbance distance to the MCS platform, the more the number of tasks applied by the user is, the more the exposed location information is, and the possibility of privacy disclosure is correspondingly increased. Meanwhile, the possibility of privacy disclosure has a direct relation with the privacy budget.

Theorem 1. setting

Represents user w_iTask set to application

True distance vector of_iFor user w_iThe privacy budget of (a) is determined,

is a task set

The issue radius set of the task. For mechanism M: m (d)_i,ε_i)＝d_i+Lap(1/ε_i) Satisfies the condition of

Differential privacy of (1).

And (3) proving that: for arbitrary d_i，

d_ij∈d_iAnd d_ij∈d_iAll represent user w_iTo task t_jPossible true distance, and d_ij-d′_ij|≤r_j。

Presentation reporting to MCS levelsUser w of a station_iTo task collections

Perturbing the distance vector, i.e.

Wherein eta₁,η₂,...,

Is subject to Laplace (0, 1/epsilon)_i) K of (a)_iIndependent and equally distributed random variables. Thus, there are

And is also provided with

So that M is M (d)_i,ε_i)＝d_i+Lap(1/ε_i) Satisfies a privacy level of

Differential privacy of (1).

Theorem 1 shows that: the privacy level of the user is related to the chosen privacy budget and the task applied. The smaller the privacy budget is, the greater the privacy protection degree is; the fewer the number of the tasks applied, the less the exposed position information; the smaller the issue radius of the application task, the greater the indistinguishability of two real locations in the same task area.

The trusted third party receives the user w_iTransmitted triplet

Then, M (d)_i,ε_i) Providing location privacy protection for it. Wherein

I.e. to provide a more aggressive privacy protection.

As can be appreciated from proposition 1, at this point, perturbation mechanism M (d)_i,ε_i) Still meet the privacy level of

Is privacy differentiated.

3.3 task Allocation

And the platform sorts the applicant of each task in a descending order according to the probability of being closer to the task according to the minimum privacy budget of the user, the application task set and the disturbed distance vector transmitted by the trusted third party. And after calculating the descending sequence of the applicants of all the tasks, distributing each task to the nearest user.

Suppose user w_aAnd w_bIs task t_jAny two applicants, d_ajAnd d_bjRespectively, representing their true distance to the task. When d is_aj<d_bjThen t is_jIs assigned to w_aIs even greater. In other words, when

At task t_jIn descending order sequence of (1), user w_aArranged at user w_bBefore (c) is performed.

Is obtained by reaction at d_ajObtained by adding Laplace noise, thereby obtaining

The same can be obtained

Wherein mu_a，μ_bRespectively is Laplace (0, 1/epsilon)_a)，Laplace(0,1/ε_b) The variable of (c). So that there are

Memory plane set

Equation (3) can be further expressed as

The user w can be calculated by double integral evaluation of the formula (4)_aUser w_bThe probability of closer distance, thereby determining w_aAnd w_bAt task t_jThe order of the front and back in the sequence. For task t_jAll applicants can find out more than one t by comparing every two_jA user sequence ordered in ascending distance. By performing the same calculation for other tasks, a ranking matrix can be calculated

Line S_jRepresenting a task t_jOf the ordered sequence, element s_jiK denotes an application execution task t_jUser w of_kRanked at the ith position in all applicants. When i is greater than t_jWhen the number of applicants is s_jiInfinity. At this time, the task assignment problem aimed at minimizing the overall travel distance is reduced to assigning each task to the ranking matrix S_n×mThe first user of each row. However, when the same user ranks at the same order of a plurality of tasks, a conflict occurs, that is, the plurality of tasks are all allocated to the user, and then the optimal allocation scheme can be obtained by performing linear programming with an integer of 0 to 1 and combining formula (4) to eliminate the conflict.

The end result of task assignment is to generate an assignment matrix

For any a_ij∈A_n×mHas a_ij∈{0,1}。a_ijA value of 1 indicates a task t_jIs allocated to user w_i。

Each task is distributed to at most one user to execute;

indicating that each user performs at most one task.

3.4 privacy budget-task distribution Game

To provide the most appropriate privacy protection for the user, TTP needs to simulate the user selection of perturbation strategies, simulate platform assignment tasks, and simulate user-platform interactions. This interaction process is modeled as a Stancodelberg game (Stackelberg game): the user as a whole is used as a leader to convey the position disturbance strategy of the user as a whole to the platform; the MCS platform is used as a follower to allocate tasks by taking the minimum travel distance as a target according to a disturbance strategy of a user; after receiving the task allocation result of the platform, the user adjusts the overall disturbance strategy to maximize the overall utility, so that the interaction is continuous.

The two game parties are two virtual entities in the TTP respectively: a leader and a follower. And the leader simulates a user to select a disturbance strategy, and the follower simulation platform distributes tasks. As shown in FIG. 2, the leader is first user w_iSelecting a privacy budget ε_iProviding satisfaction of the privacy budget ε_iThe protection mechanism M of (2), the user overall protection strategy is recorded as pi, the mechanism M (d)_i,ε_i) By the formula

User M (d)_i,ε_i) Uploaded true distance vector d to the requested task_iPerturbation as a vector

The leader uploads the current policy pi to the platform. The follower distributes tasks by taking the minimum travel distance as a target according to the received pi to obtain a task distribution matrix A_n×m。a_ijIs a matrix A_n×mThe value of (b) is 0 or 1. a is_ijA value of 1 indicates a task t_jIs assigned to user w_iAnd if the value is 0, the task t is indicated_jIs not allocated to user w_i。

After the platform task is allocated, user w_iIs expected to have a utility function of

Wherein λ is_iIs the user w_iRepresents the user's tendency strength between location privacy protection and assigned tasks, lambda_i>1 indicates that location privacy is more likely to be protected.

Indicating that a trusted third party is providing a privacy budget ε_iAfter differential privacy protection of (1), user w_iThe distance between the blurred distance vector and the true distance vector is expected to be

The utility function expectation of the user as a whole can be expressed as

The platform utility function is expressed as

Wherein

Indicating the number of tasks that are to be assigned,

indicating the travel distance expectation for the assigned task. The utility of the platform is expressed in terms of the inverse of the average travel distance, the greater the average travel distance, the lower the utility of the platform.

It is desirable to maximize personal utility as much as possible for a reasonable user. That is, after being assigned with a task, the privacy protection strength is increased in an attempt to better protect the privacy. If not assigned the task, will try to reduce privacy protection degree, let oneself more have an opportunity to select, and then improve individual's usefulness. Therefore, after the follower simulates task allocation each time, the leader can adjust privacy protection strategies of all users according to the current allocation matrix, and the overall effectiveness expectation of the users is maximized. The follower will redistribute the task according to the adjusted privacy policy to minimize the travel distance. The leader and the follower reach a balance point finally through continuous interaction, namely

This balance point is the optimum state point that maximizes the user's overall utility while optimizing the task allocation utility. At this time, the optimal perturbation strategy selected by the user according to the current task allocation result is the current strategy, and the optimal result of the platform performing task allocation according to the perturbation strategy of the current user is the current task allocation result.

Due to the strategy pi, the selection space is

The temporal complexity of the traversal is O (h)^m). The time complexity of task allocation is approximately O (n)²) (ii) a The overall time complexity is about O (h)^mn²). Since the number m of users in the system is often very large, which results in too high time complexity, the brute force exhaustion method is obviously not the best method for solving the problem.

4 location perturbation decision based on reinforcement learning

Reinforcement learning is useful for solving the problem of an agent maximizing return value during interaction with the environment, and a common model is the standard Markov Decision Process (MDP). Therefore, the invention adopts a reinforcement learning method to solve the problem of disturbance strategy decision for maximizing the utility of the user under the condition of high-efficiency task allocation. This section introduces the MDP of the location perturbation strategy decision problem, followed by the Q-learning algorithm for solving the optimal perturbation strategy.

4.1 MDP of decisions

The Markov decision process is a sequential decision model used for simulating the execution action of an agent and acquiring return in an environment with Markov system state. It is usually expressed as a five-tuple < S, a, P, R, γ >, where S denotes the system state, a denotes the action of the agent, P denotes the transfer function between the system states, R denotes the reward, and γ denotes the discount factor.

The process of the trusted third party selecting the optimal perturbation strategy for the user can be seen as a markov process. The agent is a leader in a trusted third party, and the environment is an interaction process of the leader and a follower. The MDP five-element of the location perturbation strategy decision problem is described in detail below:

the system state is composed of a disturbance strategy vector pi and a task allocation matrix A. Initial state s⁽¹⁾＝[π⁽⁰⁾,A⁽⁰⁾]In which pi⁽⁰⁾The privacy budget of each user is represented as an initial value uploaded to a trusted third party, namely the minimum privacy protection degree accepted by the user.

Perturbation strategy

An action that is a leader. Since each user can select a set of privacy budgets provided by a trusted third party

Any one of the location perturbation schemes meeting the privacy requirement of the leader is adopted, so that the action strategy space of the leader is

At time t, the system state s^(t)Taking action pi^(t)Post arrival state s^(t+1). Because the state is composed of the disturbance strategy and the task allocation matrix, and the task allocation matrix depends on the disturbance strategy, the state at the next moment is determined by the current state and the current action, and the conditions are met

P(s^(t+1)|s⁽¹⁾,π⁽¹⁾,s⁽²⁾,π⁽²⁾,...,s^(t),π^(t))＝P(s^(t+1)|s^(t),π^(t)) (14)

I.e. the state transition has markov properties.

The reward R represents the reward for performing the corresponding action in the current state. Using equation (10) as a reward value calculation equation, i.e. at state s^(t)Take action pi^(t)And then the return value is equal to the utility value of the user at the moment.

The discount factor γ,0 ≦ γ ≦ 1, indicating how important the future and current rewards are, γ ≦ 0 meaning that looking at the current reward alone, γ ≦ 1 indicates that the future reward is as important as the current reward.

Since both the state space and the motion space are finite, the perturbation decision problem is a finite markov decision process. After the perturbation decision is converted into the MDP, the optimal perturbation selection problem in the privacy protection task allocation game is converted into: finding the initial state s⁽¹⁾A final execution action is initiated that maximizes convergence of the accumulated reward values.

4.2Q-learning based location perturbation decision algorithm

The Q-learning algorithm is an effective unsupervised reinforcement learning algorithm for solving the Markov decision process. The intelligent agent finds the best strategy to achieve the maximum convergence of the reported value by continuously trial and error learning in different environments.

In the Q-learning algorithm, the agent creates a decision matrix Q, where rows represent states and columns represent actions, storing the values of the state-action pairs (s, pi), and initializing to a zero matrix. The Q matrix is iteratively updated by Bellman Equation (Bellman Equation) as follows:

Q(s,π)←(1-α)Q(s,π)+α(u_w(s,π)+γV(s')), (15)

wherein alpha belongs to (0,1) and represents the learning rate, and the larger the value is, the less the training results before the retention is; u. of_w(s, π) represents the reported value of the execution of action π in state s; s' represents the next state after the state s performs the action π; γ is a discount factor, and has a value of 0 ≦ γ ≦ 1, indicating the effect of the future reward and the current reward on the action value function (Q function), γ ≦ 0 meaning that the action value function is only relevant for the current reward, γ ≦ 1 meaning that the future reward is as important as the current reward for the action value function; the function V (-) represents the maximum value in the next state of the Q matrix.

According to the decision matrix Q and the current state s, the leader can use an e-greedy strategy to avoid the algorithm from falling into local optimality. In state s, the leader performs the current optimal action with a probability of 1-e

The actions are randomly selected with a probability of e.

The Q-learning based perturbation scheme decision algorithm is described as follows:

inputting:

and (3) outputting: pi

Start of

Step 1, initializing the alpha, gamma,

π,Q(s,π)＝0,A＝0

step 2, for k ← 1to epsilon do

Step 3.s^(k)＝[A^(k-1),π^(k-1)]

Step 4, selecting action through e-greedy algorithm

Step 5, executing action pi, uploading the disturbed user position to the follower

Step 6, the follower distributes tasks and generates a distribution matrix A^(k)

Step 7.for i ← 1to m do

Step 8, user w_iCalculating utility according to equation (9)

Step 9.end for

Step 10. calculate u according to the formula (10)_w(s^(k),π^(k))

Step 11, updating Q(s) according to formula (15)^(k),π^(k))

Step 12, updating V(s) according to formula (16)^(k))

Step 13.end for

Step 14.return pi^*

End up

The algorithm inputs the minimum privacy budget for all users

And system selectable sets of privacy budget values

Output optimal disturbance strategy pi^*。

In step 1, a learning rate alpha and a discount factor gamma used in the algorithm are initialized, a decision matrix Q is initialized to a zero matrix, and a task allocation matrix is initialized to the zero matrix.

Steps 2-13 are a loop body with the epicode representing the maximum number of iterations of training. And 4, in the first circulation, the leader provides privacy protection for the privacy level by using the initial value of the privacy budget uploaded by the user. And in the second circulation and later, the leader selects a disturbance scheme by using an e-greedy algorithm, utilizes the previously trained optimal disturbance strategy according to the probability of 1-e, and randomly selects the disturbance strategy according to the probability of e to avoid local optimization. And 5-6, allocating tasks by the follower according to the received user privacy budget, the disturbance position and the application task set, and generating an allocation matrix. Step 7-9 is to calculate the utility of each user based on the current allocation matrix. In step 10, the overall utility is calculated according to the utility of each user, that is, the current state s is calculated^(k)Take action pi down^(k)The prize of (1). Steps 11-12 are updating the values of the state-action pairs in the decision matrix Q.

Step 14 is outputting the position perturbation strategy pi when convergence is reached or the cycle number is over^*。

The algorithm executes epsilon times in a co-loop mode, and in each loop iteration, a leader can acquire the current optimal position disturbance scheme strategy pi through a Q table according to the time complexity of O (1). The temporal complexity of the follower assignment task is O (n)²). The time to calculate the utility of all users is o (m). In summary, the time complexity of the Q-learning based position perturbation decision algorithm proposed by the present invention is O (epsilon × max (m, n)²))。

5 Experimental and results analysis

The performance of the privacy budget-task allocation gaming mechanism is evaluated through simulation experiments. The following describes specific experimental environmental parameters and analyzes the experimental results.

Table 1 lists the value settings of the basic parameters in the experiment. In a perception environment area of 5km multiplied by 5km, 10 users participate in perception of tasks, 5 perception tasks are to be distributed in a platform, and the issuing radius of each task is 1 km. Each user selects a maximum privacy budget that he/she can accept,assuming that the initial privacy budget for each user is 5, the most appropriate privacy budget is then selected for each user in the algorithm iteration. Privacy weight coefficient lambda of each user_iA positive distribution with a mean of 1 and a variance of 5 was followed. This is because location privacy protection and assigned tasks are equally important to the user as a whole. The learning rate, discount factor and greedy strategy coefficients in Q-learning are set to 0.2,0.7 and 0.8, respectively.

Table 1 experimental environment parameter settings

Table 1 Experimental parameters

5.1 evaluation of Q-learning Algorithm Performance

A random algorithm that randomly selects a perturbation strategy for the user is used as Baseline against the Q-learning algorithm of the present invention.

Fig. 4a and 4b compare the performance of the Q-learning algorithm and the stochastic algorithm used in the present invention in terms of the user overall utility and the task assignment utility, respectively. The experimental chart shows that the performance of the Q-learning algorithm is obviously superior to that of the random algorithm no matter the overall utility of the user or the task allocation utility. In the random algorithm, the privacy budget is randomly selected for each user in each iteration process, so that the result of task allocation at each time is inconsistent, the user utility and the task allocation utility are expected to fluctuate up and down, and convergence cannot be realized. FIG. 4a shows the trend of increasing and then smoothing the overall utility of the user in the Q-learning algorithm. This is because the initial privacy budget with the least degree of privacy protection uploaded by each user is selected by default at the time the Q-learning algorithm is just started, resulting in a low user utility expectation for the assigned task. With the increase of the iteration times, the algorithm continuously selects a more appropriate privacy budget for the user, and the overall utility expectation of the user is increased. Also, in fig. 4b, since the privacy protection of the user is small initially, the usability of the user location is high in the task allocation stage, so that the result of task allocation is closer to the optimal value. As user utility expectations increase, user privacy protection becomes greater and location availability decreases, resulting in a small expected increase in travel distance and a slight decrease in task assignment utility. According to experimental results, the mechanism provided by the invention can better protect the position privacy of the user while optimizing the task allocation utility, improve the overall utility of the user and achieve the win-win situation between the user and the platform.

5.2 influence of the number of Users on System Performance

The number of mobile users, which is an indispensable part of the MCS system, is an important factor for measuring the performance of the system. Fig. 5a and 5b show the influence of the number of users on the system performance in an MCS system with a task number of 5 and a task issuing radius of 1 km. It can be seen from fig. 5b that the average travel distance of both No-privacy and Q-learning algorithm proposed by the present invention becomes smaller as the number of users increases. This is because an increase in the number of users will result in new candidates appearing closer to the task. When a task is assigned to a new candidate, the average travel distance will be significantly reduced, thereby improving overall task assignment utility. At the same time, the average travel distance using randomly selected Baseline may increase due to the appearance of candidates that are further away from the task. Because the number of tasks is fixed, and users close to the tasks can select stronger protection schemes, the utility of the users assigned with the tasks does not fluctuate greatly along with the change of the total number of the users. The experiment result shows that the average travel distance can be effectively reduced when the number of the users is increased, the average travel distance is close to the optimal value without privacy protection, and the effectiveness of task allocation is obviously improved.

5.3 impact of task publishing radius on System Performance

The issue radius of the task also affects the performance of the system, and too small an issue radius may result in no user in the task issue range and the task cannot be distributed and executed. Fig. 6a and 6b show the effect of the task distribution radius on the system performance in the MCS system with the number of users 10 and the number of tasks 5. As can be seen from fig. 6a and 6b, when the release radius is less than 1km, as the task release radius increases, the overall utility and the average travel distance of the user become larger. This is because tasks that have no users in the native region will be requested and successfully distributed as the issue radius increases. When the radius is larger than 1km, the user's overall utility and average travel distance tend to be stable. The reason for one is that all tasks are allocated and no new users are allocated tasks anymore. On the other hand, the matrix of task assignment at this time does not change due to the increase of the issue radius.

The experimental result shows that the algorithm can improve the overall utility of the user while ensuring the task allocation utility in the MCS system providing the personalized privacy protection. Meanwhile, the effect is better in an MCS system with larger task release radius and more users participating in sensing tasks.

6 concluding remarks

The invention provides a win-win game mechanism for user position privacy protection and platform task allocation in a mobile crowd sensing MCS, and a balance point is solved by a reinforcement learning means. The core idea is as follows: providing personalized position privacy protection for users to attract more users to participate in MCS perception tasks; the utility of the overall user is improved as much as possible while the utility of the platform task allocation is optimized through the game. Experimental results show that the game mechanism provided by the invention can well solve the balance problem of task distribution and user position privacy protection, and has better effect in a system with large task release radius and more users.

Claims

1. A game method for position privacy protection and platform task allocation in mobile crowd sensing is characterized in that in a mobile crowd sensing system MCS, after receiving a task request, an MCS platform issues a task; users who wish to perform tasks provide location information to the MCS platform; the MCS platform selects users and distributes tasks, and is characterized in that a trusted third party TTP simulates the interaction between the users and the MCS platform; the method comprises the following steps: 1) for the tasks issued by the MCS platform, users who wish to execute the tasks transmit the true distance and privacy budget to the applied tasks to the TTP; 2) simulating the interaction process of the user and the MCS platform in the TTP, and obtaining the optimal disturbance position of the user; 3) the MCS platform selects a user allocation task according to the optimal disturbance position of the user from the TTP;

2. The game method for location privacy protection and platform task allocation in mobile crowd sensing according to claim 1, wherein in the steps 2.2) and 2.3), different location perturbation strategies are tried continuously by using a reinforcement learning method, and finally an optimal location perturbation strategy is obtained;

firstly, a Markov decision process is used for representing a process of obtaining an optimal disturbance strategy:

in the Markov decision process, an agent is used as a leader, and the environment is used as an interactive process of the leader and a follower; the five elements of the markov decision process are:

element 1: at time t, system state s^(t)Perturbed by location strategy pi^(t-1)And a task allocation matrix A^(t-1)Composition is carried out;

initial state is s⁽¹⁾＝[π⁽⁰⁾，A⁽⁰)]In which pi⁽⁰⁾Represents: the privacy budget of each user is an initial value transmitted to the TTP, namely the minimum privacy protection strength accepted by the user;

element 2: location perturbation strategy

An action that is a leader; each user can select the set of privacy budgets provided by the TTP

In any one of the location perturbation schemes meeting the privacy requirement of the leader, the action strategy space of the leader is

Element 3: at time t, the system state s^(t)Taking action pi^(t)Post arrival state s^(t+1)(ii) a The system state is composed of a position disturbance strategy and a task allocation matrix, and the task allocation matrix depends on the disturbance strategy, so that the state at the next moment is determined by the current state and the current action, P(s)^(t+1)|s⁽¹⁾，π⁽¹⁾，s⁽²⁾，π⁽²⁾，..，s^(t)，π^(t))＝P(s^(t+1)|s^(t)，π^(t)) I.e., state transitions are markov;

element 4: the reward R represents the reward for executing corresponding actions in the current state; in a state s^(t)Take action pi^(t)Then, the return value is equal to the integral utility value of the user at the moment;

element 5: the discount factor gamma is more than or equal to 0 and less than or equal to 1, the importance degree of the future return and the current return is represented, the condition that gamma is 0 means that only the current reward is seen, and the condition that gamma is 1 means that the future reward is as important as the current reward is represented;

because both the state space and the action space are limited, the position disturbance decision problem is a limited Markov decision process;

then, a Q-learning algorithm is adopted to solve the Markov decision process to obtain the initial state s⁽¹⁾A final execution action is initiated to maximize convergence of the accumulated reward values;

in the Q-learning algorithm, a decision matrix Q is created by the agent, where rows represent states and columns represent actions, used to store the values of the state-action pairs;

initialization: initializing a learning rate alpha and a discount factor gamma used in the algorithm, initializing a decision matrix Q to a zero matrix, and initializing a task allocation matrix to the zero matrix;

firstly, the leader provides privacy protection for the privacy level by using the privacy budget initial value, and selects an action through an e-greedy algorithm

Then, executing an action pi, and uploading the disturbed user position to a follower; the follower distributes tasks according to the received privacy budget, the disturbance position and the application task set and generates a distribution matrix A^(k)；

Calculating the utility of each user according to the current distribution matrix;

then, according to the utility of each user, the overall utility is calculated, namely the current state s is calculated^(k)Take action pi down^(k)The reward of (1);

iteratively updating the values of the state-action pairs in the Q matrix through a Bellman equation;

repeating the above process; a final execution action that maximizes convergence of the accumulated reward values;

position disturbance strategy pi when output reaches convergence or cycle number is finished^*。