CN115982737B

CN115982737B - Optimal privacy protection strategy method based on reinforcement learning

Info

Publication number: CN115982737B
Application number: CN202211656580.0A
Authority: CN
Inventors: 王德光; 何家汉; 张志恒
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2023-07-21
Anticipated expiration: 2042-12-22
Also published as: CN115982737A

Abstract

The invention discloses an optimal privacy protection strategy method based on reinforcement learning, which comprises the following steps: s1, establishing a deterministic finite automaton model G for a system; s2, constructing an observation decision; constructing a sensor activation strategy; s3, constructing a state estimation function; constructing a detection function; s4, combining the S1, the S2 and the S3 to construct a most permissive observer; s5, respectively endowing each observation decision in the most permissible observer and the operation of the switch sensor with an activation cost and a switching cost; s6, constructing the maximum allowable observer with the numerical cost in the S5 as a deterministic finite Markov decision process; s7, combining with the S6, solving the optimal sensor activation strategy by improving Q learning. The invention adopts the optimal privacy protection strategy method based on reinforcement learning, solves the problem that the model needs to be re-built due to the increase of certain constraint conditions in the most permissible observer, and is suitable for processing the most permissible observer with or without limited cost.

Description

Optimal privacy protection strategy method based on reinforcement learning

Technical Field

The invention relates to the technical field of privacy protection, in particular to an optimal privacy protection strategy method based on reinforcement learning.

Background

In recent years, with the development of information physical systems, the scale of information transmission between different devices is increasing. Therefore, the security of information transmission is particularly important. Some confidential information of the information security requirement system cannot be found by an intruder. Discrete event systems are dynamic systems that are finger-like discrete, driven by events. Smart grids, information physical systems, etc. can be modeled logically as discrete event systems. The opacity of a discrete event system is an attribute that describes the security and privacy of the system. If an intruder cannot determine whether the system is in a secret state by observing the system behavior, the system is opaque.

When the system is non-opaque, it can be ensured that the system is opaque by forced methods such as supervisory control, insertion functions, dynamic sensor activation, etc. The supervisory control protects the confidential information by limiting system behavior. Thus, if a certain action in the system leaks a secret, that action will be prohibited by the monitor. While supervisory control methods can ensure system opacity, such methods place great constraints and limitations on system behavior. The insertion function inserts virtual events into the system to alter the output behavior of the system. Thereby ensuring that the system is opaque. However, this synthesis approach has a high computational complexity.

Dynamic sensor activation methods change the set of observable events by turning on/off the sensor to make the system meet certain attributes, such asK-Diagnosability, opacity, and the like. Such methods do not limit system behavior and are therefore not destructive to the system. In practical applications, the number of sensors for monitoring events is limited and the cost is high, and the availability of the sensors and the service life thereof, the battery power, calculation and communication resources and other factors need to be considered. If the number of open sensors is excessive, confidential information of the system may be revealed to an intruder; if the number of on sensors is too small, the information available to the user is limited. In addition, frequent switching on or off of the sensor and the running process of the sensor means more energy or bandwidth consumption, and how to solve the optimal sensor activation strategy has important research significance under the conditions of limited resources and safety of the system.

The limited architecture most permissive observer is a two-player gaming architecture in which all sensor activation strategies that meet the current state-based opacity are embedded. The purpose of two players in double games is opposite, and conventional path planning algorithms such as a, dijkstra and the like cannot cope with such flexible game structures. Average revenue games do not have a cost-limited, most likely observer by computing the cost-effective processing per step, but cannot handle the cost-limited, most likely observer, nor the real cost.

Disclosure of Invention

The invention aims to provide an optimal privacy protection strategy method based on reinforcement learning, which considers the most allowable observer with limited cost, solves the problem that some constraint conditions in the most allowable observer are increased to need to reestablish a model, and is also suitable for processing the most allowable observer without limited cost.

In order to achieve the above object, the present invention provides an optimal privacy protection policy method based on reinforcement learning, comprising the following steps:

s1, establishing a deterministic finite automaton model for a system based on the problem of privacy protection strategy；

S2, constructing an observation decision, wherein the observation decision is a current considerable event set and is changed according to the historical behavior of the system; constructing a sensor activation strategy, wherein the sensor activation strategy is a constructed observation decision on system behavior; the dynamic projection is a mapping under the sensor activation strategy, and the event sequence of the system filters out the events which do not belong to the current observation decision through the dynamic projection;

s3, constructing a state estimation function to estimate the current state of the system; constructing a detection function to check whether the current state opacity is met in the state estimation;

s4, combining the S1, the S2 and the S3 to construct a most permissive observer;

s5, respectively endowing each observation decision in the most permissible observer and the operation of the switch sensor with an activation cost and a switching cost;

s6, constructing the maximum allowable observer with the numerical cost in the S5 as a deterministic finite Markov decision process;

s7, combine with S6, through improvementAnd learning and solving an optimal sensor activation strategy, and carrying out experiments and result analysis.

Preferably, in the step S1, a deterministic finite automaton modelThe method comprises the following steps:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,is a finite state set->For a limited set of events->For transfer function +.>Is in an initial state; the event set is divided into a dynamic event set and a constant unobservable event set, and the dynamic event dynamically changes the observability of the event according to the behavior of the system.

Preferably, in the step S4, the specific process of constructing the most licensed observer is as follows:

first, establishStatus and->A state space of states; />Status and->The state is an information state for capturing the relation between the observation decision and the occurrence of the event; search->Status and->State space of states until a violation is encountered based on the current state opacity +.>A state;

then, the trimming is performedStatus and corresponding +.>State, until the structure converges.

Preferably, in the step S7, the solving the optimal sensor activation policy specifically includes:

s71 input stateAs an initial state;

s72, if the current state of the traversal is not the termination state or the set traversal times are not reached: if a random number is less than the greedy rate, execution (1), otherwise, execution (2):

(1) At the position ofSelecting maximum +.>Action corresponding to value->，/>The table is a matrix of the number of states multiplied by the number of actions;

(2) Randomly selecting an active action based on the current state；

If the current state is the termination state or the set traversal times are reached, ending the current cycle;

s73, executing actionAnd acquires the status of the next reachable +.>And rewarding->；

S74, iterative updating according to the formulaA value;

s75, updating stateFor the case that multiple states are reachable, randomly selecting one state from the multiple states;

s76 repeating the steps S71, S72, S73, S74 and S75, and so on untilThe value converges or reaches the set iteration times;

s77, in turnSelecting a corresponding +.>Maximum action->And integrating the above processes to obtain the optimal sensor activation strategy +.>。

The optimal privacy protection strategy method based on reinforcement learning solves the optimal privacy protection strategy based on reinforcement learning, and can achieve the following three purposes: (1) An intruder can not always determine whether the current system is in a confidential state by observing the behavior of the system; (2) The cost corresponding to the solved sensor activation strategy is the lowest; (3) While being suitable for handling situations where there are limited cost of the most licensed observers and where there are no limited cost of the most licensed observers.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a flow chart of an embodiment of an optimal privacy preserving policy method based on reinforcement learning;

FIG. 2 is a system of an embodiment of an optimal privacy preserving policy method based on reinforcement learning according to the present inventionA deterministic finite automaton model;

FIG. 3 is a most permissive observer of an embodiment of an optimal privacy preserving policy method based on reinforcement learning according to the present invention;

FIG. 4 is a graph of a most probable viewer with numerical cost for an embodiment of an optimal privacy protection policy method based on reinforcement learning in accordance with the present invention.

Detailed Description

The technical scheme of the invention is further described below through the attached drawings and the embodiments.

Examples

As shown in fig. 1, an optimal privacy protection policy method based on reinforcement learning includes the following steps:

s1, establishing a deterministic finite automaton model for a system based on the problem of privacy protection strategy。

Deterministic finite automaton modelThe method comprises the following steps:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,is a finite state set->For a limited set of events->For transfer function +.>Is in an initial state. The event set is divided into a dynamic event set and a constant unobservable event set, and the dynamic event dynamically changes the observability of the event according to the behavior of the system. />Is a secret state set; wherein (1)>For dynamically switchable observability events, +.>For a constant unobservable event, the initial state is state 0, the secret state set is +.>。

S2, constructing an observation decisionThe observation decision is the current observable event set +.>Changing according to the historical behavior of the system; construction of a sensor activation strategy->Sensor activation strategy->Observation decision for the systematic behavioral construction +.>；/>For the set of all observation decisions +.>Is->The method comprises the steps of carrying out a first treatment on the surface of the Dynamic projection->For activating the strategy in the sensor->Next mapping, filtering out the event sequence of the system by dynamic projection, wherein each curtain does not belong to the current observation decision +.>Event in (a);

for the event sequence of the system,/->For events->Is a sequence of events that is empty and,

。

s3, constructing a state estimation function to estimate the current state of the system:；

a detection function is constructed to verify whether the current state opacity is satisfied in the state estimation:，；

if it isAll have->Indicating that the system is satisfying the current state opacity.

S4, combining S1, S2 and S3 to construct the most permissive observer.

The most permissive observer is a seven-tuple；For the state set->，/>Respectively->Status to->Transfer function and->Status to->State transfer function->Is->An initial state of the states; the ellipse in FIG. 3 is +.>Status, rectangle +.>Status of the device. The method comprises the following specific steps:

a) Input deviceAs an initial state;

b) For each observation decision，yStatus is decided by the observation->Transfer tozStatus of if thezDetection value of state->Then add the transition to the most permissive observer, where +.>Is thatzState estimation part of state, fig. 3 +.>The state estimation part of the state {0,4}, { a, b }, is {0,4};

c) If it iszAdding the state to the most licensed observer if the state is not in the most licensed observer; for any eventIf (3)zState passing eventeReach toyThe transition of the state is valid, the state transitions to +.>A state;

d) If it isAdding the state to the most licensed observer if the state is not in the most licensed observer;

e) Recursively invoking steps (b) (c) (d);

f) Solving for the most licensed observer, if anyyThe transition of the state is not valid, the state is removed, and the state can be reached at the same timezStatus of the device.

S5, respectively endowing the operation of each observation decision and the switch sensor in the most permissible observer with an activation cost and a switching cost.

Activation costCost for observation decisions; the switching cost is calculated as follows: decision making for arbitrary observation，

；

Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Event->Opening and closing costs of (2); the activation cost in FIG. 3，/>. Switching cost->，The method comprises the steps of carrying out a first treatment on the surface of the Fig. 4 is the most allowable observer with numerical cost.

S6, constructing the maximum allowable observer with the numerical cost in S5 as a deterministic finite Markov decision process.

Five-tuple for deterministic finite Markov decision processRepresentation of->Is a movement space->For the state space +.>For transfer function +.>For rewarding (I)>Is an attenuation factor;

for the most licensed observers with numerical cost, this can be equivalent to a deterministic finite Markov decision process:observation decision valid for the current state, +.>Is +.>Status (S)>A transition relation for the current state to reach the next state through observation decision; the number of current states reaching a state through an observation decision may be more than one,representing a set of reachable states; the prize may be expressed as the sum of the inverse of the numerical cost and a real number:

。

Solving an optimal sensor activation strategy specifically comprises the following steps:

s71 input stateAs an initial matterA state;

s72, if the current state of the traversal is not the termination state or the set traversal times are not reached: if a random numberLess than greedy rate->Executing (1), otherwise executing (2):

(2) Randomly selecting an active action based on the current state；

S74, according to the formulaIterative updatingValue of (1), wherein->For learning rate->。

For the most licensed observer with numerical cost of FIG. 4, the optimal strategy for solving is as follows:

；

the policies corresponding to fig. 3 are:

；

the results were analyzed as follows: sensor activation strategy based on the above strategy integrationEvent sequence for a system，/>，/>，/>WhereinFor observation decisions. From the results, the optimal sensor activation strategy is such that the system always only turns on the monitoring event +.>Is a sensor of (a).

Therefore, the invention adopts the optimal privacy protection strategy method based on reinforcement learning, considers the most allowable observer with limited cost, solves the problem that some constraint conditions in the most allowable observer are increased and need to reestablish a model, and is also suitable for processing the most allowable observer without limited cost.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. The optimal privacy protection strategy method based on reinforcement learning is characterized by comprising the following steps of:

s7, combine with S6, through improvementLearning and solving an optimal sensor activation strategy, and carrying out experiments and result analysis;

the most permissive observer is a seven-tuple；/>For the state set->，/>Respectively->Status to->Transfer function and->Status to->State transfer function->Is->Initial state of states->For a limited set of events->For the set of observation decisions +.>Is a finite state set;

the specific process of constructing the most licensed observer is as follows:

a) Input deviceAs an initial state;

b) For each observation decision，yStatus is decided by the observation->Transfer tozStatus of if thezDetection value of stateThen add the transition to the most licensed observer, whereI(z)Is thatzA state estimating section of the state;

e) Recursively invoking steps (b) (c) (d);

f) Solving for the most licensed observer, if anyyThe transition of the state is not valid, the state is removed, and the state can be reached at the same timezA state;

in the step S7, the solving of the optimal sensor activation policy specifically includes:

s71 input stateAs an initial state;

(2) Randomly selecting an active action based on the current state；

S74, iterative updating according to the formulaA value;

s76 repeating the steps S71, S72, S73, S74 and S75, and so on untilValue convergence or reaching a set overlapThe times of generation;

2. The method according to claim 1, wherein in S1, the deterministic finite automaton model is used for the optimal privacy protection policy based on reinforcement learningThe method comprises the following steps:

；