CN115982737B - Optimal privacy protection strategy method based on reinforcement learning - Google Patents

Optimal privacy protection strategy method based on reinforcement learning Download PDF

Info

Publication number
CN115982737B
CN115982737B CN202211656580.0A CN202211656580A CN115982737B CN 115982737 B CN115982737 B CN 115982737B CN 202211656580 A CN202211656580 A CN 202211656580A CN 115982737 B CN115982737 B CN 115982737B
Authority
CN
China
Prior art keywords
state
observer
constructing
event
observation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211656580.0A
Other languages
Chinese (zh)
Other versions
CN115982737A (en
Inventor
王德光
何家汉
张志恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Original Assignee
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University filed Critical Guizhou University
Priority to CN202211656580.0A priority Critical patent/CN115982737B/en
Publication of CN115982737A publication Critical patent/CN115982737A/en
Application granted granted Critical
Publication of CN115982737B publication Critical patent/CN115982737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses an optimal privacy protection strategy method based on reinforcement learning, which comprises the following steps: s1, establishing a deterministic finite automaton model G for a system; s2, constructing an observation decision; constructing a sensor activation strategy; s3, constructing a state estimation function; constructing a detection function; s4, combining the S1, the S2 and the S3 to construct a most permissive observer; s5, respectively endowing each observation decision in the most permissible observer and the operation of the switch sensor with an activation cost and a switching cost; s6, constructing the maximum allowable observer with the numerical cost in the S5 as a deterministic finite Markov decision process; s7, combining with the S6, solving the optimal sensor activation strategy by improving Q learning. The invention adopts the optimal privacy protection strategy method based on reinforcement learning, solves the problem that the model needs to be re-built due to the increase of certain constraint conditions in the most permissible observer, and is suitable for processing the most permissible observer with or without limited cost.

Description

Optimal privacy protection strategy method based on reinforcement learning
Technical Field
The invention relates to the technical field of privacy protection, in particular to an optimal privacy protection strategy method based on reinforcement learning.
Background
In recent years, with the development of information physical systems, the scale of information transmission between different devices is increasing. Therefore, the security of information transmission is particularly important. Some confidential information of the information security requirement system cannot be found by an intruder. Discrete event systems are dynamic systems that are finger-like discrete, driven by events. Smart grids, information physical systems, etc. can be modeled logically as discrete event systems. The opacity of a discrete event system is an attribute that describes the security and privacy of the system. If an intruder cannot determine whether the system is in a secret state by observing the system behavior, the system is opaque.
When the system is non-opaque, it can be ensured that the system is opaque by forced methods such as supervisory control, insertion functions, dynamic sensor activation, etc. The supervisory control protects the confidential information by limiting system behavior. Thus, if a certain action in the system leaks a secret, that action will be prohibited by the monitor. While supervisory control methods can ensure system opacity, such methods place great constraints and limitations on system behavior. The insertion function inserts virtual events into the system to alter the output behavior of the system. Thereby ensuring that the system is opaque. However, this synthesis approach has a high computational complexity.
Dynamic sensor activation methods change the set of observable events by turning on/off the sensor to make the system meet certain attributes, such asK-Diagnosability, opacity, and the like. Such methods do not limit system behavior and are therefore not destructive to the system. In practical applications, the number of sensors for monitoring events is limited and the cost is high, and the availability of the sensors and the service life thereof, the battery power, calculation and communication resources and other factors need to be considered. If the number of open sensors is excessive, confidential information of the system may be revealed to an intruder; if the number of on sensors is too small, the information available to the user is limited. In addition, frequent switching on or off of the sensor and the running process of the sensor means more energy or bandwidth consumption, and how to solve the optimal sensor activation strategy has important research significance under the conditions of limited resources and safety of the system.
The limited architecture most permissive observer is a two-player gaming architecture in which all sensor activation strategies that meet the current state-based opacity are embedded. The purpose of two players in double games is opposite, and conventional path planning algorithms such as a, dijkstra and the like cannot cope with such flexible game structures. Average revenue games do not have a cost-limited, most likely observer by computing the cost-effective processing per step, but cannot handle the cost-limited, most likely observer, nor the real cost.
Disclosure of Invention
The invention aims to provide an optimal privacy protection strategy method based on reinforcement learning, which considers the most allowable observer with limited cost, solves the problem that some constraint conditions in the most allowable observer are increased to need to reestablish a model, and is also suitable for processing the most allowable observer without limited cost.
In order to achieve the above object, the present invention provides an optimal privacy protection policy method based on reinforcement learning, comprising the following steps:
s1, establishing a deterministic finite automaton model for a system based on the problem of privacy protection strategy
S2, constructing an observation decision, wherein the observation decision is a current considerable event set and is changed according to the historical behavior of the system; constructing a sensor activation strategy, wherein the sensor activation strategy is a constructed observation decision on system behavior; the dynamic projection is a mapping under the sensor activation strategy, and the event sequence of the system filters out the events which do not belong to the current observation decision through the dynamic projection;
s3, constructing a state estimation function to estimate the current state of the system; constructing a detection function to check whether the current state opacity is met in the state estimation;
s4, combining the S1, the S2 and the S3 to construct a most permissive observer;
s5, respectively endowing each observation decision in the most permissible observer and the operation of the switch sensor with an activation cost and a switching cost;
s6, constructing the maximum allowable observer with the numerical cost in the S5 as a deterministic finite Markov decision process;
s7, combine with S6, through improvementAnd learning and solving an optimal sensor activation strategy, and carrying out experiments and result analysis.
Preferably, in the step S1, a deterministic finite automaton modelThe method comprises the following steps:
wherein, the liquid crystal display device comprises a liquid crystal display device,is a finite state set->For a limited set of events->For transfer function +.>Is in an initial state; the event set is divided into a dynamic event set and a constant unobservable event set, and the dynamic event dynamically changes the observability of the event according to the behavior of the system.
Preferably, in the step S4, the specific process of constructing the most licensed observer is as follows:
first, establishStatus and->A state space of states; />Status and->The state is an information state for capturing the relation between the observation decision and the occurrence of the event; search->Status and->State space of states until a violation is encountered based on the current state opacity +.>A state;
then, the trimming is performedStatus and corresponding +.>State, until the structure converges.
Preferably, in the step S7, the solving the optimal sensor activation policy specifically includes:
s71 input stateAs an initial state;
s72, if the current state of the traversal is not the termination state or the set traversal times are not reached: if a random number is less than the greedy rate, execution (1), otherwise, execution (2):
(1) At the position ofSelecting maximum +.>Action corresponding to value->,/>The table is a matrix of the number of states multiplied by the number of actions;
(2) Randomly selecting an active action based on the current state
If the current state is the termination state or the set traversal times are reached, ending the current cycle;
s73, executing actionAnd acquires the status of the next reachable +.>And rewarding->
S74, iterative updating according to the formulaA value;
s75, updating stateFor the case that multiple states are reachable, randomly selecting one state from the multiple states;
s76 repeating the steps S71, S72, S73, S74 and S75, and so on untilThe value converges or reaches the set iteration times;
s77, in turnSelecting a corresponding +.>Maximum action->And integrating the above processes to obtain the optimal sensor activation strategy +.>
The optimal privacy protection strategy method based on reinforcement learning solves the optimal privacy protection strategy based on reinforcement learning, and can achieve the following three purposes: (1) An intruder can not always determine whether the current system is in a confidential state by observing the behavior of the system; (2) The cost corresponding to the solved sensor activation strategy is the lowest; (3) While being suitable for handling situations where there are limited cost of the most licensed observers and where there are no limited cost of the most licensed observers.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of an embodiment of an optimal privacy preserving policy method based on reinforcement learning;
FIG. 2 is a system of an embodiment of an optimal privacy preserving policy method based on reinforcement learning according to the present inventionA deterministic finite automaton model;
FIG. 3 is a most permissive observer of an embodiment of an optimal privacy preserving policy method based on reinforcement learning according to the present invention;
FIG. 4 is a graph of a most probable viewer with numerical cost for an embodiment of an optimal privacy protection policy method based on reinforcement learning in accordance with the present invention.
Detailed Description
The technical scheme of the invention is further described below through the attached drawings and the embodiments.
Examples
As shown in fig. 1, an optimal privacy protection policy method based on reinforcement learning includes the following steps:
s1, establishing a deterministic finite automaton model for a system based on the problem of privacy protection strategy
Deterministic finite automaton modelThe method comprises the following steps:
wherein, the liquid crystal display device comprises a liquid crystal display device,is a finite state set->For a limited set of events->For transfer function +.>Is in an initial state. The event set is divided into a dynamic event set and a constant unobservable event set, and the dynamic event dynamically changes the observability of the event according to the behavior of the system. />Is a secret state set; wherein (1)>For dynamically switchable observability events, +.>For a constant unobservable event, the initial state is state 0, the secret state set is +.>
S2, constructing an observation decisionThe observation decision is the current observable event set +.>Changing according to the historical behavior of the system; construction of a sensor activation strategy->Sensor activation strategy->Observation decision for the systematic behavioral construction +.>;/>For the set of all observation decisions +.>Is->The method comprises the steps of carrying out a first treatment on the surface of the Dynamic projection->For activating the strategy in the sensor->Next mapping, filtering out the event sequence of the system by dynamic projection, wherein each curtain does not belong to the current observation decision +.>Event in (a);
for the event sequence of the system,/->For events->Is a sequence of events that is empty and,
s3, constructing a state estimation function to estimate the current state of the system:
a detection function is constructed to verify whether the current state opacity is satisfied in the state estimation:
if it isAll have->Indicating that the system is satisfying the current state opacity.
S4, combining S1, S2 and S3 to construct the most permissive observer.
The most permissive observer is a seven-tupleFor the state set->,/>Respectively->Status to->Transfer function and->Status to->State transfer function->Is->An initial state of the states; the ellipse in FIG. 3 is +.>Status, rectangle +.>Status of the device. The method comprises the following specific steps:
a) Input deviceAs an initial state;
b) For each observation decisionyStatus is decided by the observation->Transfer tozStatus of if thezDetection value of state->Then add the transition to the most permissive observer, where +.>Is thatzState estimation part of state, fig. 3 +.>The state estimation part of the state {0,4}, { a, b }, is {0,4};
c) If it iszAdding the state to the most licensed observer if the state is not in the most licensed observer; for any eventIf (3)zState passing eventeReach toyThe transition of the state is valid, the state transitions to +.>A state;
d) If it isAdding the state to the most licensed observer if the state is not in the most licensed observer;
e) Recursively invoking steps (b) (c) (d);
f) Solving for the most licensed observer, if anyyThe transition of the state is not valid, the state is removed, and the state can be reached at the same timezStatus of the device.
S5, respectively endowing the operation of each observation decision and the switch sensor in the most permissible observer with an activation cost and a switching cost.
Activation costCost for observation decisions; the switching cost is calculated as follows: decision making for arbitrary observation
Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Event->Opening and closing costs of (2); the activation cost in FIG. 3,/>. Switching cost->The method comprises the steps of carrying out a first treatment on the surface of the Fig. 4 is the most allowable observer with numerical cost.
S6, constructing the maximum allowable observer with the numerical cost in S5 as a deterministic finite Markov decision process.
Five-tuple for deterministic finite Markov decision processRepresentation of->Is a movement space->For the state space +.>For transfer function +.>For rewarding (I)>Is an attenuation factor;
for the most licensed observers with numerical cost, this can be equivalent to a deterministic finite Markov decision process:observation decision valid for the current state, +.>Is +.>Status (S)>A transition relation for the current state to reach the next state through observation decision; the number of current states reaching a state through an observation decision may be more than one,representing a set of reachable states; the prize may be expressed as the sum of the inverse of the numerical cost and a real number:
s7, combine with S6, through improvementAnd learning and solving an optimal sensor activation strategy, and carrying out experiments and result analysis.
Solving an optimal sensor activation strategy specifically comprises the following steps:
s71 input stateAs an initial matterA state;
s72, if the current state of the traversal is not the termination state or the set traversal times are not reached: if a random numberLess than greedy rate->Executing (1), otherwise executing (2):
(1) At the position ofSelecting maximum +.>Action corresponding to value->,/>The table is a matrix of the number of states multiplied by the number of actions;
(2) Randomly selecting an active action based on the current state
If the current state is the termination state or the set traversal times are reached, ending the current cycle;
s73, executing actionAnd acquires the status of the next reachable +.>And rewarding->
S74, according to the formulaIterative updatingValue of (1), wherein->For learning rate->
S75, updating stateFor the case that multiple states are reachable, randomly selecting one state from the multiple states;
s76 repeating the steps S71, S72, S73, S74 and S75, and so on untilThe value converges or reaches the set iteration times;
s77, in turnSelecting a corresponding +.>Maximum action->And integrating the above processes to obtain the optimal sensor activation strategy +.>
For the most licensed observer with numerical cost of FIG. 4, the optimal strategy for solving is as follows:
the policies corresponding to fig. 3 are:
the results were analyzed as follows: sensor activation strategy based on the above strategy integrationEvent sequence for a system,/>,/>,/>WhereinFor observation decisions. From the results, the optimal sensor activation strategy is such that the system always only turns on the monitoring event +.>Is a sensor of (a).
Therefore, the invention adopts the optimal privacy protection strategy method based on reinforcement learning, considers the most allowable observer with limited cost, solves the problem that some constraint conditions in the most allowable observer are increased and need to reestablish a model, and is also suitable for processing the most allowable observer without limited cost.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims (2)

1. The optimal privacy protection strategy method based on reinforcement learning is characterized by comprising the following steps of:
s1, establishing a deterministic finite automaton model for a system based on the problem of privacy protection strategy
S2, constructing an observation decision, wherein the observation decision is a current considerable event set and is changed according to the historical behavior of the system; constructing a sensor activation strategy, wherein the sensor activation strategy is a constructed observation decision on system behavior; the dynamic projection is a mapping under the sensor activation strategy, and the event sequence of the system filters out the events which do not belong to the current observation decision through the dynamic projection;
s3, constructing a state estimation function to estimate the current state of the system; constructing a detection function to check whether the current state opacity is met in the state estimation;
s4, combining the S1, the S2 and the S3 to construct a most permissive observer;
s5, respectively endowing each observation decision in the most permissible observer and the operation of the switch sensor with an activation cost and a switching cost;
s6, constructing the maximum allowable observer with the numerical cost in the S5 as a deterministic finite Markov decision process;
s7, combine with S6, through improvementLearning and solving an optimal sensor activation strategy, and carrying out experiments and result analysis;
the most permissive observer is a seven-tuple;/>For the state set->,/>Respectively->Status to->Transfer function and->Status to->State transfer function->Is->Initial state of states->For a limited set of events->For the set of observation decisions +.>Is a finite state set;
the specific process of constructing the most licensed observer is as follows:
a) Input deviceAs an initial state;
b) For each observation decisionyStatus is decided by the observation->Transfer tozStatus of if thezDetection value of stateThen add the transition to the most licensed observer, whereI(z)Is thatzA state estimating section of the state;
c) If it iszAdding the state to the most licensed observer if the state is not in the most licensed observer; for any eventIf (3)zState passing eventeReach toyThe transition of the state is valid, the state transitions to +.>A state;
d) If it isAdding the state to the most licensed observer if the state is not in the most licensed observer;
e) Recursively invoking steps (b) (c) (d);
f) Solving for the most licensed observer, if anyyThe transition of the state is not valid, the state is removed, and the state can be reached at the same timezA state;
in the step S7, the solving of the optimal sensor activation policy specifically includes:
s71 input stateAs an initial state;
s72, if the current state of the traversal is not the termination state or the set traversal times are not reached: if a random number is less than the greedy rate, execution (1), otherwise, execution (2):
(1) At the position ofSelecting maximum +.>Action corresponding to value->,/>The table is a matrix of the number of states multiplied by the number of actions;
(2) Randomly selecting an active action based on the current state
If the current state is the termination state or the set traversal times are reached, ending the current cycle;
s73, executing actionAnd acquires the status of the next reachable +.>And rewarding->
S74, iterative updating according to the formulaA value;
s75, updating stateFor the case that multiple states are reachable, randomly selecting one state from the multiple states;
s76 repeating the steps S71, S72, S73, S74 and S75, and so on untilValue convergence or reaching a set overlapThe times of generation;
s77, in turnSelecting a corresponding +.>Maximum action->And integrating the above processes to obtain the optimal sensor activation strategy +.>
2. The method according to claim 1, wherein in S1, the deterministic finite automaton model is used for the optimal privacy protection policy based on reinforcement learningThe method comprises the following steps:
wherein, the liquid crystal display device comprises a liquid crystal display device,is a finite state set->For a limited set of events->For transfer function +.>Is in an initial state; the event set is divided into a dynamic event set and a constant unobservable event set, and the dynamic event dynamically changes the observability of the event according to the behavior of the system.
CN202211656580.0A 2022-12-22 2022-12-22 Optimal privacy protection strategy method based on reinforcement learning Active CN115982737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211656580.0A CN115982737B (en) 2022-12-22 2022-12-22 Optimal privacy protection strategy method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211656580.0A CN115982737B (en) 2022-12-22 2022-12-22 Optimal privacy protection strategy method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN115982737A CN115982737A (en) 2023-04-18
CN115982737B true CN115982737B (en) 2023-07-21

Family

ID=85962088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211656580.0A Active CN115982737B (en) 2022-12-22 2022-12-22 Optimal privacy protection strategy method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN115982737B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373997A1 (en) * 2017-06-21 2018-12-27 International Business Machines Corporation Automatically state adjustment in reinforcement learning
US20210232970A1 (en) * 2020-01-24 2021-07-29 Jpmorgan Chase Bank, N.A. Systems and methods for risk-sensitive reinforcement learning
KR102615244B1 (en) * 2020-04-07 2023-12-19 한국전자통신연구원 Apparatus and method for recommending user's privacy control
CN113420326B (en) * 2021-06-08 2022-06-21 浙江工业大学之江学院 Deep reinforcement learning-oriented model privacy protection method and system
CN113935024B (en) * 2021-10-09 2024-04-26 天津科技大学 Discrete event system information security judging method with uncertainty observation

Also Published As

Publication number Publication date
CN115982737A (en) 2023-04-18

Similar Documents

Publication Publication Date Title
Servin et al. Multi-agent reinforcement learning for intrusion detection
Zonouz et al. RRE: A game-theoretic intrusion response and recovery engine
Lahiri et al. The genetic algorithm as a general diffusion model for social networks
Vázquez et al. Computational complexity arising from degree correlations in networks
Brody et al. Mathematical models for fake news
Baccelli et al. Pairwise stochastic bounded confidence opinion dynamics: Heavy tails and stability
CN110839031B (en) Malicious user behavior intelligent detection system based on reinforcement learning
Rasouli et al. A supervisory control approach to dynamic cyber-security
Qiu et al. Temporal link prediction with motifs for social networks
Fang et al. Revealing structural and functional vulnerability of power grids to cascading failures
CN107944562B (en) A kind of building method of the intelligence system of the uncertain causality category information of the processing of extension
Zhong et al. An efficient parallel reinforcement learning approach to cross-layer defense mechanism in industrial control systems
CN115982737B (en) Optimal privacy protection strategy method based on reinforcement learning
Deng et al. Opacity measures of fuzzy discrete event systems
Gao et al. PHDP: Preserving persistent homology in differentially private graph publications
CN110034961B (en) Seepage rate calculation method taking OODA chain as element
Cho et al. Dynamics of uncertain opinions in social networks
Ciftcioglu et al. Topology design games and dynamics in adversarial environments
Buffet et al. Robust planning with (l) rtdp
Ghosh et al. An iterative security game for computing robust and adaptive network flows
Zhao et al. Prediction of competitive diffusion on complex networks
CN113037718B (en) Internet of things node privacy protection method based on Bayesian game under fog computing architecture
Li et al. AI-enabled Trust in Distributed Networks
Rafati et al. Efficient exploration through intrinsic motivation learning for unsupervised subgoal discovery in model-free hierarchical reinforcement learning
Taha Identifying and Protecting Cyber-Physical Systems' Influential Devices for Sustainable Cybersecurity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant