CN109375514A

CN109375514A - A kind of optimal track control device design method when the injection attacks there are false data

Info

Publication number: CN109375514A
Application number: CN201811453386.6A
Authority: CN
Inventors: 刘皓
Original assignee: Shenyang Aerospace University
Current assignee: Shensu Intelligent Agricultural Machinery Equipment Henan Co ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2019-02-22
Anticipated expiration: 2038-11-30
Also published as: CN109375514B

Abstract

The present invention relates to a kind of intelligent-tracking controllers, and when there are false data injection attacks, which can calculate optimal tracking control law in real time, so that the reference input of tracking system is capable of in the output of system.The controller may include different control algolithm processors, using the adaptive dynamic programming algorithm based on game theory and Q- study, the case where can be adapted for the unknown situation of system dynamic, can only even obtain input-output data.The present invention is suitable for the case where by Wireless networking systems and controller, or the case where by wireless communication networks transmission data, has great application value in terms of UAV Formation Flight, intelligent vehicle.

Description

Design method of optimal tracking controller in presence of false data injection attack

Technical Field

The invention relates to a method for determining an optimal tracking controller by using a game theory, a self-adaptive dynamic programming and a reinforcement learning method when a linear discrete time system has false data injection attack.

Background

Optimal tracking control is an important research topic in the control field and has a wide application background. For example, the track tracking of intelligent vehicles and unmanned aerial vehicles, the tracking control of robots, and the like. The purpose of the optimal tracking control is to enable the output of the system to track the reference input (or reference trajectory) in an optimal sense, which can be achieved by minimizing a pre-given quadratic performance indicator function. It should be noted that, as network technologies develop and are applied, wireless network transmission technologies are increasingly applied to remote transmission of data. However, due to the existence of the wireless network, the transmitted data is easy to be attacked by adversaries, which mainly includes denial of service attack, replay attack, fake data injection attack, and the like. Therefore, the research on the optimal tracking control in the presence of network attacks has important practical significance. The invention mainly aims at the research of false data injection attack.

In the traditional optimal tracking control, a corresponding tracking controller is designed by adopting a dynamic programming method. However, the dynamic planning method belongs to a backward-forward recursion method, so that online calculation cannot be performed, and a problem of dimension disaster exists. The self-adaptive dynamic programming method belongs to the artificial intelligence category, is essentially based on the reinforcement learning theory, simulates the thinking of learning by a human through the feedback of a complex environment, and further recurs forward in time to solve a control strategy, so that the method can be executed on line.

The optimal control rate is calculated by adopting a Q learning method, a system matrix of an original system and a reference trajectory generator is not needed, and the method is suitable for the condition that some dynamic matrixes are unknown. In addition, the method can also be used for iteratively solving the optimal tracking control strategy by only adopting input and output data without current state information.

Disclosure of Invention

The invention aims to provide a design method of an optimal tracking controller of a discrete time system when false data injection attack exists, and solves the problem that the tracking cannot be performed when the false data injection attack exists in the prior art. The system of the present invention is illustrated in block diagram form in fig. 1. The technical scheme of the invention is implemented as follows:

1) establishing a false data attack model and an augmentation system model;

2) establishing a game model of an attack and defense party by adopting a game theory method; the defender is a controller, and the attacker is an injector of false data;

3) establishing a Bellman equation, and solving an optimal control strategy and an attack strategy through an optimal control theory; solving a game algebra Riacati equation by adopting a strategy iteration method and a value iteration method;

4) solving the optimal strategies of both game parties by adopting a Q-function-based reinforcement learning method, wherein the optimal strategies comprise a strategy iteration method and a value iteration method;

5) and (4) iteratively solving the optimal strategy by adopting a Q-learning method based on input-output data only.

Drawings

Fig. 1 is a system configuration diagram in the presence of a false data injection attack.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, the present invention provides a method using game theory, adaptive dynamic programming and Q learning, which solves the problem of optimal tracking control of a discrete time system. The specific implementation mode is as follows:

1) establishing a false data attack model and an augmentation model

Consider the following system model

x_k+1＝Ax_k+Bu_k(1)

Wherein A and B are system matrixes; assume system control input u_kThe system model is attacked in the transmission process and becomes the system model after being attacked by the dummy data injection

Wherein q is the number of attackers,the ith transmission is attacked by the jth attacker, otherwise, the ith transmission is not attacked;injecting false data for the jth channel at time k.

The assumed tracking model has the following form

Wherein the matrix T is a tracking model matrix; it should be noted that the system matrix T in the above equation is not required to be Hurwitz. By combining the equations (2) and (3), the following augmented system equation can be obtained

2) Establishing game models of an attack and defense party by adopting a game theory method

Generally, controllers come in a variety of forms, such as state feedback, output feedback, dynamic output feedback, and the like. In addition, the injection dummy data may be various. The present invention assumes that the tracking controller and the dummy data are both statesOf linear function, i.e.

Wherein, K ═ K₁，K₂]Andrespectively, the feedback gains of the attacking and defending parties. The following payment functions are selected for the two parties in the game respectively:

wherein,Q_eand the value is more than or equal to 0, R is more than 0, and gamma epsilon (0, 1) is a discount factor. Therefore, the optimal strategy for the defenders and attackers is

Solving (9) and (10) is equivalent to solving the following game problem

3) Establishing a Bellman equation, and solving an optimal control strategy and an attack strategy through an optimal control theory

First, a utility function is defined as follows,

then, through calculation, the following optimal control bellman equation can be obtained,

according to the theory of optimal control, it can be known that,p is more than 0. Therefore, by solving the optimal control equation, the following optimal strategies of both attacking and defending can be obtained:

wherein,

Θ＝[(Θ¹)^T(Θ³)^T…(Θ^q)^T]^T

L(P)＝[(L¹(P))^T(L²(P))^T…(L^q(P))^T]^T

the matrix P > 0 and satisfies

The above results are given according to the dynamic programming method, and can be calculated only off-line. Now, a reinforcement learning method is adopted to calculate optimal strategies of both game parties on line. The strategy iteration and value iteration calculation processes are given in algorithm 1 and algorithm 2 below, respectively.

Algorithm 1: online policy iteration

1. Initialization: setting j to 0, selecting a stable initial strategyAnd

2. and (3) policy evaluation: solving the following equation to obtain P^j+1

3. Strategy improvement:

4. stopping conditions are as follows: i K^j+1-K^j||＜∈，||L^j+1-L^j||＜∈

And 2, algorithm: value iteration algorithm

1. Initialization: setting j to 0, selecting a stable initial strategyAnd

2. and (3) policy evaluation: solving the following equation to obtain P^j+1

3. Strategy improvement:

4. stopping conditions are as follows: i K^j+1-K^j||＜∈，||L^j+1-L^j||＜∈；

As can be seen from Algorithm 1, solving equation (17) requires known dataAnd the initial value must be stable, otherwise the equation is unsolved. Algorithm 2 is improved accordingly and the initial value is no longer required to be stable.

4) Solving the optimal strategy of both game parties by adopting a Q-function-based reinforcement learning method

The Q-function is defined as follows,

for ease of description re-writing to the following compact form,

wherein,

therefore, it is possible to solve the equationAndthe following optimal strategies of both attacking and defending can be obtained,

by taking equation (20) into equation (19), a bellman equation based on a Q-function, which is an important equation in the iterative process, can be obtained. The strategy iteration and value iteration methods based on the Q-function are given in algorithm 3 and algorithm 4, respectively.

Algorithm 3: strategy iteration algorithm based on Q-function

1. Initialization: setting j to 0, and selecting H⁰＝(H⁰)^T

2. And (3) policy evaluation: solving the following equation to obtain P^j+1

3. Strategy improvement:

4. stopping conditions are as follows: i H^j+1-H^j||＜∈

And algorithm 4: value iteration algorithm based on Q-function

1. Initialization: setting j to 0, and selecting H⁰＝(H⁰)^T

2. And (3) policy evaluation: solving the following equation to obtain P^j+1

3. Strategy improvement:

4. stopping conditions are as follows: i H^j+1-H^j||＜∈。

Notably, the Q-function based iterative algorithms 3 and 4 do not require the system matrix of the previously known augmentation systemAnd

iterative solution of optimal strategy based on input-output data by Q-learning method

Assuming the system is observable, the system stateAn input-output sequence representation may be employed,

wherein,

as can be seen from the above equation, there is a constant k > 0, such that when N < k, rank (V)_N) N + p, when N is not less than k, rank (V)_N) N + p. Wherein n is the original system state dimension, and p is the system output dimension. Therefore, N ≧ κ is selected such that the matrix V_NColumn full rank. Definition of

Then, the Q-function can be written in the following form

Therefore, the optimal strategy of the attack and defense can be obtained as

Wherein,

bellman's equation based on Q-function and input-output data can be written as

Linear parameterized Q-function, can be obtained

In the above formula, the unknown matrixIs provided withThe number of the unknown elements is not known,because of the fact thatBased on the above analysis, algorithm 5 and algorithm 6 give a strategy iteration and value iteration method, respectively, using Q-learning, which uses only input-output data.

And algorithm 5: strategy iterative algorithm using Q-learning

1. Initialization: setting j to 0, and selecting a stable initial strategyAnd

2. and (3) policy evaluation: solving the following equation to solve h^j+1

3. Strategy improvement:

4. stopping conditions are as follows: i H^j+1-H^j||＜∈

And 6, algorithm: value iterative algorithm using Q-learning

1. Initialization: setting j to 0, and selecting an arbitrary initial policyAnd

2. and (3) policy evaluation: solving the following equation to solve h^j+1

3. Strategy improvement:

4. stopping conditions are as follows: i H^j+1-H^j||＜∈；

As can be seen from algorithm 6, the initial strategy of the two parties that do not need to attack and defend is stable. In addition, for recursive computation Must satisfy

Claims

1. A design method of an optimal tracking controller in the presence of false data injection attack is characterized by comprising the following steps:

the method comprises the following steps: establishing a false data attack model and an augmentation system model;

step two: establishing a game model of an attack and defense party by adopting a game theory method;

step three: solving the optimal strategies of both game parties by adopting a Q-function-based reinforcement learning method, wherein the optimal strategies comprise a strategy iteration method and a value iteration method;

step four: and based on the input-output data, adopting a Q-learning method to iteratively solve the optimal strategy.

2. The method according to claim 1, wherein the first step is specifically:

consider the following system model:

x_k+1＝Ax_k+Bu_k

wherein A and B are system matrixes; if the system control input u_kIf the system is attacked in the transmission process, the system model after being attacked by the dummy data injection is as follows:

wherein q is the number of attackers, the ith transmission is attacked by the jth attacker, otherwise, the ith transmission is not attacked;injecting false data for the jth channel at the moment k;

the tracking model is assumed to have the form:

wherein the matrix T is a tracking model matrix; the augmentation system can be expressed as:

3. the method for designing an optimal tracking controller in the presence of a false data injection attack as claimed in claim 1, wherein the second step specifically comprises:

assuming both tracking controller and dummy data are statesOf linear function, i.e.

Wherein, K ═ K₁，K₂]Andrespectively, the feedback gains of the attacking and defending parties.

The gaming parties pick the following payout functions:

wherein gamma is the discount factor, Q_eAnd R are respectively given semi-positive definite and positive definite matrixes; the optimal strategy design of the defender and the attacker is as follows:

4. the method according to claim 1, wherein the third step is specifically:

the following Q-function is defined:

by solving equationsAndthe following optimal action strategies of the attacking and defending parties can be obtained:

the strategy iteration and value iteration method based on the Q-function are respectively given in an algorithm 1 and an algorithm 2;

algorithm 1: the strategy iterative algorithm based on the Q-function comprises the following steps,

1) and initializing: setting j to 0, and selecting H⁰＝(H⁰)^T

2) And strategy evaluation: solving the following equation, solving for P^j+1

3) And strategy improvement:

4) and stopping conditions: i H^j+1-H^j||＜∈；

And 2, algorithm: an iterative algorithm based on the values of the Q-function, comprising the following steps,

1) and initializing: setting j to 0, and selecting H⁰＝(H⁰)^T

2) And strategy evaluation: solving the following equation to obtain P^j+1

3) And strategy improvement:

4) stopping conditions are as follows: i H^j+1-H^j||＜∈。

5. The method for designing an optimal tracking controller in the presence of a false data injection attack as claimed in claim 1, wherein said step four specifically comprises:

system stateThe following input-output sequence representation can be employed:

then, the Q-function can be written in the form:

therefore, the optimal strategy for both the attack and defense is as follows:

wherein,

the strategy iteration and value iteration methods using Q-learning are given in algorithm 3 and algorithm 4, respectively:

algorithm 3: the strategy iterative algorithm adopting Q-learning comprises the following steps,

1) initialization: setting j to 0, and selecting a stable initial strategyAnd

2) and (3) policy evaluation: solving the following equation to solve h^j+1

3) Strategy improvement:

4) stopping conditions are as follows: i H^j+1-H^j||＜∈；

And algorithm 4: the iterative algorithm using the Q-learned values, comprises the following steps,

1) initialization: setting j to 0, and selecting an arbitrary initial policyAnd

2) and (3) policy evaluation: solving the following equation to solve h^j+1

3) Strategy improvement:

4) stopping conditions are as follows: i H^j+1-H^j||＜∈。