CN114527666B

CN114527666B - CPS system reinforcement learning control method based on attention mechanism

Info

Publication number: CN114527666B
Application number: CN202210221958.8A
Authority: CN
Inventors: 卢岩涛; 李青; 孙仕琦
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2023-08-11
Anticipated expiration: 2042-03-09
Also published as: CN114527666A

Abstract

The invention provides a CPS system reinforcement learning control method based on an attention mechanism, which comprises the following steps: the control object selects a proper strategy through a strategy network and executes the environment; the environment generates change and response under the execution of the strategy, and generates a reward; detecting the environment by a plurality of preset sensors to obtain detection information of the plurality of sensors; and transmitting the sensor detection information into a self-attention network, inputting the acquired rewards and the current state of the sensor information into a strategy network at the same time, updating the gradient of the strategy network, selecting the strategy of the next time period as the input of the strategy network, and repeating the steps to finish the learning control method. When the invention uses the reinforcement learning algorithm to solve the actual control problem, the method has more relaxed and convenient design requirements for rewards, namely, partial information can be learned through the hidden knowledge of the sensor.

Description

CPS system reinforcement learning control method based on attention mechanism

Technical Field

The invention belongs to the technical field of CPS system learning control methods, and particularly relates to a CPS system reinforcement learning control method based on an attention mechanism.

Background

In current CPS systems, how to combine the sensing information of sensors to design a reasonably intelligent control algorithm for the CPS system has become a long-felt problem. In the design of intelligent algorithms, reinforcement learning has received a great deal of attention as an algorithm located at the forefront of academic. Although the reinforcement learning, especially q learning, is a black box model based on machine learning algorithm, resulting in a weaker interpretability than the traditional model, the reinforcement learning does not need to be redesigned according to the model, has strong adaptability, can be trained more easily, has better effect and more intelligent, and is favored by a series of characteristics.

However, there is a significant problem in that the conventional reinforcement learning model is essentially an interpretation model for learning without considering the modification required when applied to the CPS system, and because of the large number of sensors in the complex CPS system, the training itself of the reinforcement learning model is more difficult, thereby affecting the improvement effect that the model can obtain.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a CPS system reinforcement learning control method based on an attention mechanism to solve the problems in the background art.

In order to solve the technical problems, the invention adopts the following technical scheme: CPS system reinforcement learning control method based on attention mechanism includes the following steps:

s1, selecting a proper strategy by a control object through a strategy network, and executing the environment;

s2, the environment generates change and response under the execution of the strategy, and a reward is generated;

s3, detecting the environment by a plurality of preset sensors to obtain detection information of the plurality of sensors;

s4, transmitting the sensor detection information into a self-attention network, and simultaneously, automatically acquiring the last-step behavior of the control object by the self-attention network, and calculating the needed sensor information by taking the sensor detection information and the last-step behavior of the control object as references;

s5, inputting the rewards and the current state of the acquired sensor information into a strategy network at the same time, updating the gradient of the strategy network, selecting the strategy of the next time period as the input of the strategy network, and repeating the steps to finish the learning control method.

Further, the learning control method is further divided into a training mode and an execution mode.

Further, the execution mode includes the steps of:

s101, at time k, the state of the control object isState of the environmentTake action u _k ∈A；

S102, under the influence of the behavior, the state of the environment:state of control object: />Prize value: />

S103, for the environment state of the time k+1The sensor captures information in the environment, obtains:

s104, sensor information based on the time periodBehavior u with previous time period _k Sensor information after screening is obtained using a model of the self-attention mechanism: />

S105, in combination with the above information, the control object starts to infer an action that should be performed in the next time period:

s106, executing action u _(k+1) Returning to S101;

wherein S is _env Representing the state of the environment; s is S _agent Representing the state of the control object; s is S _sensor Representing the state of each parameter obtained by the sensor;

a represents a limited set of actions, i.e. actions that the control object can take; p represents the transition probability, i.e. after taking an action,transfer to->Probability of (2); r is a reward function; gamma represents a discount factor;

sensor reading environment: f (F) _sensor :S _env →S _sensor ；

The environment changes: f (F) _env :S _env ×A→S _env ；

Bonus function: f (F) _reward :S _agent ×A→R；

State change function: f (F) _state :S _agent ×A→S _agent ；

There is also provided an end-to-end model that can be obtained by machine learning: neural networks of self-attention mechanisms: sigma (sigma) _attention :S _sensor ×A→S _{att_sensor} The method comprises the steps of carrying out a first treatment on the surface of the Neural networks controlling object selection behavior strategies: sigma (sigma) _agent :S _{att_sensor} ×S _agent →A

S _sensor Information obtained on behalf of the sensor sensing the external environment;

S _{att_sensor} representing the sensor information left after passing through the self-attention mechanism.

Further, the training mode includes the steps of:

s201, at time k, the state of the control object isStatus of the environment-> Take action u _k ∈A；

S202, under the influence of the behavior, the state of the environment:state of control object: />Prize value: />

S203, for the environmental state of this time k+1The sensor captures information in the environment, obtains:

s204, sensor information based on the time periodBehavior u with previous time period _k Sensor information after screening is obtained using a model of the self-attention mechanism: />

S205, in combination with the above information, the control object starts to infer an action that should be performed in the next time period:

s206, executing action u _(k+1) Returning to S101, each data pairing is collected:

s207, pairing data as a data set, and regarding the neural network sigma _attention Sum sigma _agent And carrying out joint gradient descent, taking the descended parameters as a new neural network, and returning to the first step until convergence.

Compared with the prior art, the invention has the following advantages:

according to the method, a self-attention mechanism is introduced into information screening of the sensor, and the action of the last time period is taken as a part of screening, so that the information is considered, and when the actual control problem is solved by using a reinforcement learning algorithm, the method has more relaxed and convenient design requirements on rewards, namely, part of information can be learned through hidden knowledge of the sensor; meanwhile, due to the fact that screening of the attention mechanism to which the sensor belongs is set, a large number of sensors can be added when the CPS system is built, the CPS system is suitable for more application scenes, and the application range of the CPS system is widened.

Detailed Description

The following description of the technical solutions in the embodiments of the present invention will be clear and complete, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a technical scheme that: CPS system reinforcement learning control method based on attention mechanism includes the following steps:

Specifically, the method comprises a main control object agent, wherein the control object comprises a strategy selector network for automatically selecting a control strategy, and the strategy selector network comprises an external environment which is mainly an external scene to which the control method is applied;

when the control object agent executes the strategy in the environment, some interactions with the environment, such as walking over some obstacles and taking away some objects in the scene, the changes can generate some stimulus to the environment, and the states of the control object and the environment are influenced at the same time, so that certain changes can be generated;

after these changes occur, the rewarding mechanism will determine how much of the overall environment and between the control object and the final target needs to be completed;

comparing the distance between the control object and the target before executing the strategy and the distance between the control object and the target after executing the strategy, the strategy can be known to play a positive role or a negative role for overall control, and then rewarding review is defined according to the role;

if positive effects are given to overall control, a positive prize is given, and if negative effects are given, a penalty is given.

The rewards can be defined at any time according to specific application scenes, for example, the rewards can be defined as the distance between a robot and a target in a robot path planning task;

in the context of algorithm application, a large number of sensors, such as infrared sensors, distance sensors, temperature sensors, pressure sensors, etc., may be included to construct a series of sensors for external environmental conditions in real time, to obtain the state of the environment, so that the environment can be sensitively captured after the environment interacts with the control object to generate a change.

The sensor system is connected with a screening network constructed by a self-attention mechanism, and mainly aims to combine with an execution strategy of a control object in one step, screen a large amount of sensor information through the correlation of actions and environment interaction and the self-attention of a sensing space, and directly screen the end-to-end algorithm based on machine learning through the screening network and the self-attention algorithm to obtain the needed sensor information.

The self-attention mechanism network and rewards from the environment are input into the strategy selection network of the control object agent after data normalization and coupling to select a proper strategy, wherein the strategy needs to specifically design a strategy space according to different scenes, the strategy space can be continuous or discrete, and the discrete behavior space comprises various discrete behaviors such as switching equipment, object taking and the like; the continuous behavior space comprises a continuous number of behaviors, for example, controlling at what speed and angle the robot moves.

The learning control method is also divided into a training mode and an execution mode.

The execution mode comprises the following steps:

the execution mode comprises the following steps:

s106, executing action u _(k+1) Returning to S101;

a represents a limited set of actions, i.e. actions that the control object can take; p represents transition probability, i.eAfter an action is taken, the user can take,transfer to->Probability of (2); r is a reward function; gamma represents a discount factor;

sensor reading environment: f (F) _sensor :S _env →S _sensor ；

The environment changes: f (F) _env :S _env ×A→S _env ；

Bonus function: f (F) _reward :S _agent ×A→R；

State change function: f (F) _state :S _agent ×A→S _agent ；

The training mode comprises the following steps:

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The CPS system reinforcement learning control method based on the attention mechanism is characterized by comprising the following steps of:

s5, inputting the rewards and the current state of the acquired sensor information into a strategy network at the same time, updating the gradient of the strategy network, selecting the strategy of the next time period as the input of the strategy network, and repeating the steps to finish a learning control method which is further divided into a training mode and an executing mode;

the execution mode comprises the following steps:

s106, executing action u _(k+1) Returning to S101;

sensor reading environment: f (F) _sensor :S _env →S _sensor ；

The environment changes: f (F) _env :S _env ×A→S _env ；

Bonus function: f (F) _reward :S _agent ×A→R；

State change function: f (F) _state :S _agent ×A→S _agent ；

S _{att_sensor} representing sensor information left after passing through the self-attention mechanism;

the training mode comprises the following steps:

s201, at time k, the state of the control object isStatus of the environment->∈S _env The method comprises the steps of carrying out a first treatment on the surface of the Take action u _k ∈A；