CN115392435A

CN115392435A - Hierarchical decision method based on deep reinforcement learning

Info

Publication number: CN115392435A
Application number: CN202211014161.7A
Authority: CN
Inventors: 朱燎原; 包骐豪; 夏少杰; 瞿崇晓; 王宇峰
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-11-25

Abstract

The invention discloses a hierarchical decision method based on deep reinforcement learning, which comprises the steps of initializing tactical decision layer intelligent bodies and intention recognition layer intelligent bodies of decision objects, generating intention recognition layer behaviors by the intention recognition intelligent bodies according to input intention recognition observation information by adopting a deep circulation Q network, and selecting the tactical decision layer intelligent bodies according to the intention recognition layer behaviors; and the tactical decision layer intelligent agent adopts a depth certainty strategy gradient algorithm, and calculates the behavior of the tactical decision layer intelligent agent according to the tactical decision observation information. The tactical decision layer and the intention recognition layer are mutually independent during training and are mutually connected during decision, so that the problem that simultaneous training is difficult to converge is avoided, the algorithm convergence speed during training can be increased, and the overall decision capability of an intelligent agent is improved.

Description

Hierarchical decision method based on deep reinforcement learning

Technical Field

The application belongs to the technical field of deep reinforcement learning, and particularly relates to a hierarchical decision method based on deep reinforcement learning.

Background

Modern fighters are developing in the forward high-automation, informatization and intelligent directions, battlefield environments are complex and changeable, battlefield information is complex and diverse, a task with heavy burden is realized by depending on a pilot to make a combat decision in a short time, and an intelligent decision support system is urgently needed to assist the pilot to make a real-time decision in the face of a complex battlefield situation.

The deep reinforcement learning is an artificial intelligence algorithm independent of label samples, and the intelligence level of a decision system is improved by continuously training and iterating a model through interactive learning knowledge with the environment. The deep reinforcement learning mainly solves the problem of sequence decision, can make real-time decision according to current environment information, is very suitable for air combat game scenes, and is a current research hotspot. The existing air combat application-oriented deep reinforcement learning algorithm has the following problems: the reward function is difficult to shape, slow to converge, and poorly interpretable. These problems make training very difficult, affecting algorithm efficiency.

In the battlefield environment with complex battle conditions, fierce confrontation and great change, the battlefield situation assessment and target tactical intention identification technology deduces the situation through limited information perception, assesses the enemy fighting intention, the threat degree and the winning degree of the local computer, forms basic judgment and has great significance for subsequent decision making.

The existing air combat decision technology based on the reinforcement learning algorithm has the main defects of low convergence rate, difficult reward modeling and poor interpretability caused by large output space.

Disclosure of Invention

The application aims to provide a deep reinforcement learning-based hierarchical decision method, which is used for air combat confrontation simulation deduction and solves the problems that a reinforcement learning algorithm model is difficult to converge, reward is sparse and results are difficult to explain in air combat simulation confrontation.

In order to achieve the purpose, the technical scheme of the application is as follows:

a hierarchical decision-making method based on deep reinforcement learning comprises the following steps:

initializing tactical decision layer agents and intention recognition layer agents of decision objects;

the intention recognition intelligent agent adopts a deep circulation Q network to recognize observation information according to input intention, generates an intention recognition layer behavior, and selects a tactical decision layer intelligent agent according to the intention recognition layer behavior;

the tactical decision layer intelligent agent adopts a depth certainty strategy gradient algorithm, and the behavior of the tactical decision layer intelligent agent is calculated according to tactical decision observation information;

and the decision object executes the behavior command output by the tactical decision layer agent and updates the state information of the decision object.

Further, the intention identification observation information includes a state of the observation object when the preset discrete time length is T, and the state of the observation object at a single moment includes:

at least one of relative distance between the observed objects, relative angle of the observed objects, heading angle of the observed objects, survival status of the observed objects, radar status of the observed objects, weapon status of the observed objects, and motion status of the observed objects.

Further, the number of tactical decision layer agents is the same as the number of activities intended to identify agents.

Further, the tactical decision observation information includes the state of the current time instant decision object and the states of the current decision object and other observation objects, including: at least one of the relative distance between the current decision-making object and other observation objects at the current moment, the relative angle between the current decision-making object and other observation objects at the current moment, the course angle of the current decision-making object and other observation objects at the current moment, the radar state of the current decision-making object at the current moment, the weapon state of the current decision-making object at the current moment, the survival state of the current decision-making object and other observation objects at the current moment, and the motion state of the current decision-making object and other observation objects at the current moment.

The application relates to a hierarchical decision method based on deep reinforcement learning, which adopts the technical scheme that the method comprises the following two parts: the method comprises a deep strength learning algorithm facing to tactical decision of an agent and a deep reinforcement learning algorithm facing to intention identification. The former is used for the bottom tactical decision of the intelligent agent and is the decision of a continuous behavior space, and the latter is used for the upper-layer intention recognition of the intelligent agent and is the decision of a discrete behavior space. The two are iterated mutually, the convergence speed is accelerated, and the overall decision-making capability of the intelligent agent is improved. The scheme has the advantages that: (1) The tactical decision layer and the intention recognition layer are mutually independent during training and are mutually connected during decision, so that the problem that the simultaneous training is difficult to converge is avoided, and the algorithm convergence speed during training can be increased; (2) The air combat decision is divided into an intention layer and a tactical layer which are more in accordance with the cognition of human, and the training result has good interpretability; (3) The hierarchical algorithm splits the problem into two interrelated subproblems, making it possible for the agent to implement more complex tactical combinations.

Drawings

FIG. 1 is a flowchart of a hierarchical decision method based on deep reinforcement learning according to the present application;

FIG. 2 is a block diagram of a hierarchical decision process according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The intelligent agent in reinforcement learning learns knowledge through interaction with the environment, and in the interaction mode of the intelligent agent and the environment, the intention recognition intelligent agent obtains observation information obs from the environment _IA Calculating the output a _IA According to a _IA Selecting a tactical decision agent, and acquiring observation information obs from the environment by the tactical decision agent _DA The calculated behavior a _DA And returning to the environment. The intent recognition agent affects the environment by switching tactical decision agents, which interact directly with the environment. Through continuous training, the intention recognition intelligent body can maximize a revenue function, and intelligently switches various tactical models according to situation information, so that strategy scheduling is realized. The method can be applied to layered decision of air battle and intelligent operation of game AI, and the air battle is taken as an example in the following embodimentsAnd (4) explanation.

In one embodiment, as shown in fig. 1, a hierarchical decision method based on deep reinforcement learning proposed by the present application includes:

s1, initializing tactical decision-making layer agents and intention identification layer agents of decision-making objects.

The environment initialization of the embodiment is mainly to initialize and construct two types of agents, namely a tactical decision layer agent and an intention identification layer agent. For example, initialization results in a tactical decision layer agent set of [ DA ] ₀ ,DA ₁ ,....DA _n-1 ]The intention recognition layer agent IA is initialized. The tactical decision layer agents are constructed in a tactical mode, and the number of the constructed tactical decision layer agents is the same as the behavior number of the intention recognition layer agents.

The intention recognition agent recognizes the enemy intention and outputs the own intention recognition layer behaviors (or tactics), including pursuing, tailgating, avoiding, laterally guiding, lifting, lowering, attacking and the like. And (3) selecting a corresponding tactical decision-making intelligent agent according to the output of the intention identification intelligent agent, and outputting decision-making instructions, such as maneuvering instructions, attack instructions, interference switch instructions, sensor switch instructions and the like, by the tactical decision-making intelligent agent according to the method.

For example, tactics include mainly pursuing, tailgating, sideways guidance, raising, lowering, attacking, and the like. The output of the tactical decision layer agent mainly comprises maneuvering instructions (target waypoints), attack instructions, interference switch instructions, sensor switch instructions and the like.

It should be noted that the tactical decision layer agent and the intention identification layer agent are both disposed on a decision object that needs to make a decision, for example, an air battle is taken as an example, and is disposed on an airplane of the same party to make a decision on the airplane of the same party, and the decision object is the airplane of the same party.

S2, identifying observation information obs by the intention identification intelligent agent according to the input intention by adopting a deep circulation Q network _IA Generating the intention recognition layer behavior a _IA Identifying layer behavior a according to intent _IA A tactical decision layer agent is selected.

The observation information is observed according to the real-time situation of the air combat environment, the observation information is obtained through an API (application program interface) data interface in the simulation environment, and the real battlefield environment is comprehensively obtained through a fighter plane sensor, a ground radar and the like.

Assume that the intent recognition agent behavior set is: knocking, tailgating and avoiding, laterally guiding, lifting, lowering and attacking. a is _IA When the 0 represents that the tactics of the image layer is pursuit, the tactics decision intelligent body corresponding to pursuit is selected. The tactical decision intelligent body outputs the decision as a specific maneuvering path, an interference switch, a sensor switch and a weapon order in the pursuing process.

a _IA And 1 represents that the tactics of the intention layer are tail-putting and avoiding, namely selecting a tactical decision-making intelligent agent corresponding to the tail-putting and avoiding, and outputting the tactical decision-making intelligent agent as a specific maneuvering path, an interference switch, a sensor switch and a weapon ordinance in the tail-putting and avoiding process.

The step is used for realizing intention identification, and a Deep circulation Q-Learning Network (DRQN) is adopted to identify observation information obs according to the intention _IA Generating intent recognition layer behavior a _IA 。

The Deep cycle Q-Network (DRQN) is a novel artificial intelligence algorithm combining Deep Learning and Deep Q-Network reinforcement Learning. In air battle situation assessment, after obtaining a large amount of situation advantage information, the problem can be described as: based on the input information, a network learns its characterization, giving a targeted decision.

The change of the air battle situation is a continuous process, and the situation is difficult to describe by information at a single moment. To solve the above problem, the DRQN algorithm replaces the fully connected layer in DQN with LSTM network. When the model is trained using partial observation data and evaluated using full observation data, the effectiveness of the model is related to the integrity of the observation data; when training is performed using full observation data and evaluation is performed using partial observation data, the effect of DRQN is reduced less than DQN. The cyclic network has stronger adaptability under the condition that the quality of the observed quantity changes.

The network structure of the DQN can not learn the time sequence, and the DRQN can use the n-second situation to replace the previous 1-second situation as a state to make a learning decision. Since the current state is not only the last moment state, the whole system loses markov property, and the reward is related to the current situation information and the situation in a plurality of previous times.

The air war situation information may be partially observable in some cases, and the MDP (markov decision process) in some cases becomes POMDP (partially observable markov decision process). For a partially observable Markov decision process, a better effect can be obtained by processing incomplete observation by RNN (recurrent neural network), and missing information can be better processed by DQRN compared with DQN.

A key issue for DRQN is to use a suitable network fitting value function Q (s, a; θ) for evaluating the performance. The DRQN only modifies the structure of the DQN on the network construction, replaces a full connection layer behind an input layer with an RNN network, such as an LSTM or a GRU, and finally outputs a Q (s, a) value corresponding to each action a. In the training process, the full connection layer, the output layer and the recurrent neural network layer are updated and iterated together. The DRQN uses a recurrent neural network as a feature extraction tool, each time, the input is time sequence situation information, and the output is a Q value of each action. The DRQN network structure comprises an input layer, an LSTM/GRU layer, a full connection layer and an output layer.

The input situation information of the DRQN is a time sequence, is the state of an observation object related to an intelligence body of an idea recognition layer in the environment at the past T historical moments, and the selection of proper situation information is beneficial to improving the learning speed of the model. In this embodiment, the intention-to-recognize observation information includes a state of the observation object when the preset discrete time length is T, where the state of the observation object at a single time includes:

at least one of relative distance between observed objects, relative angle of observed objects, heading angle of observed objects, survival state of observed objects, radar state of observed objects, weapon state of observed objects, and motion state of observed objects.

If the current time is t, the situation information may be described as: [ x ] of _t-T+1 ,x _t-T+2 ,…,x _t-1 ,x _t ]. Assuming that the observation objects are m red-side airplanes and n blue-side airplanes, the acquired state vector x at a single moment _i Comprises the following elements:

relative distance between aircraft

Relative angle between aircraft

m + n aircraft course angles

State, time of flight of all missiles in m + n aircraft

m + n aircraft speeds

m + n aircraft detection radar states

m + n aircraft photoelectric radar states

m + n aircraft jammer states

m + n aircraft survival states

Combining the state vectors at the T moments together, namely the finally selected situation information: [ x ] of _t-T+1 ,x _t-T+2 ,…,x _t-1 ,x _t ]. Considering that the pilot comprehensively considers situation information in a period of time in the actual flight process, situation judgment is made only after continuously observing the battlefield, and therefore the length T of the input data time sequence is generally larger than 1. The observation information of this embodiment includes the states of all the observation objects in the application scene, and includes the states at T times.

The decision of the intention recognition agent is relatively sparse, so that the fight result can be directly used as the basis for optimizing the intention recognition agent, and the reward function is designed as follows:

wherein the number of the red square airplanes is m, the number of the blue square airplanes is n, and R _r Reward, R, for a red-party aircraft _b Reward, s, for the plane of the blue party _i Awarding a parameter, alpha, for the survival status of the ith aircraft ₁ 、α ₂ 、α ₃ For the bonus factor, it can be adjusted. If the own airplane is knocked down, punishing; if the enemy plane is knocked down, awarding; if the single round is finished and wins, awarding; if the single round ends and fails, punish.

The DRQN algorithm is used for constructing a network Q (s, a; theta) by utilizing an LSTM and a full-connection network, and training a neural network by optimizing a time difference error function. Selecting a tactical pursuit a with a probability ∈ _t Or the probability 1-e is according to a _t ＝argmax _a Q (s, a; theta) select action a _t . With a dual network structure, Q-network and

the model structure of the network is consistent with the initial parameters, the model parameters of the Q network are updated in real time in the iterative training process of the model, and the Q network model parameters are assigned to the Q network model parameters at regular intervals

The model of the network is a model of the network,

the network is used to calculate the time differential error. The input situation information of the Q (s, a; theta) network is a situation sequence。

And a deep reinforcement learning algorithm for recognizing the air combat intention realizes tactical switching through situation evaluation. The intelligent agent intent recognition method aims to solve the problem of intent recognition of an intelligent agent, and separate tactical decision and intent recognition training to accelerate the convergence speed of the intelligent agent and solve the problem of sparse reward. And (3) evaluating the behavior value by using the situation information of the time sequence as input through a recurrent neural network, then selecting the behavior with the maximum value, wherein each behavior corresponds to a tactical decision, and thus determining the tactical decision layer intelligent agent corresponding to the tactical decision.

S3, adopting a depth certainty strategy gradient algorithm by the tactical decision layer intelligent agent, and observing information obs according to tactical decision _DA Calculating behavior a of tactical decision-making layer agent _DA 。

Tactical decision observation information is observed according to the real-time situation of the air combat environment, the tactical decision observation information is obtained through an API (application program interface) data interface in the simulation environment, and the real battlefield environment is comprehensively obtained through a fighter plane sensor, a ground radar and the like.

The method is used for tactical decision, and adopts a Deep Deterministic strategy Gradient algorithm DDPG (Deep Deterministic Policy Gradient) to select an output instruction of an intelligent agent unit of a tactical decision layer, so as to realize tactical game decision of the tactical decision intelligent agent in a continuous behavior space. The input of the model is the operation situation observed by the tactical decision agent in the current state, and the output is the decision action needed to be executed by the tactical decision agent.

Taking an air combat as an example, the tactical decision observation information of the embodiment includes the state of the current decision object and the states of the current decision object and other observation objects, including: at least one of the relative distance between the current decision object and other observation objects at the current moment, the relative angle between the current decision object and other observation objects at the current moment, the course angle of the current decision object and other observation objects at the current moment, the radar state of the current decision object at the current moment, the weapon state of the current decision object at the current moment, the survival state of the current decision object and other observation objects at the current moment, and the motion state of the current decision object and other observation objects at the current moment.

Suppose the observation objects are m red-square airplanes and n blue-square airplanes, wherein the fighting situation comprises the following elements:

the relative distance between the airplane and m + n-1 airplanes at the current moment;

the relative angle between the airplane and m + n-1 airplanes at the current moment;

all m + n aircraft course angles at the current moment;

the states and flight times of all missiles of the plane at the current moment;

detecting the radar state of the airplane at the current moment;

the state of the plane photoelectric radar at the current moment;

the state of the airplane jamming bomb at the current moment;

at the current moment, m + n airplane speeds;

m + n aircraft survival states at the current time.

And in the process of interacting the countermeasure simulation environment and the intelligent agent, the countermeasure simulation environment returns the state and the reward to the intelligent agent, and the intelligent agent returns the decision-making behavior to the simulation environment and updates the airplane state. And through interaction of an intelligent agent and the environment, decisions such as airplanes, missiles, radars and the like are controlled.

The reward function is a standard for the intelligent agent to judge the performance of the intelligent agent, and the establishment of a reasonable reward function R(s) is a key for reinforcement learning. The reward function should take into account the realizability and sparsity comprehensively, and achieve the balance between tactical implementation and exploration. To implement the basic tactical library, reward functions are designed including, but not limited to: pursuit, avoidance, tangential avoidance, guidance, cross-attack tactics, and the like. Taking pursuit tactics as an example, the reward function modeling process is described below, assuming that the pursuit agent is taken as a reference system, and the relative direction angle of the pursuit agent is θ, where θ is ≦ 180, the reward function R can be described as:

R(θ)＝α*(90-abs(θ))/90

where α is a constant that can be adjusted as needed. The reward function is not uniquely shaped and the distribution of positive and negative rewards is preferably balanced across the range of argument for value optimization.

The method is used for constructing an air combat tactical continuous space decision model based on a DDPG algorithm and deciding the action of a tactical decision intelligent agent. The training process of the DDPG algorithm intelligent agent game decision is now described, and the algorithm training steps are as follows.

1. Initializing the critical network Q (s, a | θ |) ^Q ) And operator network mu (s | theta) ^μ ) Weight value

2. Initializing weights theta of target networks Q' and mu ^Q ' and θ ^μ ', experience pool R

3. Initializing a random procedure

Taking initial state observation s ₀

4. Updating step by step until s is in a termination state

(1) Selecting action a based on current strategy and exploration noise _t

(2) Performing action a _t And receive a reward r from the environment _t And the observed quantity s _t+1

(3) Will (a) _t ,s _t ,r _t ,s _t+1 ) Is stored in an experience pool R

(4) Randomly sampling small batches of data from an experience pool R

(5) Updating of actor and critic networks by optimally computing gradients

(6) Updating target network parameters:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθμ+(1-τ)θ ^μ′

when the initial training is carried out, the step 1 and the step 2 are carried out, when each battlefield bureau starts, the step 3 is carried out, and then the step 4 is carried out in a circulating way in the battlefield bureau until the model training is finished. Wherein, state s _t Battle observed by agent in current state for tactical decision layerSituation, i.e. observation, a _t Decision actions, r, performed by tactical decision-making layer agents in the current situation of combat _t Instant reward, s, obtained after tactical decision layer agent executes decision action _t+1 And after the tactical decision layer intelligent agent executes the decision action, the tactical situation returned by the next state environment is entered. Tactical decision level agents interact with the air combat environment through a series of s, a and r with the goal of maximizing jackpot. The air war agent is in the current state s, and then calculates the decision behavior a according to the strategy μ = μ (a | s) _t Will decide action a _t And executed in the environment until the end of the war office. The optimal action cost function is calculated as follows:

the tactical decision model based on the DDPG algorithm is trained to have the following characteristics:

(1) The decision-making action of the air war intelligent agent is continuous action, and the algorithm utilizes a deep neural network to construct a criticc network Q (s, a | theta) ^Q ) And operator network mu (s | theta) ^μ )。

(2) The critic network is used for evaluating the current behavior and training the critic network by optimizing a time difference error function; the operator network is used to compute decision behavior and is trained by maximizing Q.

(3) Early in training, the agent accumulates the memory bank first

When the data in the memory base reach a certain capacity, the intelligent body samples the data from the memory base and updates the model, and the sampling probability of each sample is the same. While the model is being updated, the memory bank

Also in the updating, memory bank

Only the latest N training data sets are stored and the past data sets are deleted.

(4) Using a dual network architecture, Q-network and

the model structure of the network is kept consistent with the initial parameters, the model parameters of the Q network are updated independently, and then the weighting of the Q network model parameters is given to

And (4) network model. Such small range updates

The method of (3) can make the model easy to converge.

After tactical decision model training is completed, decision models are generated, one tactical model corresponds to a specific model, and the decision models are combined to form an optional model library. The intention recognition layer can conveniently call the models to switch the models and make layered decisions.

S4, the decision object executes a behavior command a output by the tactical decision layer agent _DA And updating the state information of the decision object.

An airplane is an individual in a real physical world or a simulated world, and an agent is an algorithm model for providing behavior decisions for the airplane. And jointly realizing the behavior decision of the airplane through the output of the tactical agent and the intention agent.

As noted above, the intent recognition agent recognizes the enemy intent and outputs the strategy of the present (tactics include pursuit, tailgating, lateral guidance, elevation, attack, etc.). The tactical decision agent outputs decision instructions such as maneuvering instructions, attack instructions, interference switch instructions, sensor switch instructions and the like according to the tactical plan. The command output by the tactical decision intelligent agent is a decision object, namely a command which is finally required to be executed by the airplane, the airplane is updated according to the commands, for example, the waypoint output by the tactical decision intelligent agent is [1000,1000 ], the airplane can go to the waypoint [1000,1000 ], and if the decision command output by the tactical decision intelligent agent is to open interference, the airplane can open interference.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A hierarchical decision method based on deep reinforcement learning is characterized in that the hierarchical decision method based on deep reinforcement learning comprises the following steps:

2. The deep reinforcement learning-based hierarchical decision method according to claim 1, wherein the intention recognition observation information includes a state of an observation object when a preset discrete time length is T, and the state of the observation object at a single moment includes:

3. The deep reinforcement learning-based hierarchical decision method according to claim 1, characterized in that the number of tactical decision layer agents is the same as the number of behaviors of the intention recognition agent.

4. The deep reinforcement learning-based hierarchical decision method according to claim 1, wherein the tactical decision observation information includes the state of the current decision object at the current time and the states of the current decision object and other observation objects, including: at least one of the relative distance between the current decision object and other observation objects at the current moment, the relative angle between the current decision object and other observation objects at the current moment, the course angle of the current decision object and other observation objects at the current moment, the radar state of the current decision object at the current moment, the weapon state of the current decision object at the current moment, the survival state of the current decision object and other observation objects at the current moment, and the motion state of the current decision object and other observation objects at the current moment.