CN118153621A

CN118153621A - Multi-agent trapping method based on double-layer diagram attention reinforcement learning

Info

Publication number: CN118153621A
Application number: CN202410141040.1A
Authority: CN
Inventors: 史殿习; 李彤月; 郝锋; 王震; 张轶; 邱春平
Original assignee: Chinese People's Liberation Army 32806 Unit
Current assignee: Chinese People's Liberation Army 32806 Unit
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-06-07

Abstract

The invention relates to the technical field of multi-agent systems, and particularly provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, which comprises the following steps: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. The technical scheme provided by the invention can effectively solve the problem of multi-objective optimization in multi-agent trapping tasks.

Description

Multi-agent trapping method based on double-layer diagram attention reinforcement learning

Technical Field

The invention relates to the technical field of multi-agent systems, in particular to a multi-agent trapping method based on double-layer diagram attention reinforcement learning.

Background

Multi-agent trapping is a classical but challenging problem in the multi-agent community and is widely used in civilian and military applications. For example, collaborative interception in aerospace, collaborative search and rescue by multiple robots. In the multi-agent trapping task, the chasers and the prey are in a dynamically changing environment, the chasers cooperate to form an expected formation to surround a specific prey to form a surrounding ring, the prey is trapped, the collision between the agents and the collision with the obstacle are avoided, and the prey is required to be learned to escape as much as possible. The intelligent agents have a complicated relationship, along with the gradual expansion of the task scale, the intelligent agents need to rapidly extract key information from a large amount of redundant changing environment information, dynamically adjust own targets, learn effective cooperation strategies, adapt the training and learning strategies to large-scale complex new tasks, and are key challenges for multi-agent task capturing.

The current multi-agent trapping problem research is mainly divided into two types, namely a control theory-based method and a learning-based method. Control theory based methods are often solved by estimating the central location of all prey objects, which are considered as one global objective. Most of the subsequent work is focused on two aspects of task allocation and surrounding control, namely grouping the chasers, and then planning each group of chasers to cooperatively trap the corresponding hunting object. However, in the process of trapping multiple prey objects, all agents dynamically change, and a task allocation method is performed in advance, so that flexible cooperation among the chasers is severely restricted, and trapping efficiency is reduced. In addition, most control theory-based methods ignore obstacle avoidance problems of the intelligent agent, are highly dependent on accurate control models, and are difficult to adapt in actual tasks in which obstacles exist.

Because of the limitations of the control theory-based method, the learning-based method solves the problem of multi-prey trapping by introducing reinforcement learning, and has great potential. The paper Multi-robot cooperative TARGET ENCIRCLEMENT through learning distributed transferable policy published by Zhang et al in the international neural network joint conference in 2020 proposes a distributed transferable policy network framework GACM based on deep reinforcement learning, adopts a graph attention communication mechanism to model Multi-agent interaction into a graph, extracts cooperation information of agents from the graph, and simultaneously utilizes uncertain quantity of barrier information in a long and short time network LSTM processing environment to solve the problem of Multi-agent cooperation trapping. However, the method is a task-trapping study aiming at single hunting, and cannot be applied to a scene of multi-hunting task due to the lack of multi-objective allocation. The paper "Multi-Target Encirclement with Collision Avoidance via Deep Reinforcement Learning using Relational Graphs", published by Zhang et al in 2022, robot and Automation national conference, presents a decentralization method MECADRL-RG of a robot-level and target-level relationship diagram based on deep reinforcement learning, which solves the MECA problem of a distributed multi-robot system. After the trapping targets are grouped, a robot-level relationship diagram formed by three heterogeneous relationship diagrams between each robot and other robots, targets and obstacles is modeled and learned by using a diagram attention network GATs, and spatial relationship representations among different agents are extracted instead of simple stacking of observation information. In addition, a target-level relationship graph is constructed by using GAT, and a spatial relationship from each robot to each target is constructed. Further, in order to predict the motion trail of each target, the motion of the target is modeled through a target-level relationship graph, and learning is performed through supervised learning. However, in the actual multi-prey trapping task, the trapping target is not fixed, but the chaser continuously adjusts the own trapping target according to the state and the acquired information in the process of trapping the prey, dynamically distributes the cooperative teammates, integrates the environmental factors, considers the obstacle avoidance, ensures the self safety, finishes trapping the prey and maximizes the trapping success rate.

Disclosure of Invention

In order to overcome the defects, the invention provides a multi-agent trapping method based on double-layer diagram attention-strengthening learning.

In a first aspect, a multi-agent trapping method based on double-layer diagram attention-strengthening learning is provided, where the multi-agent trapping method based on double-layer diagram attention-strengthening learning includes:

Step S101, sensing observation state information of an intelligent agent;

Step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention;

step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101;

Wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object.

Preferably, the obtaining process of the pre-trained reinforcement learning model based on the attention of the double-layer graph comprises the following steps:

Constructing a multi-agent trapping task simulation scene comprising a chasing agent provided with a reinforced learning model based on double-layer diagram attention, a hunting agent provided with a reinforced learning model based on double-layer diagram attention and an obstacle;

Training a reinforced learning model based on double-layer diagram attention corresponding to each intelligent agent in a multi-intelligent agent trapping task simulation scene by taking an SAC algorithm as a basic algorithm;

And taking the reinforced learning model based on the double-layer diagram attention corresponding to the pursuer agent as the pre-trained reinforced learning model based on the double-layer diagram attention.

Further, the reinforcement learning model based on the double-layer graph attention comprises: observing an encoder network, a graph attention network and a Q network;

the observation encoder network consists of a full-connection layer with a ReLU activation function, takes the position information and the speed information of an intelligent agent as input, and outputs a 128-dimensional embedded vector;

the attention network adopts an attention mechanism of 128-dimensional query, key and value;

the Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons.

Further, in the training process of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by adopting the SAC algorithm as a basic algorithm, the reward function of the chaser agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:

In the above, the review _pursuer is the total reward for the chaser, For the distance between the chaser and the prey, reward _encirclement is the trapping reward, reward _collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.

Further, the collision rewards of the chasers are as follows:

In the above-mentioned method, the step of, For the distance between the pursuer and other pursuers,/>For the distance between the chaser and the obstacle, eta is a super parameter,/>For other chaser locations,/>Is the position of the obstacle;

the trapping rewards are as follows:

reward _encirclement =15, if And angel _gap < tolerance

In the above formula, angel _gap is the difference between the angle difference forming a queue between the chasers and the expected surrounding angle, and tolerance is the error tolerance range.

Further, the difference between the angle difference forming the formation and the expected surrounding angle between the chasers is as follows:

angel_gap＝angel_except-angel_diff

In the above description, angel _except is the average surrounding angle expected when forming a team between the chasers, angel _diff is the actual angle when forming a team between the chasers, wherein:

angel_except＝2π/M

In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>To track the longitudinal axis of the agent,Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.

Further, in the training process of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by adopting the SAC algorithm as a basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:

In the above equation, reward _prey is the total reward for the game and reward _collide is the impact reward for the game.

Further, the impact rewards of the prey are as follows:

In the above-mentioned method, the step of, For the distance between the prey and the obstacle,/>Is the distance between the prey and other prey.

Further, the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and other prey are as follows:

In the above-mentioned method, the step of, Is the horizontal axis position of the obstacle,/>Is the vertical axis position of the obstacle,/>For the horizontal axis position of other prey agents,/>Is the vertical axis position of other game agents.

The technical scheme provided by the invention has at least one or more of the following beneficial effects:

The invention provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, which comprises the following steps: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. According to the technical scheme, a reinforcement learning model based on double-layer diagram attention is designed, an 'intra-group' attention layer is used for effectively modeling interaction among intelligent agent individuals, an 'inter-group' attention layer is used for modeling group-level information interaction, the intelligent agents can adaptively extract state dependency relations among a plurality of intelligent agents, the intelligent agents which are more important in relation with the intelligent agents are focused on information interaction, and the pursuit targets and cooperative teammates of the intelligent agents are flexibly adjusted, so that the pursuit targets and cooperative teammates of the intelligent agents are completed, the pursuers and the obstacles in the environment are comprehensively considered in the model training process, the pursuers and the obstacles are respectively restrained, the escape of the obstacles and the obstacle avoidance of the intelligent agents in the task are completed, and the multi-target optimization problem in the multi-intelligent agent trapping task is effectively solved.

Drawings

Fig. 1 is a schematic flow chart of main steps of a multi-agent trapping method based on double-layer diagram attention-strengthening learning according to an embodiment of the invention.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As disclosed in the background art, multi-agent trapping is a classical but challenging problem in the multi-agent collaboration field, and is widely used in civilian and military applications. For example, collaborative interception in aerospace, collaborative search and rescue by multiple robots. In the multi-agent trapping task, the chasers and the prey are in a dynamically changing environment, the chasers cooperate to form an expected formation to surround a specific prey to form a surrounding ring, the prey is trapped, the collision between the agents and the collision with the obstacle are avoided, and the prey is required to be learned to escape as much as possible. The intelligent agents have a complicated relationship, along with the gradual expansion of the task scale, the intelligent agents need to rapidly extract key information from a large amount of redundant changing environment information, dynamically adjust own targets, learn effective cooperation strategies, adapt the training and learning strategies to large-scale complex new tasks, and are key challenges for multi-agent task capturing.

In order to improve the above problems, the present invention provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, comprising: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. According to the technical scheme, a reinforcement learning model based on double-layer diagram attention is designed, an 'intra-group' attention layer is used for effectively modeling interaction among intelligent agent individuals, an 'inter-group' attention layer is used for modeling group-level information interaction, the intelligent agents can adaptively extract state dependency relations among a plurality of intelligent agents, the intelligent agents which are more important in relation with the intelligent agents are focused on information interaction, and the pursuit targets and cooperative teammates of the intelligent agents are flexibly adjusted, so that the pursuit targets and cooperative teammates of the intelligent agents are completed, the pursuers and the obstacles in the environment are comprehensively considered in the model training process, the pursuers and the obstacles are respectively restrained, the escape of the obstacles and the obstacle avoidance of the intelligent agents in the task are completed, and the multi-target optimization problem in the multi-intelligent agent trapping task is effectively solved.

The above-described scheme is explained in detail below.

Example 1

Referring to fig. 1, fig. 1 is a schematic flow chart of main steps of a multi-agent trapping method based on double-layer diagram attention-strengthening learning according to an embodiment of the present invention. As shown in fig. 1, the multi-agent trapping method based on double-layer diagram attention-strengthening learning in the embodiment of the invention mainly includes the following steps:

Step S101, sensing observation state information of an intelligent agent;

In this embodiment, the process of obtaining the pre-trained reinforcement learning model based on the attention of the double-layer graph includes:

Wherein the agent refers to an unmanned node (e.g., unmanned plane, robot, etc.) having sensing, communication, movement, storage, computation, etc. capabilities, including but not limited to agent particles constructed in a simulation environment. The simulation trapping task environment is an entity which is constructed according to trapping scene parameters and interacts with an intelligent body, and the intelligent body observes the state of the environment and acts in the environment according to control instructions based on the state.

In a specific embodiment, a multi-agent trapping task simulation scene is built in an MPE (multiagent-part-envs) multi-agent simulation environment of Open AI Open source, and preparation is made for training a reinforcement learning model based on double-layer graph attention. The method comprises the following steps:

and (3) installing MPE simulation environments in any computers provided with ubuntu (more than 18.04 required version) and pytorch deep learning frames, and constructing an intelligent agent trapping task simulation scene.

Setting the size of the whole simulation map in the constructed simulation scene to be 1 multiplied by 1, and setting the number of the intelligent agents and the number of the obstacles in the environment, the speed, the acceleration, the size and the color of the intelligent agents, the initialization position, the size and the color of the obstacles.

In one embodiment, an agent is composed of a perception module, an action module, a storage module, a reinforcement learning model based on double-layer diagram attention, and a control module.

The sensing module is connected with the feature extraction module and the storage module. The sensing module obtains the state (position information and speed information of the intelligent agent) and the environment state (position and speed information of other intelligent agents and position information of obstacles) of the intelligent agent from the simulation task environment, and sends the information to the feature coding module and the storage module.

The action module is an actuator of an agent control instruction, is connected with the control module, receives the instruction from the control module and moves in the simulation task environment according to the control instruction.

The storage module is a memory containing more than 1GB of available space, is connected with the sensing module, the control module and the data expansion module, receives observation information from the sensing module, receives control instruction information from the control module, receives rewarding information from the simulation trapping task environment, and combines the observation information, the control instruction information and the rewarding information into track data (track data for short) of interaction of an intelligent agent and the simulation trapping task environment. The track data is stored in a form of four groups (s _t,a_t,r_t,s_t+1), wherein s _t is observation state information received from the perception module when the agent is interacted with the simulated capture task environment for the t time, a _t is a control instruction from the control module executed when the agent is interacted with the simulated capture task environment for the t time, r _t is a reward value fed back by the environment for the control instruction a _t when the agent is interacted with the simulated capture task environment for the t time, and s _t+1 is observation state information received from the perception module after the environment state is changed (also referred to as observation state information when the agent is interacted with the simulated capture task environment for the t+1st time).

In one embodiment, the reinforcement learning model based on double-layer graph attention includes: observing an encoder network, a graph attention network and a Q network;

The attention network adopts an attention mechanism of 128-dimensional query, key and value; the aggregate information and the status of agents are stitched together and updated through a single fully-connected layer with a ReLU activation function containing 128 neurons. At each time step, a trace is generated containing a tuple (s _t,a_t,r_t,s_t+1)_1…N, N is the number of tuples, added to the storage module. After each game, the environment is reset, 4 updates are performed on the attention evaluation network and the policy network.

The Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons. For each update, 1024 time steps of data are sampled from the memory module, and then gradient descent is performed on the loss target and the policy target of the Q network.

In one embodiment, in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as the basic algorithm, the reward function of the pursuer agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:

In the above, the review _pursure is the total reward for the chaser, For the distance between the chaser and the prey, reward _encirclement is the trapping reward, reward _collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.

In one embodiment, the collision rewards of the chaser are as follows:

the trapping rewards are as follows:

reward _eneirclement =15, if And angel _gap < tolerance

In one embodiment, the difference between the angle difference forming the convoy and the desired enclosure angle between the chasers is as follows:

angel_gap＝angel_except-angel_diff

in the above formula, angel _except is the average surrounding angle expected when forming a team between the chasers, angle _diff is the actual angle when forming a team between the chasers, wherein:

angel_except＝2π/M

In one embodiment, in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as the basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:

In one embodiment, the game of hunting is awarded as follows:

In one embodiment, the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and the other prey are as follows:

Finally, the trained reinforcement learning model based on the attention of the double-layer graph is saved as a. Pt format.

The present embodiment separately sets the number of the chasers, the prey and the obstacle, and the speed between the chasers and the prey to verify the performance of the method. To compare the performance of the models, the MADDPG algorithm that uses no attention at all, and the G2Anet algorithm and the MAAC algorithm that use attention, are considered. Three general evaluation indexes are used to evaluate the performance of different models: the success rate is caught, rewards are averaged, and step length is completed averagely.

To evaluate the effectiveness of the proposed method, we performed two-dimensional experiments. On the one hand, 3 different task scenarios were set, here the maximum acceleration of all chasers and the maximum acceleration of the prey were set to only 10. As shown in table 1, when 2 obstacles O exist in the environment and 6 chasers P cooperatively trap 2 two hunting objects E, all the methods can complete the task, but the trapping success rate of the method of the present invention is improved by 33.05% compared with the trapping success rate of the MADDPG method without attention, and compared with the MAAC method with attention, the MER of our algorithm is improved by 3.08. When the number of obstacles in the environment increases, the capture MER of the MAAC algorithm, which also uses attention, decreases, while the MER convergence value of the method of the present invention does not change much, indicating that our method is more robust. When 9 chasers cooperate to capture 3 hunting objects, the MADDPG algorithm can not complete tasks basically, and the capturing success rate is reduced sharply. Also using the focused MAAC approach, performance declines, MEL increases significantly, and agents learn strategies more difficult. At this time, the success rate of the method of the invention is 7.71% higher than MAAC, MER does not converge to a lower level due to the increase in the number of agents, MEL also attains a higher level, indicating that agents learn a more advanced strategy.

TABLE 1

On the other hand, in order to evaluate the generalization of the method, the invention sets the speeds of the chaser and the prey differently, and completes the comparison experiment. In general, when the speed V _p of the chaser is greater than the speed V _e of the prey, the trapping task is easily completed, and when the speed of the chaser is less than the prey, a higher level cooperation strategy is needed between the chaser to complete the trapping task. As shown in Table 2, when the speed of the chaser is greater than that of the prey, all algorithms can complete the trapping task, and the MEL of the method of the invention is reduced by 6.72 compared with the MAAC algorithm. When the speed of the chaser is the same as the speed of the prey, the success rate of MADDPG algorithm is reduced by 10.68%, but the trapping success rate of the method is not obviously reduced, and meanwhile, compared with the MAAC method, the convergence level of MER is improved by 7.57. When the speed of the chaser is smaller than that of the prey, the MADDPG method is difficult to learn an effective trapping strategy, and in the MAAC method, the learning time of the intelligent agent is also obviously prolonged, and the strategy learned by the method can still cope with a scene with faster speed of the prey, compared with the MAAC method, the trapping time is reduced by 7.07, and the success rate is improved by 12.51 percent.

TABLE 2

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A multi-agent trapping method based on double-layer diagram attention-strengthening learning, which is characterized by comprising the following steps:

Step S101, sensing observation state information of an intelligent agent;

2. The method of claim 1, wherein the pre-trained two-layer graph attention-based reinforcement learning model acquisition process comprises:

3. The method of claim 2, wherein the double-layer graph attention-based reinforcement learning model comprises: observing an encoder network, a graph attention network and a Q network;

4. The method of claim 2, wherein in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as a basic algorithm, the reward function of the pursuer agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:

5. The method of claim 4, wherein the collision rewards of the chaser are as follows:

the trapping rewards are as follows:

REWARDENCIRCLEMENT =15 if And angel _gap < tolerance

6. The method of claim 5, wherein the difference between the angle difference forming the convoy and the desired enclosure angle between the chasers is as follows:

angel_gap＝angel_except-angel_diff

angel_except＝2π/M

In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>For the vertical axis position of the pursuer agent,/>Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.

7. The method of claim 6, wherein in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as a basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:

In the above equation, reward _prey is the total reward for the game and reward _collode is the impact reward for the game.

8. The method of claim 7 wherein said game is awarded the following:

9. The method of claim 8, wherein the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and the other prey are as follows: