CN118153621A - Multi-agent trapping method based on double-layer diagram attention reinforcement learning - Google Patents

Multi-agent trapping method based on double-layer diagram attention reinforcement learning Download PDF

Info

Publication number
CN118153621A
CN118153621A CN202410141040.1A CN202410141040A CN118153621A CN 118153621 A CN118153621 A CN 118153621A CN 202410141040 A CN202410141040 A CN 202410141040A CN 118153621 A CN118153621 A CN 118153621A
Authority
CN
China
Prior art keywords
agent
attention
double
trapping
prey
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410141040.1A
Other languages
Chinese (zh)
Inventor
史殿习
李彤月
郝锋
王震
张轶
邱春平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese People's Liberation Army 32806 Unit
Original Assignee
Chinese People's Liberation Army 32806 Unit
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese People's Liberation Army 32806 Unit filed Critical Chinese People's Liberation Army 32806 Unit
Priority to CN202410141040.1A priority Critical patent/CN118153621A/en
Publication of CN118153621A publication Critical patent/CN118153621A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/008Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Robotics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to the technical field of multi-agent systems, and particularly provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, which comprises the following steps: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. The technical scheme provided by the invention can effectively solve the problem of multi-objective optimization in multi-agent trapping tasks.

Description

Multi-agent trapping method based on double-layer diagram attention reinforcement learning
Technical Field
The invention relates to the technical field of multi-agent systems, in particular to a multi-agent trapping method based on double-layer diagram attention reinforcement learning.
Background
Multi-agent trapping is a classical but challenging problem in the multi-agent community and is widely used in civilian and military applications. For example, collaborative interception in aerospace, collaborative search and rescue by multiple robots. In the multi-agent trapping task, the chasers and the prey are in a dynamically changing environment, the chasers cooperate to form an expected formation to surround a specific prey to form a surrounding ring, the prey is trapped, the collision between the agents and the collision with the obstacle are avoided, and the prey is required to be learned to escape as much as possible. The intelligent agents have a complicated relationship, along with the gradual expansion of the task scale, the intelligent agents need to rapidly extract key information from a large amount of redundant changing environment information, dynamically adjust own targets, learn effective cooperation strategies, adapt the training and learning strategies to large-scale complex new tasks, and are key challenges for multi-agent task capturing.
The current multi-agent trapping problem research is mainly divided into two types, namely a control theory-based method and a learning-based method. Control theory based methods are often solved by estimating the central location of all prey objects, which are considered as one global objective. Most of the subsequent work is focused on two aspects of task allocation and surrounding control, namely grouping the chasers, and then planning each group of chasers to cooperatively trap the corresponding hunting object. However, in the process of trapping multiple prey objects, all agents dynamically change, and a task allocation method is performed in advance, so that flexible cooperation among the chasers is severely restricted, and trapping efficiency is reduced. In addition, most control theory-based methods ignore obstacle avoidance problems of the intelligent agent, are highly dependent on accurate control models, and are difficult to adapt in actual tasks in which obstacles exist.
Because of the limitations of the control theory-based method, the learning-based method solves the problem of multi-prey trapping by introducing reinforcement learning, and has great potential. The paper Multi-robot cooperative TARGET ENCIRCLEMENT through learning distributed transferable policy published by Zhang et al in the international neural network joint conference in 2020 proposes a distributed transferable policy network framework GACM based on deep reinforcement learning, adopts a graph attention communication mechanism to model Multi-agent interaction into a graph, extracts cooperation information of agents from the graph, and simultaneously utilizes uncertain quantity of barrier information in a long and short time network LSTM processing environment to solve the problem of Multi-agent cooperation trapping. However, the method is a task-trapping study aiming at single hunting, and cannot be applied to a scene of multi-hunting task due to the lack of multi-objective allocation. The paper "Multi-Target Encirclement with Collision Avoidance via Deep Reinforcement Learning using Relational Graphs", published by Zhang et al in 2022, robot and Automation national conference, presents a decentralization method MECADRL-RG of a robot-level and target-level relationship diagram based on deep reinforcement learning, which solves the MECA problem of a distributed multi-robot system. After the trapping targets are grouped, a robot-level relationship diagram formed by three heterogeneous relationship diagrams between each robot and other robots, targets and obstacles is modeled and learned by using a diagram attention network GATs, and spatial relationship representations among different agents are extracted instead of simple stacking of observation information. In addition, a target-level relationship graph is constructed by using GAT, and a spatial relationship from each robot to each target is constructed. Further, in order to predict the motion trail of each target, the motion of the target is modeled through a target-level relationship graph, and learning is performed through supervised learning. However, in the actual multi-prey trapping task, the trapping target is not fixed, but the chaser continuously adjusts the own trapping target according to the state and the acquired information in the process of trapping the prey, dynamically distributes the cooperative teammates, integrates the environmental factors, considers the obstacle avoidance, ensures the self safety, finishes trapping the prey and maximizes the trapping success rate.
Disclosure of Invention
In order to overcome the defects, the invention provides a multi-agent trapping method based on double-layer diagram attention-strengthening learning.
In a first aspect, a multi-agent trapping method based on double-layer diagram attention-strengthening learning is provided, where the multi-agent trapping method based on double-layer diagram attention-strengthening learning includes:
Step S101, sensing observation state information of an intelligent agent;
Step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention;
step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101;
Wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object.
Preferably, the obtaining process of the pre-trained reinforcement learning model based on the attention of the double-layer graph comprises the following steps:
Constructing a multi-agent trapping task simulation scene comprising a chasing agent provided with a reinforced learning model based on double-layer diagram attention, a hunting agent provided with a reinforced learning model based on double-layer diagram attention and an obstacle;
Training a reinforced learning model based on double-layer diagram attention corresponding to each intelligent agent in a multi-intelligent agent trapping task simulation scene by taking an SAC algorithm as a basic algorithm;
And taking the reinforced learning model based on the double-layer diagram attention corresponding to the pursuer agent as the pre-trained reinforced learning model based on the double-layer diagram attention.
Further, the reinforcement learning model based on the double-layer graph attention comprises: observing an encoder network, a graph attention network and a Q network;
the observation encoder network consists of a full-connection layer with a ReLU activation function, takes the position information and the speed information of an intelligent agent as input, and outputs a 128-dimensional embedded vector;
the attention network adopts an attention mechanism of 128-dimensional query, key and value;
the Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons.
Further, in the training process of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by adopting the SAC algorithm as a basic algorithm, the reward function of the chaser agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above, the review pursuer is the total reward for the chaser, For the distance between the chaser and the prey, reward encirclement is the trapping reward, reward collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.
Further, the collision rewards of the chasers are as follows:
In the above-mentioned method, the step of, For the distance between the pursuer and other pursuers,/>For the distance between the chaser and the obstacle, eta is a super parameter,/>For other chaser locations,/>Is the position of the obstacle;
the trapping rewards are as follows:
reward encirclement =15, if And angel gap < tolerance
In the above formula, angel gap is the difference between the angle difference forming a queue between the chasers and the expected surrounding angle, and tolerance is the error tolerance range.
Further, the difference between the angle difference forming the formation and the expected surrounding angle between the chasers is as follows:
angelgap=angelexcept-angeldiff
In the above description, angel except is the average surrounding angle expected when forming a team between the chasers, angel diff is the actual angle when forming a team between the chasers, wherein:
angelexcept=2π/M
In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>To track the longitudinal axis of the agent,Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.
Further, in the training process of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by adopting the SAC algorithm as a basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above equation, reward prey is the total reward for the game and reward collide is the impact reward for the game.
Further, the impact rewards of the prey are as follows:
In the above-mentioned method, the step of, For the distance between the prey and the obstacle,/>Is the distance between the prey and other prey.
Further, the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and other prey are as follows:
In the above-mentioned method, the step of, Is the horizontal axis position of the obstacle,/>Is the vertical axis position of the obstacle,/>For the horizontal axis position of other prey agents,/>Is the vertical axis position of other game agents.
The technical scheme provided by the invention has at least one or more of the following beneficial effects:
The invention provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, which comprises the following steps: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. According to the technical scheme, a reinforcement learning model based on double-layer diagram attention is designed, an 'intra-group' attention layer is used for effectively modeling interaction among intelligent agent individuals, an 'inter-group' attention layer is used for modeling group-level information interaction, the intelligent agents can adaptively extract state dependency relations among a plurality of intelligent agents, the intelligent agents which are more important in relation with the intelligent agents are focused on information interaction, and the pursuit targets and cooperative teammates of the intelligent agents are flexibly adjusted, so that the pursuit targets and cooperative teammates of the intelligent agents are completed, the pursuers and the obstacles in the environment are comprehensively considered in the model training process, the pursuers and the obstacles are respectively restrained, the escape of the obstacles and the obstacle avoidance of the intelligent agents in the task are completed, and the multi-target optimization problem in the multi-intelligent agent trapping task is effectively solved.
Drawings
Fig. 1 is a schematic flow chart of main steps of a multi-agent trapping method based on double-layer diagram attention-strengthening learning according to an embodiment of the invention.
Detailed Description
The following describes the embodiments of the present invention in further detail with reference to the drawings.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As disclosed in the background art, multi-agent trapping is a classical but challenging problem in the multi-agent collaboration field, and is widely used in civilian and military applications. For example, collaborative interception in aerospace, collaborative search and rescue by multiple robots. In the multi-agent trapping task, the chasers and the prey are in a dynamically changing environment, the chasers cooperate to form an expected formation to surround a specific prey to form a surrounding ring, the prey is trapped, the collision between the agents and the collision with the obstacle are avoided, and the prey is required to be learned to escape as much as possible. The intelligent agents have a complicated relationship, along with the gradual expansion of the task scale, the intelligent agents need to rapidly extract key information from a large amount of redundant changing environment information, dynamically adjust own targets, learn effective cooperation strategies, adapt the training and learning strategies to large-scale complex new tasks, and are key challenges for multi-agent task capturing.
The current multi-agent trapping problem research is mainly divided into two types, namely a control theory-based method and a learning-based method. Control theory based methods are often solved by estimating the central location of all prey objects, which are considered as one global objective. Most of the subsequent work is focused on two aspects of task allocation and surrounding control, namely grouping the chasers, and then planning each group of chasers to cooperatively trap the corresponding hunting object. However, in the process of trapping multiple prey objects, all agents dynamically change, and a task allocation method is performed in advance, so that flexible cooperation among the chasers is severely restricted, and trapping efficiency is reduced. In addition, most control theory-based methods ignore obstacle avoidance problems of the intelligent agent, are highly dependent on accurate control models, and are difficult to adapt in actual tasks in which obstacles exist.
Because of the limitations of the control theory-based method, the learning-based method solves the problem of multi-prey trapping by introducing reinforcement learning, and has great potential. The paper Multi-robot cooperative TARGET ENCIRCLEMENT through learning distributed transferable policy published by Zhang et al in the international neural network joint conference in 2020 proposes a distributed transferable policy network framework GACM based on deep reinforcement learning, adopts a graph attention communication mechanism to model Multi-agent interaction into a graph, extracts cooperation information of agents from the graph, and simultaneously utilizes uncertain quantity of barrier information in a long and short time network LSTM processing environment to solve the problem of Multi-agent cooperation trapping. However, the method is a task-trapping study aiming at single hunting, and cannot be applied to a scene of multi-hunting task due to the lack of multi-objective allocation. The paper "Multi-Target Encirclement with Collision Avoidance via Deep Reinforcement Learning using Relational Graphs", published by Zhang et al in 2022, robot and Automation national conference, presents a decentralization method MECADRL-RG of a robot-level and target-level relationship diagram based on deep reinforcement learning, which solves the MECA problem of a distributed multi-robot system. After the trapping targets are grouped, a robot-level relationship diagram formed by three heterogeneous relationship diagrams between each robot and other robots, targets and obstacles is modeled and learned by using a diagram attention network GATs, and spatial relationship representations among different agents are extracted instead of simple stacking of observation information. In addition, a target-level relationship graph is constructed by using GAT, and a spatial relationship from each robot to each target is constructed. Further, in order to predict the motion trail of each target, the motion of the target is modeled through a target-level relationship graph, and learning is performed through supervised learning. However, in the actual multi-prey trapping task, the trapping target is not fixed, but the chaser continuously adjusts the own trapping target according to the state and the acquired information in the process of trapping the prey, dynamically distributes the cooperative teammates, integrates the environmental factors, considers the obstacle avoidance, ensures the self safety, finishes trapping the prey and maximizes the trapping success rate.
In order to improve the above problems, the present invention provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, comprising: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. According to the technical scheme, a reinforcement learning model based on double-layer diagram attention is designed, an 'intra-group' attention layer is used for effectively modeling interaction among intelligent agent individuals, an 'inter-group' attention layer is used for modeling group-level information interaction, the intelligent agents can adaptively extract state dependency relations among a plurality of intelligent agents, the intelligent agents which are more important in relation with the intelligent agents are focused on information interaction, and the pursuit targets and cooperative teammates of the intelligent agents are flexibly adjusted, so that the pursuit targets and cooperative teammates of the intelligent agents are completed, the pursuers and the obstacles in the environment are comprehensively considered in the model training process, the pursuers and the obstacles are respectively restrained, the escape of the obstacles and the obstacle avoidance of the intelligent agents in the task are completed, and the multi-target optimization problem in the multi-intelligent agent trapping task is effectively solved.
The above-described scheme is explained in detail below.
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of main steps of a multi-agent trapping method based on double-layer diagram attention-strengthening learning according to an embodiment of the present invention. As shown in fig. 1, the multi-agent trapping method based on double-layer diagram attention-strengthening learning in the embodiment of the invention mainly includes the following steps:
Step S101, sensing observation state information of an intelligent agent;
Step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention;
step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101;
Wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object.
In this embodiment, the process of obtaining the pre-trained reinforcement learning model based on the attention of the double-layer graph includes:
Constructing a multi-agent trapping task simulation scene comprising a chasing agent provided with a reinforced learning model based on double-layer diagram attention, a hunting agent provided with a reinforced learning model based on double-layer diagram attention and an obstacle;
Wherein the agent refers to an unmanned node (e.g., unmanned plane, robot, etc.) having sensing, communication, movement, storage, computation, etc. capabilities, including but not limited to agent particles constructed in a simulation environment. The simulation trapping task environment is an entity which is constructed according to trapping scene parameters and interacts with an intelligent body, and the intelligent body observes the state of the environment and acts in the environment according to control instructions based on the state.
In a specific embodiment, a multi-agent trapping task simulation scene is built in an MPE (multiagent-part-envs) multi-agent simulation environment of Open AI Open source, and preparation is made for training a reinforcement learning model based on double-layer graph attention. The method comprises the following steps:
and (3) installing MPE simulation environments in any computers provided with ubuntu (more than 18.04 required version) and pytorch deep learning frames, and constructing an intelligent agent trapping task simulation scene.
Setting the size of the whole simulation map in the constructed simulation scene to be 1 multiplied by 1, and setting the number of the intelligent agents and the number of the obstacles in the environment, the speed, the acceleration, the size and the color of the intelligent agents, the initialization position, the size and the color of the obstacles.
Training a reinforced learning model based on double-layer diagram attention corresponding to each intelligent agent in a multi-intelligent agent trapping task simulation scene by taking an SAC algorithm as a basic algorithm;
And taking the reinforced learning model based on the double-layer diagram attention corresponding to the pursuer agent as the pre-trained reinforced learning model based on the double-layer diagram attention.
In one embodiment, an agent is composed of a perception module, an action module, a storage module, a reinforcement learning model based on double-layer diagram attention, and a control module.
The sensing module is connected with the feature extraction module and the storage module. The sensing module obtains the state (position information and speed information of the intelligent agent) and the environment state (position and speed information of other intelligent agents and position information of obstacles) of the intelligent agent from the simulation task environment, and sends the information to the feature coding module and the storage module.
The action module is an actuator of an agent control instruction, is connected with the control module, receives the instruction from the control module and moves in the simulation task environment according to the control instruction.
The storage module is a memory containing more than 1GB of available space, is connected with the sensing module, the control module and the data expansion module, receives observation information from the sensing module, receives control instruction information from the control module, receives rewarding information from the simulation trapping task environment, and combines the observation information, the control instruction information and the rewarding information into track data (track data for short) of interaction of an intelligent agent and the simulation trapping task environment. The track data is stored in a form of four groups (s t,at,rt,st+1), wherein s t is observation state information received from the perception module when the agent is interacted with the simulated capture task environment for the t time, a t is a control instruction from the control module executed when the agent is interacted with the simulated capture task environment for the t time, r t is a reward value fed back by the environment for the control instruction a t when the agent is interacted with the simulated capture task environment for the t time, and s t+1 is observation state information received from the perception module after the environment state is changed (also referred to as observation state information when the agent is interacted with the simulated capture task environment for the t+1st time).
In one embodiment, the reinforcement learning model based on double-layer graph attention includes: observing an encoder network, a graph attention network and a Q network;
the observation encoder network consists of a full-connection layer with a ReLU activation function, takes the position information and the speed information of an intelligent agent as input, and outputs a 128-dimensional embedded vector;
The attention network adopts an attention mechanism of 128-dimensional query, key and value; the aggregate information and the status of agents are stitched together and updated through a single fully-connected layer with a ReLU activation function containing 128 neurons. At each time step, a trace is generated containing a tuple (s t,at,rt,st+1)1…N, N is the number of tuples, added to the storage module. After each game, the environment is reset, 4 updates are performed on the attention evaluation network and the policy network.
The Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons. For each update, 1024 time steps of data are sampled from the memory module, and then gradient descent is performed on the loss target and the policy target of the Q network.
In one embodiment, in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as the basic algorithm, the reward function of the pursuer agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above, the review pursure is the total reward for the chaser, For the distance between the chaser and the prey, reward encirclement is the trapping reward, reward collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.
In one embodiment, the collision rewards of the chaser are as follows:
In the above-mentioned method, the step of, For the distance between the pursuer and other pursuers,/>For the distance between the chaser and the obstacle, eta is a super parameter,/>For other chaser locations,/>Is the position of the obstacle;
the trapping rewards are as follows:
reward eneirclement =15, if And angel gap < tolerance
In the above formula, angel gap is the difference between the angle difference forming a queue between the chasers and the expected surrounding angle, and tolerance is the error tolerance range.
In one embodiment, the difference between the angle difference forming the convoy and the desired enclosure angle between the chasers is as follows:
angelgap=angelexcept-angeldiff
in the above formula, angel except is the average surrounding angle expected when forming a team between the chasers, angle diff is the actual angle when forming a team between the chasers, wherein:
angelexcept=2π/M
In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>To track the longitudinal axis of the agent,Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.
In one embodiment, in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as the basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above equation, reward prey is the total reward for the game and reward collide is the impact reward for the game.
In one embodiment, the game of hunting is awarded as follows:
In the above-mentioned method, the step of, For the distance between the prey and the obstacle,/>Is the distance between the prey and other prey.
In one embodiment, the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and the other prey are as follows:
In the above-mentioned method, the step of, Is the horizontal axis position of the obstacle,/>Is the vertical axis position of the obstacle,/>For the horizontal axis position of other prey agents,/>Is the vertical axis position of other game agents.
Finally, the trained reinforcement learning model based on the attention of the double-layer graph is saved as a. Pt format.
The present embodiment separately sets the number of the chasers, the prey and the obstacle, and the speed between the chasers and the prey to verify the performance of the method. To compare the performance of the models, the MADDPG algorithm that uses no attention at all, and the G2Anet algorithm and the MAAC algorithm that use attention, are considered. Three general evaluation indexes are used to evaluate the performance of different models: the success rate is caught, rewards are averaged, and step length is completed averagely.
To evaluate the effectiveness of the proposed method, we performed two-dimensional experiments. On the one hand, 3 different task scenarios were set, here the maximum acceleration of all chasers and the maximum acceleration of the prey were set to only 10. As shown in table 1, when 2 obstacles O exist in the environment and 6 chasers P cooperatively trap 2 two hunting objects E, all the methods can complete the task, but the trapping success rate of the method of the present invention is improved by 33.05% compared with the trapping success rate of the MADDPG method without attention, and compared with the MAAC method with attention, the MER of our algorithm is improved by 3.08. When the number of obstacles in the environment increases, the capture MER of the MAAC algorithm, which also uses attention, decreases, while the MER convergence value of the method of the present invention does not change much, indicating that our method is more robust. When 9 chasers cooperate to capture 3 hunting objects, the MADDPG algorithm can not complete tasks basically, and the capturing success rate is reduced sharply. Also using the focused MAAC approach, performance declines, MEL increases significantly, and agents learn strategies more difficult. At this time, the success rate of the method of the invention is 7.71% higher than MAAC, MER does not converge to a lower level due to the increase in the number of agents, MEL also attains a higher level, indicating that agents learn a more advanced strategy.
TABLE 1
On the other hand, in order to evaluate the generalization of the method, the invention sets the speeds of the chaser and the prey differently, and completes the comparison experiment. In general, when the speed V p of the chaser is greater than the speed V e of the prey, the trapping task is easily completed, and when the speed of the chaser is less than the prey, a higher level cooperation strategy is needed between the chaser to complete the trapping task. As shown in Table 2, when the speed of the chaser is greater than that of the prey, all algorithms can complete the trapping task, and the MEL of the method of the invention is reduced by 6.72 compared with the MAAC algorithm. When the speed of the chaser is the same as the speed of the prey, the success rate of MADDPG algorithm is reduced by 10.68%, but the trapping success rate of the method is not obviously reduced, and meanwhile, compared with the MAAC method, the convergence level of MER is improved by 7.57. When the speed of the chaser is smaller than that of the prey, the MADDPG method is difficult to learn an effective trapping strategy, and in the MAAC method, the learning time of the intelligent agent is also obviously prolonged, and the strategy learned by the method can still cope with a scene with faster speed of the prey, compared with the MAAC method, the trapping time is reduced by 7.07, and the success rate is improved by 12.51 percent.
TABLE 2
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (9)

1. A multi-agent trapping method based on double-layer diagram attention-strengthening learning, which is characterized by comprising the following steps:
Step S101, sensing observation state information of an intelligent agent;
Step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention;
step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101;
Wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object.
2. The method of claim 1, wherein the pre-trained two-layer graph attention-based reinforcement learning model acquisition process comprises:
Constructing a multi-agent trapping task simulation scene comprising a chasing agent provided with a reinforced learning model based on double-layer diagram attention, a hunting agent provided with a reinforced learning model based on double-layer diagram attention and an obstacle;
Training a reinforced learning model based on double-layer diagram attention corresponding to each intelligent agent in a multi-intelligent agent trapping task simulation scene by taking an SAC algorithm as a basic algorithm;
And taking the reinforced learning model based on the double-layer diagram attention corresponding to the pursuer agent as the pre-trained reinforced learning model based on the double-layer diagram attention.
3. The method of claim 2, wherein the double-layer graph attention-based reinforcement learning model comprises: observing an encoder network, a graph attention network and a Q network;
the observation encoder network consists of a full-connection layer with a ReLU activation function, takes the position information and the speed information of an intelligent agent as input, and outputs a 128-dimensional embedded vector;
the attention network adopts an attention mechanism of 128-dimensional query, key and value;
the Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons.
4. The method of claim 2, wherein in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as a basic algorithm, the reward function of the pursuer agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above, the review pursuer is the total reward for the chaser, For the distance between the chaser and the prey, reward encirclement is the trapping reward, reward collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.
5. The method of claim 4, wherein the collision rewards of the chaser are as follows:
In the above-mentioned method, the step of, For the distance between the pursuer and other pursuers,/>For the distance between the chaser and the obstacle, eta is a super parameter,/>For other chaser locations,/>Is the position of the obstacle;
the trapping rewards are as follows:
REWARDENCIRCLEMENT =15 if And angel gap < tolerance
In the above formula, angel gap is the difference between the angle difference forming a queue between the chasers and the expected surrounding angle, and tolerance is the error tolerance range.
6. The method of claim 5, wherein the difference between the angle difference forming the convoy and the desired enclosure angle between the chasers is as follows:
angelgap=angelexcept-angeldiff
In the above description, angel except is the average surrounding angle expected when forming a team between the chasers, angel diff is the actual angle when forming a team between the chasers, wherein:
angelexcept=2π/M
In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>For the vertical axis position of the pursuer agent,/>Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.
7. The method of claim 6, wherein in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as a basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above equation, reward prey is the total reward for the game and reward collode is the impact reward for the game.
8. The method of claim 7 wherein said game is awarded the following:
In the above-mentioned method, the step of, For the distance between the prey and the obstacle,/>Is the distance between the prey and other prey.
9. The method of claim 8, wherein the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and the other prey are as follows:
In the above-mentioned method, the step of, Is the horizontal axis position of the obstacle,/>Is the vertical axis position of the obstacle,/>For the horizontal axis position of other prey agents,/>Is the vertical axis position of other game agents.
CN202410141040.1A 2024-02-01 2024-02-01 Multi-agent trapping method based on double-layer diagram attention reinforcement learning Pending CN118153621A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410141040.1A CN118153621A (en) 2024-02-01 2024-02-01 Multi-agent trapping method based on double-layer diagram attention reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410141040.1A CN118153621A (en) 2024-02-01 2024-02-01 Multi-agent trapping method based on double-layer diagram attention reinforcement learning

Publications (1)

Publication Number Publication Date
CN118153621A true CN118153621A (en) 2024-06-07

Family

ID=91300107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410141040.1A Pending CN118153621A (en) 2024-02-01 2024-02-01 Multi-agent trapping method based on double-layer diagram attention reinforcement learning

Country Status (1)

Country Link
CN (1) CN118153621A (en)

Similar Documents

Publication Publication Date Title
Huegle et al. Dynamic input for deep reinforcement learning in autonomous driving
US9764468B2 (en) Adaptive predictor apparatus and methods
Zhang et al. Collective behavior coordination with predictive mechanisms
CN108776483A (en) AGV paths planning methods and system based on ant group algorithm and multiple agent Q study
CN113050640B (en) Industrial robot path planning method and system based on generation of countermeasure network
CN114741886B (en) Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
Hao et al. Independent generative adversarial self-imitation learning in cooperative multiagent systems
Guo et al. Distributed reinforcement learning for coordinate multi-robot foraging
CN113253738B (en) Multi-robot cooperation trapping method and device, electronic equipment and storage medium
CN113759935B (en) Intelligent group formation mobile control method based on fuzzy logic
Felbrich et al. Autonomous robotic additive manufacturing through distributed model‐free deep reinforcement learning in computational design environments
CN114083539B (en) Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning
Wang et al. Research on pursuit-evasion games with multiple heterogeneous pursuers and a high speed evader
Curran et al. Using PCA to efficiently represent state spaces
Cao et al. Dynamic task assignment for multi-AUV cooperative hunting
CN113485323B (en) Flexible formation method for cascading multiple mobile robots
Li et al. Robot path planning using improved artificial bee colony algorithm
Zhan et al. Flocking of discrete-time multi-agent systems with predictive mechanisms
CN117553798A (en) Safe navigation method, equipment and medium for mobile robot in complex crowd scene
Pierre End-to-end deep learning for robotic following
CN117518907A (en) Control method, device, equipment and storage medium of intelligent agent
Vaščák et al. Use and perspectives of fuzzy cognitive maps in robotics
CN109752952A (en) Method and device for acquiring multi-dimensional random distribution and strengthening controller
Huang et al. A deep reinforcement learning approach to preserve connectivity for multi-robot systems
Zhang et al. Recent advances in robot trajectory planning in a dynamic environment

Legal Events

Date Code Title Description
PB01 Publication