CN118153621A - Multi-agent trapping method based on double-layer diagram attention reinforcement learning - Google Patents
Multi-agent trapping method based on double-layer diagram attention reinforcement learning Download PDFInfo
- Publication number
- CN118153621A CN118153621A CN202410141040.1A CN202410141040A CN118153621A CN 118153621 A CN118153621 A CN 118153621A CN 202410141040 A CN202410141040 A CN 202410141040A CN 118153621 A CN118153621 A CN 118153621A
- Authority
- CN
- China
- Prior art keywords
- agent
- attention
- double
- trapping
- prey
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010586 diagram Methods 0.000 title claims abstract description 57
- 230000002787 reinforcement Effects 0.000 title claims abstract description 25
- 238000001926 trapping method Methods 0.000 title claims abstract description 13
- 230000000875 corresponding effect Effects 0.000 claims abstract description 20
- 230000009471 action Effects 0.000 claims abstract description 14
- 239000003795 chemical substances by application Substances 0.000 claims description 180
- 238000000034 method Methods 0.000 claims description 67
- 238000004088 simulation Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 7
- 238000005728 strengthening Methods 0.000 claims description 7
- 230000015572 biosynthetic process Effects 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 5
- 210000002569 neuron Anatomy 0.000 claims description 4
- 238000012552 review Methods 0.000 claims description 3
- 241000767925 Collodes Species 0.000 claims 1
- 238000005457 optimization Methods 0.000 abstract description 3
- 230000003993 interaction Effects 0.000 description 9
- 238000003860 storage Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 7
- 239000000284 extract Substances 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000001133 acceleration Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- PLYRYAHDNXANEG-QMWPFBOUSA-N (2s,3s,4r,5r)-5-(6-aminopurin-9-yl)-3,4-dihydroxy-n-methyloxolane-2-carboxamide Chemical compound O[C@@H]1[C@H](O)[C@@H](C(=O)NC)O[C@H]1N1C2=NC=NC(N)=C2N=C1 PLYRYAHDNXANEG-QMWPFBOUSA-N 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 101100204059 Caenorhabditis elegans trap-2 gene Proteins 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/008—Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Robotics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Feedback Control In General (AREA)
Abstract
The invention relates to the technical field of multi-agent systems, and particularly provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, which comprises the following steps: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. The technical scheme provided by the invention can effectively solve the problem of multi-objective optimization in multi-agent trapping tasks.
Description
Technical Field
The invention relates to the technical field of multi-agent systems, in particular to a multi-agent trapping method based on double-layer diagram attention reinforcement learning.
Background
Multi-agent trapping is a classical but challenging problem in the multi-agent community and is widely used in civilian and military applications. For example, collaborative interception in aerospace, collaborative search and rescue by multiple robots. In the multi-agent trapping task, the chasers and the prey are in a dynamically changing environment, the chasers cooperate to form an expected formation to surround a specific prey to form a surrounding ring, the prey is trapped, the collision between the agents and the collision with the obstacle are avoided, and the prey is required to be learned to escape as much as possible. The intelligent agents have a complicated relationship, along with the gradual expansion of the task scale, the intelligent agents need to rapidly extract key information from a large amount of redundant changing environment information, dynamically adjust own targets, learn effective cooperation strategies, adapt the training and learning strategies to large-scale complex new tasks, and are key challenges for multi-agent task capturing.
The current multi-agent trapping problem research is mainly divided into two types, namely a control theory-based method and a learning-based method. Control theory based methods are often solved by estimating the central location of all prey objects, which are considered as one global objective. Most of the subsequent work is focused on two aspects of task allocation and surrounding control, namely grouping the chasers, and then planning each group of chasers to cooperatively trap the corresponding hunting object. However, in the process of trapping multiple prey objects, all agents dynamically change, and a task allocation method is performed in advance, so that flexible cooperation among the chasers is severely restricted, and trapping efficiency is reduced. In addition, most control theory-based methods ignore obstacle avoidance problems of the intelligent agent, are highly dependent on accurate control models, and are difficult to adapt in actual tasks in which obstacles exist.
Because of the limitations of the control theory-based method, the learning-based method solves the problem of multi-prey trapping by introducing reinforcement learning, and has great potential. The paper Multi-robot cooperative TARGET ENCIRCLEMENT through learning distributed transferable policy published by Zhang et al in the international neural network joint conference in 2020 proposes a distributed transferable policy network framework GACM based on deep reinforcement learning, adopts a graph attention communication mechanism to model Multi-agent interaction into a graph, extracts cooperation information of agents from the graph, and simultaneously utilizes uncertain quantity of barrier information in a long and short time network LSTM processing environment to solve the problem of Multi-agent cooperation trapping. However, the method is a task-trapping study aiming at single hunting, and cannot be applied to a scene of multi-hunting task due to the lack of multi-objective allocation. The paper "Multi-Target Encirclement with Collision Avoidance via Deep Reinforcement Learning using Relational Graphs", published by Zhang et al in 2022, robot and Automation national conference, presents a decentralization method MECADRL-RG of a robot-level and target-level relationship diagram based on deep reinforcement learning, which solves the MECA problem of a distributed multi-robot system. After the trapping targets are grouped, a robot-level relationship diagram formed by three heterogeneous relationship diagrams between each robot and other robots, targets and obstacles is modeled and learned by using a diagram attention network GATs, and spatial relationship representations among different agents are extracted instead of simple stacking of observation information. In addition, a target-level relationship graph is constructed by using GAT, and a spatial relationship from each robot to each target is constructed. Further, in order to predict the motion trail of each target, the motion of the target is modeled through a target-level relationship graph, and learning is performed through supervised learning. However, in the actual multi-prey trapping task, the trapping target is not fixed, but the chaser continuously adjusts the own trapping target according to the state and the acquired information in the process of trapping the prey, dynamically distributes the cooperative teammates, integrates the environmental factors, considers the obstacle avoidance, ensures the self safety, finishes trapping the prey and maximizes the trapping success rate.
Disclosure of Invention
In order to overcome the defects, the invention provides a multi-agent trapping method based on double-layer diagram attention-strengthening learning.
In a first aspect, a multi-agent trapping method based on double-layer diagram attention-strengthening learning is provided, where the multi-agent trapping method based on double-layer diagram attention-strengthening learning includes:
Step S101, sensing observation state information of an intelligent agent;
Step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention;
step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101;
Wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object.
Preferably, the obtaining process of the pre-trained reinforcement learning model based on the attention of the double-layer graph comprises the following steps:
Constructing a multi-agent trapping task simulation scene comprising a chasing agent provided with a reinforced learning model based on double-layer diagram attention, a hunting agent provided with a reinforced learning model based on double-layer diagram attention and an obstacle;
Training a reinforced learning model based on double-layer diagram attention corresponding to each intelligent agent in a multi-intelligent agent trapping task simulation scene by taking an SAC algorithm as a basic algorithm;
And taking the reinforced learning model based on the double-layer diagram attention corresponding to the pursuer agent as the pre-trained reinforced learning model based on the double-layer diagram attention.
Further, the reinforcement learning model based on the double-layer graph attention comprises: observing an encoder network, a graph attention network and a Q network;
the observation encoder network consists of a full-connection layer with a ReLU activation function, takes the position information and the speed information of an intelligent agent as input, and outputs a 128-dimensional embedded vector;
the attention network adopts an attention mechanism of 128-dimensional query, key and value;
the Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons.
Further, in the training process of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by adopting the SAC algorithm as a basic algorithm, the reward function of the chaser agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above, the review pursuer is the total reward for the chaser, For the distance between the chaser and the prey, reward encirclement is the trapping reward, reward collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.
Further, the collision rewards of the chasers are as follows:
In the above-mentioned method, the step of, For the distance between the pursuer and other pursuers,/>For the distance between the chaser and the obstacle, eta is a super parameter,/>For other chaser locations,/>Is the position of the obstacle;
the trapping rewards are as follows:
reward encirclement =15, if And angel gap < tolerance
In the above formula, angel gap is the difference between the angle difference forming a queue between the chasers and the expected surrounding angle, and tolerance is the error tolerance range.
Further, the difference between the angle difference forming the formation and the expected surrounding angle between the chasers is as follows:
angelgap=angelexcept-angeldiff
In the above description, angel except is the average surrounding angle expected when forming a team between the chasers, angel diff is the actual angle when forming a team between the chasers, wherein:
angelexcept=2π/M
In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>To track the longitudinal axis of the agent,Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.
Further, in the training process of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by adopting the SAC algorithm as a basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above equation, reward prey is the total reward for the game and reward collide is the impact reward for the game.
Further, the impact rewards of the prey are as follows:
In the above-mentioned method, the step of, For the distance between the prey and the obstacle,/>Is the distance between the prey and other prey.
Further, the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and other prey are as follows:
In the above-mentioned method, the step of, Is the horizontal axis position of the obstacle,/>Is the vertical axis position of the obstacle,/>For the horizontal axis position of other prey agents,/>Is the vertical axis position of other game agents.
The technical scheme provided by the invention has at least one or more of the following beneficial effects:
The invention provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, which comprises the following steps: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. According to the technical scheme, a reinforcement learning model based on double-layer diagram attention is designed, an 'intra-group' attention layer is used for effectively modeling interaction among intelligent agent individuals, an 'inter-group' attention layer is used for modeling group-level information interaction, the intelligent agents can adaptively extract state dependency relations among a plurality of intelligent agents, the intelligent agents which are more important in relation with the intelligent agents are focused on information interaction, and the pursuit targets and cooperative teammates of the intelligent agents are flexibly adjusted, so that the pursuit targets and cooperative teammates of the intelligent agents are completed, the pursuers and the obstacles in the environment are comprehensively considered in the model training process, the pursuers and the obstacles are respectively restrained, the escape of the obstacles and the obstacle avoidance of the intelligent agents in the task are completed, and the multi-target optimization problem in the multi-intelligent agent trapping task is effectively solved.
Drawings
Fig. 1 is a schematic flow chart of main steps of a multi-agent trapping method based on double-layer diagram attention-strengthening learning according to an embodiment of the invention.
Detailed Description
The following describes the embodiments of the present invention in further detail with reference to the drawings.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As disclosed in the background art, multi-agent trapping is a classical but challenging problem in the multi-agent collaboration field, and is widely used in civilian and military applications. For example, collaborative interception in aerospace, collaborative search and rescue by multiple robots. In the multi-agent trapping task, the chasers and the prey are in a dynamically changing environment, the chasers cooperate to form an expected formation to surround a specific prey to form a surrounding ring, the prey is trapped, the collision between the agents and the collision with the obstacle are avoided, and the prey is required to be learned to escape as much as possible. The intelligent agents have a complicated relationship, along with the gradual expansion of the task scale, the intelligent agents need to rapidly extract key information from a large amount of redundant changing environment information, dynamically adjust own targets, learn effective cooperation strategies, adapt the training and learning strategies to large-scale complex new tasks, and are key challenges for multi-agent task capturing.
The current multi-agent trapping problem research is mainly divided into two types, namely a control theory-based method and a learning-based method. Control theory based methods are often solved by estimating the central location of all prey objects, which are considered as one global objective. Most of the subsequent work is focused on two aspects of task allocation and surrounding control, namely grouping the chasers, and then planning each group of chasers to cooperatively trap the corresponding hunting object. However, in the process of trapping multiple prey objects, all agents dynamically change, and a task allocation method is performed in advance, so that flexible cooperation among the chasers is severely restricted, and trapping efficiency is reduced. In addition, most control theory-based methods ignore obstacle avoidance problems of the intelligent agent, are highly dependent on accurate control models, and are difficult to adapt in actual tasks in which obstacles exist.
Because of the limitations of the control theory-based method, the learning-based method solves the problem of multi-prey trapping by introducing reinforcement learning, and has great potential. The paper Multi-robot cooperative TARGET ENCIRCLEMENT through learning distributed transferable policy published by Zhang et al in the international neural network joint conference in 2020 proposes a distributed transferable policy network framework GACM based on deep reinforcement learning, adopts a graph attention communication mechanism to model Multi-agent interaction into a graph, extracts cooperation information of agents from the graph, and simultaneously utilizes uncertain quantity of barrier information in a long and short time network LSTM processing environment to solve the problem of Multi-agent cooperation trapping. However, the method is a task-trapping study aiming at single hunting, and cannot be applied to a scene of multi-hunting task due to the lack of multi-objective allocation. The paper "Multi-Target Encirclement with Collision Avoidance via Deep Reinforcement Learning using Relational Graphs", published by Zhang et al in 2022, robot and Automation national conference, presents a decentralization method MECADRL-RG of a robot-level and target-level relationship diagram based on deep reinforcement learning, which solves the MECA problem of a distributed multi-robot system. After the trapping targets are grouped, a robot-level relationship diagram formed by three heterogeneous relationship diagrams between each robot and other robots, targets and obstacles is modeled and learned by using a diagram attention network GATs, and spatial relationship representations among different agents are extracted instead of simple stacking of observation information. In addition, a target-level relationship graph is constructed by using GAT, and a spatial relationship from each robot to each target is constructed. Further, in order to predict the motion trail of each target, the motion of the target is modeled through a target-level relationship graph, and learning is performed through supervised learning. However, in the actual multi-prey trapping task, the trapping target is not fixed, but the chaser continuously adjusts the own trapping target according to the state and the acquired information in the process of trapping the prey, dynamically distributes the cooperative teammates, integrates the environmental factors, considers the obstacle avoidance, ensures the self safety, finishes trapping the prey and maximizes the trapping success rate.
In order to improve the above problems, the present invention provides a multi-agent trapping method based on double-layer diagram attention reinforcement learning, comprising: step S101, sensing observation state information of an intelligent agent; step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention; step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101; wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object. According to the technical scheme, a reinforcement learning model based on double-layer diagram attention is designed, an 'intra-group' attention layer is used for effectively modeling interaction among intelligent agent individuals, an 'inter-group' attention layer is used for modeling group-level information interaction, the intelligent agents can adaptively extract state dependency relations among a plurality of intelligent agents, the intelligent agents which are more important in relation with the intelligent agents are focused on information interaction, and the pursuit targets and cooperative teammates of the intelligent agents are flexibly adjusted, so that the pursuit targets and cooperative teammates of the intelligent agents are completed, the pursuers and the obstacles in the environment are comprehensively considered in the model training process, the pursuers and the obstacles are respectively restrained, the escape of the obstacles and the obstacle avoidance of the intelligent agents in the task are completed, and the multi-target optimization problem in the multi-intelligent agent trapping task is effectively solved.
The above-described scheme is explained in detail below.
Example 1
Referring to fig. 1, fig. 1 is a schematic flow chart of main steps of a multi-agent trapping method based on double-layer diagram attention-strengthening learning according to an embodiment of the present invention. As shown in fig. 1, the multi-agent trapping method based on double-layer diagram attention-strengthening learning in the embodiment of the invention mainly includes the following steps:
Step S101, sensing observation state information of an intelligent agent;
Step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention;
step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101;
Wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object.
In this embodiment, the process of obtaining the pre-trained reinforcement learning model based on the attention of the double-layer graph includes:
Constructing a multi-agent trapping task simulation scene comprising a chasing agent provided with a reinforced learning model based on double-layer diagram attention, a hunting agent provided with a reinforced learning model based on double-layer diagram attention and an obstacle;
Wherein the agent refers to an unmanned node (e.g., unmanned plane, robot, etc.) having sensing, communication, movement, storage, computation, etc. capabilities, including but not limited to agent particles constructed in a simulation environment. The simulation trapping task environment is an entity which is constructed according to trapping scene parameters and interacts with an intelligent body, and the intelligent body observes the state of the environment and acts in the environment according to control instructions based on the state.
In a specific embodiment, a multi-agent trapping task simulation scene is built in an MPE (multiagent-part-envs) multi-agent simulation environment of Open AI Open source, and preparation is made for training a reinforcement learning model based on double-layer graph attention. The method comprises the following steps:
and (3) installing MPE simulation environments in any computers provided with ubuntu (more than 18.04 required version) and pytorch deep learning frames, and constructing an intelligent agent trapping task simulation scene.
Setting the size of the whole simulation map in the constructed simulation scene to be 1 multiplied by 1, and setting the number of the intelligent agents and the number of the obstacles in the environment, the speed, the acceleration, the size and the color of the intelligent agents, the initialization position, the size and the color of the obstacles.
Training a reinforced learning model based on double-layer diagram attention corresponding to each intelligent agent in a multi-intelligent agent trapping task simulation scene by taking an SAC algorithm as a basic algorithm;
And taking the reinforced learning model based on the double-layer diagram attention corresponding to the pursuer agent as the pre-trained reinforced learning model based on the double-layer diagram attention.
In one embodiment, an agent is composed of a perception module, an action module, a storage module, a reinforcement learning model based on double-layer diagram attention, and a control module.
The sensing module is connected with the feature extraction module and the storage module. The sensing module obtains the state (position information and speed information of the intelligent agent) and the environment state (position and speed information of other intelligent agents and position information of obstacles) of the intelligent agent from the simulation task environment, and sends the information to the feature coding module and the storage module.
The action module is an actuator of an agent control instruction, is connected with the control module, receives the instruction from the control module and moves in the simulation task environment according to the control instruction.
The storage module is a memory containing more than 1GB of available space, is connected with the sensing module, the control module and the data expansion module, receives observation information from the sensing module, receives control instruction information from the control module, receives rewarding information from the simulation trapping task environment, and combines the observation information, the control instruction information and the rewarding information into track data (track data for short) of interaction of an intelligent agent and the simulation trapping task environment. The track data is stored in a form of four groups (s t,at,rt,st+1), wherein s t is observation state information received from the perception module when the agent is interacted with the simulated capture task environment for the t time, a t is a control instruction from the control module executed when the agent is interacted with the simulated capture task environment for the t time, r t is a reward value fed back by the environment for the control instruction a t when the agent is interacted with the simulated capture task environment for the t time, and s t+1 is observation state information received from the perception module after the environment state is changed (also referred to as observation state information when the agent is interacted with the simulated capture task environment for the t+1st time).
In one embodiment, the reinforcement learning model based on double-layer graph attention includes: observing an encoder network, a graph attention network and a Q network;
the observation encoder network consists of a full-connection layer with a ReLU activation function, takes the position information and the speed information of an intelligent agent as input, and outputs a 128-dimensional embedded vector;
The attention network adopts an attention mechanism of 128-dimensional query, key and value; the aggregate information and the status of agents are stitched together and updated through a single fully-connected layer with a ReLU activation function containing 128 neurons. At each time step, a trace is generated containing a tuple (s t,at,rt,st+1)1…N, N is the number of tuples, added to the storage module. After each game, the environment is reset, 4 updates are performed on the attention evaluation network and the policy network.
The Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons. For each update, 1024 time steps of data are sampled from the memory module, and then gradient descent is performed on the loss target and the policy target of the Q network.
In one embodiment, in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as the basic algorithm, the reward function of the pursuer agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above, the review pursure is the total reward for the chaser, For the distance between the chaser and the prey, reward encirclement is the trapping reward, reward collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.
In one embodiment, the collision rewards of the chaser are as follows:
In the above-mentioned method, the step of, For the distance between the pursuer and other pursuers,/>For the distance between the chaser and the obstacle, eta is a super parameter,/>For other chaser locations,/>Is the position of the obstacle;
the trapping rewards are as follows:
reward eneirclement =15, if And angel gap < tolerance
In the above formula, angel gap is the difference between the angle difference forming a queue between the chasers and the expected surrounding angle, and tolerance is the error tolerance range.
In one embodiment, the difference between the angle difference forming the convoy and the desired enclosure angle between the chasers is as follows:
angelgap=angelexcept-angeldiff
in the above formula, angel except is the average surrounding angle expected when forming a team between the chasers, angle diff is the actual angle when forming a team between the chasers, wherein:
angelexcept=2π/M
In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>To track the longitudinal axis of the agent,Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.
In one embodiment, in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as the basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above equation, reward prey is the total reward for the game and reward collide is the impact reward for the game.
In one embodiment, the game of hunting is awarded as follows:
In the above-mentioned method, the step of, For the distance between the prey and the obstacle,/>Is the distance between the prey and other prey.
In one embodiment, the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and the other prey are as follows:
In the above-mentioned method, the step of, Is the horizontal axis position of the obstacle,/>Is the vertical axis position of the obstacle,/>For the horizontal axis position of other prey agents,/>Is the vertical axis position of other game agents.
Finally, the trained reinforcement learning model based on the attention of the double-layer graph is saved as a. Pt format.
The present embodiment separately sets the number of the chasers, the prey and the obstacle, and the speed between the chasers and the prey to verify the performance of the method. To compare the performance of the models, the MADDPG algorithm that uses no attention at all, and the G2Anet algorithm and the MAAC algorithm that use attention, are considered. Three general evaluation indexes are used to evaluate the performance of different models: the success rate is caught, rewards are averaged, and step length is completed averagely.
To evaluate the effectiveness of the proposed method, we performed two-dimensional experiments. On the one hand, 3 different task scenarios were set, here the maximum acceleration of all chasers and the maximum acceleration of the prey were set to only 10. As shown in table 1, when 2 obstacles O exist in the environment and 6 chasers P cooperatively trap 2 two hunting objects E, all the methods can complete the task, but the trapping success rate of the method of the present invention is improved by 33.05% compared with the trapping success rate of the MADDPG method without attention, and compared with the MAAC method with attention, the MER of our algorithm is improved by 3.08. When the number of obstacles in the environment increases, the capture MER of the MAAC algorithm, which also uses attention, decreases, while the MER convergence value of the method of the present invention does not change much, indicating that our method is more robust. When 9 chasers cooperate to capture 3 hunting objects, the MADDPG algorithm can not complete tasks basically, and the capturing success rate is reduced sharply. Also using the focused MAAC approach, performance declines, MEL increases significantly, and agents learn strategies more difficult. At this time, the success rate of the method of the invention is 7.71% higher than MAAC, MER does not converge to a lower level due to the increase in the number of agents, MEL also attains a higher level, indicating that agents learn a more advanced strategy.
TABLE 1
On the other hand, in order to evaluate the generalization of the method, the invention sets the speeds of the chaser and the prey differently, and completes the comparison experiment. In general, when the speed V p of the chaser is greater than the speed V e of the prey, the trapping task is easily completed, and when the speed of the chaser is less than the prey, a higher level cooperation strategy is needed between the chaser to complete the trapping task. As shown in Table 2, when the speed of the chaser is greater than that of the prey, all algorithms can complete the trapping task, and the MEL of the method of the invention is reduced by 6.72 compared with the MAAC algorithm. When the speed of the chaser is the same as the speed of the prey, the success rate of MADDPG algorithm is reduced by 10.68%, but the trapping success rate of the method is not obviously reduced, and meanwhile, compared with the MAAC method, the convergence level of MER is improved by 7.57. When the speed of the chaser is smaller than that of the prey, the MADDPG method is difficult to learn an effective trapping strategy, and in the MAAC method, the learning time of the intelligent agent is also obviously prolonged, and the strategy learned by the method can still cope with a scene with faster speed of the prey, compared with the MAAC method, the trapping time is reduced by 7.07, and the success rate is improved by 12.51 percent.
TABLE 2
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.
Claims (9)
1. A multi-agent trapping method based on double-layer diagram attention-strengthening learning, which is characterized by comprising the following steps:
Step S101, sensing observation state information of an intelligent agent;
Step S102, taking the observation state information of the intelligent agent as the input of a pre-trained reinforced learning model based on double-layer diagram attention, and obtaining an action execution instruction output by the pre-trained reinforced learning model based on double-layer diagram attention;
step S103, executing corresponding actions based on the action execution instruction, ending the operation if the trapping is successful, otherwise, returning to step S101;
Wherein, the observation state information of the agent includes: position information and speed information of the intelligent agent, position and speed information of other intelligent agents, position information of obstacles, position information and speed information of the trapping object.
2. The method of claim 1, wherein the pre-trained two-layer graph attention-based reinforcement learning model acquisition process comprises:
Constructing a multi-agent trapping task simulation scene comprising a chasing agent provided with a reinforced learning model based on double-layer diagram attention, a hunting agent provided with a reinforced learning model based on double-layer diagram attention and an obstacle;
Training a reinforced learning model based on double-layer diagram attention corresponding to each intelligent agent in a multi-intelligent agent trapping task simulation scene by taking an SAC algorithm as a basic algorithm;
And taking the reinforced learning model based on the double-layer diagram attention corresponding to the pursuer agent as the pre-trained reinforced learning model based on the double-layer diagram attention.
3. The method of claim 2, wherein the double-layer graph attention-based reinforcement learning model comprises: observing an encoder network, a graph attention network and a Q network;
the observation encoder network consists of a full-connection layer with a ReLU activation function, takes the position information and the speed information of an intelligent agent as input, and outputs a 128-dimensional embedded vector;
the attention network adopts an attention mechanism of 128-dimensional query, key and value;
the Q network consists of two fully connected layers with a ReLU activation function, each layer containing 128 neurons.
4. The method of claim 2, wherein in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as a basic algorithm, the reward function of the pursuer agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above, the review pursuer is the total reward for the chaser, For the distance between the chaser and the prey, reward encirclement is the trapping reward, reward collide1 is the collision reward for the chaser,/>For the location of the pursuer agent,/>Is the position of the prey agent, and l is the trapping radius of the chaser.
5. The method of claim 4, wherein the collision rewards of the chaser are as follows:
In the above-mentioned method, the step of, For the distance between the pursuer and other pursuers,/>For the distance between the chaser and the obstacle, eta is a super parameter,/>For other chaser locations,/>Is the position of the obstacle;
the trapping rewards are as follows:
REWARDENCIRCLEMENT =15 if And angel gap < tolerance
In the above formula, angel gap is the difference between the angle difference forming a queue between the chasers and the expected surrounding angle, and tolerance is the error tolerance range.
6. The method of claim 5, wherein the difference between the angle difference forming the convoy and the desired enclosure angle between the chasers is as follows:
angelgap=angelexcept-angeldiff
In the above description, angel except is the average surrounding angle expected when forming a team between the chasers, angel diff is the actual angle when forming a team between the chasers, wherein:
angelexcept=2π/M
In the above, M is the number of chasers when forming surrounding formation, Is the horizontal axis position of the prey agent,/>For the vertical axis position of prey agent,/>For the horizontal axis position of the pursuer agent,/>For the vertical axis position of the pursuer agent,/>Is AND/>The transverse axis position of the adjacent pursuer agent of the pursuer agents at the position,/>Is AND/>The longitudinal axis position of the chasing agent adjacent to the chasing agent at the location.
7. The method of claim 6, wherein in the training of the reinforced learning model based on the attention of the double-layer map corresponding to each agent in the multi-agent trapping task simulation scene by using the SAC algorithm as a basic algorithm, the reward function of the game agent provided with the reinforced learning model based on the attention of the double-layer map is as follows:
In the above equation, reward prey is the total reward for the game and reward collode is the impact reward for the game.
8. The method of claim 7 wherein said game is awarded the following:
In the above-mentioned method, the step of, For the distance between the prey and the obstacle,/>Is the distance between the prey and other prey.
9. The method of claim 8, wherein the distance between the chaser and the prey, the distance between the prey and the obstacle, and the distance between the prey and the other prey are as follows:
In the above-mentioned method, the step of, Is the horizontal axis position of the obstacle,/>Is the vertical axis position of the obstacle,/>For the horizontal axis position of other prey agents,/>Is the vertical axis position of other game agents.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410141040.1A CN118153621A (en) | 2024-02-01 | 2024-02-01 | Multi-agent trapping method based on double-layer diagram attention reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410141040.1A CN118153621A (en) | 2024-02-01 | 2024-02-01 | Multi-agent trapping method based on double-layer diagram attention reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118153621A true CN118153621A (en) | 2024-06-07 |
Family
ID=91300107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410141040.1A Pending CN118153621A (en) | 2024-02-01 | 2024-02-01 | Multi-agent trapping method based on double-layer diagram attention reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118153621A (en) |
-
2024
- 2024-02-01 CN CN202410141040.1A patent/CN118153621A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Huegle et al. | Dynamic input for deep reinforcement learning in autonomous driving | |
US9764468B2 (en) | Adaptive predictor apparatus and methods | |
Zhang et al. | Collective behavior coordination with predictive mechanisms | |
CN108776483A (en) | AGV paths planning methods and system based on ant group algorithm and multiple agent Q study | |
CN113050640B (en) | Industrial robot path planning method and system based on generation of countermeasure network | |
CN114741886B (en) | Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation | |
Hao et al. | Independent generative adversarial self-imitation learning in cooperative multiagent systems | |
Guo et al. | Distributed reinforcement learning for coordinate multi-robot foraging | |
CN113253738B (en) | Multi-robot cooperation trapping method and device, electronic equipment and storage medium | |
CN113759935B (en) | Intelligent group formation mobile control method based on fuzzy logic | |
Felbrich et al. | Autonomous robotic additive manufacturing through distributed model‐free deep reinforcement learning in computational design environments | |
CN114083539B (en) | Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning | |
Wang et al. | Research on pursuit-evasion games with multiple heterogeneous pursuers and a high speed evader | |
Curran et al. | Using PCA to efficiently represent state spaces | |
Cao et al. | Dynamic task assignment for multi-AUV cooperative hunting | |
CN113485323B (en) | Flexible formation method for cascading multiple mobile robots | |
Li et al. | Robot path planning using improved artificial bee colony algorithm | |
Zhan et al. | Flocking of discrete-time multi-agent systems with predictive mechanisms | |
CN117553798A (en) | Safe navigation method, equipment and medium for mobile robot in complex crowd scene | |
Pierre | End-to-end deep learning for robotic following | |
CN117518907A (en) | Control method, device, equipment and storage medium of intelligent agent | |
Vaščák et al. | Use and perspectives of fuzzy cognitive maps in robotics | |
CN109752952A (en) | Method and device for acquiring multi-dimensional random distribution and strengthening controller | |
Huang et al. | A deep reinforcement learning approach to preserve connectivity for multi-robot systems | |
Zhang et al. | Recent advances in robot trajectory planning in a dynamic environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication |