CN118070066A - Training method and device for underwater multi-agent trapping escape game strategy - Google Patents
Training method and device for underwater multi-agent trapping escape game strategy Download PDFInfo
- Publication number
- CN118070066A CN118070066A CN202410466338.XA CN202410466338A CN118070066A CN 118070066 A CN118070066 A CN 118070066A CN 202410466338 A CN202410466338 A CN 202410466338A CN 118070066 A CN118070066 A CN 118070066A
- Authority
- CN
- China
- Prior art keywords
- agent
- game strategy
- strategy
- determining
- escape
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 238000012549 training Methods 0.000 title claims abstract description 74
- 230000009471 action Effects 0.000 claims abstract description 91
- 230000009182 swimming Effects 0.000 claims abstract description 81
- 238000012546 transfer Methods 0.000 claims abstract description 15
- 239000003795 chemical substances by application Substances 0.000 claims description 370
- 230000033001 locomotion Effects 0.000 claims description 34
- 230000010355 oscillation Effects 0.000 claims description 21
- 230000001133 acceleration Effects 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 17
- 210000002569 neuron Anatomy 0.000 claims description 8
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 4
- 239000012530 fluid Substances 0.000 description 18
- 238000004422 calculation algorithm Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 10
- 230000007704 transition Effects 0.000 description 10
- 230000036544 posture Effects 0.000 description 8
- 241000251468 Actinopterygii Species 0.000 description 7
- 238000004891 communication Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000002787 reinforcement Effects 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 230000001537 neural effect Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 239000011664 nicotinic acid Substances 0.000 description 4
- 230000000630 rising effect Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000009916 joint effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000003592 biomimetic effect Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 101000720524 Gordonia sp. (strain TY-5) Acetone monooxygenase (methyl acetate-forming) Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000003169 central nervous system Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013016 damping Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001020 rhythmical effect Effects 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/28—Design optimisation, verification or simulation using fluid dynamics, e.g. using Navier-Stokes equations or computational fluid dynamics [CFD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2113/00—Details relating to the application field
- G06F2113/08—Fluids
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/14—Force analysis or force optimisation, e.g. static or dynamic forces
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Algebra (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Fluid Mechanics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of group intelligent control, and provides a training method and device for an underwater multi-agent trapping escape game strategy, wherein the method comprises the following steps: acquiring action information of multiple intelligent agents in a game strategy at the last moment; determining resultant force of the multiple intelligent agents in the underwater environment based on the action information, and performing kinematic analysis on the resultant force to obtain a swimming state of the multiple intelligent agents at the current moment; and determining action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until the trapping escape game strategy is obtained. According to the method provided by the invention, the action information of multiple agents at the last moment is obtained, the state transfer model facing the underwater scene is constructed to realize the state transfer by constructing the hydrodynamic model and the dynamic model, the game strategy at the next moment is obtained, the final trapping escape game strategy facing the underwater scene is further obtained, and the trapping escape tasks among different underwater agents can be effectively realized.
Description
Technical Field
The invention relates to the technical field of group intelligent control, in particular to a training method and device for an underwater multi-agent trapping escape game strategy.
Background
The control of a multi-robot system, especially a multi-robot PE (Pursuit-escape) game, plays an increasingly important role in the fields of group cooperation, game decision, biological cluster analysis and the like. In the practical application of multi-agent reinforcement learning algorithms, a common solution is to consider all agents as one super agent based on a fully centralized structure, combining the actions of each agent into one joint action. Also, most research on game problems is focused on ground or aerial robots.
However, exploring a gaming strategy for multi-biomimetic robots in an underwater environment with highly nonlinear characteristics remains a difficult challenge.
Disclosure of Invention
The invention provides a training method and device for an underwater multi-agent trapping escape game strategy, which are used for solving the problem that the exploration of a game strategy of a multi-bionic robot in an underwater environment with high nonlinear characteristics is still a difficult challenge in the prior art.
The invention provides a training method for an underwater multi-agent trapping escape game strategy, which comprises the following steps:
Acquiring action information of a multi-agent in a game strategy at the last moment, wherein the multi-agent comprises a capturer and an evasion person;
Based on the action information, determining resultant force of the multi-agent in an underwater environment, and performing kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at the current moment;
and determining action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
According to the training method of the underwater multi-agent trapping escape game strategy, the action information comprises the oscillation frequency of neuron signals leading the multi-agent to move and the turning offset angle of tail fins of the multi-agent;
The determining, based on the motion information, a resultant force experienced by the multi-agent in an underwater environment, comprising:
And determining the resultant force suffered by the multi-agent in the underwater environment based on the oscillation frequency and the turning offset angle.
According to the training method of the underwater multi-agent trapping escape game strategy provided by the invention, the resultant force is subjected to kinematic analysis to obtain the swimming state of the multi-agent at the current moment, and the training method comprises the following steps:
Performing deflection guide on kinetic energy corresponding to the resultant force to obtain momentum information of the multi-agent;
Determining acceleration information of the multi-agent based on the resultant force and momentum information of the multi-agent;
Determining the swimming state of the multi-agent at the current moment based on the acceleration information of the multi-agent;
The swimming state includes coordinates, swimming speed, and posture of the agent.
According to the training method of the underwater multi-agent trapping escape game strategy provided by the invention, the action information of the multi-agent in the game strategy at the next moment is determined based on the swimming state at the current moment until the final trapping escape game strategy is obtained, and the training method comprises the following steps:
Based on the swimming state of all the intelligent agents at the current moment, respectively determining the action information of each intelligent agent in the game strategy at the next moment until the shared game strategy is obtained;
And taking the shared game strategy as a final trapping escape game strategy, wherein the final trapping escape game strategy is used for determining action information of each agent in the game strategy at the next moment based on the swimming state of each agent at the current moment and the final trapping escape game strategy in an execution stage.
According to the training method of the underwater multi-agent trapping escape game strategy provided by the invention, the action information of each agent in the game strategy at the next moment is respectively determined based on the swimming states of all agents at the current moment, and the training method comprises the following steps:
Determining a strategy reward at the previous moment based on the global swimming state at the current moment;
And respectively determining action information of each agent in a game strategy at the next moment based on the maximized accumulated strategy rewards.
According to the training method of the underwater multi-agent trapping escape game strategy provided by the invention, the strategy rewards at the last moment are determined based on the global swimming state at the current moment, and the training method comprises the following steps:
Determining capture rewards at the previous moment based on coordinates of a capturer and an evacuator in the intelligent agent and a preset game rule;
Determining boundary rewards at the last moment based on the coordinates of the intelligent agent and the movement boundary of the intelligent agent;
determining the reinforced game rewards at the last moment based on coordinates of the capturers and the evades in the intelligent agent;
and determining the strategy rewards at the last moment based on the capture rewards, the boundary rewards and the enhanced game rewards.
The invention also provides a training device for the underwater multi-agent trapping escape game strategy, which comprises:
the system comprises an acquisition unit, a game strategy generation unit and a game strategy generation unit, wherein the acquisition unit acquires action information in a game strategy of a plurality of agents at the last moment, and the plurality of agents comprise a capturer and an evasion;
The state transfer unit is used for determining resultant force of the multi-agent in the underwater environment based on the action information, and performing kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at the current moment;
and the game unit determines action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the training method of the underwater multi-agent trapping escape game strategy is realized when the processor executes the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method of an underwater multi-agent trapping escape game strategy as described in any of the above.
The invention also provides a computer program product, which comprises a computer program, wherein the computer program is executed by a processor to realize the training method of the underwater multi-agent trapping escape game strategy.
According to the training method and device for the underwater multi-agent trapping escape game strategy, which are provided by the invention, the action information of the multi-agent trapping escape game strategy at the last moment is obtained, the state transfer model facing the underwater scene is constructed by constructing the hydrodynamic model and the dynamic model, the state transfer of the multi-agent is realized, the game strategy of the multi-agent at the next moment is obtained, the final trapping escape game strategy facing the underwater scene is obtained through training, and the trapping escape tasks among different underwater agents can be effectively realized.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a training method of an underwater multi-agent trapping escape game strategy provided by the invention;
FIG. 2 is a schematic illustration of a trap escape task provided by the present invention;
FIG. 3 is a schematic diagram of a coordinate system of the underwater biomimetic robotic fish provided by the invention;
FIG. 4 is a second flow chart of the training method of the underwater multi-agent trapping escape game strategy provided by the invention;
FIG. 5 is a plot of a reward function based on three different algorithmic simulation training provided by the present invention;
FIG. 6 is a schematic structural diagram of a training device for an underwater multi-agent trapping escape game strategy provided by the invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Aiming at the problems, the invention provides a training method of an underwater multi-agent trapping escape game strategy, which is used for training to obtain the multi-agent trapping escape game strategy in an underwater environment. Fig. 1 is a schematic flow chart of a training method of an underwater multi-agent trapping escape game strategy provided by the invention, and as shown in fig. 1, the method includes:
Step 110, obtaining action information in a game strategy of a multi-agent at the last moment, wherein the multi-agent comprises a capturer and an evasion person;
In a specific scenario of underwater multi-agent, the multi-agent may be divided into a capturer team and an evacuator team. The internal coordination and collaboration of the capturer team, and the competition and antagonism between the capturer team and the evasion team, may be referred to as a multi-agent gaming strategy. Here, for the multi-agent in the underwater environment, the multi-agent may be a bio-machine corresponding to the living organism in the real underwater environment, for example, may be a bio-robot fish, or may be other bio-robot organisms with strong motor skills.
In one embodiment, FIG. 2 is a schematic diagram of the escape task of the present invention, as shown in FIG. 2, which is a graph of team P of the person trapped) And escapement E (/ >)) The escape task is caught. Coordinate SystemThe system is a multi-agent coordinate system, and the x-axis direction of the coordinate system represents the positive direction of the head of the agent; /(I)Representing a world coordinate system; /(I)Respectively represent the induction radius of each capturer,Representing the vibration amplitude of the tail fin surfaces of the ith capturer and the jth evacuator; /(I)The distance between the ith and jth evacuees is shown. Wherein, all the capturers in the capturer team are isomorphic, the capturers and the escapers are heterogeneous, namely the maximum speed of the capturers is set to be 0.4m/s, the maximum angular speed is set to be 0.67rad/s, and the body length is set to be 0.65m; in addition, the escapement team contains 1 escapement, the maximum speed of which is 0.6m/s, the maximum angular speed is 0.89rad/s, and the body length is 0.5m. In each round of gaming, the trapping task is considered successful if the distance between any of the trapped and escapers is less than the sensing radius R. Conversely, if the distance between any capturer and the evacuator has not been achieved by the end of the round is less than the sensing radius R, then the evacuee is considered to survive successfully.
It will be appreciated that formulating a reasonable gaming strategy involves formulating a real-time path plan for multiple agents, coordination and control among the agents. Therefore, the current state of each agent after executing the action information at the previous moment can be obtained by obtaining the action information of the multi-agent in the game strategy at the previous moment through the reinforcement learning method, and the action information of the multi-agent at the next moment is determined based on the current state and is recorded as a 'trapping escape' game strategy.
It should be noted that in the prior art, the trapping escape game strategy can be obtained through training by other methods, for example, the condition and the application in the game can be controlled based on the relaxation, and the concept of the relaxation game is introduced; for another example, how to apply Monte Carlo tree search to capturing the camouflage game, and propose various methods such as deterministic analysis, location classification, alliance reduction, etc.; for another example, the computational complexity of the above approach tends to increase exponentially with increasing number of agents based on actor-criticizer-opponent (actor-critic-mass-opponent, ACMO) algorithms, resulting in less scalability of gaming algorithms. In contrast, reinforcement learning methods can avoid accurate modeling of the strategy itself, thereby reducing modeling complexity and ensuring algorithm scalability.
The motion information of the multi-agent herein may be used to control neural signals of the multi-agent for locomotion. It can be understood that the neural signals for controlling the multi-agent to perform straight running, steering and other motion paths in the underwater environment are used as the action information of the multi-agent, so that the action information of the multi-agent is digitized.
Specifically, the action information of the multi-agent in the game strategy at the last moment can be obtained by acquiring the oscillation frequency and the steering angle information of the neural signals for controlling the multi-agent to perform the motion at the last moment.
It is also understood that after the multi-agent performs the motion information in the underwater environment, the information such as the coordinate position, speed, and posture of the multi-agent changes, which can be regarded as a state transition. However, in an underwater environment, it is difficult to directly acquire information such as coordinate positions, speeds, postures and the like of multiple intelligent agents in the underwater environment. Thus, step 120 may be performed to achieve a state transition by analyzing the motion information of the multi-agent and the underwater fluid environment, resulting in a swimming state of the multi-agent at the current time.
Step 120, based on the action information, determining resultant force of the multi-agent in the underwater environment, and performing kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at the current moment;
Here, the resultant force of the multi-agent in the underwater environment may be determined based on the fluid force the multi-agent is subjected to in the underwater environment and the quasi-steady state lift resistance. Fluid forces herein refer to the resistance of water to which the multi-agent is subjected in an underwater environment, as well as the interaction forces with water. It will be appreciated that multi-agents in an underwater environment are typically designed to improve their efficiency and stability of movement in water by reference to the hydrodynamic properties of the real underwater organism. In addition, the quasi-steady-state lift resistance here refers to the lift force applied by the multi-agent in a steady or approximately steady motion state, perpendicular to the direction of motion thereof, and the resistance along the direction of motion thereof.
In addition, the swimming state refers to the current movement state and posture of the multi-agent after the multi-agent has executed the action information. For example, the information such as the coordinate position, the speed, the gesture and the like of the current moment of the multi-agent can be obtained.
Specifically, first, the joint angle can be obtained from the motion information of the joints of the multiple agents, and the joint angular velocity and the joint angular acceleration can be obtained from the joint angle. Furthermore, the fluid force and the fluid moment suffered by the multi-agent can be obtained through a pre-constructed viscous drag model. In addition, the acting force and the acting moment of the multi-agent can be obtained through a pre-constructed quasi-steady state rising resistance model. Thus, the resultant force and resultant moment of the multiple agents can be obtained.
Further, dynamic modeling and kinematic analysis can be performed on resultant force received by the multi-agent, for example, a dynamic model is constructed, state transition of the multi-agent is calculated according to current stress information and action information of the multi-agent, and coordinate position, speed, gesture and other information of the multi-agent at the current moment are obtained, so that a swimming state of the multi-agent at the current moment is obtained.
It should be noted that, a viscous drag model and a quasi-steady-state rising drag model may be respectively constructed, and the constructed drag model may be used as a fluid dynamic model. In practical training, each acting force model in the fluid dynamic model can be adjusted according to the actual stress condition of the intelligent agent in the underwater environment, namely the fluid dynamic model can also comprise other acting force models.
And 130, determining action information of the multi-agent game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
Specifically, the strategy rewards of the game strategy at the previous moment can be calculated through the swimming state of the multiple intelligent agents at the current moment. Further, the running state at the current moment and the strategy rewards of the game strategy at the previous moment can be input into a depth deterministic strategy gradient network, the action information of the multi-agent in the game strategy at the next moment can be output through the multi-agent depth deterministic strategy gradient network, and the action information is circulated until the strategy rewards of the game strategy are not lifted, namely the maximization of accumulated rewards is realized, or the iteration times reach the preset step length, so that the final trapping escape game strategy is obtained.
According to the method provided by the embodiment of the invention, the action information of the game strategy of the multi-agent at the last moment is obtained, the state transfer model facing the underwater scene is constructed by constructing the hydrodynamic model and the dynamic model, the state transfer of the multi-agent is realized, the game strategy of the multi-agent at the next moment is obtained, the final capture escape game strategy facing the underwater scene is further trained, and the capture escape task among different underwater agents can be effectively realized.
Based on any of the above embodiments, the motion information includes an oscillation frequency of a neuron signal that dominates the multi-agent motion, and a turning offset angle of the multi-agent tail fin;
In step 120, the determining, based on the motion information, a resultant force that the multi-agent receives in an underwater environment includes:
And determining the resultant force suffered by the multi-agent in the underwater environment based on the oscillation frequency and the turning offset angle.
Here, the action information may be obtained based on a CPG (CENTRAL PATTERN Generator ) network. Here, the CPG network is a self-organizing biological neural circuit that dominates rhythmic movement of a living being, and may refer to a neural circuit in the central nervous system, that is, self-oscillation may be achieved by mutual inhibition between neurons, thereby generating a stable periodic signal. The invention can realize the output of the action information of the intelligent agent based on the CPG network by the following equation:
In the method, in the process of the invention, Representing the rate of change of the oscillator state variable that produces the oscillating signal at time t; /(I)The oscillation frequency is represented and used for describing the swimming speed of the intelligent body; /(I)Representing a coupling coefficient affecting the convergence speed; /(I)An oscillator state variable representing the generation of an oscillation signal at time t-1; /(I)Representing the vibration amplitude of the tail fin surface of the ith agent; Representing the offset of the ith agent, describing the offset angle of the ith agent when the tail fin turns; /(I) Representing the rate of change of the oscillator state variable that produces the oscillating signal at time t; /(I)Respectively representing the state variables of the oscillator for generating the oscillation signal at the time t; /(I)Respectively representing the state variables of the oscillator for generating the oscillation signal at the time t-1; /(I)Differentiation representing time; /(I)Representing the joint angle of the tail fin surface of the ith agent; /(I)Representing the state variables of the oscillator of the i-th agent for generating the oscillation signal.
Thus, a control model based on the action information of the CPG network control agent can be constructed by the following formula:
In the method, in the process of the invention, Representing the joint angle of the tail fin surface of the intelligent body; /(I)Representing the vibration amplitude of the tail fin surface of the intelligent body; /(I)Representing the oscillation frequency; /(I)Representing the offset of the agent, describing the offset angle of the agent when the tail fin turns; t represents time. WhereinDetermines the swimming speed, bias/>, of the intelligent bodyThe fin deflection angle of the agent is determined.
Based on CPG network, oscillation frequency parameter f for controlling swimming speed of robot fish and bias parameter b for influencing swimming bias, action space conforming to control characteristic of intelligent body is designed.
As shown in fig. 2, the multi-agent may be propelled and steered in an underwater environment by the swinging and biasing of the tail fin. It should be noted that, the oscillation of the tail fin is controlled by the oscillation frequency of the neuron signals that dominate the multi-agent motion in the motion information; the offset may be controlled by the turning offset angle of the multi-agent tail fin. Therefore, based on proper action design, the mapping relation from the action space of the intelligent body to the control parameters of the intelligent body is fully embodied, and smart action control of the intelligent body is realized.
Specifically, first, the joint angular velocity and the joint angular acceleration can be obtained according to the turning offset angle of the caudal fin surface joint, and the speed of the propulsion of the agent can be obtained through the oscillation frequency of the neuron signals of the multi-agent motion. Then, fluid force and fluid moment of each agent in the underwater environment can be obtained through a viscous resistance model. For example, this can be achieved by the following equation:
In the method, in the process of the invention, Representing fluid forces to which the agent is subjected in an underwater environment; /(I)Representing the coefficient of fluid resistance; /(I)Indicating the speed of propulsion of the agent; /(I)Representing a fluid moment of the agent in an underwater environment; /(I)Representing the fluid drag torque coefficient tail fin force; /(I)Indicating the angular velocity of the agent;
In addition, the acting force and the acting moment of the tail fin of the intelligent agent can be calculated through the angle of the tail fin surface joint and the quasi-steady-state lifting resistance model. For example, this can be achieved by the following equation:
In the method, in the process of the invention, Representing the acting force of the tail fin of the intelligent body; /(I)Representing the tail fin length; /(I)Representing the mass per unit length of the skeg; /(I)The angle of the tail fin swing at the t moment is shown; /(I)Angular acceleration representing the fin oscillation at time t; /(I)Representing the acting moment of the tail fin of the intelligent body.
Further, the forces received by the agents can be summed to obtain a resultant force received by the agents, and the moments of the agents can be summed to obtain a resultant moment of the agents.
It should be noted that, a viscous drag model and a quasi-steady-state rising drag model may be respectively constructed, and the constructed drag model may be used as a fluid dynamic model. In practical training, each acting force model in the fluid dynamic model can be adjusted according to the actual stress condition of the intelligent agent in the underwater environment, namely the fluid dynamic model can also comprise other acting force models.
According to the method provided by the embodiment of the invention, the oscillation frequency of the neuron signals leading to the movement of the intelligent body and the turning deviation angle of the tail fin of the intelligent body are used as the action information of the intelligent body, and the action information is transferred into the resultant force born by the intelligent body through a fluid dynamic model, so that the state transfer of the intelligent body in an underwater environment is realized.
Based on any of the foregoing embodiments, in step 120, the performing kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at the current moment includes:
Performing deflection guide on kinetic energy corresponding to the resultant force to obtain momentum information of the multi-agent;
Determining acceleration information of the multi-agent based on the resultant force and momentum information of the multi-agent;
Determining the swimming state of the multi-agent at the current moment based on the acceleration information of the multi-agent;
The swimming state includes coordinates, swimming speed, and posture of the agent.
Here, the coordinates of the agent may reflect the location information of the agent at the current time; the posture of the intelligent body can be an included angle which indicates the body of the intelligent body and the horizontal direction, and the propelling direction of the intelligent body can be reflected.
Specifically, the conversion relationship between the resultant force and the kinetic energy can be constructed by the theorem of kinetic energy and newton's law. Momentum and angular momentum can then be obtained by deflecting the kinetic energy in the direction of speed and angular velocity. Further, acceleration information of the multi-agent may be determined by the resultant force, and momentum information of the multi-agent. For example, this can be achieved by the following equation:
In the method, in the process of the invention, Representing the rate of change of momentum; /(I)A diagonal symmetry matrix representing momentum; /(I)Indicating the angular velocity of the agent; f represents the resultant force; /(I)Representing the rate of change of angular momentum; /(I)A diagonal symmetry matrix representing angular momentum; /(I)Indicating the speed of propulsion of the agent; /(I)Representing the resultant moment;
Further, the acceleration information of the multi-agent in the agent coordinate system can be converted into the acceleration information in the world coordinate system. For example, this can be achieved by the following equation:
In the method, in the process of the invention, Expressing the acceleration of the agent in the world coordinate system; /(I)Differentiation of the rotation matrix representing the meeting of the agent coordinate system to the world coordinate system; /(I)A diagonal symmetry matrix representing angular velocity; /(I)A rotation matrix representing the meeting of the agent coordinate system to the world coordinate system; /(I)Representing acceleration of the agent in the agent coordinate system; /(I)Expressing the angular acceleration of the agent in the world coordinate system; /(I)And the angular acceleration of the intelligent agent under the intelligent agent coordinate system is represented.
And finally, determining the coordinates of the multi-agent under the world coordinate system through the acceleration and the angular acceleration under the world coordinate system, and determining the speed and the gesture of the multi-agent, namely determining the swimming state of the multi-agent at the current moment. The swimming state includes coordinates, swimming speed, and posture of the agent. Fig. 3 is a schematic diagram of a coordinate system of the underwater biomimetic robotic fish provided by the invention, as shown in fig. 3,Representing an agent coordinate system,Representing a world coordinate system; /(I)Representing the joint angle of the caudal fin surface, i.e. the agent coordinate systemOffset angle with the target point; representing the attitude of the robot fish; /(I) Representing the deflection angle between the robot fish and the target point, and can represent the deflection angle between the capturer and the evacuator;
According to the method provided by the embodiment of the invention, the dynamic model is constructed through the momentum theorem and Newton's law, the dynamic model is used for carrying out kinematic analysis to obtain the swimming state of multiple intelligent agents, the state transfer of the intelligent agents is realized, and the global state corresponding to the action information at the last moment is obtained.
In the practical application of multi-agent reinforcement learning algorithms, a common solution is to consider all agents as one super agent based on a fully centralized structure, combining the actions of each agent into one joint action. While this approach ensures environmental invariance, embodying the rationality of the federated policy, the state space dimensions and policy complexity increase exponentially as the number of agents increases. This complexity prevents scalability of large-scale trap escape tasks. Furthermore, complex joint actions place higher demands on the communication capabilities and real-time of the multi-robot system.
Aiming at the problems, based on any one of the embodiments, determining action information of the multi-agent in a game strategy at a next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained includes:
Based on the swimming state of all the intelligent agents at the current moment, respectively determining the action information of each intelligent agent in the game strategy at the next moment until the shared game strategy is obtained;
And taking the shared game strategy as a final trapping escape game strategy, wherein the final trapping escape game strategy is used for determining action information of each agent in the game strategy at the next moment based on the swimming state of each agent at the current moment and the final trapping escape game strategy in an execution stage.
Here, the shared game policy is a game policy obtained by predicting the swimming state of all agents based on global observation information. Specifically, in the training stage of the Multi-agent trapping escape game strategy, training may be performed through a training paradigm of CTDE (Centralized training TRAINING WITH Decentralized Execution, centralized training is separately performed), specifically, through a swimming state of all agents at a current moment, and action information of each agent in a game strategy at a next moment is respectively determined based on a MADDPG (Multi-AGENT DEEP DETERMINISTIC Policy Gradient) network. The action information in the game strategy at the next moment is obtained through global observation, and the game strategy obtained based on MADDPG network training can be recorded as a shared game strategy. The shared gaming strategy may then be used as the final catch escape gaming strategy. In the execution stage of the game strategy, the action information of each agent in the game strategy at the next moment can be determined in real time based on the swimming state of each agent at the current moment and the game strategy of the agent.
It can be understood that the game strategy is obtained by constructing MADDPG network for training, in fact, a DDPG (DEEP DETERMINISTIC Policy Gradient) network is constructed for each agent, so that each agent can share the shared game strategy obtained based on training. The DDPG network may include an evaluation network Critic and an action policy network Actor. Critic and Actor include two hidden layers, each layer containing 64 nodes. The input dimension of Critic is the sum of the dimensions of all agents' observations and actions. The input dimension of the Actor is 16, which is the same as the observation dimension of the agent, including the speed and position of all agents, and the output dimension is 2, which is the same as the action space dimension of the agent. All networks use RELU as an activation function. The parameters in the training process are set as follows: discount factor of 0.95, policy learning rate of 0.01, update factor of 0.01, playback buffer size ofThe batch size is 1024.
It should be noted that, compared to considering all agents as one super agent based on a completely centralized structure, the actions of each agent are combined into one combined action, and the distributed structure is adopted in the embodiment of the present invention, so that each agent in the multi-agent system can independently make a decision according to own observation and learn its own strategy. In addition, by the distributed mode, the calculation load is distributed to each intelligent agent, so that the calculation complexity is reduced, and the expandability of the system can be maintained under the condition that the number of the intelligent agents is increased. This feature makes it suitable for solving large-scale gaming problems. In addition, in the execution stage, real-time state information of the intelligent agents is input through a real-time decision network, and independent control signals aiming at each intelligent agent are output, so that the control of each intelligent agent in real time is realized, the response speed of each intelligent agent is improved, and the communication pressure is reduced.
According to the method provided by the embodiment of the invention, based on the training method which is singly executed by the centralized training, the game strategy is obtained by training, so that the calculation complexity in the execution stage can not exponentially increase along with the increase of the number of the agents, and the algorithm is ensured to have the expandability of the stabilizing effect in the tasks with different numbers of the agents.
Fig. 4 is a second flowchart of a training method of an underwater multi-agent trapping escape game strategy provided by the present invention, as shown in fig. 4, the method includes:
Firstly, dividing multiple intelligent agents into a capturer team and an evasion person, and obtaining a state transfer model through carrying out underwater environment modeling in advance, wherein the state transfer model comprises a hydrodynamic model, a dynamic model and a kinematic model. Therefore, the state transition of the underwater environment can be realized by constructing a state transition model.
Firstly, acting force of the intelligent agent in the underwater environment is obtained through action information of the intelligent agent at the last moment through viscous damping and a quasi-steady state rising resistance model respectively, and resultant force of the intelligent agent in the underwater environment is obtained; then, the swimming state of the intelligent agent at the current moment can be obtained through a dynamic model constructed based on the momentum theorem and Newton's law and the resultant force born by the intelligent agent. Finally, coordinate system transformation can be performed on the swimming state under the coordinate system of the intelligent body based on a pre-constructed kinematic model, and the position of the intelligent body is updated. Thus, the intelligent agent can acquire global observation O from the underwater environment, namely the observation information of 3 capturers in the figureAnd observation information of escapers。
And then training the network based on MADDPD by adopting CTDE training paradigm to obtain a final trapping escape game strategy. In CTDE-based training paradigms, the training length may be set toEach round is 50 steps in length. Specifically, during the training phase, a central evaluation network (Critic) performs a value evaluation based on global information and actions of all agents. The goal of each agent's learning is to maximize its jackpot,Representing the discount factor, may be 0.95s.
During the execution phase, the agentBased on the partial observations/>, of each agentSelecting an action thereof,Indicating the swimming status of the agent. Then, outputting the running state/>, at the current moment, according to the state transition model,Representing the current running state of the output and based on the evaluation network Critic/>, of each agent individuallyObtain corresponding rewards,Representing the corresponding prize. And finally, obtaining action information of a game strategy at the next moment through an action strategy network Actor according to the swimming state at the current moment and the corresponding rewards. WhereinRepresenting a set of actions for all of the agents,Representing a rewards set of all agents, and P represents a state transition probability. It should be noted that the game policy in the execution stage is the shared game policy obtained through the training stage.
It should be noted that, the method combines a hydrodynamic model, a dynamic model and a motion mechanism of the underwater bionic robot fish, constructs a state transfer model facing underwater conditions, constructs a multi-agent reinforcement learning training frame facing cooperative cognition countermeasure decision of the underwater environment, carries out training step by adopting concentrated training based on a multi-agent depth deterministic strategy gradient network, finally obtains a trapping escape strategy of the underwater scene-oriented bionic robot cluster, and can effectively realize trapping escape tasks among different underwater bionic robots.
It should be noted that the game strategy may be trained based on several different algorithms, such as MADDPD algorithm, DDPG algorithm, VDN (Value Decomposition Networks, value decomposition network) algorithm. Fig. 5 is a plot of a reward function based on three different algorithm simulation training provided by the present invention, as shown in fig. 5, the vertical axis represents average rewards, the horizontal axis represents training rounds, and the training result indicates that a satisfactory trapping escape strategy cannot be obtained based on DDPG algorithm, and although the method can reach convergence more quickly, the obtained reward value is minimum. In contrast, MADDPG algorithm and VDN algorithm achieve a higher jackpot after training. Because VDN is more suitable for each task scene that the intelligent agent can independently learn, MADDPG can comprehensively utilize the global information of the environment, so that the intelligent agent can learn the collaborative decision strategy, and higher rewards can be obtained in the escape task.
Based on any of the foregoing embodiments, the determining, based on the swimming states of all the agents at the current time, the action information of each agent in the game policy at the next time includes:
Determining a strategy reward at the previous moment based on the global swimming state at the current moment;
And respectively determining action information of each agent in a game strategy at the next moment based on the maximized accumulated strategy rewards.
Here, the global swimming state at the current moment may refer to the current swimming state of this agent, the swimming state of other agents in the same team, and the swimming state of agents of different teams. For example, the global swimming state of the i-th agent in the capturer at the current time may be set to. WhereinRepresenting the speed in the swimming state of the ith agent,Representing coordinates in the swimming state of the i-th agent; /(I)Representing coordinates of other agents except the ith agent in j agents of the capturer; /(I)Representing coordinates of k of the evades; /(I)Indicating the speed of k of the evacuees.
Specifically, first, the speed and coordinates in the global running state at the current time may be outputted with the policy bonus at the previous time through the central evaluation network Critic. Then, the action information of the game strategy at the next moment can be obtained through the action strategy network Actor by maximizing and accumulating the strategy rewards at the last moment.
It should be noted that, compared with the reinforcement learning situation of a single agent, multiple agents need to face the problems of environmental unsteadiness, target diversity, and the like. Therefore, action information of the next moment is obtained by maximizing accumulated strategy rewards of different teams, and successful acquisition of the multi-agent game strategy in the underwater environment is further ensured.
Based on any of the foregoing embodiments, the determining the policy bonus at the previous time based on the global running state at the current time includes:
Determining capture rewards at the previous moment based on coordinates of a capturer and an evacuator in the intelligent agent and a preset game rule;
Determining boundary rewards at the last moment based on the coordinates of the intelligent agent and the movement boundary of the intelligent agent;
determining the reinforced game rewards at the last moment based on coordinates of the capturers and the evades in the intelligent agent;
and determining the strategy rewards at the last moment based on the capture rewards, the boundary rewards and the enhanced game rewards.
Here, the capture rewards may reflect gaming rewards between the capturer and the escapement team; the boundary rewards can ensure that each intelligent agent can not fall into bad situations such as out-of-bounds in the competition process inside the team; the strategic rewards may be used to enhance the creation of a trap escape trend.
Specifically, the distance between each agent and the evacuator of the capturer can be obtained through the coordinates of the capturer and the evacuator in the agents, for example, the distance can be obtained throughAnd (3) representing. And then, determining the catching rewards at the previous moment according to the preset game rules by the distances between the intelligent bodies of the capturers and the evacuees. The game rules herein may be when the distance between any one agent in the capturer and the evacuee is less than the sensing radiusWhen the capturing is successful, the capturer captures the captured image successfully; when the distance between any intelligent agent in the capturer and the escapers is smaller than the sensing radius of the capturer within the set time, capturing fails. It will be appreciated that when the capture is successful, the capture reward is positive for the capturer and negative for the evacuee. For example, when the current capture condition satisfiesThe capture rewards of each agent can be expressed by the following formula: /(I)
In the method, in the process of the invention,Representing a capture reward; when the intelligent agent i belongs to an escaper, the catching reward is-10, and the numerical value can be determined according to the actual situation; when the intelligent agent i belongs to a capturer, the capture reward is 10, and in other cases, the capture reward is 0.
In addition, whether the coordinates of the intelligent agent exceed the motion boundary of the intelligent agent or not can be used for determining the boundary loss at the current moment, and if the intelligent agent exceeds the motion boundary, the intelligent agent is subjected to corresponding punishment. For example, the training environment in which each agent is located is arranged in a volume ofIs surrounded by non-collidable walls. For example, the boundary rewards herein may be implemented by the following formula:
Representing boundary rewards, wherein the boundary rewards are 0 when the distance between the coordinates of the ith agent and the boundary is smaller than 1.5 meters; when the distance between the coordinates of the ith agent and the boundary is 1.5-2 m, the boundary rewards are that ; When the distance between the coordinates of the ith agent and the boundary is greater than 2 meters, the boundary is awarded as. It can be appreciated that under certain conditions, the greater the distance between the location coordinates of the agent and the boundary, the greater the boundary rewards obtained by the agent; the smaller the distance between the position coordinates of the agent and the boundary, the smaller the boundary rewards that the agent gets.
Finally, the enhanced rewards can be obtained by obtaining the distance between each capturer and the escapers. For the capturer, the smaller the distance between the capturer and the evacuee is, the larger the intensified rewards are, and the smaller the conversely is; for the escapement, the larger the distance between the escapement and the capturer is, the larger the intensified rewards are, and conversely, the smaller the distance is. For example, the enhanced rewards may be expressed by the following formula:
In the method, in the process of the invention, Representing a reinforcement reward; indicating that any jth agent belongs to an evasion person, and the ith agent belongs to a trapping person, wherein the intensified rewards of the ith agent are; Any jth agent belongs to the capturer, and the ith agent belongs to the evasion person, so that the intensified rewards of the ith agent are。
Finally, for a single agent, the strategic awards for that agent at the current time may be calculated by adding the capture awards, boundary awards, enhanced game awards directly, or by weighted summation.
Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a training device for capturing an escape game strategy by using multiple underwater agents, where, as shown in fig. 6, the device includes:
an obtaining unit 610, configured to obtain action information in a game policy of a multi-agent at a previous time, where the multi-agent includes a capturer and an evasion;
A state transition unit 620, configured to determine a resultant force that the multi-agent receives in an underwater environment based on the motion information, and perform a kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at a current moment;
And the game unit 630 determines action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
According to the device provided by the embodiment of the invention, the action information of the game strategy of the multi-agent at the last moment is obtained, the state transfer model facing the underwater scene is constructed by constructing the hydrodynamic model and the dynamic model, the state transfer of the multi-agent is realized, the game strategy of the multi-agent at the next moment is obtained, the final capture escape game strategy facing the underwater scene is further obtained through training, and the capture escape task among different underwater agents can be effectively realized.
Based on any of the above embodiments, the motion information includes an oscillation frequency of a neuron signal that dominates the multi-agent motion, and a turning offset angle of the multi-agent tail fin;
The state transition unit is specifically configured to:
And determining the resultant force suffered by the multi-agent in the underwater environment based on the oscillation frequency and the turning offset angle.
Based on any of the above embodiments, the state transition unit is further specifically configured to:
Performing deflection guide on kinetic energy corresponding to the resultant force to obtain momentum information of the multi-agent;
Determining acceleration information of the multi-agent based on the resultant force and momentum information of the multi-agent;
Determining the swimming state of the multi-agent at the current moment based on the acceleration information of the multi-agent;
The swimming state includes coordinates, swimming speed, and posture of the agent.
Based on any of the above embodiments, the gaming unit is specifically configured to:
Based on the swimming state of all the intelligent agents at the current moment, respectively determining the action information of each intelligent agent in the game strategy at the next moment until the shared game strategy is obtained;
And taking the shared game strategy as a final trapping escape game strategy, wherein the final trapping escape game strategy is used for determining action information of each agent in the game strategy at the next moment based on the swimming state of each agent at the current moment and the final trapping escape game strategy in an execution stage.
Based on any of the above embodiments, the gaming unit is further specifically configured to:
Determining a strategy reward at the previous moment based on the global swimming state at the current moment;
And respectively determining action information of each agent in a game strategy at the next moment based on the maximized accumulated strategy rewards.
Based on any of the above embodiments, the gaming unit is further specifically configured to:
Determining capture rewards at the previous moment based on coordinates of a capturer and an evacuator in the intelligent agent and a preset game rule;
Determining boundary rewards at the last moment based on the coordinates of the intelligent agent and the movement boundary of the intelligent agent;
determining the reinforced game rewards at the last moment based on coordinates of the capturers and the evades in the intelligent agent;
and determining the strategy rewards at the last moment based on the capture rewards, the boundary rewards and the enhanced game rewards.
Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a training method for an underwater multi-agent trap escape gaming strategy, the method comprising: acquiring action information of a multi-agent in a game strategy at the last moment, wherein the multi-agent comprises a capturer and an evasion person; based on the action information, determining resultant force of the multiple intelligent agents in the underwater environment, and performing kinematic analysis on the resultant force to obtain a swimming state of the multiple intelligent agents at the current moment; and determining action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute a training method of the underwater multi-agent trapping escape game strategy provided by the above methods, and the method includes: acquiring action information of a multi-agent in a game strategy at the last moment, wherein the multi-agent comprises a capturer and an evasion person; based on the action information, determining resultant force of the multiple intelligent agents in the underwater environment, and performing kinematic analysis on the resultant force to obtain a swimming state of the multiple intelligent agents at the current moment; and determining action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
In still another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform a training method of an underwater multi-agent trap escape game strategy provided by the above methods, the method comprising: acquiring action information of a multi-agent in a game strategy at the last moment, wherein the multi-agent comprises a capturer and an evasion person; based on the action information, determining resultant force of the multiple intelligent agents in the underwater environment, and performing kinematic analysis on the resultant force to obtain a swimming state of the multiple intelligent agents at the current moment; and determining action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. The training method of the underwater multi-agent trapping escape game strategy is characterized by comprising the following steps of:
Acquiring action information of a multi-agent in a game strategy at the last moment, wherein the multi-agent comprises a capturer and an evasion person;
Based on the action information, determining resultant force of the multi-agent in an underwater environment, and performing kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at the current moment;
and determining action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
2. The method of training an underwater multi-agent trapping escape game strategy of claim 1, wherein the motion information includes an oscillation frequency of a neuron signal that dominates the multi-agent motion, and a cornering offset angle of the multi-agent tail fin;
The determining, based on the motion information, a resultant force experienced by the multi-agent in an underwater environment, comprising:
And determining the resultant force suffered by the multi-agent in the underwater environment based on the oscillation frequency and the turning offset angle.
3. The training method of the underwater multi-agent trapping escape game strategy according to claim 1, wherein the performing kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at a current moment comprises:
Performing deflection guide on kinetic energy corresponding to the resultant force to obtain momentum information of the multi-agent;
Determining acceleration information of the multi-agent based on the resultant force and momentum information of the multi-agent;
Determining the swimming state of the multi-agent at the current moment based on the acceleration information of the multi-agent;
The swimming state includes coordinates, swimming speed, and posture of the agent.
4. The training method of the underwater multi-agent trapping escape game strategy according to any one of claims 1 to 3, wherein the determining, based on the swimming state at the current moment, action information of the multi-agent in the game strategy at the next moment until a final trapping escape game strategy is obtained includes:
Based on the swimming state of all the intelligent agents at the current moment, respectively determining the action information of each intelligent agent in the game strategy at the next moment until the shared game strategy is obtained;
And taking the shared game strategy as a final trapping escape game strategy, wherein the final trapping escape game strategy is used for determining action information of each agent in the game strategy at the next moment based on the swimming state of each agent at the current moment and the final trapping escape game strategy in an execution stage.
5. The training method of the underwater multi-agent trapping escape game strategy according to claim 4, wherein the determining the action information of each agent in the game strategy at the next time based on the swimming states of all agents at the current time respectively comprises:
Determining a strategy reward at the previous moment based on the global swimming state at the current moment;
And respectively determining action information of each agent in a game strategy at the next moment based on the maximized accumulated strategy rewards.
6. The training method of the underwater multi-agent trapping escape game strategy according to claim 5, wherein the determining the strategy rewards at the previous moment based on the global swimming state at the current moment comprises:
Determining capture rewards at the previous moment based on coordinates of a capturer and an evacuator in the intelligent agent and a preset game rule;
Determining boundary rewards at the last moment based on the coordinates of the intelligent agent and the movement boundary of the intelligent agent;
determining the reinforced game rewards at the last moment based on coordinates of the capturers and the evades in the intelligent agent;
and determining the strategy rewards at the last moment based on the capture rewards, the boundary rewards and the enhanced game rewards.
7. The utility model provides a training device of escape game strategy is enclosed to multi-agent under water, its characterized in that includes:
the system comprises an acquisition unit, a game strategy generation unit and a game strategy generation unit, wherein the acquisition unit acquires action information in a game strategy of a plurality of agents at the last moment, and the plurality of agents comprise a capturer and an evasion;
The state transfer unit is used for determining resultant force of the multi-agent in the underwater environment based on the action information, and performing kinematic analysis on the resultant force to obtain a swimming state of the multi-agent at the current moment;
and the game unit determines action information of the multi-agent in a game strategy at the next moment based on the swimming state at the current moment until a final trapping escape game strategy is obtained.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the training method of the underwater multi-agent trapping escape game strategy of any one of claims 1 to 6 when the program is executed.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a training method of the underwater multi-agent trap escape game strategy of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements a training method of an underwater multi-agent trap escape game strategy as claimed in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410466338.XA CN118070066B (en) | 2024-04-18 | 2024-04-18 | Training method and device for underwater multi-agent trapping escape game strategy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410466338.XA CN118070066B (en) | 2024-04-18 | 2024-04-18 | Training method and device for underwater multi-agent trapping escape game strategy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118070066A true CN118070066A (en) | 2024-05-24 |
CN118070066B CN118070066B (en) | 2024-08-13 |
Family
ID=91104492
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410466338.XA Active CN118070066B (en) | 2024-04-18 | 2024-04-18 | Training method and device for underwater multi-agent trapping escape game strategy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118070066B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117093010A (en) * | 2023-10-20 | 2023-11-21 | 清华大学 | Underwater multi-agent path planning method, device, computer equipment and medium |
CN117574950A (en) * | 2023-11-16 | 2024-02-20 | 湖南科技大学 | Multi-agent self-organizing collaborative trapping method in non-convex environment |
-
2024
- 2024-04-18 CN CN202410466338.XA patent/CN118070066B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117093010A (en) * | 2023-10-20 | 2023-11-21 | 清华大学 | Underwater multi-agent path planning method, device, computer equipment and medium |
CN117574950A (en) * | 2023-11-16 | 2024-02-20 | 湖南科技大学 | Multi-agent self-organizing collaborative trapping method in non-convex environment |
Non-Patent Citations (2)
Title |
---|
CHANGLIN QIU 等: "Multiagent-Reinforcement-Learning-Based Stable Path Tracking Control for a Bionic Robotic Fish With Reaction Wheel", IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, vol. 70, no. 12, 31 December 2023 (2023-12-31), pages 12670 - 12679 * |
RU TONG 等: "A Survey on Reinforcement Learning Methods in Bionic Underwater Robots", BIOMIMETICS 2023, vol. 8, 20 April 2023 (2023-04-20), pages 1 - 29 * |
Also Published As
Publication number | Publication date |
---|---|
CN118070066B (en) | 2024-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109992000B (en) | Multi-unmanned aerial vehicle path collaborative planning method and device based on hierarchical reinforcement learning | |
Shi et al. | End-to-end navigation strategy with deep reinforcement learning for mobile robots | |
Amarjyoti | Deep reinforcement learning for robotic manipulation-the state of the art | |
Nelson et al. | Fitness functions in evolutionary robotics: A survey and analysis | |
Lin et al. | Evolutionary digital twin: A new approach for intelligent industrial product development | |
CN111240356B (en) | Unmanned aerial vehicle cluster convergence method based on deep reinforcement learning | |
CN114741886B (en) | Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation | |
Ziya et al. | Comparative study for deep reinforcement learning with CNN, RNN, and LSTM in autonomous navigation | |
CN115220458A (en) | Distributed decision-making method for multi-robot multi-target enclosure based on reinforcement learning | |
Gan et al. | Multi-usv cooperative chasing strategy based on obstacles assistance and deep reinforcement learning | |
CN112651486A (en) | Method for improving convergence rate of MADDPG algorithm and application thereof | |
CN114083539A (en) | Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning | |
CN116362289A (en) | Improved MATD3 multi-robot collaborative trapping method based on BiGRU structure | |
Wang et al. | Multi-agent deep reinforcement learning based on maximum entropy | |
CN118070066B (en) | Training method and device for underwater multi-agent trapping escape game strategy | |
CN117908565A (en) | Unmanned aerial vehicle safety path planning method based on maximum entropy multi-agent reinforcement learning | |
Bugajska et al. | Coevolution of form and function in the design of micro air vehicles | |
Dutta et al. | Exploring with sticky mittens: Reinforcement learning with expert interventions via option templates | |
Liu et al. | Learning multi-agent cooperation via considering actions of teammates | |
CN116203987A (en) | Unmanned aerial vehicle cluster collaborative obstacle avoidance method based on deep reinforcement learning | |
CN115933734A (en) | Multi-machine exploration method and system under energy constraint based on deep reinforcement learning | |
Latzke et al. | Imitative reinforcement learning for soccer playing robots | |
Rafati et al. | Learning sparse representations in reinforcement learning | |
Tang et al. | Reinforcement learning for robots path planning with rule-based shallow-trial | |
Saravanan et al. | Exploring spiking neural networks in single and multi-agent rl methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |