CN116540553A

CN116540553A - Mobile robot safe movement method based on reinforcement learning

Info

Publication number: CN116540553A
Application number: CN202310814855.7A
Authority: CN
Inventors: 熊昊; 曾伟锋; 江翰韬; 陆文杰
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-08-04
Anticipated expiration: 2043-07-05
Also published as: CN116540553B

Abstract

The invention relates to the technical field of mobile robots, and particularly discloses a mobile robot safe motion method based on reinforcement learning, which comprises the following technical scheme: s1, setting a motion equation and a nonlinear affine system; s2, developing a protective barrier based on a control barrier function CBF, and providing a multi-agent reinforcement learning algorithm containing the protective barrier based on the CBF so as to realize the safety movement of the mobile robot AMR based on reinforcement learning. The safety movement method of the mobile robot based on reinforcement learning can ensure the safety of the mobile robot in the working process.

Description

Mobile robot safe movement method based on reinforcement learning

Technical Field

The invention relates to the technical field of mobile robots, in particular to a mobile robot safe motion method based on reinforcement learning.

Background

In recent years, more and more mobile robots AMR are put into use. While some risk of AMR runaway is rising violently, the reinforcement learning RL-based approach has achieved tremendous success in motion planning for a large number of AMR, if some AMR is out of control, the existing RL-based motion approach does not provide security for the remaining functional AMR.

Disclosure of Invention

The invention aims to provide a mobile robot safe motion method based on reinforcement learning, which is used for processing complex high-level tasks through a single shielding algorithm MARL of CBF and processing the safety problem of each function AMR through low-level shielding of CBF, so that the safety of the mobile robot in the working process can be ensured.

In order to achieve the above purpose, the invention provides a mobile robot safe motion method based on reinforcement learning, which comprises the following specific steps:

s1, constructing a motion equation and a nonlinear affine system of a mobile robot AMR, wherein the motion equation and the nonlinear affine system are specifically as follows:

the problem to be solved by the invention is that when some AMRs in a two-dimensional space of a warehouse are out of control, the AMRs interfere the controllable AMRs based on the safety motion planning problem of reinforcement learning; the AMR models are set to be the same, and the kinematics model is known, each controllable AMR and each uncontrollable AMR can observe the pose of the nearby AMR, the AMR refers to a differential driving robot DDR, and the DDR can be expressed in two-dimensional Cartesian coordinates as shown in figure 1 toRepresenting the pose state of DDR in two-dimensional Cartesian coordinates, wherein +.>And->Represents the location of DDR in two-dimensional cartesian coordinates, and (2)>The evolution motion equation of the DDR state is specifically:

；

wherein the method comprises the steps of、/>And->Respectively indicate->、/>And->First derivative of>And->Respectively indicate->And，/>represents the magnitude of the translation speed of DDR, in the present invention,>is assumed to be a constant, the DDR action is to adjust the angular velocity + ->；

In step S1, the nonlinear affine system is:

；

representing a systemStatus of->Representing a generalized state->Representing a generalized state space,/->Representation->A dimension space; />Representing a control input (i.e. an action), wherein +.>Representing a generalized action,/->Representing a generalized motion space, < >>Representation->Space of real dimension->And->Is a function of two local Lipschitz; by ensuring invariance of the security set, the security of the system can be ensured, a continuous microcompact +.>The defined set C satisfies:

;

；

set C is referred to as a security set; wherein the method comprises the steps ofRepresents the boundary of set C, +.>Representing the interior of set C;

for system and set C, there is a relative orderIs->Is an Exponentially Controlled Barrier Function (ECBF), if there is +.>Satisfy the following requirements

；

Representation->Space of real dimension->Representing the upper bound->Representing a function pair +.>Li Daoshu of (A)>Representing a function pair +.>Is->Weight Li Daoshu,/->Representing a function pair +.>Li Daoshu of (A)>Indicating that the system is +.>The state of the moment of time,indicating that the system is +.>Status of moment->Representing a constant matrix,/->Representing the power of +.>An exponential matrix of (a);

wherein the method comprises the steps of

；

、/>And->Respectively indicate->First time derivative, second time derivative and +.>The order time derivative.

S2, constructing a protective barrier based on a control barrier function CBF and a multi-agent reinforcement learning algorithm based on the CBF protective barrier so as to realize the safety movement of the mobile robot AMR based on reinforcement learning, wherein the safety movement is specifically as follows:

complex advanced tasks can be handled through multi-agent reinforcement learning and AMR safety issues can be handled through CBF-based protection barriers. The safety motion method is extensible, and can be deployed on a large number of AMRs of the same model through a small number of AMRs learned safety motion methods;

the protective barrier of the CBF may be determined based on a plurality of security targets, where k CBFs of the plurality of security targets are noted asA complex CBF can be realized by fusing a plurality of CBFs through Boolean operation, and the combination of +.>The representation is:

；

and (3) obtaining the CBF-based protection barrier based on the composite CBF derived by the above formula.

CBF-based protection barriers modify the actions of AMR only if AMR tends to violate security conditions; CBF-based protection barriers may determine a security action：

;

；

Wherein the method comprises the steps ofNominal actions of AMR determined according to multi-agent reinforcement learning, ++>Is->Is a lower bound of (c).

Algorithm 1: the multi-agent reinforcement learning algorithm comprising a protective barrier based on a control barrier function is specifically:

1: CBF-based protection barrier for AMR design；

2: initializing multi-agent reinforcement learning network parameters;

3: round from 1 to M;

4: resetting the environment;

5: receiving an initial state；

6: time steps from 1 to T;

7: the number i of the controllable AMR is from 1 to N;

8: selecting an action based on multi-agent reinforcement learning；

9: based onAnd->Determination ofSafety action->；

10: update action;

11: ending;

12: execution of；

13: obtaining rewardsAnd New State->；

14: storage of；

15: updating state；

16: ending;

17: updating parameters of the network;

18: ending;

19: returning a protection barrier based on network parameters and CBFAnd (5) determining a strategy.

For the system composed of the ith controllable AMR and one uncontrollable AMR, the following is set up) Representing the pose of the ith AMR, (-) -for example>) Representing the pose of the uncontrollable AMR, the difference between the positions of the ith controllable AMR and the uncontrollable AMR is represented as +.>Can be written as:

；

assume thatWherein->Representation->Is provided with->And->The magnitudes of the translation speeds representing the i-th functional AMR and the uncontrollable AMR, respectively, are the same, one of which has +.>The method comprises the steps of carrying out a first treatment on the surface of the The difference between the speeds of the ith controllable AMR and the uncontrollable AMR is recorded as +.>This can be expressed as:

；

the angular velocities of the ith function AMR and the uncontrollable AMR are respectively recorded asAnd->；/>And->Represents the maximum possible angular velocity representing both controllable and uncontrollable AMR, and +.>；

Designing a protection barrier based on CBF according to the worst case for the ith controllable AMR to improve the safety of the AMR; in the worst case, the uncontrollable AMR will pursue the i-th controllable AMR according to the optimal pursuit strategy:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,representation->，/>Is a signed function:

；

representing the angle between the x-axis of the two-dimensional Cartesian coordinate and the connecting line of the uncontrollable AMR position and the ith controllable AMR position in the anticlockwise direction, wherein

；

For a system consisting of controllable AMR and uncontrollable AMR, a protection barrier based on CBF can be designed based on the kinematic model of AMR to ensure that the ith controllable AMR and the uncontrollable AMR cannot collide; to implement a CBF-based protection barrier, the security state of the ith controllable AMR is defined as:

；

wherein the method comprises the steps ofIs a safe distance; a continuously differentiable function->Is defined as:

；

the first and second derivatives of (a) can be expressed as:

；

wherein the method comprises the steps of

；

For the system hasThen->Can be expressed as:

；

for a system consisting of the ith controllable AMR and uncontrollable AMR, if there is a demand forAnd keep the above formula above zero, then by +.>Defined +.>Is an effective protective barrier;

then, the following theorem is obtained: in a system consisting of a controllable AMR and an uncontrolled AMR of the same type, if the state of the controllable AMR is initially safe, there is always an action of the controllable AMR to keep the state of the controllable AMR in a safe set, depending on a properly designed control barrier function.

For a system, if the state of the ith controllable AMR is within the security set, then there is

；

If there is a group ofThen, let +.>；

And->Is->The term may be expressed as:

；

wherein if and only ifThe time equal sign holds. Due to->If (if)

；

Then there isAnd->，

Can be expressed as:

；

wherein the method comprises the steps of；

The ith controllable AMR is controlled by selecting the appropriate actionMake the following steps

；

Thereby making it possible to；

If the ith controllable AMR selects the appropriate actionMake the following steps

；

Representation->And let the parameter setting of the CBF protection barrier +.>And->Satisfy the following requirements

；

Representing the maximum distance of controllable AMR from uncontrollable AMR. It can be found that there is always one that satisfiesAnd->The action of the condition if the initial state of the controllable AMR is in the safety set and the parameters of the CBF-based protection barrier satisfy:

；

then there is always one action that can keep the state security of the controllable AMR;

the security action of the ith controllable AMR is recorded asCan be expressed as:

；

wherein the method comprises the steps ofThe nominal actions of the controllable AMR are determined by multi-agent reinforcement learning; />A threshold value representing a safe distance; />Representing the difference in position of the ith controllable AMR and the ith uncontrollable AMR; />And->The translation speed of the ith functional AMR and the uncontrollable AMR are respectively represented; />Representing the difference between the speeds of the ith controllable AMR and the uncontrollable AMR; />And->Respectively representing the angular speeds of the ith controllable AMR and the uncontrollable AMR; />Representing the maximum possible angular velocity of the function AMR; () Representing the pose of the ith function AMR, (-I)>) Representing the pose of uncontrollable AMR;is a signed function; />Representation->，/>Representation->，Representing the angle from the x-axis of the two-dimensional cartesian coordinate to the line connecting the uncontrollable AMR position with the i-th controllable AMR position in the counterclockwise direction.

Therefore, the safety movement method of the mobile robot based on reinforcement learning can ensure the safety of the mobile robot in the working process.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a pose representation of DDR in a two-dimensional Cartesian coordinate system of an embodiment of a reinforcement learning-based mobile robot safety motion planning method of the present invention;

FIG. 2 is a flow chart of a multi-agent reinforcement learning algorithm including a CBF-based protective barrier in an embodiment of a reinforcement learning-based mobile robot safety motion planning method of the present invention;

FIG. 3 is a simulated warehouse environment in a simulated environment in an embodiment of a mobile robot safety motion planning method based on reinforcement learning in accordance with the present invention;

FIG. 4 is a graph showing average jackpot awards achieved by controllable AMR during training in an embodiment of a mobile robot safety motion planning method based on reinforcement learning in accordance with the present invention;

FIG. 5 is a diagram showing the number and proportion of safety rounds and crashed rounds in the training process in an embodiment of a mobile robot safety motion planning method based on reinforcement learning;

FIG. 6 is a simulation test round AMR trace in an embodiment of a mobile robot safety motion planning method based on reinforcement learning in accordance with the present invention;

FIG. 7 is a diagram of nominal and safety actions of a first controllable AMR of a simulation test run in an embodiment of a reinforcement learning based mobile robot safety motion planning method of the present invention;

FIG. 8 is a diagram of nominal and safety actions of a second controllable AMR of a simulation test run in an embodiment of a reinforcement learning based mobile robot safety motion planning method of the present invention;

FIG. 9 is a trajectory of a controllable AMR in a simulation environment to avoid uncontrollable AMR in an embodiment of a reinforcement learning based mobile robot safety motion planning method of the present invention;

FIG. 10 is a trajectory of a controllable AMR in a simulation environment for avoiding obstacles in an embodiment of a mobile robot safety motion planning method based on reinforcement learning in accordance with the present invention;

FIG. 11 is a trajectory of a physical test round AMR in an embodiment of a reinforcement learning based mobile robot safety motion planning method of the present invention;

FIG. 12 is a diagram showing nominal and safety actions of a first controllable AMR of a physical test round in an embodiment of a reinforcement learning based mobile robot safety motion planning method of the present invention;

FIG. 13 is a diagram showing nominal and safety actions of a second controllable AMR of one physical test round in an embodiment of a reinforcement learning based mobile robot safety motion planning method of the present invention;

FIG. 14 is a trajectory of a controllable AMR in real world avoiding uncontrollable AMR in an embodiment of a reinforcement learning based mobile robot safety motion planning method in accordance with the present invention;

FIG. 15 is a trajectory of a controllable AMR in real world obstacle avoidance in an embodiment of a reinforcement learning based mobile robot safety motion planning method in accordance with the present invention;

FIG. 16 is a large scale simulated warehouse environment in an embodiment of a mobile robot safety motion planning method based on reinforcement learning according to the present invention.

Detailed Description

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.

Example 1

To verify the effectiveness and safety of the developed safe movement method, verification is performed using simulation and real objects based on a simulated warehouse environment, as shown in fig. 3.

Exercise planning method based on reinforcement learning

AMR in warehouses is multi-agent in nature, so it is appropriate to apply multi-agent reinforcement learning algorithms to achieve reinforcement learning-based motions; CBF-based protective barriers an algorithm with CBF protective barriers can be built to learn to obtain reinforcement learning-based safe motion strategies;

to implement a reinforcement learning based motion strategy with a protective barrier, which is referred to as a protected strategy, the prize function of the controllable AMR can be defined as:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,and->Representing the distance between the controllable AMR and the target position at time step t and time step t-1, respectively,/>A threshold value representing reaching the target position; />Representing the distance between AMR and the nearest entity at time step t, the nearest entity may be an obstacle or an AMR,/>A threshold value representing a safe distance; the reward function in the above equation includes two targets, namely, cargo transportation and collision avoidance;

to prove the effectiveness and safety of the motion strategy with protective barrier, it will be compared with two reference strategies; the first reference strategy is obtained by deleting the protective barrier of the motion planning strategy with the protective barrier, which is called a strategy without the protective barrier;

the second reference strategy is a strategy in which the reward function comprises a dynamic obstacle avoidance function, which is referred to as a strategy with a dynamic obstacle avoidance function, for a controllable AMR, the uncontrollable AMR is considered as a dynamic obstacle, the multi-objective reward function of the controllable AMR is defined as:

；

wherein the method comprises the steps ofAnd->Respectively representing the distance between the controllable AMR and the nearest AMR at the time step t-1 and the time step t; the rewarding function in the above formula comprises three targets, namely cargo transportation, collision avoidance and dynamic obstacle avoidance;

based on the setting described above, the invention trains three motion planning strategies based on reinforcement learning for controllable AMR, namely a strategy with a protective barrier, a strategy without a protective barrier and a strategy with a dynamic obstacle avoidance function. In training, uncontrolled AMR followsThe defined optimal pursuit strategy pursues the nearest controllable AMR. The learning rate is set to 3.00 x 10-4 and the discount factor is set to 0.99. The upper time step limit per round is set to 3000 and the number of rounds is set to 1000.

If 1) any controllable AMR collides; 2) Any controllable AMR reaches the target position; 3) The time step reaches 3000. The average jackpot awards obtained by the controlled AMR in training following the different reinforcement learning based motion planning methods are shown in fig. 4, from which it can be seen that the strategy with the protective barrier converges in the 60 rounds, whereas the strategy without the protective barrier and the strategy with the dynamic obstacle avoidance function converge in the 150 rounds and 330 rounds, respectively. The results indicate that at the end of training, the strategy with protective barriers can achieve a higher average jackpot than the strategy without protective barriers. The CBF-based protection barrier can protect the controllable AMR from collisions, thereby reducing the penalty incurred by the controllable AMR.

Safety of motion planning strategy with protective barrier in training

The safety of different exercise methods in training is compared; if all AMRs do not suffer any collision in a round, then the round is considered a safe round; if any AMR collides in a round, the one round is called a collided round.

The number and ratio of safety rounds and crashed rounds are shown in fig. 5. It can be seen from the figure that the controllable AMR only collides in 2.00% of cases following the strategy with protective barriers. The results indicate that while the strategy with a protective barrier cannot always guarantee the security of the controllable AMR when encountering uncontrollable AMR, the strategy with a protective barrier is more secure than the strategy without a protective barrier. By further analysis of the collisions occurring with the controllable AMR following the strategy with a protective barrier, it can be found that the controllable AMR following the strategy with a protective barrier collides only when approaching simultaneously the uncontrollable AMR and the at least one obstacle or another controllable AMR.

Testing uncontrolled AMR encountering similar dynamic obstacles in a simulation environment

Based on the simulation environment, when AMR follows different trained motion planning strategies, comparing the safety of AMR with the time consumption for completing tasks; in testing, the AMR needs to transport two items to a target location. The initial position of the cargo is (2.00 ) (unit: m), and the target position of the cargo is (0.00 ) (unit: m). To ensure that the initial pose of each controllable AMR is within the safety set, the initial pose of the AMR is randomly selected within a specific range, as shown in table 1. The simulated warehouse environment includes three static obstacles, located respectively: (1.50, 0.00), (0.00, 1.50) and (1.00 ) (unit: m). One test represents a round with 5000 time steps, and the test is repeated 100 times. The different situations of time delay and interference which the controllable AMR may suffer are considered in the test, if the controllable AMR suffers from the time delay, the controllable AMR will delay 0.10 seconds to observe the environment state, and if the controllable AMR suffers from the interference, the action of the controllable AMR will be disturbed by a Gaussian random disturbance with zero mean and one variance.

Tables 2 and 3 respectively show the rate and average time consumption of the safe rounds of the task completed by the controllable AMR in a simulation environment containing uncontrollable AMR like dynamic obstacles. Wherein the method comprises the steps of、/>Andthe pose states of the 1 st controllable AMR, the 2 nd controllable AMR and the uncontrollable AMR in two-dimensional Cartesian coordinates are respectively represented. When the uncontrollable AMR randomly selects a direction and keeps uniform motion as a dynamic obstacle, the strategy with a protective barrier is superior to the strategy with a dynamic obstacle avoidance function in terms of both safety and time consumption. Strategies with protective barriers and strategies with dynamic obstacle avoidance functions have better security than strategies without protective barriers.

TABLE 1 AMR initial pose Range (Unit: m, rad)

TABLE 2 ratio of safety rounds in simulation environments containing uncontrollable AMR like dynamic obstacles

TABLE 3 average time consumption (in seconds) for completing a task in a simulation environment containing uncontrollable AMR like dynamic barrier

Testing uncontrolled AMR subject to similar chasers in a simulation environment

In this test, the uncontrollable AMR acts as an aggressive chaser, followingIs a best pursuit strategy of (c).

Tables 4 and 5 show the rate and average time consumption of the safe rounds of the controlled AMR completion tasks in a simulation environment containing uncontrolled AMR like a chaser, respectively. When the uncontrollable AMR follows the optimal pursuit strategy, the strategy with the protective barrier is better in terms of safety than the strategy without the protective barrier and the strategy with the dynamic obstacle avoidance function. Compared to strategies with dynamic obstacle avoidance functions, strategies with protective barriers perform slightly better in terms of time consumption.

TABLE 4 ratio of safety rounds in simulation environments containing uncontrolled AMR like chasers

TABLE 5 average time consumption (in seconds) for completing a task in a simulation environment containing an uncontrolled AMR like a chaser

Actions when controllable AMR follows strategy with protective barrier in simulation environment

To further demonstrate the effectiveness of the developed reinforcement learning based safety motion planning method, the present invention analyzes the actions of AMR following the trained strategy with protective barrier in a simulated environment, AMR requires two items to be transported to the target location, AMR has no time delay and interference, uncontrollable AMR as a chaser, followingDefining an optimal pursuit strategy;

the initial pose of the controllable AMR is set to (0.00, 1.00, 1.57) and (0.00, -1.00, 1.57) (unit: m, rad), and the initial pose of the uncontrollable AMR is set to (0.00, 1.57) (unit: m, rad); the motion profile of AMR is shown in fig. 6. It can be seen from the figure that two controllable AMR following the strategy with protective barrier can be moved between the initial position and the target position of the cargo to transport the cargo, which can evade collision when an uncontrollable AMR chases one of the controllable AMR.

The nominal actions determined by multi-agent reinforcement learning and the safety actions corrected by CBF-based protection barrier for the first and second controllable AMR are shown in fig. 7 and 8. Fig. 7 and 8 show that CBF-based protection barriers often modify the actions of two controllable AMR's to secure the safety of the two controllable AMR's.

To further investigate the action of the controllable AMR following the strategy with a protective barrier, the trajectory when the controllable AMR evades the uncontrollable AMR and the trajectory when the controllable AMR evades the obstacle are shown in fig. 9 and 10. FIG. 9 shows that the controllable AMR will make a right-hand sharp turn around the uncontrollable AMR. Fig. 10 shows that even if being pursued by an uncontrollable AMR, the controllable AMR can be turned right to avoid an obstacle and be transferred to the left to load the cargo to its original position. The above analysis of controllable AMR actions verifies the effectiveness of the developed reinforcement learning based safety motion planning method in the face of uncontrollable AMR and obstacles.

Controllable AMR acts in the real world following a policy with a protective barrier

The present invention analyzes the actions of controllable AMR following a strategy with a protective barrier in the real world, where the rounds performed in the real world have the same initial condition settings as the rounds performed in the above-described simulation environment. Because of the relatively poor security of strategies without protective barriers and strategies with dynamic avoidance functions, both strategies have not been actually deployed on the real world AMR in control.

The development trajectory of AMR in the real world is shown in fig. 11. It can be seen from the figure that following the strategy with protective barriers, two controllable AMR can be moved between the initial position of the cargo and the target position to transport the cargo and avoid uncontrollable AMR and obstacles.

The nominal actions of the first and second controllable AMR as determined by multi-agent reinforcement learning and the safety actions as modified by CBF-based protection barrier are shown in fig. 12 and 13. In the real world, CBF-based protection barriers often modify the actions of two controllable AMR's to ensure the safety of the two controllable AMR's.

To further investigate the action of controllable AMR following a strategy with a protective barrier in the real world, the trajectory of controllable AMR in avoiding uncontrollable AMR is shown in fig. 14. The controllable AMR will be turned abruptly to avoid the uncontrollable AMR and move along a curve of a larger radius to the initial position of the cargo. The motion profile of the functional AMR avoiding the obstacle is shown in fig. 15. The AMR can be controlled to turn left to avoid an obstacle and after bypassing the obstacle, move toward the target location of the cargo.

Testing in a large scale simulation environment

In order to verify the scalability of the developed reinforcement learning based safety motion planning method, the safety and effectiveness of the strategy with the protection barrier was studied in an environment with more AMR, as shown in fig. 16, in a large scale simulation environment, there are ten AMR, two of which are uncontrollable AMR. Uncontrollable AMR follows the optimal pursuit strategy, eight controllable AMR adopts a strategy with a protective barrier.

The safety round ratio and average time spent completing the task are shown in table 6, and it can be seen from table 6 that the strategy with the protective barrier can be deployed into a large scale environment containing eight controllable AMR. Even if two uncontrollable AMR perform the optimal pursuit strategy, the effectiveness of the strategy with the protective barrier is not significantly reduced.

TABLE 6 ratio of safety rounds and average time spent completing a task in a large scale simulation environment

The invention develops a multi-agent reinforcement learning algorithm containing a CBF-based protection barrier by integrating the CBF-based protection barrier into multi-agent reinforcement learning, designs the CBF-based protection barrier based on a kinematic model of a certain type of mobile robot, and provides a safe motion planning method. Simulation experiments are carried out in a simulation environment and the real world, and the simulation results prove that the safety motion planning method based on reinforcement learning can improve the safety of controllable AMR.

Therefore, the mobile robot safety motion planning method based on reinforcement learning can ensure the safety of the mobile robot in the working process.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. The mobile robot safe motion method based on reinforcement learning is characterized in that: the method comprises the following specific steps:

s1, constructing a motion equation and a nonlinear affine system of a mobile robot AMR;

s2, constructing a protective barrier based on a control barrier function CBF and a multi-agent reinforcement learning algorithm based on the CBF protective barrier so as to realize the safety movement of the mobile robot AMR based on reinforcement learning.

2. The reinforcement learning-based safe motion method of a mobile robot according to claim 1, wherein: in step S1, the specific operation of setting the equation of motion is:

in the working process of the AMR, the uncontrolled AMR can interfere with the controllable AMR; the AMR model is set to be the same, and the kinematics model is known, each controllable AMR and each uncontrollable AMR observe the pose of the nearby AMR, and the AMR refers to the DDR of the differential driving robot so as toRepresenting the pose state of DDR in two-dimensional Cartesian coordinates, wherein +.>And->Represents the location of DDR in two-dimensional cartesian coordinates, and (2)>The evolution motion equation of the DDR state is specifically:

；

and->Respectively indicate->And->，/>Represents the magnitude of the translation speed of DDR, +.>、/>And->Respectively indicate->、/>And->Is the function of DDR to adjust the angular velocity +.>。

3. The reinforcement learning-based mobile robot safety motion method of claim 2, wherein: in step S1, the nonlinear affine system is:

；

representing the state of the system, wherein->Representing a generalized state->A generalized state space is represented and is represented,representation->A dimension space; />Representing a control input, i.e. an action, wherein +.>Representing a generalized action,/->Representing a generalized motion space, < >>Representation->Space of real dimension->And->Is a function of two local Lipschitz; by the invariance of the safety set, the safety of the system state is ensured, and a continuous micro-function is adopted>The defined set C satisfies:

；

set C is referred to as a security set; wherein the method comprises the steps ofRepresents the boundary of set C, +.>Representation ofThe interior of collection C;

for system and set C, there is a relative orderIs->Is an exponentially controlled barrier function ECBF, existsSatisfy the following requirements

；

Representation->Space of real dimension->Representing the upper bound->Representing a function pair +.>Li Daoshu of (A)>Representing a function pair +.>Is->Weight Li Daoshu,/->Representing a function pair +.>Li Daoshu of (A)>Indicating that the system is +.>Status of moment->Indicating that the system is +.>Status of moment->Representing a constant matrix,/->Representing the power of +.>An exponential matrix of (a);

wherein the method comprises the steps of

；

Wherein the method comprises the steps of、/>And->Respectively indicate->First time derivative, second time derivative and +.>The order time derivative.

4. A method for safe motion of a mobile robot based on reinforcement learning according to claim 3, wherein: in step S2, a protective barrier for the CBF is determined based on the plurality of security targets, wherein k CBFs of the plurality of security targets are denoted asA plurality of CBFs are fused by Boolean operation to form a composite CBF using +.>The representation is made of a combination of a first and a second color,

；

obtaining a CBF-based proxy mask based on the composite CBF derived above; CBF-based protection barriers modify the actions of AMR only if AMR tends to violate security conditions.

5. The reinforcement learning-based safe motion method of a mobile robot according to claim 4, wherein: the ith controllable AMR, state is recorded asWhen facing uncontrollable AMR, the state is marked +.>Select proper action->Make the following steps

；

Representation->And to set the parameters of the CBF protection barrier +.>And->Satisfy the following requirements

；

Representing the maximum distance between controllable AMR and uncontrollable AMR, the state of the ith controllable AMR is ensured to be in a safety set, namely the safety of the controllable AMR is ensured;

the security action of the ith controllable AMR is recorded asExpressed as:

；

wherein the method comprises the steps ofThe nominal actions of the controllable AMR are determined by multi-agent reinforcement learning; />A threshold value representing a safe distance;representing the difference in position of the ith controllable AMR and the ith uncontrollable AMR; />And->The translation speed of the ith functional AMR and the uncontrollable AMR are respectively represented; />Representing the difference between the speeds of the ith controllable AMR and the uncontrollable AMR; />And->Respectively representing the angular speeds of the ith controllable AMR and the uncontrollable AMR; />Representing the maximum possible angular velocity of the function AMR; (/>，/>，) Representing the pose of the ith function AMR, (-I)>，/>，/>) Representing the pose of uncontrollable AMR; />Is a sign function, +.>Representation->，/>Representation->；/>Representing the angle from the x-axis of the two-dimensional cartesian coordinate to the line connecting the uncontrollable AMR position with the i-th controllable AMR position in the counterclockwise direction.