CN114237235B

CN114237235B - Mobile robot obstacle avoidance method based on deep reinforcement learning

Info

Publication number: CN114237235B
Application number: CN202111460950.9A
Authority: CN
Inventors: 穆宗昊; 宋伟; 廖建峰; 周元海; 金天磊; 方伟
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2024-01-19
Anticipated expiration: 2041-12-02
Also published as: CN114237235A

Abstract

The invention discloses a mobile robot obstacle avoidance method based on deep reinforcement learning, which comprises the steps of acquiring point cloud data through a laser radar, carrying out convolution feature extraction on the point cloud data, inputting the point cloud data and pedestrian positions, pedestrian speeds and global paths together as a neural network, establishing a fully connected neural network, setting environmental rewards, and outputting robot actions through a PPO deep reinforcement learning algorithm. Compared with other planning or learning navigation methods, the method does not need to predict pedestrians and preprocess sensors, simplifies algorithm complexity, and is more suitable for a robot navigation strategy in a multi-person environment. Meanwhile, as the global path is added as input quantity, the application range of the algorithm is improved, and meanwhile, the convergence time of the algorithm is also shortened.

Description

Mobile robot obstacle avoidance method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of mobile robot navigation, in particular to a PPO-based mobile robot obstacle avoidance method in a multi-person environment.

Background

Autonomous navigation obstacle avoidance of robots is one of the fundamental problems of robotics. Robots typically build a two-dimensional grid map, with gray values in the grid used to indicate whether the point is free space or an obstacle, and thus perform a path planning algorithm. Robots typically employ a combination of global planning, typically employing conventional a-x, dijkstra algorithms, and local path planning, typically employing DWA or TEB algorithms. The planning method based on the graph has low calculation cost, but the generated track generally cannot meet the dynamic constraint of the robot, and has poor obstacle avoidance effect on dynamic obstacles such as people.

Deep reinforcement learning is one of the most interesting directions in the field of robot planning navigation in recent years, and a robot can find an optimal strategy through trial and error like a human, so as to complete a navigation task in a complex environment. The PPO algorithm is the most popular deep reinforcement learning algorithm at present, has wide applicability, and can solve the planning navigation problem in a continuous environment. Therefore, the PPO has stronger environment adaptability than the traditional algorithm aiming at the obstacle avoidance problem in a multi-person environment.

The existing obstacle avoidance algorithm based on deep reinforcement learning has made great progress, but the obstacle avoidance performance, training time and practical use value of the obstacle avoidance algorithm still have a great improvement space. The main reasons are that the motion model of the pedestrian is changeable, the effect of the final target is smaller when the path is long, and the result is unstable due to unknown conditions in actual use.

Therefore, the deep reinforcement learning obstacle avoidance method with high practical use value is needed to meet the practical application requirement of strong adaptability of the multi-dynamic pedestrian obstacle.

Disclosure of Invention

In order to solve the defects in the prior art and achieve the purpose of strong obstacle avoidance adaptability of the mobile robot to multiple dynamic pedestrian obstacles, the invention adopts the following technical scheme:

a mobile robot obstacle avoidance method based on deep reinforcement learning comprises the following steps:

s1, acquiring sensor original data through a laser radar and a front camera carried by a mobile robot, and acquiring the position and speed of a pedestrian through a pedestrian positioning sensor;

s2, a two-dimensional or three-dimensional laser radar is adopted to map a scene;

s3, setting a navigation target point and acquiring a global planning path;

s4, designing an action space and a reward function of a PPO deep reinforcement learning algorithm, wherein the action space comprises a robot speed direction and a robot rewarding functionThe magnitude of the reward function comprises a reward R reaching the target point _t (gold), reward R for path-extending walking _t (path), penalty R for encountering obstacle _t Punishment R of encountering pedestrian _t (scope) penalty R for the step size used _t (time) in which specific parameters can be adjusted to satisfy the prize R for reaching the target point as a whole _t (gold) is much greater than other penalties, pedestrian and obstacle penalties progressively in distance;

s5, establishing an action neural network actor, acquiring laser radar data, front camera data, pedestrian information position and speed, and position information of the robot, a target point and a global planning path, and outputting speed information selected by the mobile robot;

s6, establishing a reward neural network critic, acquiring laser radar data, front camera data, pedestrian information position and speed, and position information of a robot, a target point and a global planning path, and outputting the maximum reward which can be obtained in the current state;

s7, in the constructed simulation scene and the actual test scene, a PPO deep reinforcement learning algorithm is adopted, real rewards of the output of each step of action neural network actor of the robot in the simulation environment are compared with rewards predicted by rewards neural network critic, iterative training is alternately carried out until obstacle avoidance training is completed, and the trained action neural network actor and rewards neural network critic are used in the actual scene.

Further, after the simulation training in step S7 is completed, the network parameters are recorded and applied to the actual scene, the environmental state quantity, the action space and the rewarding function are the same as those in the simulation environment, the neural network and the training method are the same as those in the simulation environment, the actual scene data are obtained through training, the network is corrected, and the actual training time can be accelerated through the training method combining simulation and real objects.

Further, the robot adopts an omni-wheel robot, the speed direction takes the value of-180 degrees to 180 degrees, and the speed size takes the value of 0-1m/s.

Further, the map built in the step S2 is a gray level map, and the place where the map is not accurately built is corrected according to actual conditions, including adding wall barriers which cannot be scanned by laser, and removing burr points caused by the map building.

Further, a set of target points is set in S3, and classified according to distance, middle and near.

Further, in S3, a global path planning method is adopted to obtain a global planned path, where the path is a two-dimensional coordinate point sequence, and the distance between the coordinate points is usually 5cm.

Further, the prize function in S4, the current prize value R _t The calculation method is as follows:

R _t ＝R _t (goal)+R _t (path)+R _t (obstacle)+R _t (people)+R _t (time)

wherein the target point rewards R _t (gold) calculation method, is the robot position at time tAnd target point position p _g Distance between->Less than threshold D from the target point _g When get rewards R _g Otherwise, the prize is 0, and the specific formula is as follows:

wherein the target point rewards R _t (path) calculation method, which is the robot position at time tAnd +.about.nearest point position from global path>Distance between->Less thanDistance at time t-1->When get rewards R _t (path) is the difference between the two times the approach coefficient ω ₁ Otherwise the reward is the difference between the two multiplied by the distance coefficient omega ₂ The robot is embodied whether to move along a planned global path, and the specific formula is shown as follows:

wherein penalty R for obstacles _t The (obstacle) calculating method is that the robot position is at the moment tAnd nearest obstacle position->Distance between->Less than the collision distance D _near When penalty-R is obtained _near When the distance is at the collision distance D _near Distance from punishment D _far And when the penalty is reduced along with the distance, alpha represents an object collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:

wherein penalty R for pedestrians _t (scope) calculation method, which is the robot position at time tAnd the nearest pedestrian locationDistance between->Less than the collision distance D _near When penalty-R is obtained _near When the distance is at the collision distance D _near Distance from punishment D _far When the penalty is increased along with the distance, alpha represents a pedestrian collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:

wherein penalty of time R _t The (time) calculating method is that the time t is multiplied by a parameter beta, and beta represents a time penalty coefficient, and the specific formula is shown as follows:

R _t (time)＝-β*t。

further, the action neural network actor established in the S5 is a CNN convolutional neural network, the point cloud data lidar_data of the laser radar and the image data picture_data of the front camera are input, and the point cloud feature value lidar_feature and the image feature value picture_feature are extracted through convolution layer, pooling layer and normalization, the point cloud feature value lidar_feature, the image feature value picture_feature, pedestrian position picture_pos and speed picture_v and the distance between the robot and a target point are obtainedAnd angle->Position of closest point to global planned pathAnd angle->The input dimension is the sum of the dimensions of all data, and the output is a mobile machine after the calculation of the 5-layer fully-connected neural networkHuman-selected speed information robot_v, which includes speed magnitude and direction.

Further, the reward neural network critic established in S6 is a CNN convolutional neural network, and is input as lidar point cloud data lidar_data and image data picture_data of a front camera, and is extracted to a laser point cloud feature value lidar_feature and an image feature value picture_feature through a convolutional layer, a pooling layer and normalization, wherein the laser point cloud feature value lidar_feature, the image feature value picture_feature, a pedestrian position picture_pos, a velocity picture_v and a distance between a robot and a target pointAnd angle->Position of closest point to global planned pathAnd angle->The maximum reward_max which can be obtained under the current state is output through the calculation of the 5-layer fully-connected neural network.

Further, the training method in step S7 includes the steps of:

s71, acquiring laser radar point cloud data lidar_data, image data picture_data of a front camera, pedestrian position peoples, pedestrian speed peoples and distances between a robot and a target point from a simulation environmentAnd angle->Position of closest point to global planned path +.>And angle ofAs an environmental state quantity state;

s72, according to the action space, the rewarding function, the action neural network actor and the rewarding neural network critic, inputting an environmental state quantity state into the action neural network actor to obtain robot speed information robot_v, and inputting the robot speed information robot_v into a simulation environment to obtain rewarding report_now at the current moment;

s73, recording actions and rewards of each step in the training process, and updating network parameters according to the difference value of actual rewards and rewards predicted by the rewards neural network critic after a certain training time is reached, so that the output of an action neural network actor is updated towards the maximum rewards direction, and the output of the rewards neural network critic is updated towards a true value;

s73, repeating iteration until training is completed, and enabling the robot to avoid pedestrians in the simulation environment and reach a target point.

The invention has the advantages that:

in a robot navigation task in a multi-person environment, the situation that the obstacle avoidance is difficult due to the dynamic obstacle of the pedestrian is aimed. Compared with the traditional algorithm, the method has the advantages that the prediction obstacle avoidance of the dynamic pedestrian obstacle and the smoothness of the robot motion track are considered through the neural network; compared with the conventional reinforcement learning method, the method has the advantages that the position and the angle of the closest point to the global planning path are added as input states, and the distance between the front moment and the rear moment and the path are used as rewards, so that a local target is provided for the robot, and universality under different final target points is improved; in addition, through the process of simulating training and then physical training, the training speed is accelerated, and the practical value is improved. Therefore, the obstacle avoidance method has higher application universality and practical use value, and is irrelevant to the hardware type of the robot.

Drawings

Fig. 1 is a flow chart of the present invention.

Fig. 2 is a schematic view of a scenario of the present invention.

Fig. 3 is a diagram showing the structure of an operational neural network according to the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

A mobile robot obstacle avoidance method based on PPO in a multi-person environment is shown in figure 1, and comprises the following steps:

step one: as shown in fig. 2, a simulation scene and an actual test site are built, sensor raw data are obtained through a laser radar and a front camera carried by a mobile robot, and pedestrian positions and speeds are obtained through a pedestrian positioning sensor.

Step two: and (3) mapping the scene by adopting a two-dimensional laser radar or a three-dimensional laser radar, wherein the map is a gray level map, and modifying the inaccurate place of the map according to the actual situation, including adding wall barriers which cannot be scanned by laser, and removing burr points caused by the map.

Step three: setting navigation target points, generally setting a plurality of target points, classifying according to distance, middle and near, adopting an A-global path planning method to obtain a global planning path, wherein the path is a two-dimensional coordinate point sequence, and the distance between coordinate points is usually 5cm.

Step four: and designing an action space and a reward function of the PPO deep reinforcement learning algorithm. The motion space comprises the speed direction and the size of the robot, and the speed direction is valued at-180 degrees to 180 degrees and the speed size is valued at 0-1m/s due to the adoption of the omni-wheel robot;

the reward function includes a reward R to the target point _t (gold), reward R for path-extending walking _t (path), penalty R for encountering obstacle _t Punishment R of encountering pedestrian _t (scope) penalty R for the step size used _t (time). Wherein specific parameters can be adjusted to meet the reward R to the target point as a whole _t (gold) is much greater than other penalties, pedestrian and obstacle spacingAnd (5) punishing in an ion-release process.

Current prize value R _t The calculation method is as follows:

R _t ＝R _t (goal)+R _t (path)+R _t (obstacle)+R _t (people)+R _t (time)

wherein the target point rewards R _t The (gold) calculation method is when the robot positionAnd target point position p _g Distance between->Less than threshold D from the target point _g When get rewards R _g Otherwise, the prize is 0. The specific formula is as follows:

wherein the target point rewards R _t The (path) calculation method is that when the robot position is at the time tAnd +.about.nearest point position from global path>Distance between->Distance less than t-1 moment +.>When get rewards R _t (path) is the difference between the two times the approach coefficient ω ₁ Otherwise the reward is the difference between the two multiplied by the distance coefficient omega ₂ Whether the robot moves along the planned global path is reflected. The specific formula is as follows:

wherein penalty R for obstacles _t The (obstacle) calculation method is when the robot is at the positionAnd the nearest obstacle locationDistance between->Less than the collision distance D _near When penalty-R is obtained _near When the distance is at the collision distance D _near Distance from punishment D _far And when the penalty is reduced along with the distance, alpha represents an object collision penalty coefficient, and the rest penalty is 0. The specific formula is as follows:

wherein penalty R for pedestrians _t The (peer) calculation method is when the robot is at the positionAnd nearest pedestrian position->Distance between->Less than the collision distance D _near When penalty-R is obtained _near . When the distance is at the collision distance D _near Distance from punishment D _far And when the distance between the two penalties is increased and reduced along with the distance, alpha represents a pedestrian collision penalty coefficient, and the rest penalties are 0. The specific formula is as follows:

wherein penalty of time R _t The (time) calculation method is that the time t is multiplied by a parameter beta, and beta represents a time penalty coefficient. The specific formula is as follows:

R _t (time)＝-β*t

step five: an action neural network actor is established, and the network structure is shown in figure 3. The input of the convolutional neural network CNN is that the laser radar is point cloud data lidar_data and image data picture_data of the front camera, and the point cloud characteristic value lidar_feature and the image characteristic value picture_feature are extracted through a convolutional layer, a pooling layer and normalization. Lidar_feature and picture_feature and pedestrian position peoples and speed peoples v, distance of robot from target pointAnd angle->Position of closest point to global planned path +.>And angle->The input of the 5-layer fully-connected neural network is taken as the input, and the input dimension is the sum of the dimensions of all data. And outputting the speed robot_v selected for the mobile robot through calculation of the 5-layer fully connected neural network.

Step six: and establishing a reward neural network critic, wherein the network structure is shown in figure 3. The input of the convolutional neural network CNN is that the laser radar is point cloud data lidar_data and image data picture_data of the front camera, and the point cloud characteristic value lidar_feature and the image characteristic value picture_feature are extracted through a convolutional layer, a pooling layer and normalization. Lidar_feature and picture_feature and pedestrian position peoples and speed peoples v, distance of robot from target pointAnd angle->Position of closest point to global planned path +.>And angle->The input of the 5-layer fully-connected neural network is taken as the input, and the input dimension is the sum of the dimensions of all data. And outputting the maximum reward review_max which can be obtained in the current state through calculation of a 5-layer fully connected neural network.

Step seven: and in the constructed simulation scene and the actual test site, adopting a PPO deep reinforcement learning algorithm.

Firstly, constructing simulation and physical environments, wherein a robot model is constructed by adopting gazebo simulation software, the robot model is an omni-wheel, and the parameters such as a laser radar, a front camera, the quality, the height and the like are the same as those of an actual robot. Pedestrians adopt a social force model and wander in the scene.

The training mode is that firstly, the laser radar point cloud data lidar_data and the image data picture_data of a front camera, the pedestrian position peoples and the speed peoples_v and the distance between a robot and a target point are obtained from a simulation environmentAnd angle->Position of closest point to global planned path +.>And angle ofAs a ringState of context state quantity state. And constructing an action space and a reward function according to the method in the step four. And (3) constructing an action neural network actor and a reward neural network critic according to the methods in the fifth step and the sixth step. Inputting state into an actor network to acquire robot speed robot_v, inputting robot_v into a simulation environment to acquire current moment rewarding report_now, recording action and rewards of each step in the training process, updating network data according to the difference value of actual rewards and predicted rewards after a certain training number of times is reached, so that the output of an action neural network actor is updated to the maximum rewarding direction, and the output of a rewarding neural network critic is updated to a true value. The iteration is repeated until the robot can avoid the pedestrian in the simulation environment and reach the target point.

After the simulation training is completed, the network parameters are recorded and applied to the actual scene. The physical scene is built by a framework that a robot is responsible for collecting data and executing actions and a cloud background is responsible for receiving data training and issuing instructions. And building a UWB base station in a real object, wearing a UWB tag by a pedestrian, and acquiring the position and speed parameters of the pedestrian from a background. And the robot is provided with a laser radar and a front camera, and data are directly uploaded to the cloud.

The environment state quantity, the action space and the rewarding function, the neural network and the training method are the same as those in the simulation environment, the actual scene data are obtained through training, and the network is corrected. The practical training time can be accelerated by a training method combining simulation and real objects.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. The mobile robot obstacle avoidance method based on deep reinforcement learning is characterized by comprising the following steps of:

s1, acquiring sensor original data through a laser radar and a camera carried by a mobile robot, and acquiring the position and the speed of a pedestrian through a pedestrian positioning sensor;

s2, constructing a scene by using a laser radar;

s3, setting a navigation target point and acquiring a global planning path;

s4, designing an action space of a deep reinforcement learning algorithm and a reward function, wherein the action space comprises the speed direction and the size of the robot, and the reward function comprises rewards R reaching target points _t (gold), reward R for path-extending walking _t (path), penalty R for encountering obstacle _t Punishment R of encountering pedestrian _t (scope) penalty R for the step size used _t (time) in which the prize R to the target point is satisfied as a whole _t (gold) is greater than other penalties, pedestrian and obstacle penalties progressively by distance; current prize value R _t The calculation method is as follows:

R _t ＝R _t (goal)+R _t (path)+R _t (obstacle)+R _t (people)+R _t (time)

wherein the target point rewards R _t (gold) calculation method, is the robot position at time tDistance between the target point position pg>Less than threshold D from the target point _g When get rewards R _g Otherwise, the prize is 0, and the specific formula is as follows:

wherein the target point rewards R _t (path) calculation method, which is the robot position at time tAnd +.about.nearest point position from global path>Distance between->Distance less than t-1 moment +.>When get rewards R _t (path) is the difference between the two times the approach coefficient ω ₁ Otherwise the reward is the difference between the two multiplied by the distance coefficient omega ₂ The specific formula is shown as follows:

wherein penalty R for obstacles _t The (obstacle) calculating method is that the robot position is at the moment tAnd the nearest obstacle locationDistance between->Less than the collision distance D _near When penalty-R is obtained _near When the distance is at the collision distance D _near Distance from punishment D _far And when the penalty is reduced along with the distance, alpha represents an object collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:

wherein penalty R for pedestrians _t (scope) calculation method, which is the robot position at time tAnd nearest pedestrian position->Distance between->Less than the collision distance D _near When penalty-R is obtained _near When the distance is at the collision distance D _near Distance from punishment D _far When the penalty is increased along with the distance, alpha represents a pedestrian collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:

R _t (time)＝-β*t

s5, an action neural network actor is established, laser radar data, camera data, pedestrian information positions and speeds and position information of a robot, a target point and a global planning path are obtained, and speed information selected by the mobile robot is output;

s6, establishing a reward neural network critic, acquiring laser radar data, camera data, pedestrian information position and speed, and position information of a robot, a target point and a global planning path, and outputting the maximum reward which can be obtained in the current state;

s7, constructing a simulation scene, adopting a deep reinforcement learning algorithm, comparing the real rewards of the output of each step of action neural network actor of the robot in the simulation environment with rewards predicted by rewards neural network critic, alternately performing iterative training until obstacle avoidance training is completed, and using the trained action neural network actor and rewards neural network critic in the actual scene.

2. The method for avoiding the obstacle of the mobile robot based on the deep reinforcement learning of claim 1, wherein after the simulation training in the step S7 is completed, the network parameters are recorded and applied to the actual scene, and the actual scene data is obtained through training to correct the network.

3. The mobile robot obstacle avoidance method based on deep reinforcement learning as claimed in claim 1, wherein the robot adopts an omni-wheel robot, the speed direction takes a value of-180 degrees to 180 degrees, and the speed is 0-1m/s.

4. The obstacle avoidance method of mobile robot based on deep reinforcement learning of claim 1, wherein the map built in S2 is a gray scale map, and the correction is performed on the place where the map is not accurate according to the actual situation.

5. The method for avoiding the obstacle of the mobile robot based on the deep reinforcement learning of claim 1, wherein a group of target points are set in the step S3, and the target points are classified according to distance, middle and near.

6. The method for avoiding the obstacle of the mobile robot based on the deep reinforcement learning of claim 1, wherein the method is characterized in that a global path planning method is adopted in the step S3, a global planned path is obtained, and the path is a two-dimensional coordinate point sequence.

7. The method for avoiding obstacle of mobile robot based on deep reinforcement learning as set forth in claim 1, wherein the action neural network actor established in the step S5 is a CNN convolutional neural network, and the input is a point of a laser radarCloud data lidar_data and image data picture_data of a camera are extracted through a convolution layer, a pooling layer and normalization to obtain laser point cloud feature value lidar_feature and image feature value picture_feature, and the distances between a robot and a target point are calculated according to the laser point cloud feature value lidar_feature, the image feature value picture_feature, pedestrian positions peple_pos and speeds peple_vAnd angle->Position of closest point to global planned path +.>And angle ofThe speed information robot_v selected for the mobile robot is output through calculation of the fully connected neural network and is used as input of the fully connected neural network, and the robot_v comprises the speed and the direction.

8. The method of claim 1, wherein the reward neural network critical established in S6 is a CNN convolutional neural network, and is input as lidar point cloud data lidar_dat and image data picture_data of a camera, and the images are convolved, pooled, normalized to obtain a laser point cloud feature value lidar_feature and an image feature value picture_feature, and the laser point cloud feature value lidar_feature, the image feature value picture_feature, pedestrian position pepole_pos, velocity pepole_v, and the distance between the robot and the target pointAnd angle->Position of closest point to global planned path +.>And angle ofThe maximum reward_max which can be obtained in the current state is output through calculation of the full-connection neural network.

9. The mobile robot obstacle avoidance method based on deep reinforcement learning according to claim 1, wherein the training method in step S7 comprises the steps of:

s71, acquiring laser radar point cloud data lidar_data, image data picture_data of a camera, pedestrian position peoples, pedestrian speed peoples and distance between a robot and a target point from a simulation environmentAnd angle->Position of closest point to global planned path +.>And angle->As an environmental state quantity state;

s73, recording actions and rewards of each step in the training process, and updating network parameters according to the difference value of actual rewards and rewards predicted by the rewards neural network critic, so that the output of an action neural network actor is updated to the maximum rewards direction, and the output of the rewards neural network critic is updated to a true value;

s73, repeating iteration until training is completed.