CN114237235B - Mobile robot obstacle avoidance method based on deep reinforcement learning - Google Patents

Mobile robot obstacle avoidance method based on deep reinforcement learning Download PDF

Info

Publication number
CN114237235B
CN114237235B CN202111460950.9A CN202111460950A CN114237235B CN 114237235 B CN114237235 B CN 114237235B CN 202111460950 A CN202111460950 A CN 202111460950A CN 114237235 B CN114237235 B CN 114237235B
Authority
CN
China
Prior art keywords
robot
neural network
distance
rewards
pedestrian
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111460950.9A
Other languages
Chinese (zh)
Other versions
CN114237235A (en
Inventor
穆宗昊
宋伟
廖建峰
周元海
金天磊
方伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202111460950.9A priority Critical patent/CN114237235B/en
Publication of CN114237235A publication Critical patent/CN114237235A/en
Application granted granted Critical
Publication of CN114237235B publication Critical patent/CN114237235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0223Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0246Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0257Control of position or course in two dimensions specially adapted to land vehicles using a radar

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Electromagnetism (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a mobile robot obstacle avoidance method based on deep reinforcement learning, which comprises the steps of acquiring point cloud data through a laser radar, carrying out convolution feature extraction on the point cloud data, inputting the point cloud data and pedestrian positions, pedestrian speeds and global paths together as a neural network, establishing a fully connected neural network, setting environmental rewards, and outputting robot actions through a PPO deep reinforcement learning algorithm. Compared with other planning or learning navigation methods, the method does not need to predict pedestrians and preprocess sensors, simplifies algorithm complexity, and is more suitable for a robot navigation strategy in a multi-person environment. Meanwhile, as the global path is added as input quantity, the application range of the algorithm is improved, and meanwhile, the convergence time of the algorithm is also shortened.

Description

Mobile robot obstacle avoidance method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of mobile robot navigation, in particular to a PPO-based mobile robot obstacle avoidance method in a multi-person environment.
Background
Autonomous navigation obstacle avoidance of robots is one of the fundamental problems of robotics. Robots typically build a two-dimensional grid map, with gray values in the grid used to indicate whether the point is free space or an obstacle, and thus perform a path planning algorithm. Robots typically employ a combination of global planning, typically employing conventional a-x, dijkstra algorithms, and local path planning, typically employing DWA or TEB algorithms. The planning method based on the graph has low calculation cost, but the generated track generally cannot meet the dynamic constraint of the robot, and has poor obstacle avoidance effect on dynamic obstacles such as people.
Deep reinforcement learning is one of the most interesting directions in the field of robot planning navigation in recent years, and a robot can find an optimal strategy through trial and error like a human, so as to complete a navigation task in a complex environment. The PPO algorithm is the most popular deep reinforcement learning algorithm at present, has wide applicability, and can solve the planning navigation problem in a continuous environment. Therefore, the PPO has stronger environment adaptability than the traditional algorithm aiming at the obstacle avoidance problem in a multi-person environment.
The existing obstacle avoidance algorithm based on deep reinforcement learning has made great progress, but the obstacle avoidance performance, training time and practical use value of the obstacle avoidance algorithm still have a great improvement space. The main reasons are that the motion model of the pedestrian is changeable, the effect of the final target is smaller when the path is long, and the result is unstable due to unknown conditions in actual use.
Therefore, the deep reinforcement learning obstacle avoidance method with high practical use value is needed to meet the practical application requirement of strong adaptability of the multi-dynamic pedestrian obstacle.
Disclosure of Invention
In order to solve the defects in the prior art and achieve the purpose of strong obstacle avoidance adaptability of the mobile robot to multiple dynamic pedestrian obstacles, the invention adopts the following technical scheme:
a mobile robot obstacle avoidance method based on deep reinforcement learning comprises the following steps:
s1, acquiring sensor original data through a laser radar and a front camera carried by a mobile robot, and acquiring the position and speed of a pedestrian through a pedestrian positioning sensor;
s2, a two-dimensional or three-dimensional laser radar is adopted to map a scene;
s3, setting a navigation target point and acquiring a global planning path;
s4, designing an action space and a reward function of a PPO deep reinforcement learning algorithm, wherein the action space comprises a robot speed direction and a robot rewarding functionThe magnitude of the reward function comprises a reward R reaching the target point t (gold), reward R for path-extending walking t (path), penalty R for encountering obstacle t Punishment R of encountering pedestrian t (scope) penalty R for the step size used t (time) in which specific parameters can be adjusted to satisfy the prize R for reaching the target point as a whole t (gold) is much greater than other penalties, pedestrian and obstacle penalties progressively in distance;
s5, establishing an action neural network actor, acquiring laser radar data, front camera data, pedestrian information position and speed, and position information of the robot, a target point and a global planning path, and outputting speed information selected by the mobile robot;
s6, establishing a reward neural network critic, acquiring laser radar data, front camera data, pedestrian information position and speed, and position information of a robot, a target point and a global planning path, and outputting the maximum reward which can be obtained in the current state;
s7, in the constructed simulation scene and the actual test scene, a PPO deep reinforcement learning algorithm is adopted, real rewards of the output of each step of action neural network actor of the robot in the simulation environment are compared with rewards predicted by rewards neural network critic, iterative training is alternately carried out until obstacle avoidance training is completed, and the trained action neural network actor and rewards neural network critic are used in the actual scene.
Further, after the simulation training in step S7 is completed, the network parameters are recorded and applied to the actual scene, the environmental state quantity, the action space and the rewarding function are the same as those in the simulation environment, the neural network and the training method are the same as those in the simulation environment, the actual scene data are obtained through training, the network is corrected, and the actual training time can be accelerated through the training method combining simulation and real objects.
Further, the robot adopts an omni-wheel robot, the speed direction takes the value of-180 degrees to 180 degrees, and the speed size takes the value of 0-1m/s.
Further, the map built in the step S2 is a gray level map, and the place where the map is not accurately built is corrected according to actual conditions, including adding wall barriers which cannot be scanned by laser, and removing burr points caused by the map building.
Further, a set of target points is set in S3, and classified according to distance, middle and near.
Further, in S3, a global path planning method is adopted to obtain a global planned path, where the path is a two-dimensional coordinate point sequence, and the distance between the coordinate points is usually 5cm.
Further, the prize function in S4, the current prize value R t The calculation method is as follows:
R t =R t (goal)+R t (path)+R t (obstacle)+R t (people)+R t (time)
wherein the target point rewards R t (gold) calculation method, is the robot position at time tAnd target point position p g Distance between->Less than threshold D from the target point g When get rewards R g Otherwise, the prize is 0, and the specific formula is as follows:
wherein the target point rewards R t (path) calculation method, which is the robot position at time tAnd +.about.nearest point position from global path>Distance between->Less thanDistance at time t-1->When get rewards R t (path) is the difference between the two times the approach coefficient ω 1 Otherwise the reward is the difference between the two multiplied by the distance coefficient omega 2 The robot is embodied whether to move along a planned global path, and the specific formula is shown as follows:
wherein penalty R for obstacles t The (obstacle) calculating method is that the robot position is at the moment tAnd nearest obstacle position->Distance between->Less than the collision distance D near When penalty-R is obtained near When the distance is at the collision distance D near Distance from punishment D far And when the penalty is reduced along with the distance, alpha represents an object collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:
wherein penalty R for pedestrians t (scope) calculation method, which is the robot position at time tAnd the nearest pedestrian locationDistance between->Less than the collision distance D near When penalty-R is obtained near When the distance is at the collision distance D near Distance from punishment D far When the penalty is increased along with the distance, alpha represents a pedestrian collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:
wherein penalty of time R t The (time) calculating method is that the time t is multiplied by a parameter beta, and beta represents a time penalty coefficient, and the specific formula is shown as follows:
R t (time)=-β*t。
further, the action neural network actor established in the S5 is a CNN convolutional neural network, the point cloud data lidar_data of the laser radar and the image data picture_data of the front camera are input, and the point cloud feature value lidar_feature and the image feature value picture_feature are extracted through convolution layer, pooling layer and normalization, the point cloud feature value lidar_feature, the image feature value picture_feature, pedestrian position picture_pos and speed picture_v and the distance between the robot and a target point are obtainedAnd angle->Position of closest point to global planned pathAnd angle->The input dimension is the sum of the dimensions of all data, and the output is a mobile machine after the calculation of the 5-layer fully-connected neural networkHuman-selected speed information robot_v, which includes speed magnitude and direction.
Further, the reward neural network critic established in S6 is a CNN convolutional neural network, and is input as lidar point cloud data lidar_data and image data picture_data of a front camera, and is extracted to a laser point cloud feature value lidar_feature and an image feature value picture_feature through a convolutional layer, a pooling layer and normalization, wherein the laser point cloud feature value lidar_feature, the image feature value picture_feature, a pedestrian position picture_pos, a velocity picture_v and a distance between a robot and a target pointAnd angle->Position of closest point to global planned pathAnd angle->The maximum reward_max which can be obtained under the current state is output through the calculation of the 5-layer fully-connected neural network.
Further, the training method in step S7 includes the steps of:
s71, acquiring laser radar point cloud data lidar_data, image data picture_data of a front camera, pedestrian position peoples, pedestrian speed peoples and distances between a robot and a target point from a simulation environmentAnd angle->Position of closest point to global planned path +.>And angle ofAs an environmental state quantity state;
s72, according to the action space, the rewarding function, the action neural network actor and the rewarding neural network critic, inputting an environmental state quantity state into the action neural network actor to obtain robot speed information robot_v, and inputting the robot speed information robot_v into a simulation environment to obtain rewarding report_now at the current moment;
s73, recording actions and rewards of each step in the training process, and updating network parameters according to the difference value of actual rewards and rewards predicted by the rewards neural network critic after a certain training time is reached, so that the output of an action neural network actor is updated towards the maximum rewards direction, and the output of the rewards neural network critic is updated towards a true value;
s73, repeating iteration until training is completed, and enabling the robot to avoid pedestrians in the simulation environment and reach a target point.
The invention has the advantages that:
in a robot navigation task in a multi-person environment, the situation that the obstacle avoidance is difficult due to the dynamic obstacle of the pedestrian is aimed. Compared with the traditional algorithm, the method has the advantages that the prediction obstacle avoidance of the dynamic pedestrian obstacle and the smoothness of the robot motion track are considered through the neural network; compared with the conventional reinforcement learning method, the method has the advantages that the position and the angle of the closest point to the global planning path are added as input states, and the distance between the front moment and the rear moment and the path are used as rewards, so that a local target is provided for the robot, and universality under different final target points is improved; in addition, through the process of simulating training and then physical training, the training speed is accelerated, and the practical value is improved. Therefore, the obstacle avoidance method has higher application universality and practical use value, and is irrelevant to the hardware type of the robot.
Drawings
Fig. 1 is a flow chart of the present invention.
Fig. 2 is a schematic view of a scenario of the present invention.
Fig. 3 is a diagram showing the structure of an operational neural network according to the present invention.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
A mobile robot obstacle avoidance method based on PPO in a multi-person environment is shown in figure 1, and comprises the following steps:
step one: as shown in fig. 2, a simulation scene and an actual test site are built, sensor raw data are obtained through a laser radar and a front camera carried by a mobile robot, and pedestrian positions and speeds are obtained through a pedestrian positioning sensor.
Step two: and (3) mapping the scene by adopting a two-dimensional laser radar or a three-dimensional laser radar, wherein the map is a gray level map, and modifying the inaccurate place of the map according to the actual situation, including adding wall barriers which cannot be scanned by laser, and removing burr points caused by the map.
Step three: setting navigation target points, generally setting a plurality of target points, classifying according to distance, middle and near, adopting an A-global path planning method to obtain a global planning path, wherein the path is a two-dimensional coordinate point sequence, and the distance between coordinate points is usually 5cm.
Step four: and designing an action space and a reward function of the PPO deep reinforcement learning algorithm. The motion space comprises the speed direction and the size of the robot, and the speed direction is valued at-180 degrees to 180 degrees and the speed size is valued at 0-1m/s due to the adoption of the omni-wheel robot;
the reward function includes a reward R to the target point t (gold), reward R for path-extending walking t (path), penalty R for encountering obstacle t Punishment R of encountering pedestrian t (scope) penalty R for the step size used t (time). Wherein specific parameters can be adjusted to meet the reward R to the target point as a whole t (gold) is much greater than other penalties, pedestrian and obstacle spacingAnd (5) punishing in an ion-release process.
Current prize value R t The calculation method is as follows:
R t =R t (goal)+R t (path)+R t (obstacle)+R t (people)+R t (time)
wherein the target point rewards R t The (gold) calculation method is when the robot positionAnd target point position p g Distance between->Less than threshold D from the target point g When get rewards R g Otherwise, the prize is 0. The specific formula is as follows:
wherein the target point rewards R t The (path) calculation method is that when the robot position is at the time tAnd +.about.nearest point position from global path>Distance between->Distance less than t-1 moment +.>When get rewards R t (path) is the difference between the two times the approach coefficient ω 1 Otherwise the reward is the difference between the two multiplied by the distance coefficient omega 2 Whether the robot moves along the planned global path is reflected. The specific formula is as follows:
wherein penalty R for obstacles t The (obstacle) calculation method is when the robot is at the positionAnd the nearest obstacle locationDistance between->Less than the collision distance D near When penalty-R is obtained near When the distance is at the collision distance D near Distance from punishment D far And when the penalty is reduced along with the distance, alpha represents an object collision penalty coefficient, and the rest penalty is 0. The specific formula is as follows:
wherein penalty R for pedestrians t The (peer) calculation method is when the robot is at the positionAnd nearest pedestrian position->Distance between->Less than the collision distance D near When penalty-R is obtained near . When the distance is at the collision distance D near Distance from punishment D far And when the distance between the two penalties is increased and reduced along with the distance, alpha represents a pedestrian collision penalty coefficient, and the rest penalties are 0. The specific formula is as follows:
wherein penalty of time R t The (time) calculation method is that the time t is multiplied by a parameter beta, and beta represents a time penalty coefficient. The specific formula is as follows:
R t (time)=-β*t
step five: an action neural network actor is established, and the network structure is shown in figure 3. The input of the convolutional neural network CNN is that the laser radar is point cloud data lidar_data and image data picture_data of the front camera, and the point cloud characteristic value lidar_feature and the image characteristic value picture_feature are extracted through a convolutional layer, a pooling layer and normalization. Lidar_feature and picture_feature and pedestrian position peoples and speed peoples v, distance of robot from target pointAnd angle->Position of closest point to global planned path +.>And angle->The input of the 5-layer fully-connected neural network is taken as the input, and the input dimension is the sum of the dimensions of all data. And outputting the speed robot_v selected for the mobile robot through calculation of the 5-layer fully connected neural network.
Step six: and establishing a reward neural network critic, wherein the network structure is shown in figure 3. The input of the convolutional neural network CNN is that the laser radar is point cloud data lidar_data and image data picture_data of the front camera, and the point cloud characteristic value lidar_feature and the image characteristic value picture_feature are extracted through a convolutional layer, a pooling layer and normalization. Lidar_feature and picture_feature and pedestrian position peoples and speed peoples v, distance of robot from target pointAnd angle->Position of closest point to global planned path +.>And angle->The input of the 5-layer fully-connected neural network is taken as the input, and the input dimension is the sum of the dimensions of all data. And outputting the maximum reward review_max which can be obtained in the current state through calculation of a 5-layer fully connected neural network.
Step seven: and in the constructed simulation scene and the actual test site, adopting a PPO deep reinforcement learning algorithm.
Firstly, constructing simulation and physical environments, wherein a robot model is constructed by adopting gazebo simulation software, the robot model is an omni-wheel, and the parameters such as a laser radar, a front camera, the quality, the height and the like are the same as those of an actual robot. Pedestrians adopt a social force model and wander in the scene.
The training mode is that firstly, the laser radar point cloud data lidar_data and the image data picture_data of a front camera, the pedestrian position peoples and the speed peoples_v and the distance between a robot and a target point are obtained from a simulation environmentAnd angle->Position of closest point to global planned path +.>And angle ofAs a ringState of context state quantity state. And constructing an action space and a reward function according to the method in the step four. And (3) constructing an action neural network actor and a reward neural network critic according to the methods in the fifth step and the sixth step. Inputting state into an actor network to acquire robot speed robot_v, inputting robot_v into a simulation environment to acquire current moment rewarding report_now, recording action and rewards of each step in the training process, updating network data according to the difference value of actual rewards and predicted rewards after a certain training number of times is reached, so that the output of an action neural network actor is updated to the maximum rewarding direction, and the output of a rewarding neural network critic is updated to a true value. The iteration is repeated until the robot can avoid the pedestrian in the simulation environment and reach the target point.
After the simulation training is completed, the network parameters are recorded and applied to the actual scene. The physical scene is built by a framework that a robot is responsible for collecting data and executing actions and a cloud background is responsible for receiving data training and issuing instructions. And building a UWB base station in a real object, wearing a UWB tag by a pedestrian, and acquiring the position and speed parameters of the pedestrian from a background. And the robot is provided with a laser radar and a front camera, and data are directly uploaded to the cloud.
The environment state quantity, the action space and the rewarding function, the neural network and the training method are the same as those in the simulation environment, the actual scene data are obtained through training, and the network is corrected. The practical training time can be accelerated by a training method combining simulation and real objects.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims (9)

1. The mobile robot obstacle avoidance method based on deep reinforcement learning is characterized by comprising the following steps of:
s1, acquiring sensor original data through a laser radar and a camera carried by a mobile robot, and acquiring the position and the speed of a pedestrian through a pedestrian positioning sensor;
s2, constructing a scene by using a laser radar;
s3, setting a navigation target point and acquiring a global planning path;
s4, designing an action space of a deep reinforcement learning algorithm and a reward function, wherein the action space comprises the speed direction and the size of the robot, and the reward function comprises rewards R reaching target points t (gold), reward R for path-extending walking t (path), penalty R for encountering obstacle t Punishment R of encountering pedestrian t (scope) penalty R for the step size used t (time) in which the prize R to the target point is satisfied as a whole t (gold) is greater than other penalties, pedestrian and obstacle penalties progressively by distance; current prize value R t The calculation method is as follows:
R t =R t (goal)+R t (path)+R t (obstacle)+R t (people)+R t (time)
wherein the target point rewards R t (gold) calculation method, is the robot position at time tDistance between the target point position pg>Less than threshold D from the target point g When get rewards R g Otherwise, the prize is 0, and the specific formula is as follows:
wherein the target point rewards R t (path) calculation method, which is the robot position at time tAnd +.about.nearest point position from global path>Distance between->Distance less than t-1 moment +.>When get rewards R t (path) is the difference between the two times the approach coefficient ω 1 Otherwise the reward is the difference between the two multiplied by the distance coefficient omega 2 The specific formula is shown as follows:
wherein penalty R for obstacles t The (obstacle) calculating method is that the robot position is at the moment tAnd the nearest obstacle locationDistance between->Less than the collision distance D near When penalty-R is obtained near When the distance is at the collision distance D near Distance from punishment D far And when the penalty is reduced along with the distance, alpha represents an object collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:
wherein penalty R for pedestrians t (scope) calculation method, which is the robot position at time tAnd nearest pedestrian position->Distance between->Less than the collision distance D near When penalty-R is obtained near When the distance is at the collision distance D near Distance from punishment D far When the penalty is increased along with the distance, alpha represents a pedestrian collision penalty coefficient, the rest penalty is 0, and the specific formula is shown as follows:
wherein penalty of time R t The (time) calculating method is that the time t is multiplied by a parameter beta, and beta represents a time penalty coefficient, and the specific formula is shown as follows:
R t (time)=-β*t
s5, an action neural network actor is established, laser radar data, camera data, pedestrian information positions and speeds and position information of a robot, a target point and a global planning path are obtained, and speed information selected by the mobile robot is output;
s6, establishing a reward neural network critic, acquiring laser radar data, camera data, pedestrian information position and speed, and position information of a robot, a target point and a global planning path, and outputting the maximum reward which can be obtained in the current state;
s7, constructing a simulation scene, adopting a deep reinforcement learning algorithm, comparing the real rewards of the output of each step of action neural network actor of the robot in the simulation environment with rewards predicted by rewards neural network critic, alternately performing iterative training until obstacle avoidance training is completed, and using the trained action neural network actor and rewards neural network critic in the actual scene.
2. The method for avoiding the obstacle of the mobile robot based on the deep reinforcement learning of claim 1, wherein after the simulation training in the step S7 is completed, the network parameters are recorded and applied to the actual scene, and the actual scene data is obtained through training to correct the network.
3. The mobile robot obstacle avoidance method based on deep reinforcement learning as claimed in claim 1, wherein the robot adopts an omni-wheel robot, the speed direction takes a value of-180 degrees to 180 degrees, and the speed is 0-1m/s.
4. The obstacle avoidance method of mobile robot based on deep reinforcement learning of claim 1, wherein the map built in S2 is a gray scale map, and the correction is performed on the place where the map is not accurate according to the actual situation.
5. The method for avoiding the obstacle of the mobile robot based on the deep reinforcement learning of claim 1, wherein a group of target points are set in the step S3, and the target points are classified according to distance, middle and near.
6. The method for avoiding the obstacle of the mobile robot based on the deep reinforcement learning of claim 1, wherein the method is characterized in that a global path planning method is adopted in the step S3, a global planned path is obtained, and the path is a two-dimensional coordinate point sequence.
7. The method for avoiding obstacle of mobile robot based on deep reinforcement learning as set forth in claim 1, wherein the action neural network actor established in the step S5 is a CNN convolutional neural network, and the input is a point of a laser radarCloud data lidar_data and image data picture_data of a camera are extracted through a convolution layer, a pooling layer and normalization to obtain laser point cloud feature value lidar_feature and image feature value picture_feature, and the distances between a robot and a target point are calculated according to the laser point cloud feature value lidar_feature, the image feature value picture_feature, pedestrian positions peple_pos and speeds peple_vAnd angle->Position of closest point to global planned path +.>And angle ofThe speed information robot_v selected for the mobile robot is output through calculation of the fully connected neural network and is used as input of the fully connected neural network, and the robot_v comprises the speed and the direction.
8. The method of claim 1, wherein the reward neural network critical established in S6 is a CNN convolutional neural network, and is input as lidar point cloud data lidar_dat and image data picture_data of a camera, and the images are convolved, pooled, normalized to obtain a laser point cloud feature value lidar_feature and an image feature value picture_feature, and the laser point cloud feature value lidar_feature, the image feature value picture_feature, pedestrian position pepole_pos, velocity pepole_v, and the distance between the robot and the target pointAnd angle->Position of closest point to global planned path +.>And angle ofThe maximum reward_max which can be obtained in the current state is output through calculation of the full-connection neural network.
9. The mobile robot obstacle avoidance method based on deep reinforcement learning according to claim 1, wherein the training method in step S7 comprises the steps of:
s71, acquiring laser radar point cloud data lidar_data, image data picture_data of a camera, pedestrian position peoples, pedestrian speed peoples and distance between a robot and a target point from a simulation environmentAnd angle->Position of closest point to global planned path +.>And angle->As an environmental state quantity state;
s72, according to the action space, the rewarding function, the action neural network actor and the rewarding neural network critic, inputting an environmental state quantity state into the action neural network actor to obtain robot speed information robot_v, and inputting the robot speed information robot_v into a simulation environment to obtain rewarding report_now at the current moment;
s73, recording actions and rewards of each step in the training process, and updating network parameters according to the difference value of actual rewards and rewards predicted by the rewards neural network critic, so that the output of an action neural network actor is updated to the maximum rewards direction, and the output of the rewards neural network critic is updated to a true value;
s73, repeating iteration until training is completed.
CN202111460950.9A 2021-12-02 2021-12-02 Mobile robot obstacle avoidance method based on deep reinforcement learning Active CN114237235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111460950.9A CN114237235B (en) 2021-12-02 2021-12-02 Mobile robot obstacle avoidance method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111460950.9A CN114237235B (en) 2021-12-02 2021-12-02 Mobile robot obstacle avoidance method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114237235A CN114237235A (en) 2022-03-25
CN114237235B true CN114237235B (en) 2024-01-19

Family

ID=80752987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111460950.9A Active CN114237235B (en) 2021-12-02 2021-12-02 Mobile robot obstacle avoidance method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114237235B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114779772B (en) * 2022-04-13 2023-08-08 泉州装备制造研究所 Path planning method and device integrating global algorithm and local algorithm
CN115291616B (en) * 2022-07-25 2023-05-26 江苏海洋大学 AUV dynamic obstacle avoidance method based on near-end strategy optimization algorithm
CN115790608B (en) * 2023-01-31 2023-05-30 天津大学 AUV path planning algorithm and device based on reinforcement learning
CN117873089A (en) * 2024-01-10 2024-04-12 南京理工大学 Multi-mobile robot cooperation path planning method based on clustering PPO algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109407676A (en) * 2018-12-20 2019-03-01 哈尔滨工业大学 The moving robot obstacle avoiding method learnt based on DoubleDQN network and deeply
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110750096A (en) * 2019-10-09 2020-02-04 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
CN111897316A (en) * 2020-06-22 2020-11-06 北京航空航天大学 Multi-aircraft autonomous decision-making method under scene fast-changing condition
CN113359717A (en) * 2021-05-26 2021-09-07 浙江工业大学 Mobile robot navigation obstacle avoidance method based on deep reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111578940B (en) * 2020-04-24 2021-05-11 哈尔滨工业大学 Indoor monocular navigation method and system based on cross-sensor transfer learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109407676A (en) * 2018-12-20 2019-03-01 哈尔滨工业大学 The moving robot obstacle avoiding method learnt based on DoubleDQN network and deeply
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110750096A (en) * 2019-10-09 2020-02-04 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
CN111897316A (en) * 2020-06-22 2020-11-06 北京航空航天大学 Multi-aircraft autonomous decision-making method under scene fast-changing condition
CN113359717A (en) * 2021-05-26 2021-09-07 浙江工业大学 Mobile robot navigation obstacle avoidance method based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于深度Q网络的水面无人艇路径规划算法;随博文;黄志坚;姜宝祥;郑欢;温家一;;上海海事大学学报(第03期);全文 *
基于神经网络的强化学习在服务机器人导航中的研究;陈双;李龙;罗海南;;现代计算机(第12期);全文 *

Also Published As

Publication number Publication date
CN114237235A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
CN114237235B (en) Mobile robot obstacle avoidance method based on deep reinforcement learning
CN110703747B (en) Robot autonomous exploration method based on simplified generalized Voronoi diagram
WO2021135554A1 (en) Method and device for planning global path of unmanned vehicle
CN114384920B (en) Dynamic obstacle avoidance method based on real-time construction of local grid map
CN112097769B (en) Homing pigeon brain-hippocampus-imitated unmanned aerial vehicle simultaneous positioning and mapping navigation system and method
CN110181508B (en) Three-dimensional route planning method and system for underwater robot
CN116540731B (en) Path planning method and system integrating LSTM and SAC algorithms
Li et al. Learning view and target invariant visual servoing for navigation
CN111258311A (en) Obstacle avoidance method of underground mobile robot based on intelligent vision
CN111739066B (en) Visual positioning method, system and storage medium based on Gaussian process
Kojima et al. To learn or not to learn: Analyzing the role of learning for navigation in virtual environments
CN113593035A (en) Motion control decision generation method and device, electronic equipment and storage medium
CN116300909A (en) Robot obstacle avoidance navigation method based on information preprocessing and reinforcement learning
CN114185339A (en) Mobile robot path planning method in dynamic environment
CN112857370A (en) Robot map-free navigation method based on time sequence information modeling
Petrazzini et al. Proximal policy optimization with continuous bounded action space via the beta distribution
CN113689502B (en) Multi-information fusion obstacle measurement method
CN106127119A (en) Joint probabilistic data association method based on coloured image and depth image multiple features
CN110333513B (en) Particle filter SLAM method fusing least square method
CN113064422A (en) Autonomous underwater vehicle path planning method based on double neural network reinforcement learning
CN117029846A (en) Generalized laser ranging path planning algorithm for mobile robot in complex environment
CN116774247A (en) SLAM front-end strategy based on multi-source information fusion of EKF
Xue et al. Real-time 3D grid map building for autonomous driving in dynamic environment
CN115950414A (en) Adaptive multi-fusion SLAM method for different sensor data
CN114153216B (en) Lunar surface path planning system and method based on deep reinforcement learning and block planning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant