WO2022160430A1

WO2022160430A1 - Method for obstacle avoidance of robot in the complex indoor scene based on monocular camera

Info

Publication number: WO2022160430A1
Application number: PCT/CN2021/081649
Authority: WO
Inventors: Xin Yang; Jianchuan DING; Baocai Yin; Zhenjun DU; Haiyin Piao; Yang Sun
Original assignee: Dalian University Of Technology
Priority date: 2021-01-27
Filing date: 2021-03-19
Publication date: 2022-08-04
Also published as: CN112767373B; CN112767373A

Abstract

The invention discloses a method for avoiding obstacles in a robot indoor complex scene based on the monocular camera, which belongs to the field of robot navigation and obstacle avoidance. The monocular obstacle avoidance navigation network of the present invention is composed of an environment perception stage and a control decision-making stage, and specifically includes a depth prediction module, a semantic mask module, a depth slicing module, a feature extraction guidance module, a reinforcement learning module, and data enhancement. The network uses monocular RGB images as input, and after obtaining the semantic depth map, it performs a dynamic minimization operation to obtain "pseudo-laser" data, which is used as the state input of reinforcement learning to generate the final robot decision-making action. The invention solves the difficulty of fully perceiving complex obstacles in the obstacle avoidance task of the robot in the indoor environment, which leads to the difficulty of obstacle avoidance failure, and helps the robot to use the semantic information of the environment to remove the interference of redundant pixels, thereby performing efficient reinforcement learning training and Decision-making has validity and applicability in different scenarios.

Description

Method for obstacle avoidance of robot in the complex indoor scene based on monocular camera

Technical Field

The disclosure belongs to the field of navigation and obstacle avoidance in the field of robots, and the specific realization result is autonomous navigation and obstacle avoidance of robots, and particularly relates to a method for fully and effectively perceiving complex obstacles.

Background

The obstacle avoidance task of the robot is in a more complex scene, the robot can navigate to the target point autonomously without any collision with the obstacle, which has great practical application value. With the rapid development of artificial intelligence technology, robot obstacle avoidance-related tasks, such as sweeping robots, unmanned driving, smart warehouses, and smart logistics, have achieved significant performance improvements.

However, indoor obstacle avoidance scenes often have some complex obstacles, such as non-convex irregular objects such as tables and chairs, black metal objects, clothing, and other obstacles lying flat on the ground. These objects will affect the traditional one-dimensional laser. The radar obstacle avoidance strategy has a serious impact, but there is no relevant research to deal with such objects. The existence of such complex obstacles will cause the Lidar system to be unable to fully perceive the environment, and thus make the navigation obstacle avoidance system ineffective. Specifically, for irregular objects, such as tables, one-dimensional Lidar can only perceive the legs of the table, which will make the robot mistakenly believe that the gap between the legs of the table can pass through, but when the robot is taller, it will interact with the table. For black metal objects, it will seriously interfere with the Lidar and absorb the emitted laser light, making it completely invalid; for complex obstacles on the ground, traditional methods cannot perceive lower ground obstacles, even embedded like swimming pools to the ground, but impassable obstacles. Therefore, fully and efficiently perceiving complex obstacles is an urgent task in the field of robot obstacle avoidance.

Most of the existing robot obstacle avoidance navigation methods use deep reinforcement learning as a learning method, which is popular because it can learn independently without manually collecting labeled data sets. Reinforcement learning is a "trial and error" process, which is often learned in a virtual environment and then transferred to a real scene. In order to narrow the gap between virtual and reality, Lidar data with the simple data format and easy to learn is usually used. However, the Lidar data is not fully aware of complex obstacles and cannot implement efficient obstacle avoidance strategies.

Some new work abandons the Lidar sensor, use the depth map and color map captured by the RGB-D camera as input, directly map to the action, and carry out the end-to-end training of reinforcement learning. For laser data, images have rich semantic information, but they also have a lot of redundant information that is not helpful for obstacle avoidance, which makes it difficult for reinforcement learning algorithms to converge, difficult to train, and cause a large gap between virtual and reality and difficult migration of strategies. In addition, the depth camera appears a lot of noise in an indoor environment with sunlight, and it almost fails. The traditional method of using depth map to point cloud mapping to remove ground interference information cannot perceive low obstacles on the ground such as clothing and swimming pools. Therefore, the method based on RGB-D end-to-end learning also has many problems, unable to fully perceive the complex indoor environment, and even unable to safely navigate and avoid obstacles.

Therefore, the present disclosure is based on the investigation and analysis of the existing obstacle avoidance navigation technology, by combining the advantages of Lidar and RGB camera, while abandoning the disadvantages of the two, constructs "pseudo laser" data, thereby realizing autonomous navigation and navigation in complex scenes. Obstacle avoidance task. The input of the method is the image taken by the monocular RGB camera on the robot platform, and the output is the action to be taken by the robot, including linear velocity and angular velocity. This method can effectively perceive different types of complex obstacles in indoor scenes, thereby helping the reinforcement learning module to perform efficient learning and decision-making.

Summary of the Disclosure

The purpose of the present disclosure is to realize an efficient obstacle avoidance method for robots by mapping "pseudo laser" data for monocular RGB images in complex scenes. The method includes an environment perception stage and a control decision stage. The environment perception stage includes a depth prediction module, a semantic segmentation module, and a depth slicing module; the control decision stage includes a feature extraction guidance module and a reinforcement learning module. The method of the present disclosure is suitable for complex obstacles of various shapes and sizes.

The technical solution of the present disclosure is:

A method for obstacle avoidance of robot in the complex indoor scene based on monocular camera, the method includes the following steps:

Step 1, Loading robot simulation model and building a training and testing simulation environment

In order to solve obstacle avoidance problem in complex scenes, using URDF model of TurtleBot-ROS robot as experimental robot; usingBlock, Crossing, and Passing in ROS-Stage as training environment, and deploying24 identical TurtleBot-ROS robots for distributed control decisionmodule training; using cafe environment in ROS-Gazebo as background of test scene, and manually adding complex obstacles in Gazebo to test effectiveness of entire visual system;

Step 2, Getting semantic depth map

Obtaining RGB image from monocular camera carried by the TurtleBot-ROS robot, and inputting the RGB image into Fastdepth depth prediction network to obtain the depth map under current field of view; selecting lower half of the depth map as intermediate result; ground pixel information in the intermediate result will interfere obstacle avoidance, resulting in obstacle avoidance failure, so inputtingthe RGB image into CCNet semantic segmentation model to obtain a two-class semantic segmentation mask, where 0 represents the ground pixel and 1 represents the background. The semantic segmentation mask and the depth map are multiplypixel by pixel to obtain semantic depth map, where value of each pixel in the semantic depth map is depth distance of current viewing angle, and at the same time removing disturbing ground depth value;

Step 3, Deep slice and data enhancement module

Performing a dynamic minimum pooling operation on depth value pixels in the semantic depth map. Pooling window size is (240, 1) and step size is 1. Selecting the minimum value in the window for each pooling operation as output object, and performing all pooling operationson each column of the image, and result obtained is "pseudo laser" data;

By introducing a data enhancement method, applying noise interference to observation data of virtual environment during training; in order to identify noise boundary from training laser measurement, assuming that if the difference between two adjacent values in vector is greater than threshold 0.5, there will be a boundary; and replacing values around two adjacent endpoints by linear interpolation with a window size of (1, 8) ; at the same time, for all laser observation data, adaptively adding Gaussian white noise with a variance of 0.08;

Step 4, Controlling decision stage

After obtaining the "pseudo laser" data, placing the "pseudo laser" for three consecutive moments in three channels, and using formed tensor as input of deep reinforcement learning module, so that the experimental robot can effectively perceive optical flow effect of dynamic obstacles in a short period of time, so as to make correct decisions on dynamic obstacles;

The deep reinforcement learning module adopts PPO algorithm, and network structure is composed of 3 layers of convolutional layers and 3 layers of fully connected layers; in order to make the experimental robot reach target position steadily and safely, input of state includes three parts: observation data, target point distance and speed; the observation data is the "pseudo-laser" data obtained in step 3, the distance and speed of the target point are obtained by onboard odometer of the robot; proposing a feature extraction guidance layer, and extracting and fusing data features of three modes by three layers of convolution, and then obtaining feature mask through sigmoid activationand multiplying the "pseudo laser" observation data, sending result obtained to the deep reinforcement learning module; extracting information that is helpful to obstacle avoidance strategy from multi-modal data, and then combining the information with the "pseudo laser" observation data to make the subsequent feature extraction process more targeted and speed up convergence of the network;

Modifying second fully connected layer of the deep reinforcement learning module to an LSTM layer, increasing timing correlation of the deep reinforcement learning module, and enabling the experimental robot to make decisions based on all observations in entire path;

Step 5, Forming a monocular obstacle avoidance navigation network and output decision results

Splicing

steps

2, 3, and 4 to obtain the input image from the monocular RGB camera. After processing, obtaining the depth map and the semantic segmentation mask, and multiplying the dots and cropping. After the dynamic minimum pooling operation, obtaining the "pseudo laser" observation data. Inputting three consecutive frames of "pseudo-laser" observation data into the deep reinforcement learning module together with the distance and speed of the target point. After the feature extraction guidance layer, different attention is paid to each dimension of the "pseudo-laser" observation data. After multi-layer convolution, pooling, and full connection, the LSTM layer is used to increase the timing correlation for the entire path, and finally, the decision-making action of the robot at the current moment is output, so as to achieve the effect of autonomous obstacle avoidance and navigation.

The beneficial effects of the present disclosure:

(1) Obstacle avoidance test results and efficiency

The disclosure solves the difficulty of fully perceiving complex obstacles (non-convex irregular obstacles, ferrous metals, complex ground obstacles) in the obstacle avoidance task of the robot in the indoor environment, which leads to the difficulty of obstacle avoidance failure, and helps the robot to use the semantic information of the environment to remove the interference of redundant pixels enables efficient reinforcement learning training and decision-making. The present disclosure proposes a reinforcement learning mapping method from a single RGB image directly to the robot's obstacle avoidance navigation action. The method relies on "pseudo laser" data and performs efficient decision-making by encoding semantic information into the laser data. And the accuracy of the method is proved through comparative experiments. In the comparative experiment, the method obtained the best performance in the average success rate and average time of all two commonly used indicators and has great advantages in complex scenarios.

(2) Broad applicability

The disclosure is suitable for obstacle avoidance and navigation tasks in different complex indoor scenes (a) scenes containing non-convex irregular obstacles; (b) scenes containing black metal smooth material obstacles; (c) containing messy clothing on the ground, obstacle scenes such as glass and swimming pool. The effectiveness and applicability of this method in different scenarios are proved.

Description of the Drawings

Figure 1 is the network structure of the present disclosure.

Figure 2 is the visualization result of the experiment of the embodiment of the present disclosure.

Detailed description

The specific embodiments of the present disclosure will be further described below in conjunction with the drawings and technical solutions.

This method uses PPO as the framework of deep reinforcement learning. The state includes "pseudo-laser" data, the distance from the target point, and the velocity at the previous moment; the action is composed of the linear velocity and angular velocity of the wheeled robot; the reward function includes the distance state from the target at each moment (the closer to the target, the positive return, and vice versa) , if there is a collision, it is -15, and if it reaches the target point, it is 15. The robot is encouraged to not take too much action at each step, that is, it cannot exceed the previous one. 1.7 times the angular velocity at the moment.

The reinforcement learning algorithm is implemented in Pytorch. Stochastic gradient descent is used in the reinforcement learning network. Its momentum value is 0.9, weight attenuation is 1e-4, the learning rate is set to 5e-5, the attenuation factor is 0.99, the KL divergence parameter is 15e-4, and the maximum step size is 150. In the embodiment of the present disclosure, the learning process is terminated after 1.5 million training paths, and it takes about 40 hours to train the strategy on the computer equipped with an i7-7700 CPU and an NVIDIA GTX 1080Ti GPU. In order to verify the effectiveness of the network, it is compared with the traditional method ORCA and the latest learning method multi-robot distributed obstacle avoidance strategy to verify the effectiveness of the disclosure. And perform ablation experiments on all the modules proposed in the network to prove the effectiveness of each part.

Figure 1 is the network structure of the monocular obstacle avoidance navigation network. The network is composed of an environment perception stage and a control decision stage, which specifically includes a depth prediction module, a semantic mask module, a depth slicing module, a feature extraction guidance module, a reinforcement learning module, and data enhancement. The network takes monocular RGB images as input, and after obtaining the semantic depth map, it performs a dynamic minimization operation to obtain "pseudo-laser" data, which is used as the state input of reinforcement learning to generate the final robot decision-making action.

Figure 2 is the process visualization result of the monocular visual obstacle avoidance navigation framework, in which (A) is listed as a chair obstacle scene; (B) is listed as a table obstacle scene; (C) is listed as a clothing obstacle scene; (D) is listed as a glass obstacle Scenes. The monocular camera on the robot platform captures the RGB image, predicts the semantic depth map, and then slices it to generate "pseudo laser" data. The comparison between the last two rows of "pseudo laser" data and Lidar data shows that the "pseudo laser" data "Can capture more complete environmental information, so as to carry out efficient reinforcement learning training and better environmental interaction.

Step 1 Load the robot simulation model and build a training and testing simulation environment

In order to solve the obstacle avoidance problem in complex scenes, the URDF model of the TurtleBot-ROS robot is used as the experimental robot; Block, Crossing, and Passing in ROS-Stage are used as the training environment, and 24 identical TurtleBot-ROS robots are deployed for distributed control decision module training; use the cafe environment in ROS-Gazebo as the background of the test scene, and manually add complex obstacles (tables, chairs, wardrobes, moving pedestrians, etc. ) in Gazebo to test the effectiveness of the entire vision system;

Step 2 Get semantic depth map

Obtain the RGB image from the monocular camera carried by the TurtleBot-ROS robot, and input it into the Fastdepth depth prediction network to obtain the depth map under the current field of view; select the lower half of the depth map as the intermediate result; the ground pixel information in the result will interfere Obstacle avoidance, resulting in obstacle avoidance failure, so the RGB image is input into the CCNet semantic segmentation model to obtain a two-class semantic segmentation mask, where 0 represents the ground pixel and 1 represents the background. The semantic segmentation mask and the depth map are multiplypixel by pixel to obtain the semantic depth map, where the value of each pixel in the semantic depth map is the depth distance of the current viewing angle, and at the same time remove the disturbing ground depth value;

Step 3 Deep slice and data enhancement module

Perform a dynamic minimum pooling operation on the depth value pixels in the semantic depth map, the pooling window size is (240, 1) , the step size is 1, and each pooling operation selects the minimum value in the window as the output object, and pooling 640 times, data with a size of (1, 640) is obtained, which is "pseudo laser" data. "Pseudo-laser" not only retains the advantages of simple, easy-to-learn, and easy-to-transfer Lidar data, but also retains the semantic information in the visual image. Since the data is obtained from a two-dimensional image through a minimum pooling operation, it can fully perceive the complex obstacles in the environment, encode semantics into the laser of each dimension, and support the subsequent efficient reinforcement learning and the implementation of safe obstacle avoidance strategies.

The sensor data obtained in a virtual environment is often perfect, but in a real environment, if some part of an object occludes another object, the observation value usually has an observation error near the boundary of the object. Larger noise will reduce the accuracy of the algorithm or even fail. Therefore, a data enhancement method is introduced, and noise interference is applied to the observation data of the virtual environment during training. In order to identify the noise boundary from the training laser measurement, it is assumed that if the difference between two adjacent values in the vector is greater than the threshold of 0.5, there will be a boundary. And replace the values around two adjacent endpoints by linear interpolation with a window size of (1, 8) . At the same time, for all laser observation data, Gaussian white noise with a variance of 0.08 is adaptively added. This data enhancement method enables it to be directly transferred and adapted to real scenes full of noise even if it is trained in a virtual environment.

Step 4 Control decision stage

After obtaining the "pseudo laser" data, place the "pseudo laser" at three consecutive moments in the three channels to form a tensor with a size of (3, 640) as the input of the deep reinforcement learning module, which can make the robot effectivelyperceive the optical flow effect of dynamic obstacles in a short period of time, so as to make correct decisions on dynamic obstacles.

The deep reinforcement learning module adopts the PPO algorithm, and the network structure is composed of 3 layers of convolutional layers and 3 layers of fully connected layers. In order for the robot to reach the target position steadily and safely, the state input includes three parts: observation data, target distance, and speed. The observation data is the "pseudo laser" data obtained in step 3, and the distance and speed of the target point are obtained by the onboard odometer of the robot. Currently, there are two commonly used methods: direct fusion and indirect fusion. However, because the information comes from different modes, direct fusion in the channel is not conducive to learning obstacle avoidance strategies. On the other hand, blind indirect extraction leads to ignoring useful information in the observation data and capturing useless information. To this end, a feature extraction guidance layer is proposed. The data features of the three modalities are extracted and fused by three layers of convolution, and then the feature mask is obtained by sigmoid activation, and the "pseudo laser" observation data is multiplied, and the result is sent to the deep reinforcement learning module. It combines the advantages of the previous method. The information that is helpful for obstacle avoidance strategies is extracted from the multi-modal data, and then it is combined with the observation data so that the subsequent feature extraction process is more targeted and the convergence of the network is accelerated.

Because the monocular RGB camera is used as the sensor, the robot has a small forward perspective of 60°. Therefore, the second fully connected layer of the reinforcement learning network structure is modified to the LSTM layer to increase the timing correlation of the reinforcement learning module so that the robot can Make decisions based on all observations in the entire path.

Step 5 Form a monocular obstacle avoidance navigation network and output decision results

Steps

2, 3, and 4 are stitched together to obtain the input image from the monocular RGB camera, and the depth map and semantic mask are obtained after processing, and then cropped after dot multiplication. After the dynamic minimum pooling operation, the "pseudo laser" data is obtained. The “pseudo laser” of the frame is input into the reinforcement learning network together with the distance and speed of the target point. After the feature extraction guidance layer, a different degree of attention is applied to each dimension in the “pseudo laser” data. After transformation and full connection, the LSTM is used to increase the timing correlation for the entire path, and finally, the decision-making action of the robot at the current moment is output, so as to achieve the effect of autonomous obstacle avoidance and navigation.

Claims

A method for obstacle avoidance of robot in the complex indoor scene based on monocular camera, wherein the method includes following steps of:

step 1, loading robot simulation model and building a training and testing simulation environment

in order to solve obstacle avoidance problem in complex scenes, using URDF model of TurtleBot-ROS robot as experimental robot; using Block, Crossing, and Passing in ROS-Stage as training environment, and deploying 24 identical TurtleBot-ROS robots for distributed control decisionmodule training; using cafe environment in ROS-Gazebo as background of test scene, and manually adding complex obstacles in Gazebo to test effectiveness of entire visual system;

step 2, getting semantic depth map

obtaining RGB image from monocular camera carried by the TurtleBot-ROS robot, and inputtingthe RGB image into the Fastdepth depth prediction network to obtain the depth map under current field of view; selecting lower half of the depth map as intermediate result; ground pixel information in the intermediate result will interfere obstacle avoidance, resulting in obstacle avoidance failure, so inputtingthe RGB image is into CCNet semantic segmentation model to obtain a two-class semantic segmentation mask, where 0 represents the ground pixel and 1 represents the background, the semantic segmentation mask and the depth map are multiply pixel by pixelto obtain semantic depth map, where value of each pixel in the semantic depth map is depth distance of current viewing angle, and at the same time removing disturbing ground depth value;

step 3, deep slice and data enhancement module

performing a dynamic minimum pooling operation on depth value pixels in the semantic depth map, pooling window size is (240, 1) and step size is 1, selecting the minimum value in the window for each pooling operation as output object, and performing all pooling operations on each column of the image, and result obtained is "pseudo laser" data; by introducing a data enhancement method, applying noise interference to observation data ofvirtual environment during training; in order to identify noise boundary from training laser measurement, assuming that if the difference between two adjacent values in vector is greater than threshold 0.5, there will be a boundary; and replacing values around two adjacent endpoints by linear interpolation with a window size of (1, 8) ; at the same time, for all laser observation data, adaptively adding Gaussian white noise with a variance of 0.08;

step 4, controlling decision stage

after obtaining the "pseudo laser" data, placing the "pseudo laser" for three consecutive moments in three channels, and using formed tensor as input of the deep reinforcement learning module, so that the experimental robot can effectively perceive optical flow effect of dynamic obstacles in a short period of time, so as to make correct decisions on dynamic obstacles;

the deep reinforcement learning module adopts PPO algorithm, and network structure is composed of 3 layers of convolutional layers and 3 layers of fully connected layers; in order to make the experimental robot reach target position steadily and safely, input of state includes three parts: observation data, target point distance and speed; the observation data is the "pseudo-laser" data obtained in step 3, the distance and speed of the target point are obtained by onboard odometer of the robot; proposing a feature extraction guidance layer, and extracting and fusing data features of three modes by three layers of convolution, and then obtaining feature mask through sigmoid activationand multiplying "pseudo-laser" observation data, sending result obtained to the deep reinforcement learning module; extracting information that is helpful for obstacle avoidance strategy from multi-modal data, and then combiningthe information with the "pseudo laser" observation data to make the subsequent feature extraction process more targeted and speed up convergence of the network;

modifying second fully connected layer of the deep reinforcement learning module to an LSTM layer, increasing the timing correlation of the deep reinforcement learning module, and enabling the experimental robot to make decisions based on all observations in entire path;

step 5, forming a monocular obstacle avoidance navigation network and output decision results

splicing steps 2, 3, and 4 to obtain the input image from the monocular RGB camera; After processing, obtaining the depth map and the semantic segmentation mask, and multiplying the dots and cropping; after the dynamic minimum pooling operation, obtaining the "pseudo laser" observation data; inputting three consecutive frames of "pseudo-laser" observation data into the deep reinforcement learning module together with the distance and speed of the target point; after the feature extraction guidance layer, different attention is paid to each dimension of the "pseudo-laser" observation data; after multi-layer convolution, pooling, and full connection, the LSTM layer is used to increase the timing correlation for the entire path, and finally, the decision-making action of the robot at the current moment is output, so as to achieve the effect of autonomous obstacle avoidance and navigation.