CN111367282A

CN111367282A - Robot navigation method and system based on multimode perception and reinforcement learning

Info

Publication number: CN111367282A
Application number: CN202010157337.9A
Authority: CN
Inventors: 邓寒; 黄学钦; 张伟; 宋然; 李贻斌; 顾建军
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2020-07-03
Anticipated expiration: 2040-03-09
Also published as: CN111367282B

Abstract

The invention discloses a robot navigation method and system based on multimode perception and reinforcement learning, comprising the following steps: the method comprises the steps of obtaining RGB pictures of a scene observed by a robot at a set moment, and converting the RGB pictures into binary segmentation pictures by adopting a trained segmentation network; respectively collecting the laser radar data at the set moment and the speed measurement data of the robot; and inputting the binary segmentation map, the laser radar data and the speed measurement data of the robot into a trained multimode fusion depth network model to obtain an optimal operation strategy of the robot. The invention adopts a multimode mechanism to ensure more complete perception of the environment, and the RL-based method can directly learn a navigation strategy optimized around the surrounding environment in an infinite search space through online interaction, thereby generating flexible actions and improving the capability of avoiding collision.

Description

Robot navigation method and system based on multimode perception and reinforcement learning

Technical Field

The invention relates to the technical field of robot navigation, in particular to a robot navigation method and system based on multimode perception and reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Autonomous navigation is a very important function of mobile robots, and some existing methods have demonstrated good performance in structured environments. However, it remains challenging to design a robot navigation system that is reliable for unstructured real-world environments, which often contain dynamic obstacles with unpredictable trajectories. This requires the robot to intelligently handle various interactions with obstacles in real time.

There are some efforts that rely on Deep Learning (DL) to address the challenges of robotic navigation in complex environments. However, DL-based approaches typically focus more on the perception of the environment, without explicit learned navigation strategies. A few DL-based approaches use offline annotation for direct learning strategies in an actual structured environment, but such annotations are not only time consuming and laborious, especially in large scale generation in unstructured environments, but are also constrained by a set of fixed, finite and discrete action states. Therefore, in a dynamic complex environment of the real world, the learned strategy may not meet the requirement of navigation.

In contrast, Reinforcement Learning (RL) directly learns the optimal strategy of the current environment through a reward mechanism. In fact, this is more consistent with human decision making, where a policy is formulated by interacting with the surrounding environment and the policy model is directly modified by trial and error based on the immediate response of the environment. Furthermore, the RL does not require supervised learning based specifically on policy annotations provided by human subjects, as it finds the best policy by maximizing the expected long-term rewards.

The inventors have found that the prior art uses radar data in the RL to learn the obstacle avoidance strategy. However, the sparse point cloud of the radar can only sense information of a specific height, cannot process a complex environment containing obstacles with any height and shape, and is not enough for training a robot navigation strategy model in the actual complex environment.

The prior art studies vision-based RL as an alternative method, however, the image obtained from the vision sensor cannot provide depth information unambiguously, and in vision-based navigation work, the gap between the simulated environment and the real environment is unavoidable. No matter how powerful the simulation engine is, the rendered image may not perfectly simulate the real world, and therefore a navigation system trained in a simulated environment may not perform as well in the real world, especially when it contains dynamic obstacles such as vehicles and pedestrians.

Disclosure of Invention

In view of the above, the invention provides a robot navigation method and system based on multimode sensing and reinforcement learning, which can realize reliable navigation and collision avoidance in a high-dynamic and crowded real-world environment by fusing knowledge obtained from RGB images and radar data through a deep reinforcement learning framework.

In order to achieve the above purpose, in some embodiments, the following technical solutions are adopted:

a robot navigation method based on multimode perception and reinforcement learning comprises the following steps:

the method comprises the steps of obtaining RGB pictures of a scene observed by a robot at a set moment, and converting the RGB pictures into binary (road and non-road) segmentation pictures by adopting a trained segmentation network;

respectively collecting the laser radar data at the set moment and the speed measurement data of the robot;

and inputting the binary segmentation map, the laser radar data and the speed measurement data of the robot into a trained multimode fusion depth network model to obtain an optimal operation strategy of the robot, thereby realizing the navigation of the robot.

The invention takes the segmentation map as the intermediate representation of the RGB image, the segmentation map ignores the disturbance of low-level image details, and keeps high consistency in simulation and reality environments all the time; the RGB image segmentation graph and the radar data are fused to serve as input characteristic data, and reliable navigation and collision avoidance in a high-dynamic and crowded real-world environment can be achieved.

In other embodiments, the following technical solutions are adopted:

a robot navigation system based on multi-mode perception and reinforcement learning comprises:

the device is used for acquiring RGB pictures of a scene observed by the robot at a set moment and converting the RGB pictures into binary segmentation pictures by adopting a trained segmentation network;

the device is used for respectively acquiring the laser radar data at the set moment and the speed measurement data of the robot;

and the device is used for inputting the binary segmentation map, the laser radar data and the speed measurement data of the robot into the trained multimode fusion depth network model to obtain the optimal operation strategy of the robot.

In other embodiments, the following technical solutions are adopted:

a robot, comprising: the robot comprises a robot body and a controller, wherein the controller is configured to execute the robot autonomous navigation method based on multimode perception and deep reinforcement learning, and realize navigation of a robot running path.

A computer readable storage medium, wherein a plurality of instructions are stored, the instructions are suitable for being loaded by a processor of a terminal device and executing the robot autonomous navigation method based on multimode perception and deep reinforcement learning.

Compared with the prior art, the invention has the beneficial effects that:

(1) integrity is perceived. The present invention employs a multi-modal mechanism to ensure a more complete perception of the environment than in a single-modal mode, since the image and radar data are complementary in various scenarios. This is crucial for RL-based policy modules to learn the correct navigation policy in a complex environment, since their learning process relies only on online perception.

(2) Model portability. Using RGB images directly may encounter the problem of transferring models learned in an emulated simulation environment composed of non-realistic renderings to a real-world environment. The present invention uses the segmentation map as an intermediate representation of the RGB image, which has a consistent appearance in both simulated and real scenes. Thus, it is possible to easily transfer from simulation to the real world without additional fine-tuning.

(3) And (6) strategy optimization. The DL-based approach is essentially to predict a potentially suitable strategy through offline training, which may not be the best choice for the current environment, since the set of action states for the search strategy is limited. In contrast, the RL-based approach of the present invention can directly learn a navigation strategy optimized around the surrounding environment in an infinite search space through online interaction, thereby generating flexible actions and improving its ability to avoid collisions.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a schematic view of a navigation model based on multi-modal perception and deep reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a schematic view of a navigation framework based on multi-modal perception and deep reinforcement learning according to an embodiment of the present invention;

FIG. 3 is a flowchart of a robot autonomous navigation method based on multi-mode sensing and deep reinforcement learning according to an embodiment of the present invention;

FIGS. 4(a) - (d) are schematic diagrams of examples of simulation and real scenes, respectively, in an embodiment of the present invention;

FIG. 5 shows semantic segmentation results of a simulation scene and a real scene according to an embodiment of the present invention;

FIGS. 6(a) - (b) are the average rewards in a two-stage training scenario, respectively, in an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise, and it should be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of features, steps, operations, devices, components, and/or combinations thereof.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

In one or more embodiments, a robot navigation method based on multi-modal perception and reinforcement learning is disclosed, as shown in fig. 2 and 3, and includes the following steps:

(1) the method comprises the steps of obtaining RGB pictures of a scene observed by a robot at a set moment, and converting the RGB pictures into binary segmentation pictures by adopting a trained segmentation network;

(2) respectively collecting the laser radar data at the set moment and the speed measurement data of the robot;

(3) and inputting the binary segmentation map, the laser radar data and the speed measurement data of the robot into a trained multimode fusion depth network model to obtain an optimal operation strategy of the robot.

Specifically, the embodiment of the invention designs a deep reinforcement learning framework, and fuses knowledge obtained from RGB images and radar data; referring to fig. 1, the method of the embodiment of the invention is trained in a simulated environment, and can reliably navigate and avoid collision in a high dynamic and crowded real world environment.

The method of the present embodiment will be described in detail below.

The implementation process of the method of the embodiment is divided into two parts: a perception part and a policy part.

The sensing part converts the RGB image into a semantic segmentation image, normalizes radar data, and then takes the measurement of the speed of the robot as the input of a strategy part based on reinforcement learning. The final outputs are the linear and angular velocities, which will be input into the robot controller, enabling navigation of the robot travel path.

(1) First, reinforcement learning will be explained:

1) the problem is expressed as follows: the reinforcement learning problem and the symbols to be used in the rest of the embodiment are first defined. The optimization of the navigation strategy is formulated as a markov decision process (POMDP) observable in the limited field of view portion. At each discrete time step t, the RL robot observes the current state s_t∈ S and a_t∈ A after a time step, the robot receives a reward r(s)_t,a_t) And converted into a new state s_t+1∈ S. define the task as a limited field of view problem with Episode length of T time steps therefore, the process continues until T time steps or encounters an early termination signal, such as a collision or driving onto a sidewalk_t|s_t；θ_π)：

Here the discount factor 0< γ < 1.

There are several classical algorithms for deep reinforcement learning. Such as DQN, DDPG, A3C and PPO. As a popular deep reinforcement learning algorithm for processing continuous actions in complex tasks, the near-end policy optimization (PPO) algorithm searches for the best policy by maximizing a surrogate objective function:

wherein

Is an estimator of the merit functionAnd epsilon is a hyperparameter. In this work, we use the parallel PPO algorithm to integrate the multi-modal fusion paradigm to train the robot.

2) And (3) reinforcement learning setting:

a state space. State s_tThe medicine consists of three parts: segmentation chart s_t1Radar state s_t2And a measurement state s_t3. The segmentation map is generated by a perception module. The radar state consists of the last three consecutive frames of radar data, while the measurement state represents the current linear velocity v and angular velocity w

The use of discrete movements to indicate movement may lead to easier training procedures, however, non-uniformities in both speed and steering make real world movement control more impractical, therefore, continuous values are used for v ∈ [0,2] m/s and w ∈ [ -1,1] rad/s.

And (4) reward design. Our goal is to avoid collisions while navigating in crowded environments (containing various static and dynamic objects) and to minimize the number of times the robot moves to other lanes or sidewalks. The robot should follow the right traffic rules and other traffic rules will be ignored.

To guide the robot to achieve this goal, the following six reward functions are designed.

First, to ensure that the robot is driving according to the right-hand driving rule, a negative reward is given when the robot walks to the opposite lane:

r_lane1, otherwise r_lane＝0 (4)

Second, to encourage the robot to travel as quickly as possible, we will award r_vArranged in proportion to the speed of travel of the robot, factor c_v1.8. To improve the running smoothness of the robot, we will also award r_wSet to be inversely proportional to the square of its angular velocity, factor c_w-0.5. Thus, a large turning of the robot during driving will be heavily penalized.

r_v＝c_v×v,

r_w＝c_w×w², (5)

Then, when the robot collides with other static or moving objects in the environment (e.g., roadblocks, pedestrians, or vehicles), it will be subjected to r_cPenalty of (2):

r_cnot more than-10, otherwise r_c0 (6) furthermore, once the robot has driven onto the sidewalk, we will apply a large negative reward r_off：

r_offNot more than-10, otherwise r_off＝0 (7)

Finally, to prevent the robot from getting stuck, we give a small constant penalty r at each decision_timeSet to-0.1.

Thus, the total reward r (st, at) is defined as the sum of the above six terms:

r(s_t,a_t)＝r_lane+r_v+r_w+r_c+r_off+r_time(8)

and (4) terminating the conditions. There are three termination conditions: firstly, the robot collides with any obstacle; secondly, the robot runs to a sidewalk; and thirdly, the time step T of the current Episode is accumulated to 2000.

(2) Sensing part

The perception module aims to perceive the surrounding environment and alleviate the gap between the simulation environment and the real world. The main reason why the transfer of the learned strategy modules from the simulation environment to reality often fails is that it is difficult to accurately simulate certain factors in reality, such as texture, lighting, sensor noise, in the simulation environment. These factors result in large differences in image detail between the simulated and real-world environments. To address this issue, the present embodiment takes the segmentation map as a visual representation of the intermediate level and uses it as an input to the policy module. This is because the segmentation map ignores the perturbation of low-level image details and maintains high consistency throughout the simulation and reality environment. Thus, by monitoring the scope of learningAnd (3) training a segmentation network in a formula mode, and converting the original RGB image into a binary segmentation graph, namely a road and a non-road. Can be expressed as s_t1＝f_seg(O_t；θ_seg)

Here, f_segRepresents a segmentation model and theta_segA parameter representing it, O_tIs an RGB picture of the scene observed by the robot at time t.

Theoretically, the segmentation model f_segA deep split network, such as GSCNN, is required. However, such networks typically do not utilize the on-board computing resources of the robot to enable real-time navigation. An alternative approach is to use a lightweight split network, such as ERFNet. However, such lightweight networks are difficult to generalize to complex environments due to the lack of sufficient training data. Therefore, a teacher student model is adopted to embed GSCNs into the training process of ERFNet so as to solve the problem.

In an implementation, the teacher network GSCNN is first trained using the public cityscaps dataset. The unlabeled RGB images collected in the real-world environment are then fed into the network to generate the segmentation map. The segmentation map generated by the GSCNN is used as a label for the unlabeled RGB image, which is combined with other labeled data to train the student network ERFNet. Finally, ERFNet is used for onboard image segmentation in both simulated and real-world environments.

Of course, in the present embodiment, the teacher network GSCNN and the student network ERFNet are only an example, and those skilled in the art may select other split networks as needed.

Semantic segmentation relying only on RGB images cannot provide accurate depth information important for autonomous navigation. Thus, the laser radar data s_t2A perception module is introduced. Compared to RGB images, the information obtained from radar is relatively robust in terms of differences between simulated and actual environments, since they are not sensitive to texture and illumination. And normalizing the radar data, and introducing three recent historical frames as the input of a strategy module. To obtain real-time feedback from the robot, measurements including linear velocity v and angular velocity w of the proxy are introduced as another sensory input s_t3. Finally, the perception module synthesizes the state s_tOutput to a policy module, where s_t＝(s_t1,s_t2,s_t3) Representing a state consisting of a segmentation map, radar data and measurements.

In essence, the multimodal awareness module extracts structural representations of complex unstructured environments through image segmentation, which helps the reinforcement learning robot to better understand the high-level semantics of the environment and expedite the search for optimal strategies. In addition, lidar provides accurate depth information of the environment, and the combination of segmentation maps and radar data complements the representation of the environment, so they ensure that a full perception is obtained using richer information, thus significantly benefiting the policy module.

(3) Policy component

The strategy module aims to find the optimal navigation strategy through reinforcement learning. In many cases, strategies learned through single-modality data are not robust enough due to the inherent limitations of each sensing technique. Thus, as shown in FIG. 2, multimodal data is utilized in which learned multimodal features are fused into a deep network. Firstly, the strategy module takes the segmentation map and radar data provided by the sensing module as input and respectively extracts the characteristics of the segmentation map and radar data. And then fusing the characteristics and the measured value, and outputting the strategy pi through a fully-connected network. Finally, the strategy pi is optimized using a parallel PPO algorithm. May be expressed as pi ═ f_fus(F_t1,F_t2,s_t3；θ_f)

Wherein F_t1And F_t2Outputs representing two channels of segmentation maps and radar data in the processing strategy module, respectively, and_fusrepresenting having a learnable parameter theta_fThe full connection layer and the ReLU activation layer, which are fused to output the policy pi.

Notably, policy networks based on multimodal fusion schemes are difficult to learn due to the expansion of the state space. To solve this problem, the present embodiment proposes a mode separation learning method as an auxiliary training tool. We note that radar feature extraction in a simulated environment is highly consistent with the real world.

Therefore, an obstacle avoidance network is introduced, and a radar-based obstacle avoidance strategy is trained in a simulation environment. In the embodiment, the powerful learning robot is trained on the Stage simulator by utilizing the radar-based network model disclosed in the prior art. As shown in fig. 4(a), a new training scenario is established on the simulator. In this simulation environment, the robot is first trained using a radar-based obstacle avoidance network. Then, the corresponding layer network parameters for extracting radar data features are migrated to the multi-mode fusion model and fixed. In this way, the size of the feature to be learned is greatly reduced.

In addition, the present embodiment divides the navigation task into two subtasks: obey traffic regulations and avoid obstacles. Because it is very difficult for the reinforcement learning robot to directly learn the driving strategy in a complex environment containing a large amount of information (such as placing static and/or dynamic objects at the same time), in order to ensure that the multi-mode fusion strategy can effectively learn the above tasks, the present embodiment adopts a simple to complex course learning training paradigm.

Such an example enables the robot to learn quickly through a relatively simple driving task.

In a simulation environment, no obstacle is added on a road, and the robot continuously tries and learns to quickly learn a simple driving task by applying a navigation model of multi-mode sensing and deep reinforcement learning based on data such as real-time collected images, laser radars and the like; for example, how to drive along roads and understand traffic regulations, etc.; after the robot has achieved reliable performance, the simple task is stopped from training and then a phase of complex tasks is entered where a large number of vehicles, pedestrians and roadblocks are added. The reinforcement learning robot is continuously trained to learn the optimal driving strategy so that the robot can avoid potential collision in the driving process.

(4) Results of the experiment

The embodiment performs robot navigation experiments in simulation and real-world environments to prove the superiority of the method of the embodiment over the baseline method.

1) Details of the Experimental setup and implementation

① simulation environment Stage simulator is an open source mobile robot simulator providing a virtual world consisting of mobile robots and sensors.A radar based obstacle avoidance strategy was first trained in Stage, where robots 2.3 meters long and 1.4 meters wide were used.additionally, only 8 robots were trained, as shown in FIG. 4(a), where each laser-bearing rectangle represents one robot.6000 Episodes were trained for a total of 12 hours. the RL framework proposed in this example with the multimode fusion scheme was finally trained in CARLA, which is an open source city scene simulator for autopilot based on Unreal game engine.CARLA currently provides seven highly realistic complex city scenes, as shown in FIG. 4 (b). in our experiments, Town 2 scenes under clear day conditions were used for strategy training, while Town 1 scenes were used as an environment not seen by the robots to evaluate it for the perception of the network, the use of the autonomy image collection and the corresponding collection of the remote control images of the robot drift by remote control of RGB frequencies.5. Carla, the collection of the corresponding to the remote control of the robot drift frequency.

② real environment fig. 4(c) and 4(D) show real outdoor and indoor environment respectively.a teacher student model is used to convert the image of the real world scene into semantic segmentation labels when the experiment is performed in the real world, a raspberry-based RGB camera with height of 1.2 meters, downward tilt of 12 ° and field of view of 60 ° is installed on the robot, 1.9K raw RGB images are finally collected by a remote controller at a frequency of 3 HZ.

Table 1: hyper-parameter settings

Parameter(s)	Value of
		Discount(γ)	0.99
GAE parameter(λ)	0.95
		Clipping(ε)	0.1
Horizon(T)	256
		Entropy coeff	0.01
Minibatch size	256 (simple phase), 1024 (complex phase)
		Num.epochs	4
Learning rate	3e-4

Implementation details for the perception part the goal of this embodiment is to obtain "road" and "non-road" regions, so the labels of the semantic segmentation are divided into two categories, road and non-road.then, the public cityscaps data set and the simulation data are combined with the real scene data to train erfnet.resize all image data to 84 × 84, batch size set to 48, model is trained for 500 iterations.adam optimizer is used with initial learning rate of 0.001 and after 150 iterations it is reduced to 0.0001. weight decay set to 0.0002. fig. 5 shows the semantic segmentation results of the simulation and real scenes, where the first row in fig. 5 is the simulation scene in cara, the second row is the indoor scene of the real world, the third row is the outdoor scene of the real world.

2) Baseline method and assessment index

① baseline method in the experiment, the method of this example was compared to 4 baseline methods as follows:

DroNet: this is an existing DL-based approach, using only RGB image predictive control strategies. Although it has been applied to the task of robot navigation in real-world environments, it suffers from the problem of hesitation since motion is usually terminated when an obstacle is encountered.

DroNet + LiDAR: LiDAR is currently the best obstacle avoidance method based on RL that relies only on radar data and ensures that a moving robot can avoid colliding with an obstacle. Therefore, we provide a baseline in combination with DroNet and LiDA that can enable autonomous navigation and obstacle avoidance in a real-world environment.

SEG + LiDAR-ini: this is an ablative version of the method of this embodiment, where the radar feature extraction part is trained from scratch, in contrast to the model, which is pre-trained and fixed by the obstacle avoidance network in the method of this embodiment.

SEG-only: another ablation version of the method of the present embodiment is shown, which uses only the segmentation map as a network input, without taking radar data into account. It is therefore essentially a single-mode RL method.

② evaluation index the robot starts at a random location in each test Episode with termination conditions of 1) any collision, 2) maximum time to reach (100 seconds for simulation scenarios; 150 seconds for outdoor scenarios, 50 seconds for indoor scenarios), and 3) termination of the robot navigation when it reaches a specified area.

The distance metric shows the length of the robot drive in Episode. The total time records how long the robot has been driven in the Episode. The average speed of the robot is also reported, which reflects whether it can effectively bypass an obstacle. Furthermore, the off-lane time is recorded, which is defined as the percentage of time that the robot appears on other lanes or sidewalks throughout the driving activity.

3) Evaluation in a simulation environment

① ablation study of training procedure:

simple stage without obstacle. Fig. 6(a) shows the learning curves of three versions of the method of the present embodiment in a simple phase from a simple to a complex learning paradigm. At this stage, since the surroundings of the robot have no obstacles, the learning of the segmentation map that divides the RGB image into "road" and "non-road" areas provides the most important clue for the robot navigation. Thus, in fig. 6(a), the SEG-only version was observed to perform well. The full version of the method of this embodiment also has considerable performance compared to SEG-only. This is because in this version, the layers used to learn depth features from the LiDAR data have fixed parameters, and thus the entire network is ultimately trained to learn the segmentation maps to the greatest extent possible. In contrast, training of SEG + LiDAR-ini ends with a network that is not optimal for learning segmented graphs, but can impair the learning of LiDAR data. Thus, a control strategy learned through the SEG + LiDAR-ini network cannot achieve a high average return in an unobstructed environment.

Complex stages with static and/or dynamic obstacles.

Fig. 6(b) shows the learning curve for a complex phase where the robot often encounters static and/or dynamic obstacles in the scene. We observed that the performance of the complete version of the method of this embodiment is significantly better than SEG + LiDAR-ini in terms of training speed and average reward. Such results also demonstrate the benefit of a modality-separated learning scheme in which a collision avoidance network is first pre-trained based only on LiDAR data, and then the learning parameters of the network are transferred into a multi-modal fusion model. In practice, navigation provided by SEG + LiDAR-ini presents a series of problems, including moving to the left of the road, an uneven travel path, and unexpected turns, among others. The full version of the method of this embodiment is also superior to the ablative version of SEG-only as indicated by the training speed. In fact, the robot velocity v output by SEG-only is much worse because it follows a positive state distribution with a mean of 0 and a variance of 9. This means that the robot must frequently stop and/or significantly slow down when encountering obstacles. In contrast, the average value of the v output distribution of the complete method based on the multi-mode fusion scheme of the present embodiment is 2, and the variance is 0.1, which indicates that navigation is ideal because the robot can quickly bypass obstacles and safely avoid potential collisions.

② quantitative comparison to baseline:

the suggested method is compared to a baseline method by a city driving simulator. To test whether the autonomous navigation system can handle crowded scenarios involving static and dynamic objects, different tasks were devised to evaluate the robot's reaction patterns in different scenarios.

Task: 1) there are no obstacles in the scene. In this task, the robot follows traffic rules to the right while driving; 2) various static obstacles are included in the scene, including stationary vehicles, boxes, bins and trash bins. In the task, the robot is trained to avoid collision in driving; 3) the scene contains various dynamic obstacles including vehicles and pedestrians. The start position settings of these three tasks are the same. Each experiment was performed 3 times and the average results are reported.

As a result: as shown in Table 2, experiments have shown that the method and baseline of the present embodiment can be extended in distance without an obstruction in the training environment (i.e., task 1). Even though the driving strategies of DroNet and DroNet + LiDAR are relatively conservative because their average speed is low. In an invisible test environment, the drive speed of a robot using DroNet is very slow, as it typically takes time to identify shadows on the road (e.g., shadows of street lights, buildings, and the robot itself). Furthermore, as DroNet is generally oriented to adapt to the training site, the time that the robot appears on other lanes is greatly increased when testing in an invisible environment.

TABLE 2 quantitative comparison in a simulation Environment

In task 2, the method of the present embodiment can better avoid collisions with static obstacles in training and testing environments, while DroNet is largely unable to avoid collisions when encountering obstacles, typically resulting in stationary or accidental collisions. While DroNet + LiDAR may improve obstacle avoidance performance, it is still not as effective as the method of this embodiment. Thus, the present embodiment method corresponds to a longer travel distance and a higher average speed.

In task 3, the robot is not always in a standstill state because the vehicle and the pedestrian are dynamic. For the DroNet method, the average distance of robot motion is longer than in task 2. However, since the dynamic scene is more complicated, the possibility of collision increases, and the average driving time of the robot becomes shorter.

4) Assessment in real environments

Here, a strategic model trained using only the cara simulator is deployed directly into the real world to verify its robustness and generalization capability.

① quantitative comparisons were made in outdoor scenarios:

the methods of DroNet and DroNet + LiDAR are compared across multiple campus lanes of a challenging surrounding environment (including tight turns). Some of the test protocols are shown at the bottom of figure 5. As shown in table 3, in the results of table 3, the maximum linear velocity of the actual robot was 1 m/s; let the robot run 150 seconds in outdoor scenes and 50 seconds in indoor scenes, respectively, and report the navigation distance that the robot eventually covered.

The method of this embodiment and both baselines can accomplish the task in a simple scheme with little dynamic obstruction, while the method of this embodiment outperforms the baseline method in navigation distance. The distance the robot can travel in a fixed time. In a crowded environment, neither DroNet nor DroNet + LiDAR can drive a long distance path as with the method of the present embodiment. This is mainly because the method of the present embodiment can find a safe navigation path in a scene with highly dynamic objects, while a robot using either baseline method moves slowly due to the high probability of collision.

TABLE 3 quantitative comparison in real world

② quantitative comparison in indoor scenes:

to demonstrate that the multi-modal approach can separate the perception part from the strategy part, experiments were performed in an indoor environment, and the middle row of fig. 5 shows some test scenarios. This is challenging because the training scenario is based on a simulated outdoor environment cara. In implementation, the ERFNet segmentation model is retrained using the indoor corridor images. When testing is performed in an indoor environment, the semantic segmentation model used in the above experiment is replaced with a retrained model, and the strategy model remains unchanged. The results show that although the indoor environment is very different from the scenario used to train the policy model, the multimodal policy model of the present embodiment is still superior to the baseline in the indoor environment, especially in crowded hallways where space is limited.

In summary, the present embodiment proposes a multi-mode fusion scheme in the policy section to utilize both image and radar data. However, due to the larger state space, multi-modal strategy learning is more difficult than single-modal strategy learning. Moreover, the RL has difficulty learning effective strategies in real-world environments with various dynamic barriers. Therefore, the difficulty of multi-modal strategy learning is reduced in three ways.

First, the policy section learns the image and radar data separately. And training an obstacle avoidance strategy based on the radar, then transferring the component for extracting the radar features to the multimode strategy module, and simultaneously directly inputting a semantic segmentation graph derived from the RGB image into the multimode strategy module.

Second, training is performed from simple to complex. In a simple phase, training in an environment without obstacles, so the RL robot can quickly and efficiently learn about various driving tasks and traffic regulations. Then, at a complex stage, the robot is focused on learning reliable collision avoidance strategies in crowded scenes (including static and dynamic obstacles).

Thirdly, six reward functions are designed based on traffic rules, collision punishment and speed smoothness.

Considering that the robot is equipped with limited computing resources, for the perception part, a lightweight neural network is employed, which can reliably divide the RGB image into road and non-road regions. The generated segmentation map may be viewed as a medium-level visual feature of the scene and shows a consistent appearance in both the simulated scene and the real scene. However, training networks require sufficiently large data sets that otherwise cannot be generalized well due to the diversity of real-world environments. Therefore, teacher student models are adopted to extract segmentation knowledge, and generalization capability of the network is improved.

Example two

In one or more embodiments, disclosed is a multi-modal perception and reinforcement learning-based robot navigation system, comprising:

The specific implementation manner of the device adopts the method disclosed in the first embodiment, and details are not described again.

EXAMPLE III

In one or more embodiments, a terminal device is disclosed, which includes a server, where the server includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the program, the robot autonomous navigation method based on multimode sensing and deep robust learning disclosed in the first embodiment is implemented, and details are not repeated for the sake of brevity.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. The memory may also store information of the device type, for example.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may reside in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the elements of the various examples, i.e., the algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or in combination with computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A robot navigation method based on multimode perception and reinforcement learning is characterized by comprising the following steps:

the method comprises the steps of obtaining RGB pictures of a scene observed by a robot at a set moment, and converting the RGB pictures into binary segmentation pictures by adopting a trained segmentation network;

2. The robot navigation method based on multi-modal perception and reinforcement learning as claimed in claim 1, wherein the training process for the segmentation network specifically comprises:

training a teacher network using the published semantically segmented data set;

feeding unlabeled RGB images collected in a real environment into a trained teacher network to generate a binary segmentation map;

training a student network by using the generated binary segmentation graph as a label of the unmarked RGB image and adding a data set collected by a simulation environment and a public semantic segmentation data set;

and taking the trained student network as a final segmentation network to segment the RGB picture.

3. The robot navigation method based on multi-mode perception and reinforcement learning of claim 2, wherein the teacher network selects a GSCNN network, and the student network selects an ERFNet network.

4. The robot navigation method based on multi-mode perception and reinforcement learning as claimed in claim 1, wherein for the training process of the multi-mode fusion deep network model, specifically:

introducing a radar obstacle avoidance network, and training an obstacle avoidance strategy based on laser radar data in a simulation environment;

migrating the trained obstacle avoidance network parameters to the multi-mode fusion depth network model and fixing;

in a simulation environment, a multi-mode fusion deep network model is trained by adopting a simple to complex training process.

5. The robot navigation method based on multi-mode perception and reinforcement learning as claimed in claim 4, wherein in a simulation environment, a simple to complex training process is adopted to train the multi-mode fusion deep network model, and the specific process is as follows:

in a simulation environment, no barrier is added on a road, and the robot continuously tries and mistakes by a reinforcement learning method to quickly learn a simple driving task;

after the robot reaches the set performance, dynamic and static interference factors are added in the simulation environment, and the robot is continuously trained to learn the optimal driving strategy to avoid potential collision.

6. The multi-modal awareness and reinforcement learning-based robot navigation method of claim 5, wherein the simple driving task comprises: travel along roads and understand traffic regulations.

7. The robot navigation method based on multimode perception and reinforcement learning as claimed in claim 1, wherein the speed metric data of the robot comprises: linear and angular velocities.

8. A robot navigation system based on multimode perception and reinforcement learning is characterized by comprising:

9. A robot, comprising: the robot body and the controller are characterized in that the controller is configured to execute the robot autonomous navigation method based on the multimode perception and the deep reinforcement learning according to any one of claims 1-7, and realize navigation of a robot running path.

10. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute the method for robot autonomous navigation based on multi-modal perception and deep reinforcement learning of any one of claims 1-7.