CN115469663A

CN115469663A - End-to-end navigation obstacle avoidance method facing automatic driving and based on deep reinforcement learning

Info

Publication number: CN115469663A
Application number: CN202211119904.7A
Authority: CN
Inventors: 尤国良; 张燕咏; 吉建民
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2022-12-13
Anticipated expiration: 2042-09-15
Also published as: CN115469663B

Abstract

The invention relates to an end-to-end navigation obstacle avoidance method facing automatic driving and based on deep reinforcement learning, which adopts a laser radar to sample a laser radar point cloud to generate an environment one-dimensional laser point cloud, and converts the environment one-dimensional laser point cloud through a cost map conversion algorithm to obtain an obstacle map representing an environment dynamic static obstacle; collecting an environmental barrier map by performing closed-loop interaction with a simulation environment, thereby generating a navigation environment data set; constructing a heterogeneous intelligent agent self-game obstacle avoidance model, wherein the heterogeneous multi-intelligent agent self-game model comprises the following steps: observation space, action space, reward function and neural network; performing multi-stage parallel course learning to enable the neural network to achieve the local optimal solution faster and better, and accelerating the learning process to finally obtain a trained heterogeneous intelligent agent self-game obstacle avoidance model; and deploying the trained self-game obstacle avoidance model of the heterogeneous intelligent body to an actual vehicle, and realizing navigation and obstacle avoidance in the real world.

Description

End-to-end navigation obstacle avoidance method facing automatic driving and based on deep reinforcement learning

Technical Field

The invention relates to an end-to-end navigation obstacle avoidance method based on deep reinforcement learning and oriented to automatic driving, and belongs to the technical field of automatic driving.

Background

In recent years, with the wide application of automatic driving, navigation and obstacle avoidance methods are in endless. The Deep Reinforcement Learning (DRL) method is applied to intelligent navigation and achieves better results, wherein the automatic driving vehicle is an Ackermann dynamics model. However, developing a method of collision avoidance for Ackermann steered vehicles based on deep reinforcement learning in unknown moving obstacle scenarios remains challenging. Trained networks need not only generate collision-free trajectories that satisfy Ackermann kinematic constraints, but also handle multiple moving obstacles in various environments. This makes deep reinforcement learning necessary to search feasible collision-free and dynamically constrained trajectories in a larger search space. Ackermann steered vehicles effectively find collision-free and kinematically feasible paths to targets in a variety of environments with static and moving obstacles, which is one of the major challenges of autodrive automobiles [1]. In an autonomous driving scenario, common static and dynamic obstacles mainly include static scene elements, dynamic vehicles, pedestrians, bicycles, and the like. The problem of implementing autopilot in static and dynamic barrier scenarios can be described as a multi-agent navigation and obstacle avoidance problem under a heterogeneous dynamics model.

Although many approaches to collision avoidance have been proposed, they have some common limitations in practice. For example, the assumptions of this approach may not apply to all environments [2].

Furthermore, collision avoidance for multi-agents with different shapes and different kinematic constraints in distributed and non-communicative scenarios is a more challenging task. Existing methods, such as optimal mutual collision avoidance (ORCA) [3] and bicycle mutual collision avoidance (B-ORCA) [4], provide sufficient conditions for agents to avoid mutual collisions. However, they require that the agent use the same kinematic constraints. Generalized mutual collision avoidance extends the approach to agents with different kinematic constraints, such as differential drive and Ackermann steering.

To overcome these limitations, deep Reinforcement Learning (DRL) methods have been applied to avoid collision problems and achieve better results. However, common deep reinforcement learning approaches rarely consider multi-agent collision avoidance with different kinematic constraints, which limits their applications. For example, in a complex scene with both automobiles and pedestrians, the traditional deep reinforcement learning method cannot well realize better navigation.

[1]Dong Y,Zhang Y,Ai J.Experimental test of unmanned ground vehicle delivering goods using RRT path planning algorithm[J].Unmanned Systems,2017,5(01):45-57.

[2]Zhang W,Wei S,Teng Y,et al.Dynamic obstacle avoidance for unmanned underwater vehicles based on an improved velocity obstacle method[J].Sensors,2017,17(12):2742.

[3]Berg J,Guy S J,Lin M,et al.Reciprocal n-body collision avoidance[M]//Robotics research.Springer,Berlin,Heidelberg,2011:3-19.

[4]Alonso-Mora J,Breitenmoser A,Beardsley P,et al.Reciprocal collision avoidance for multiple car-like robots[C]//2012IEEE International Conference on Robotics and Automation.IEEE,2012:360-366.

Disclosure of Invention

The invention solves the problems: the method overcomes the defects of the prior art, reduces the consumption of computing resources, improves the reasoning speed of the automatic driving navigation, enhances the obstacle avoidance capability of the automatic driving automobile in a complex multi-agent scene, and has higher arrival rate and safety in dealing with pedestrian scenes which are difficult to predict. The test achieves 98% of task completion rate in an actual scene.

The technical scheme of the invention is as follows: an end-to-end navigation obstacle avoidance method facing automatic driving and based on deep reinforcement learning comprises the following implementation steps:

step 1: sampling the laser radar point cloud by adopting a laser radar to generate an environment one-dimensional laser point cloud, and converting the environment one-dimensional laser point cloud through a cost map conversion algorithm to finally obtain an obstacle map representing environment dynamic static obstacles; secondly, continuously collecting an environmental barrier map by performing closed-loop interaction with a simulation environment, thereby generating a navigation environment data set, wherein the data set comprises a training set and a test set;

step 2: constructing a heterogeneous intelligent agent self-game obstacle avoidance model, wherein the heterogeneous multi-intelligent agent self-game model comprises the following steps: observation space, action space, reward function and neural network;

the observation space is a complete description of the state of the world, the world has no other information except the state, and a certain value of the observation space represents the observation data of the current agent;

an action space defining a range of actions that an agent can take;

reward function, which restrains and guides the behavior of the intelligent body to realize safe navigation and sets different reward functions r for the pedestrian and the vehicle respectively _r And r _v Wherein a reward modeling method is used while at r _v Introduction of r in the formula ^w And r ^d Reward punishment as warning and hazard zones;

the neural network is used for learning the navigation strategy;

the neural network is combined with a reward function to jointly plan a target behavior, and the target behavior is contained in an action space; inputting a target behavior planned by the neural network into the intelligent agent, executing the behavior command by the intelligent agent, and outputting an obstacle map under the current environment, wherein the obstacle map is contained in an observation space, and the behavior executed by the intelligent agent and the obtained obstacle map are input into the neural network again to form closed-loop training;

the method comprises the steps that learning of an obstacle avoidance strategy in a simulation environment is achieved by combining an observation space, an action space, a reward function and a neural network, the simulation environment is abstract description of a real world, wherein obstacles comprise an ackerman model-based vehicle intelligent body and a differential model-based pedestrian intelligent body, the action space of an ackerman vehicle is described as an expected driving track of the ackerman vehicle according to a behavior strategy mode of a dynamic obstacle in the real environment, and the action space of a pedestrian is described as the action direction and the speed of the pedestrian; based on the simulation environment and the heterogeneous intelligent agent self-game obstacle avoidance model, finally realizing intelligent agent game learning between Ackermann vehicles, between Ackermann vehicles and differential pedestrians and between differential pedestrians and differential pedestrians by using the same network structure and parameters, so as to realize flexible obstacle avoidance of the multi-intelligent agent in a dynamic environment;

and 3, step 3: based on the training set in the step 1, performing multi-stage parallel course learning on the heterogeneous intelligent body self-game obstacle avoidance model constructed in the step 2, so that a neural network can reach local optimal solution more quickly and better, and meanwhile, the learning process is accelerated, and finally the trained heterogeneous intelligent body self-game obstacle avoidance model is obtained;

and 4, step 4: and deploying the trained self-game obstacle avoidance model of the heterogeneous intelligent body to an actual vehicle, and realizing navigation and obstacle avoidance in the real world.

In the heterogeneous multi-agent self-gaming model constructed in the step 2:

observation space: the intelligent agent monitoring system comprises three parts, namely an obstacle map of the environment where the intelligent agent is located, the current self state and a target point; the obstacle map is specified by a self-centered local grid map; the obstacle map represents environmental information surrounding the agent, including the shape of the agent and the appearance of observable obstacles;

an action space: firstly, a continuous action space is set for a pedestrian intelligent body based on a differential model, namely v is set _t ∈[0，0.6]And ω _t ∈[-0.9，0.9]Wherein v is _t For the speed range assumed at time t, 0,0.6 represents the maximum and minimum speed of the agent, 0m/s 0.6m/s, respectively, in units: m/s, omega _t The unit of the rotation angle range adopted for the time t: rad; -0.9,0.9 respectively representing the rotation angle range of the agent; secondly, setting a continuous action space for the vehicle intelligent agent based on the Ackerman model, namely setting c _t ∈[-1.43，1.43]And delta _t ∈[-11.25，11.25]Wherein c is _t Target track curvature, δ, for the desired travel of the agent at time t _t Acceleration corresponding to the track expected to be driven by the intelligent agent at the moment t;

the reward function: different reward functions r are set for pedestrians and vehicles respectively _r And r _v Wherein a reward modeling method is used while at r _v Introduction of r in the formula ^w And r ^d As a reward and punishment system of the warning and danger area, the following is specifically mentioned:

reward function r of pedestrian _r ：

Reward function r of a vehicle _v Wherein r is ^g ,r ^c ,r ^s Consistent with the reward function of the pedestrian:

r _v ＝r ^g +r ^c +r ^s +r ^w +r ^d

formula r _r In, r _arr >0，p _t Representing the position of the agent at the current time t, ε is a hyperparameter of the neural network, r ^g Representing rewards for reaching the goal and penalties for leaving the goal; r is _col <0 and r ^c Represents a penalty for a collision; finally, a negative penalty, r, is applied over the entire learning period ^s <0 to encourage shortest paths;

formula r _v In, r _warn ,r _danger <0,r ^w Indicating a penalty when there is an obstacle in the warning area of the vehicle, r ^d Representing a penalty when there is an obstacle within the hazardous area of the vehicle; in an implementation, r is set _arr ＝500、ε＝10、r _col ＝-500、r _s ＝-5、r _warn = -20 and r _danger ＝-10；

A neural network: the output of a strategy network in the neural network is respectively described as the behaviors of a vehicle intelligent agent based on an Ackerman model and a pedestrian intelligent agent based on a differential model, so that the self game learning of the heterogeneous intelligent agents is realized, wherein the input is the observation data of each heterogeneous intelligent agent, including the barrier map, the current self state and the target point of the intelligent agent; and modifying the output layers of the last layers in the neural network structure, so that the neural network can simultaneously output the behavior strategies of the heterogeneous intelligent agents. Wherein the modified output layer obtains linear velocity and angular velocity as the action direction and action speed of the pedestrian and the curvature and acceleration of the expected running track of the vehicle using Gaussian distribution sampling.

The representation mode enables a set of network to realize the behavior prediction of different heterogeneous agents and the behavior prediction of a plurality of agents. Higher navigation performance is obtained after the neural network training is finished, meanwhile, the pedestrian and vehicle game process in a real environment is simulated as much as possible, and the safety of a navigation strategy is enhanced.

Compared with the prior art, the invention has the advantages that:

(1) The invention can realize the rapid and collision-free navigation task under the scenes of the mobile obstacles of a plurality of heterogeneous dynamic models and a plurality of types of static obstacles, and can cover most scenes in practical application by utilizing an end-to-end deep reinforcement learning method to collect data and iterate strategies for thousands of times in a simulation environment. Compared with the traditional navigation algorithm, the method has the advantages of higher training speed and higher reasoning speed. The navigation obstacle avoidance algorithm combined with the deep reinforcement learning algorithm can exert the advantage of autonomous learning and the advantage of high learning efficiency in a simulation environment, and an optimal navigation obstacle avoidance strategy can be rapidly learned.

(2) The self-adaptive capacity of the automatic driving vehicle in the scene can be improved by combining self-game cooperative training. Compared with the traditional navigation strategy, the dynamic obstacle behavior strategy and the static obstacle shape characteristics in the environment are simple, and the robustness of the learned strategy and the ability of adapting to a new scene are poor. The invention can adapt to various simple to complex static scenes, various dynamic scenes and mixed scenes after the training process is finished. In order to realize the improvement of the self-adaptive capacity, the invention enables the intelligent agents to mutually improve the obstacle avoidance capacity and the adaptive capacity through self-gaming cooperative training under various dynamic models and various static obstacle scenes by a heterogeneous multi-intelligent-agent self-gaming method.

(3) According to the invention, the track of the automatic driving vehicle and the action mode of the pedestrian are output through the deep reinforcement learning algorithm, and the deployment efficiency and the obstacle avoidance capability of the automatic driving vehicle in the actual deployment process are improved. Compared with the traditional method for realizing action behavior output by utilizing the bottom control command of the vehicle, the high-level track output can better guarantee the safety and the navigation capability of the vehicle. The track strategy is introduced into the self game of the heterogeneous multi-agent, the jitter problem is reduced, and meanwhile, the track output contains time sequence information, so that possible future dangers can be prefabricated in advance and avoided in time.

(4) Because the dynamic environment of a plurality of intelligent agents of the complex heterogeneous dynamics model is too complex, the model can not be converged when the complex heterogeneous dynamics model is directly studied, so that a plurality of intelligent agents are placed in a plurality of sets of environments with progressive difficulty for study, and multi-environment parallel study and simple to complex course study are realized; compared with single-environment serial training, the multi-environment parallel acceleration learning speed is increased, and the training difficulty is reduced in multi-environment progressive course learning.

Drawings

FIG. 1 is a flow chart of a method implementation of the present invention;

FIG. 2 is a diagram of a heterogeneous agent self-gaming obstacle avoidance neural network structure in the present invention;

FIG. 3 is a multi-stage parallel curriculum learning of the present invention: the left graph is parallel course learning in a random static scene, and the right graph is parallel course learning in a circular scene;

FIG. 4 is a simulated training environment: a left graph random training environment and a right graph circular training environment;

FIG. 5 is a physical vehicle deployment and test chart.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

The automatic driving scene is complicated and changeable, the vehicle can realize high-efficiency and safe navigation and obstacle avoidance in the complicated and changeable scene, dynamic obstacles and various static obstacles of different dynamic models can be flexibly sensed by a navigation strategy, and a reasonable collision-free track is planned. The invention provides an end-to-end navigation obstacle avoidance method facing automatic driving and based on deep reinforcement learning, which is used for collision avoidance of Ackerman steering vehicles, and the vehicles can process a plurality of moving obstacles with heterogeneous dynamics and a plurality of types of static obstacles. Specifically, a convolutional neural network is introduced that maps the vehicle's observations to kinematically feasible trajectories and trains the network in a simulated environment where the moving obstacle strategy is the same as the ackerman steering vehicle strategy. Specified reward shaping strategies and multi-stage parallel course learning strategies are introduced to accelerate and stabilize the learning process. And then, deploying the trained model to the actual vehicle to realize navigation and obstacle avoidance in the real world. The method is used for evaluating various scenes in a simulation and a real world, so that the vehicle can avoid various moving obstacles with high success rate.

Finding a trajectory that is collision free and meets the constraints of a dynamic model in a complex autonomous driving scenario is a challenging task. The existing method generally senses and models the environment through a sensing module, and realizes navigation in a complex scene by generating a navigation map required by navigation and combining a navigation obstacle avoidance algorithm. Existing methods generally assume that surrounding dynamic obstacles move according to a specific behavior strategy, for example, a dynamic ackerman vehicle uses a bicycle response collision algorithm (B-ORCA) and a pedestrian uses an Optimal Response Collision Algorithm (ORCA), and the behavior strategy of the dynamic obstacles in an actual automatic driving scene is complicated and variable, so that the dynamic scene simulated by using a traditional algorithm has high repeatability and needs to be further explored to improve the performance of the dynamic scene.

As shown in fig. 1, the method of the present invention comprises the following steps:

step 1: and constructing a self-game obstacle avoidance model of the heterogeneous intelligent body, interactively collecting data with the simulation environment module, preprocessing the data, converting the data into input data required by the self-game obstacle avoidance model of the heterogeneous intelligent body, and using the data in the subsequent model training and evaluation process. The simulation environment is a strategy-level simulator for automatic driving design, is customized based on OpenCV, expands the self-developed simulation environment, and adds pedestrians and vehicles which can be driven by a neural network. The simulation environment is shown in fig. 4, the left diagram is a simulation environment example diagram of a scene with various static obstacles and various dynamic obstacles, wherein the square is the simulation representation of an automatic driving vehicle, the circle is the representation mode of a pedestrian, and the attributes such as the sizes of the pedestrian and the vehicle can be adjusted through the simulation environment setting. The corresponding static obstacle is square, rectangular or round, and the size of the static obstacle can be adjusted by adjusting the size. The right diagram is a dynamic barrier scenario for use in a simulation environment with progressive difficulty in subsequent steps, with the relevant elements described above.

And 2, step: the dynamic models of different agents are modeled into a heterogeneous agent self game model, the action behaviors of the agents with different dynamic models are output by using the same network, and meanwhile, when a plurality of agents exist in the environment, the network can predict the behaviors of all the agents at the same time.

Heterogeneous agent self-gaming module: fig. 2 shows a heterogeneous agent self-gaming obstacle avoidance model structure. The heterogeneous multi-agent self-gaming module mainly comprises: observation space, action space, reward function, and neural network structure. The observation space is responsible for carrying out complete perception and preprocessing on the environment, and the action space defines the effective control command types and the size and the range of the control quantity of the heterogeneous multi-agent. The reward function is used to define a quantitative assessment formula for how well the heterogeneous agent takes action. The neural network structure is used for learning navigation and obstacle avoidance strategies in complex and changeable scenes. The contents involved in the heterogeneous agent self-gaming model are respectively described as follows:

observation space: the observation consists of three parts, namely an obstacle map of the environment where the intelligent agent is located, the current self state and a target point; the obstacle map is specified by a self-centered local grid map; as shown in fig. 4, the obstacle map represents environmental information around the agent, including the shape of the agent and the appearance of observable obstacles; the obstacle map is constructed from cost maps generated from various sensor data (e.g., output of a 2D laser, a 3D laser, or a depth camera).

An action space: the self-game process of heterogeneous agents is realized by respectively setting different behavior strategies for the agents, and as shown in figure 2, continuous actions are firstly implemented for pedestrian agents based on a differential model, namely v is set _t ∈[0，0.6]And ω _t ∈[-0.9，0.9]Secondly, implementing a continuous action for the vehicle agent based on the Ackerman model, i.e. setting c _t ∈[-1.43，1.43]And delta _t ∈[-11.25，11.25]Through the design of the action space, the heterogeneous intelligent agent can use the same set of network to learn and train.

The reward function: the self-game of the heterogeneous intelligent agent is realized by performing behavior modeling on the vehicles and the pedestrians in a complex dynamic scene, and meanwhile, in order to realize safe navigation, the navigation strategy aims to reduce the collision times of the vehicles and the pedestrians to the maximum extent and reduce the collision-free arrival time of the intelligent agent to the maximum extent. In the invention, different Reward functions r are respectively set for pedestrians and vehicles by introducing a special judgment area and combining a Reward Shaping strategy _r And r _v . Reward function r of pedestrian _r ：

Reward function r of pedestrian _r ：

r _r ＝r ^g +r ^c +r ^s ,

Reward function r of vehicle _v Wherein r is ^g ,r ^c ,r ^s Consistent with the reward function of the pedestrian:

r _v ＝r ^g +r ^c +r ^s +r ^w +r ^d

a neural network: the convolutional network structure for the navigation strategy is shown in fig. 2. The input to the network is the observed data of the agent. And the network output is sampled by Gaussian distribution, the weighted average of the actions is finally calculated, and the weighted average is finally output as the behavior strategy of the agent, wherein the pedestrian agent based on the differential model is controlled by using linear velocity and angular velocity, and the vehicle agent based on the Ackerman model is controlled by using the track.

And step 3: multi-stage curriculum learning and parallel learning are combined to accelerate the training and convergence speed of the model in the simulation environment. The Learning process can be accelerated by multi-stage course Learning and parallel Learning, and the course Learning (Curriculum Learning) is proved to be an effective strategy, which can help to find a better local minimum and accelerate the training process. The main idea of curriculum learning is to decompose a hard learning task into several simple tasks, and the difficulty is gradually increased from the simplest task. In the invention, the learning efficiency of the scene difficulty university in the static scene and the dynamic scene is low, so that the parallel learning acceleration and the training guiding process are introduced, and simultaneously, the difficulty increasing relationship exists between the static scene and the dynamic scene, so that a multi-stage course learning strategy is introduced between the static scene and the dynamic scene to guide the intelligent agent to learn, a set of complete multi-stage parallel course learning strategy is formed, and the network can be trained more efficiently. And evaluating and testing the model after the model is converged, and verifying the effectiveness of the model. And finally, deploying the target model into the real world to complete the target navigation task.

In the work, a scene with heterogeneous multi-agent is built through a simulation environment and simultaneously comprises static barriers, and the vehicle agent based on the Ackerman model, the pedestrian agent based on the differential model and the barriers are trained together, so that the capability of the vehicle agent based on the Ackerman model to cope with the complex scene is improved. FIG. 4 shows two examples of training scenarios within a simulation scenario, FIG. 3 multi-stage parallel lesson learning: the left graph is parallel course learning in a random static scene, and the right graph is parallel course learning in a circular scene. This will help the ackermann model-based vehicle agent to avoid random obstacles (also called random scenarios) by first performing parallel course learning within the first example random obstacle scenario. Parallel course learning is carried out in the second example circular scene through the trained model, and therefore the obstacle avoidance capability of the complex dynamic obstacle scene is improved by the vehicle intelligent body vehicle based on the Ackerman model. Firstly, self game training is carried out in a random scene, so that a strategy network can cope with a complex static obstacle scene and a simple dynamic pedestrian scene. In particular, the environment in the stochastic scenario contains 3 ackermann-model-based vehicle agents, 3 differential-model-based pedestrian agents and 5 static obstacles, which randomly select the location of the obstacles, and the starting and target locations of the ackermann-model-based vehicle agents and the differential-model-based pedestrian agents are also random within a certain range. Secondly, both ackermann-model-based vehicle agents and differential-model-based pedestrian agents in the environment will be driven by the policy network. There are multiple heterogeneous agents in an environment that share the same navigation strategy, and they also need to avoid conflicts with each other, which helps the ability of the network to learn multi-agent obstacle avoidance. Meanwhile, a Reward Shaping strategy is combined, so that the vehicle intelligent agent based on the Ackerman model can effectively and actively avoid other dynamic and static obstacles and keep a certain safety distance. Whereas an environment in a circular scene contains 3 ackerman model-based vehicle agents and 3 differential model-based pedestrian agents, this places ackerman model-based vehicle agents and differential model-based pedestrian agents randomly on circles of random radius. The vehicle intelligent agent based on the Ackerman model and the pedestrian intelligent agent based on the differential model can effectively learn navigation in a complex dynamic scene by training in a circular scene.

Meanwhile, in the training process, as the scene complexity is increased, a multi-stage parallel course learning framework is introduced. Therefore, the training process of the heterogeneous intelligent agent self game obstacle avoidance model is accelerated, the learning direction of the vehicle intelligent agent based on the Ackerman model under the complex dynamic scene is effectively guided, and the training speed is improved.

As shown in fig. 3, the simulated training environment: the left graph is a random training environment, and the right graph is a circular training environment. The training environment that the left side diagram was shown is random training environment, the right side is the observation data of each agent in the environment in the left side diagram, the left side includes the vehicle agent of a plurality of initial positions random based on ackermann model and the pedestrian agent based on differential model, include the static barrier of arbitrary shape and a plurality of dynamic barrier in the environment simultaneously, every agent in the simulation environment all is driven by neural network, each agent can obtain the obstacle avoidance ability under the stronger static scene through long-time self-gaming learning process at last. The training environment that the right side graph was shown is circular training environment, the right side is the observation data of each agent in the environment in the right side graph, the left side includes that a plurality of initial positions are at random in appointed circular interval vehicle agent based on ackermann model and pedestrian's agent based on differential model, do not contain any static barrier in the simulation environment simultaneously, every agent in the simulation environment is driven by neural network, each agent can obtain the obstacle avoidance ability under the stronger dynamic scene through long-time self-gaming learning process at last.

The method deploys the strategy model in the simulation environment on the entity Ackerman model vehicle platform, thereby testing the obstacle avoidance capability of the complex scene in the real environment.

As shown in FIG. 5, the physical vehicle deployment and test are carried out, the Ackerman model vehicle is based on a Songling mobile intelligent body chassis, carries a Tengling 16-line laser radar for environment perception, and simultaneously comprises an RTK-GPS inertial navigation system for positioning. The vehicle is provided with an industrial personal computer which is loaded with RTX 2080Ti as a core computing unit. The method is mainly used for generating environment perception data and operating a trained strategy network. A random pedestrian scene and a circular scene under a training scene are respectively simulated in a real environment, and meanwhile, an entity test is carried out under a static dynamic scene.

Under a complex static scene, a static dynamic combination scene and a dynamic pedestrian scene, the Ackerman model vehicle can reach a target terminal point quickly and in a collision-free manner, and a navigation task is completed safely. The depth reinforcement learning algorithm based on the map and the self game can effectively improve the obstacle avoidance capability of the Ackerman model vehicle, so that the Ackerman model vehicle can perform well in complex static and dynamic scenes. Meanwhile, deployment of the entity can be realized more quickly by using the track as a strategy behavior, and the effect is improved by fine-tuning simple parameters.

The method is used for navigation and obstacle avoidance in a complex automatic driving scene. The observation space is used to express the environment information around the agent. Meanwhile, dynamics and shape feature modeling is carried out on the heterogeneous intelligent bodies, the behavior strategies of the heterogeneous intelligent bodies are trained through the neural network, and the complex dynamic scenes are explored through the self-game process, so that the obstacle avoidance capability of the Ackerman vehicle for responding to the complex dynamic scenes is improved. The method of the present invention is evaluated under a variety of scenarios, both simulated environments and real scenarios. The result shows that the Ackerman model vehicle can better cope with navigation in complex static and complex dynamic pedestrian scenes by the method.

Although particular embodiments of the present invention have been described above, it will be appreciated by those skilled in the art that these are merely examples and that many variations or modifications may be made to these embodiments without departing from the principles and implementations of the invention, the scope of which is therefore defined by the appended claims.

Claims

1. An end-to-end navigation obstacle avoidance method facing automatic driving and based on deep reinforcement learning is characterized by comprising the following implementation steps:

step 1: sampling the laser radar point cloud by adopting a laser radar, generating an environment one-dimensional laser point cloud, and converting the environment one-dimensional laser point cloud through a cost map conversion algorithm to finally obtain an obstacle map representing environment dynamic static obstacles; performing closed-loop interaction with a simulation environment, continuously collecting an environmental barrier map, and generating a navigation environment data set, wherein the data set comprises a training set and a test set;

an observation space representing observation data of a current agent;

an action space that defines the scope of actions taken by the agent;

reward function for restricting and guiding the behavior of intelligent body to realize safe navigation, and different reward functions r are set for pedestrian and vehicle _r And r _v Wherein a reward modeling method is used while at r _v Introduction of r in the formula ^w And r ^d Reward punishment as warning and hazard zones;

a neural network for learning of a navigation strategy;

the method comprises the steps that learning of an obstacle avoidance strategy in a simulation environment is achieved by combining an observation space, an action space, a reward function and a neural network, the simulation environment is abstract description of a real world, wherein obstacles comprise an ackerman model-based vehicle intelligent body and a differential model-based pedestrian intelligent body, the action space of an ackerman vehicle is described as an expected driving track of the ackerman vehicle according to a behavior strategy mode of a dynamic obstacle in the real environment, and the action space of a pedestrian is described as the action direction and the speed of the pedestrian; based on the simulation environment and the heterogeneous intelligent agent self-game obstacle avoidance model, the same network structure and parameters are finally adopted to realize intelligent agent game learning between Ackermann vehicles, between Ackermann vehicles and differential pedestrians and between differential pedestrians and differential pedestrians, so that the multi-intelligent agent flexibly avoids obstacles in a dynamic environment;

and step 3: based on the training set in the step 1, performing multi-stage parallel course learning on the heterogeneous intelligent body self-game obstacle avoidance model constructed in the step 2, so that a neural network can reach local optimal solution more quickly and better, and meanwhile, the learning process is accelerated, and finally the trained heterogeneous intelligent body self-game obstacle avoidance model is obtained;

2. The end-to-end navigation obstacle avoidance method facing automatic driving and based on deep reinforcement learning as recited in claim 1, wherein: in the heterogeneous multi-agent self-gaming model constructed in the step 2:

an action space: firstly, a continuous action space is set for a pedestrian intelligent body based on a differential model, namely v is set _t ∈[0，0.6]And ω _t ∈[-0.9，0.9]Wherein v is _t For the speed range taken at time t, 0,0.6 represents the maximum and minimum speed of the agent 0m/s 0.6m/s, respectively, in units: m/s, ω _t The unit of the rotation angle range adopted for the time t: rad; -0.9,0.9 respectively representing the range of rotation angles of the agent; secondly, setting a continuous action space for the vehicle intelligent body based on the Ackerman model, namely setting c _t ∈[-1.43，1.43]And delta _t ∈[-11.25，11.25]Wherein c is _t Target track curvature, δ, for the desired travel of the agent at time t _t Acceleration corresponding to the track expected to be driven by the intelligent agent at the moment t;

the reward function: is divided intoDifferent reward functions r are set for pedestrians and vehicles _r And r _v Wherein a reward modeling method is used while at r _v Introduction of r in the formula ^w And r ^d As reward and punishment mechanisms of the warning and danger areas, the following are specific:

reward function r of pedestrian _r ：

r _r ＝r ^g +r ^c +r ^s ,

r _v ＝r ^g +r ^c +r ^s +r ^w +r ^d

formula r _r In, r _arr >0，p _t Representing the position of the agent at the current time t, ε is a hyperparameter of the neural network, r ^g Representing rewards for reaching the goal and penalties for leaving the goal; r is _col <0 and r ^c Representing a penalty for the collision; finally, a negative penalty, r, is applied over the entire learning period ^s <0 to encourage shortest paths;

formula r _v In, r _warn ,r _danger <0,r ^w Indicating when the vehicle is within the warning zonePenalty in case of an obstacle, r ^d Representing a penalty when there is an obstacle within the hazardous area of the vehicle; in an implementation, r is set _arr ＝500、ε＝10、r _col ＝-500、r _s ＝-5、r _warn = -20 and r _danger ＝-10；

A neural network: the output of a strategy network in the neural network is respectively described as the behaviors of a vehicle intelligent body based on an Ackerman model and a pedestrian intelligent body based on a differential model, so as to further realize the self-game learning of the heterogeneous intelligent bodies, wherein the input is the observation data of each heterogeneous intelligent body, including the barrier map, the current self state and the target point of the intelligent body; and modifying the output layers of the last layers in the neural network structure, so that the neural network can simultaneously output the behavior strategies of the heterogeneous intelligent agents. Wherein the modified output layer obtains linear velocity and angular velocity as action direction and action speed of the pedestrian and curvature and acceleration of the expected running track of the vehicle by using Gaussian distribution sampling;

the representation mode enables a set of network to realize the behavior prediction of different heterogeneous agents and the behavior prediction of a plurality of agents; higher navigation performance is obtained after the training of the nerve retentions is finished, meanwhile, the pedestrian and vehicle game process in the real environment is simulated as much as possible, and the safety of the navigation strategy is enhanced.