CN110750096B

CN110750096B - Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment

Info

Publication number: CN110750096B
Application number: CN201910953377.1A
Authority: CN
Inventors: 王宏健; 何姗姗; 张宏瀚; 袁建亚; 于丹; 贺巨义
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2022-08-02
Anticipated expiration: 2039-10-09
Also published as: CN110750096A

Abstract

The invention belongs to the technical field of mobile robot navigation, and particularly relates to a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment. According to the method, a laser range finder is used for collecting original data, the processed data are used as a state S of an A3C algorithm, an A3C-LSTM neural network is constructed, the state S is used as network input, the neural network outputs corresponding parameters through an A3C algorithm, and the parameters are used for selecting actions to be executed by each step of the mobile robot through normal distribution. According to the invention, the environment is not required to be modeled, and the mobile robot can successfully avoid the obstacle in the complex static obstacle environment through a deep reinforcement learning algorithm. The invention designs a continuous action space model with a turn-bow constraint, adopts multi-thread asynchronous learning, greatly improves the learning and training time, reduces the sample correlation, ensures the high utilization of exploration space and the diversity of exploration strategies, and improves the convergence, stability and obstacle avoidance success rate of the algorithm compared with the common deep reinforcement learning method.

Description

Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment

Technical Field

The invention belongs to the technical field of mobile robot navigation, and particularly relates to a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment.

Background

The application of the mobile robot is deep into various fields in life, and on an industrial production line, the robot can replace human beings to carry out heavy work with high repeatability, so that the production efficiency is improved, and the human labor force is liberated.

Obstacle avoidance is a key problem in robot research in the process of executing tasks by multiple robots. It is a dynamic, real-time, uncertain model of the typical environment. Furthermore, it requires that the robot adapt to changes in the environment quickly and independently. For obstacle avoidance problems, behavior-based response control methods and rule-based control strategies are often used. Both methods are simple, the certainty is strong, and the robot has better response to the environment. But when the task and environment become complex, the robot can run away, leading to erroneous decisions and deadlocks. Furthermore, designers of these methods must have a great deal of experience and domain knowledge to design integrated control strategies. However, it is difficult to fully describe the entire instance that may occur and to design rules for domains that are completely or partially unknown. Therefore, improving the adaptability and robustness of the robot is an effective method for making the robot self-learn, wherein deep reinforcement learning is receiving extensive attention.

The reinforcement learning and the deep reinforcement learning have good performances on the problems of obstacle avoidance, navigation and the like of the mobile robot. In general, the traditional obstacle avoidance method has large limitation, and is particularly not suitable for complex and dynamic unknown environments; the intelligent obstacle avoidance algorithm, particularly the obstacle avoidance algorithm combining deep learning and reinforcement learning, which is popular in recent years, has great advantages for continuous high-dimensional complex dynamic unknown environments.

In obstacle avoidance research in a complex environment, obstacles are generally divided into two states: convex obstacles and concave obstacles, the meander obstacles are one of the concave obstacles, but have a more complex structure. For complex static obstacles, especially for meander-shaped obstacles, it is difficult for the mobile robot to escape from such an environment without an applicable collision avoidance algorithm. The mobile robot static obstacle avoiding method based on deep reinforcement learning is suitable for the problem of continuous action space obstacle avoiding of unknown complex static obstacles, eliminates correlation through an asynchronous learning mechanism, saves communication cost and time cost, improves diversity of exploration strategies, finally improves algorithm stability, and can realize online learning. For a complex environment, a large amount of time and cost are consumed for environment modeling, and the method does not need to model the environment and can better solve the problem of obstacle avoidance of the mobile robot in the complex static obstacle environment, particularly the loop-shaped obstacle environment. In the work of the invention, the invention concentrates on learning the obstacle avoidance behavior, solves the problem of obstacle avoidance of the mobile robot in the complex static obstacle environment, improves the obstacle avoidance efficiency, and is better applied to autonomous safe navigation of the mobile robot.

Disclosure of Invention

The invention aims to: the method is safe, effective and reliable, ensures that the mobile robot can safely and effectively avoid the loop-shaped barrier when in operation, effectively solves the trap problem caused by the loop-shaped barrier, efficiently escapes the loop-shaped or any other concave structure barriers, and smoothly executes the operation task.

The purpose of the invention is realized as follows:

a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment comprises the following steps:

the method comprises the following steps: acquiring original data through a laser range finder carried by a mobile robot to acquire obstacle information;

step two: after data processing is carried out on the original data in the step one, corresponding processing is carried out by combining the original data with relevant position and angle information in a global coordinate system, and the processed data is the state S of the A3C algorithm;

step three: designing an action space and a reward function of an A3C algorithm under the environment, and performing corresponding reward and punishment after the mobile robot performs each step of action through the reward function;

step four: designing a complex static obstacle environment, an initial position of a mobile robot and a virtual target;

step five: establishing an A3C-LSTM neural network, and taking the state S as an input layer of the network;

step six: and (3) training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing training, and applying the stored network model to the obstacle avoidance of the mobile robot.

The method comprises the steps that first, original data are obtained through a laser range finder carried by the mobile robot, in the obtained obstacle information, real data collected by the laser range finder are simulated, the original data collected by the laser range finder are a series of discrete data points and are given in a polar coordinate mode under a local coordinate system, the data include distance information and azimuth angle information, and the distance information and the azimuth information of the obstacle under the local coordinate system are obtained through the information.

And step two, after the original data in the step one are subjected to data processing, the original data are combined with relevant position and angle information in a global coordinate system to be correspondingly processed, and the processed data are the state S of the A3C algorithm, wherein the local coordinate system X is the state S of the A3C algorithm _m O _m Y _m The method is characterized in that the mobile robot is taken as the origin of coordinates, the motion direction of the mobile robot is taken as the positive direction of an X axis, and the positive direction of a Y axis is vertical to the X axis and meets the right-hand rule; global coordinate system X _e O _e Y _e Then is the geodetic coordinate system; in the first step, the original data detected by the laser range finder is orientation information in a local coordinate system, and the orientation information needs to be converted into orientation information in a global coordinate system through coordinate transformation, and then the orientation information is correspondingly processed with the position and angle information related to the mobile robot and the virtual target in the global coordinate system to be used as the state S of the A3C algorithm.

Aiming at the action space and the reward function of the A3C algorithm under the environment, the corresponding reward and punishment is carried out on the mobile robot after each step of action is executed through the reward function, and the reward function comprises three parts of punishment on the distance from the target, punishment on the distance from the obstacle and punishment on the used step length, and represents the punishment on the distance from the nearest obstacle to the current mobile robot respectively; punishment on the distance between the target and the mobile robot; and punishment on the number of steps used in the whole round, wherein when the mobile robot executes one action in each step length, the reward function carries out corresponding punishment on the quality of the action selected in the current state.

And step four, setting different types of static obstacle environments and loop-shaped obstacle environments in the designed complex static obstacle environment, the initial position of the mobile robot and the virtual target.

Establishing an A3C-LSTM neural network, and taking the state S as an input layer of the network, wherein the whole network framework is divided into a global network part and a local network part, and 4 local networks are provided due to parallel asynchronous training; the global network and the local network both comprise an Actor and Critic sub-network structures, wherein an output layer of the Actor sub-network in the local network outputs a normally distributed parameter for selecting an action executed by the mobile robot, namely a heading angle variation, the Critic sub-network outputs an evaluation value of the selected action in the state, the two sub-networks are respectively composed of an input layer, an LSTM hidden layer, two full-connection hidden layers and an output layer, and the difference between the two sub-networks is the number of neurons in the hidden layers, an activation function and output.

And step six, training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing training, and applying a stored network model to the mobile robot to avoid the obstacle, wherein the whole network works by 4 threads in parallel, in the training process, corresponding parameters of the network are continuously updated in the direction of maximizing the reward, and the training is finished until the number of fixed rounds is reached and the total reward obtained by each round of the mobile robot in the final training process tends to be stable, so that the trained A3C-LSTM network is obtained, and the network is directly applied to the mobile robot to avoid the obstacle.

The invention has the beneficial effects that:

1. compared with the traditional obstacle avoidance method and the depth reinforcement learning method, on the premise of ensuring the safety distance, the obstacle avoidance efficiency and the success rate of the mobile robot are improved, the obstacle avoidance track is smoother, and the time cost consumption in the obstacle avoidance process is reduced.

2. In consideration of the structural characteristics of the special obstacles of the complex-loop obstacles and the heading limiting factors of the movement of the mobile robot, the method is specially researched for the obstacle avoidance problem and the local trap escape problem of the loop obstacles with the special structures, and the method is more intelligent and higher in transportability by applying reinforcement learning and the long-term memory characteristic of the LSTM network, and the experimental result shows that the method for escaping from the traps has the characteristics of high efficiency and safety under the condition that the obstacle information is locally known.

3. Compared with the common deep reinforcement learning method such as Q-learning algorithm and the like which is only suitable for discrete motion space, the method can be applied to high-dimensional continuous motion space and can also be applied to discrete motion space. The method can solve more problems, can realize on-line learning, realizes the effects of training and learning at the same time, adopts multi-thread parallel learning, reduces communication cost, improves training efficiency, enriches exploration strategies, enlarges exploration areas, is more intelligent, has good environmental adaptability for static complex obstacle environments, particularly clip-shaped obstacle environments, has high obstacle avoidance execution efficiency, and is more suitable for the technical field of mobile robot navigation.

Drawings

FIG. 1 is a flow chart of a static environment obstacle avoidance algorithm;

FIG. 2 is a view of a serpentine barrier environment;

FIG. 3 is a laser sensor detection model;

FIG. 4 is a mobile robot coordinate system;

FIG. 5 is a general static obstacle environment;

FIG. 6 shows a simulation result of obstacle avoidance for a general static obstacle environment;

FIG. 7 is a curve of a mobile robot heading angle adjustment process;

FIG. 8 is a simulation result of obstacle avoidance for a square-shaped obstacle environment;

FIG. 9 is a schematic diagram of the A3C-LSTM neural network structure.

Detailed Description

The invention is further described below with reference to the accompanying drawings and examples.

The invention discloses a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment, which can be used for effectively avoiding obstacles when a mobile robot works in a complex static obstacle environment containing a square-shaped obstacle. According to the invention, corresponding data processing is carried out on the basis that a laser range finder collects original data, the processed data is used as a state S of an A3C algorithm, a corresponding A3C-LSTM neural network is constructed, the state S is used as network input, the neural network outputs corresponding parameters through an A3C algorithm, actions executed by each step of the mobile robot are selected through normal distribution by utilizing the related parameters, and the flow chart of the overall obstacle avoidance algorithm is shown in figure 1. In consideration of the complex and diversified environment, particularly the complex and diversified shape of the zigzag obstacle, the problem that the mobile robot can not escape when being trapped in the obstacle is easily caused, and aiming at the situation, the traditional obstacle avoidance method is difficult to obtain a good effect, and the algorithm does not need to model the environment and belongs to a deep reinforcement learning algorithm, namely a trial and error mechanism, the purpose of maximizing the reward is achieved, only the reward and punishment of each step of action of the mobile robot is given, and finally the purpose that the mobile robot can successfully avoid the obstacle, particularly can successfully escape the zigzag obstacle and reach the virtual target in the complex and diversified environment is achieved. In consideration of kinematic constraint conditions of the mobile robot, such as the maximum turning angular rate, a continuous action space model with turning constraint is designed, and multi-thread asynchronous learning is adopted. Experimental simulation results show that the algorithm provided by the invention has good environmental adaptability, high obstacle avoidance execution efficiency and capability of successfully escaping from the obstacle trap aiming at a static complex obstacle environment, particularly a loop-shaped obstacle environment, and is more suitable for autonomous safe navigation of a mobile robot.

The invention comprises the following steps:

the method comprises the following steps: obtaining original data through a laser range finder carried by a mobile robot, and obtaining obstacle information:

the invention simulates real laser range finder to collect data, the laser range finder equipped by a wheeled mobile robot 'traveler No. 4' is a master model, the detection range is 80 meters, the open angle is 180 degrees, the resolution is 1 degree, the original data collected by the laser range finder is a series of discrete data points, the data are given in a polar coordinate mode under a local coordinate system and all contain distance information and azimuth information, and the distance and azimuth information of an obstacle under the local coordinate system can be obtained through the information. The data detected back by 180 beams from the laser rangefinder is shown in fig. 3.

Step two: obtaining original data through a laser range finder carried by a mobile robot, and obtaining obstacle information:

local coordinate system X _m O _m Y _m The method is characterized in that the mobile robot is taken as the origin of coordinates, the motion direction of the mobile robot is taken as the positive direction of an X axis, and the positive direction of a Y axis is vertical to the X axis and meets the right-hand rule; global coordinate system X _e O _e Y _e The geodetic coordinate system. In the first step, the original data detected by the laser range finder is orientation information in a local coordinate system, and the orientation information needs to be converted into orientation information in a global coordinate system through coordinate transformation, and then the orientation information is correspondingly processed with the angle and position information related to the mobile robot and the virtual target in the global coordinate system to be used as the state S of the A3C algorithm. Suppose that the mobile robot has a position coordinate of (x) in the global coordinate system at time t _t ，y _t ) In a sampling period T _s And if the mobile robot does uniform linear motion, the kinematic model of the mobile robot is shown as the following formula:

wherein (x) _t+1 ,y _t+1 ) Respectively the position coordinate v of the mobile robot under the global coordinate system at the moment t +1 _t Sampling period T for mobile robot _s Internal movementSpeed psi being the heading and X of the robot _e The included angle is included in the positive direction of the axis.

The conversion formula for converting the polar coordinate form position information of the obstacle in the local coordinate system into the rectangular coordinate form position information is shown as the following formula:

wherein (x) _o ,y _o ) The position information of the obstacle is represented in a rectangular coordinate form under a local coordinate system, the (l, alpha) is the original polar coordinate information of the obstacle detected by a laser range finder, and the (l) is the point from the obstacle to the point O _m A linear distance of (a) is the distance between the obstacle and O _m Connecting line with Y _m The included angle is included in the positive direction of the axis.

Finally, all local information is converted into information under a global coordinate system, so that coordinates of the obstacle under the local coordinate system are converted into rectangular coordinates (x) under the global coordinate system _e ,y _e ) The conversion formula of (c) is shown as follows:

wherein psi is defined as the heading angle of the mobile robot, i.e. its heading direction and the global coordinate system X _e The angle between the positive direction of the axis and the point of the obstacle is _m A linear distance of (a) is the distance between the obstacle and O _m Connecting line with Y _m The included angle is included in the positive direction of the axis. The coordinate system of the mobile robot, which is relied on in the whole transformation process, is shown in fig. 4.

Step three: the method is characterized in that an action space and a reward function of an A3C algorithm under the environment are designed, and the mobile robot carries out corresponding reward and punishment after executing each step of action through the reward function:

the designed motion space is a continuous motion space with the change range of each step of the heading angle of the mobile robot within the range of [ -10 degrees, +10 degrees ], wherein one step represents 1s, the speed of the mobile robot in the whole obstacle avoidance process is constant and is 1m/s, the initial heading angle is set to be 0 degree by taking the north direction as the reference, and the anticlockwise direction is positive. The reward function comprises three parts of penalty for the distance from the target, penalty for the distance from the obstacle and penalty for the used step length, and represents the penalty for the distance from the nearest obstacle to the current mobile robot; punishment on the distance between the target and the mobile robot; and punishment on the number of steps used in the whole round, wherein when the mobile robot executes one action in each step length, the reward function carries out corresponding punishment on the quality of the action selected in the current state.

The reward function R is shown as follows:

R(s,a)＝p·tar_dis+q·obs_dis+k·step_count

wherein p tar dis is R ₁ The penalty of the reward function to the distance between the mobile robot and the target is represented, tar _ dis represents the distance between the mobile robot and the virtual target at the current moment, p is a target reward coefficient, and since the final aim is to reach the target, p is set to be a negative value, so that the larger tar _ dis is, the larger the penalty given by the reward function is; otherwise, the penalty is reduced, namely the phase-change reward is obtained.

q.obs _ dis is R ₂ The penalty of the reward function on the distance between the obstacle and the current mobile robot is represented, wherein the obstacle is the nearest obstacle, obs _ dis represents the distance between the current mobile robot and the nearest obstacle, q is an obstacle reward coefficient, as the most main purpose is to successfully avoid the obstacle, the corresponding safety distance is set, q is set to be a positive value, the larger the distance between the obstacle outside the safety distance and the current mobile robot is, the better the obs _ dis is, the larger the reward given by the reward function is; conversely, the smaller the reward, the phase change penalty is.

k·step _count Is R ₃ And k is a time penalty coefficient, and step _ count is the current accumulated step number. One of the important evaluation indexes for evaluating the quality of the obstacle avoidance algorithm is a cost function, so that step _ count directly represents the cost for obstacle avoidance of the mobile robot in the whole round, and corresponding reward and punishment needs to be carried out on the cost to ensure that the obstacle avoidance path is as low as possible and efficient. R ₃ Represents the punishment of the time spent on the step number for avoiding the obstacle, and takes k asA negative number, such that the greater the step _ count, the greater the penalty given by the reward function; conversely, the smaller the reward, the phase change penalty is.

By utilizing the characteristic that the reward is maximized through reinforcement learning, corresponding reward and punishment is further carried out on each step of action of the mobile robot in the aspects of whether the mobile robot successfully avoids an obstacle, whether the mobile robot reaches a virtual target, whether the cost is low and the like, so that the mobile robot finally achieves the expected target that the obstacle is successfully avoided, the virtual target is reached, the obstacle avoiding path is smooth, and the used cost is low, namely the obstacle avoiding path is short. Aiming at avoiding obstacles, the method is divided into two conditions of existence and nonexistence of obstacles in the detection range of a laser range finder: if no obstacle exists in the detection range, corresponding reward is given to the heading angle of the robot, and if the mobile robot moves towards the target direction, the reward is given as shown in the following formula:

wherein tar _ det is a global coordinate system X _e O _e Y _e Connecting line between lower trolley heading angle and trolley and target and X _e Angle difference of included angle formed by positive direction of axis. If an obstacle exists in the detection range, punishing the collision state of the mobile robot, as shown in the following formula:

wherein obs _ dis represents the distance between the current mobile robot and the closest obstacle, dis _crash Set as collision distance, take it as safe distance if obs _ dis<＝dis _crash The default is that collision occurs, corresponding punishment is carried out on the collision, the current round is ended, and the next round is started. The reward and punishment on whether the mobile robot reaches the target is shown as follows:

where tar _ dis represents the distance between the mobile robot and the virtual target at the current time, dis _reach For a set distance of arrival, i.e. tar _ dis<＝dis _reach When the mobile robot reaches the target by default, it is rewarded. In order to enable the mobile robot to reach the target better and more continuously and stably, a continuous arrival reward R is set _1′ As shown in the following formula:

and if the number of the rounds which are not reached is not reached, the ep _ count is cleared.

Step four: designing a complex static obstacle environment containing a loop-shaped obstacle environment, an initial position of a mobile robot and a virtual target:

the size of the simulation environment is 800 multiplied by 600, the initial position of the mobile robot is set to be (50,100), the initial heading angle is defined to be 0 degrees by taking the positive north direction as a reference, namely the positive direction of the y axis of the global coordinate system, the dark gray square is a virtual target, and the side length is 30. The device comprises two static obstacle environments, namely a general static obstacle environment and a square obstacle environment, which are different in type. The general static obstacle environment is shown in fig. 5, and includes various types of static obstacles, which are all represented by black areas, a dark gray rectangular area is a virtual target, and a light gray curve is a motion trajectory of the mobile robot. The speed v of the mobile robot is 1m/s, and in consideration of reality, in order to make the simulation more real, the mobile robot is provided with a rotation angle limit, so that the change of the motion direction angle of the mobile robot is more reasonable. Fig. 6 shows the obstacle avoidance simulation result of the mobile robot in the general static obstacle environment. The corresponding curve of the mobile robot heading angle adjustment process of fig. 6 is shown in fig. 7. Fig. 8 shows a simulation result of obstacle avoidance for a circle-shaped obstacle environment.

Step five: establishing an A3C-LSTM neural network, taking the state S as an input layer of the network:

the whole network framework is divided into a global network part and a local network part, as shown in fig. 9. Due to parallel asynchronous training, 4 local networks are set. Both the global network and the local network contain two sub-network structures of Actor and Critic, but the global network only plays a role in storing network related parameters and in time pulling the related parameters to the local network. The output layer of the Actor sub-network in the local network outputs normally distributed parameters for selecting the action executed by the mobile robot, namely heading angle variation, and the criticic sub-network outputs evaluation values of the selected action in the state.

Establishing an A3C-LSTM neural network, inputting an Actor network as 8-dimensional state information after corresponding data processing, and having three hidden layers, wherein the input layer is connected with an LSTM neural network layer containing 64 LSTM memory units, batch _ size is 1, the initialization state is all 0, the output of the processed LSTM layer is used as the input of a second hidden layer, a nonlinear full connection mode with an activation function of RELU6 is adopted, the output of the layer is used as the input of a last hidden layer, the activation function of the last hidden layer is also the nonlinear full connection layer of RELU6, the output layer is finally connected, the dimension is 2, the activation functions are respectively tanh and softplus, and are connected in the nonlinear full connection mode, and the final output is sigma and mu. Because of the continuous motion space, the normal distribution is determined by the output parameters, and the selected motion is obtained by directly sampling in the normal distribution form according to the probability. The Critic network and the Actor network have basically the same structure, except that the neuron number difference of the hidden layer and the activation function are different, the output layer of the Critic network is only one layer, the dimensionality is 1, and the final output is V(s) _t ) For corresponding evaluation of the selected action. And the advantage function is utilized, namely the advantage function can more clearly explain the good ring of the selected action in the current state compared with the base line, and the good or bad degree is expressed, particularly, the good or bad degree is more or less, and the bad or bad degree has a quantitative standard, so that the evaluation is more convenientThe method is accurate, and is beneficial to an Actor network to more accurately master the direction of the Actor network which is continuously updated and maximizes the reward. In addition, the updating of the network is performed by the Critic network, and the LSTM network layer in the Actor network is the same as that in the Critic network, does not play a role of updating the network memory module, and can only be directly applied.

Step six: training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing the training, and applying the stored network model to the mobile robot to avoid the obstacle:

the whole network works by 4 threads in parallel, namely 4 local networks exist, in the training process, each thread can adopt different exploration strategies, correspondingly obtain different states, explore different areas of the environment, and after 4 threads finish one-step action selection, the global network stores corresponding parameters of local networks with best performance and pulls the parameters to each local network to update the local networks. Therefore, more strategies can be explored while the correlation is removed, the exploration area is expanded, finally, the strategies are continuously updated in the direction of maximizing the reward, the training is finished until the number of fixed rounds is reached and the total reward obtained by each round of the mobile robot in the final training process tends to be stable, a trained A3C-LSTM network is obtained, and the network can be directly applied to obstacle avoidance of the mobile robot.

In order to solve the problem of local traps of complex static obstacles, particularly the difficulty brought to a mobile robot by a rectangular obstacle in an actual unknown working environment and improve the efficiency of the mobile robot in avoiding the complex spiral obstacle, the invention provides a mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment.

In order to verify the effect of the proposed obstacle avoidance algorithm on the escape from a complex static obstacle environment containing a loop-shaped obstacle, the whole experimental process of the mobile robot for avoiding the obstacle and successfully reaching a virtual target is designed. In the invention, the initial position of the mobile robot is set as (360,320), the size of the simulation environment is 800 multiplied by 600, and the heading angle is based on the global coordinate system Y _e The positive axial direction is set to 0 degree in the true north direction, the initial heading angle is 0 degree, and the square-shaped barrier area is a black areaMeanwhile, a virtual target is set to (680,320) after the mobile robot escapes from the circular obstacle. Finally, the mobile robot can safely escape from the clip-shaped barrier and smoothly reach the appointed virtual target. And when the distance between the mobile robot and the virtual target is 15m, the target reaching zone bit is true, the AUV is marked to successfully complete the obstacle avoidance task after the target is reached, the training round is ended, and the round is restarted until the preset round times are reached.

According to the obstacle avoidance algorithm set for the square-shaped obstacles, as shown in fig. 8, a concave obstacle environment in a two-dimensional environment consisting of complex square-shaped obstacles is set, a light gray curve in the graph is a moving track of the mobile robot, the mobile robot can be seen from the graph to select a reasonable obstacle avoidance path to successfully reach a virtual target without bringing obstacle avoidance risks, meanwhile, the requirement of safe obstacle avoidance distance is kept, and experimental simulation results prove the reasonability and feasibility of the method provided by the invention.

The invention finally provides a mobile robot obstacle avoidance planning method based on deep reinforcement learning in a static environment aiming at the obstacle avoidance problem of a complex static obstacle, particularly a zigzag obstacle environment.

Claims

1. A mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment is characterized by comprising the following steps:

the designed action space is a continuous action space with the change range of each step of the heading angle of the mobile robot within the range of [ -10 degrees, +10 degrees ], wherein one step represents 1s, the speed of the mobile robot in the whole obstacle avoidance process is constant and is 1m/s, the initial heading angle is set to be 0 degree by taking the north direction as the reference, and the anticlockwise direction is positive;

the reward function comprises three parts of punishment on the distance between the obstacle and the target, punishment on the distance between the obstacle and the used step length, and represents the punishment on the distance between the obstacle closest to the target and the current mobile robot, the punishment on the distance between the target and the mobile robot and the punishment on the number of steps used for the whole round; when the mobile robot finishes executing one action in each step, the reward function carries out corresponding reward and punishment on the quality of the action selected in the current state;

the reward function R is shown as follows:

R(s，a)＝p·tar_dis+q·obs_dis+k·step_count

wherein p · tar _ dis is R ₁ The penalty of the reward function on the distance between the mobile robot and the target is represented; tar _ dis represents the distance between the mobile robot and the virtual target at the current moment; p is a target reward coefficient, and p is set to be a negative value;

q.obs _ dis is R ₂ The penalty of the reward function on the distance between the obstacle and the current mobile robot is represented, wherein the obstacle is the nearest obstacle; obs _ dis represents the distance between the current mobile robot and the closest obstacle; q is the barrier reward coefficient; the main purpose is to successfully avoid the obstacle, so that a corresponding safe distance is set, q is set to be a positive value, the larger the distance between an obstacle outside the safe distance and the current mobile robot is, the better the obstacle is, the larger obs _ dis is, the larger the reward given by a reward function is;

k step _ count is R ₃ K is a time penalty coefficient, and step _ count is the current accumulated step number; r ₃ Indicating the number of steps taken to avoid the obstacleTaking k as a negative number to make the penalty given by the reward function larger when step _ count is larger;

by utilizing the characteristic of maximizing the reward by reinforcement learning, corresponding reward and punishment is further carried out on each step of actions of the mobile robot in the aspects of whether the mobile robot successfully avoids an obstacle, whether the mobile robot reaches a virtual target, whether the cost is low and the like, so that the mobile robot finally realizes the purposes of successfully avoiding the obstacle, reaching the virtual target and having a smooth obstacle avoiding path under the environment that the complex static obstacle contains the clip-shaped obstacle, and the cost is low, namely the expected target with a short obstacle avoiding path is realized; aiming at avoiding obstacles, the method is divided into two conditions of existence and nonexistence of obstacles in the detection range of a laser range finder:

if no obstacle exists in the detection range, corresponding reward is given to the heading angle of the robot, and if the mobile robot moves towards the target direction, the reward is given as shown in the following formula:

wherein tar _ det is a global coordinate system X _e O _e Y _e Connecting line between lower trolley heading angle and trolley and target and X _e Angle difference of included angle formed by axial positive direction;

if an obstacle exists in the detection range, punishing the collision state of the mobile robot, as shown in the following formula:

wherein obs _ dis represents the distance between the current mobile robot and the closest obstacle; dis _crash The collision distance is set as the safe distance if obs _ dis < ═ dis _crash If the collision is defaulted, carrying out corresponding punishment on the collision, ending the current round and starting the next round;

the reward and punishment on whether the mobile robot reaches the target is shown as follows:

wherein tar _ dis represents the distance between the mobile robot and the virtual target at the current moment; dis _reach For a set arrival distance, i.e. tar _ dis ≦ dis _reach When the mobile robot reaches the target, the mobile robot is defaulted to award the target; in order to enable the mobile robot to reach the target better and more continuously and stably, a continuous arrival reward R is set _1′ As shown in the following formula:

wherein k is a continuous arrival reward coefficient; the ep _ count is the cumulative number of rounds of the mobile robot continuously reaching the target, and if the rounds which are not reached appear, the ep _ count is reset;

2. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: the method comprises the steps that first, original data are obtained through a laser range finder carried by the mobile robot, in the obtained obstacle information, real data collected by the laser range finder are simulated, the original data collected by the laser range finder are a series of discrete data points and are given in a polar coordinate mode under a local coordinate system, the data include distance information and azimuth angle information, and the distance information and the azimuth information of the obstacle under the local coordinate system are obtained through the information.

3. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: and step two, after the original data in the step one are subjected to data processing, the original data are combined with relevant position and angle information in a global coordinate system to be correspondingly processed, and the processed data are the state S of the A3C algorithm, wherein the local coordinate system X is the state S of the A3C algorithm _m O _m Y _m The method is characterized in that the mobile robot is taken as the origin of coordinates, the motion direction of the mobile robot is taken as the positive direction of an X axis, and the positive direction of a Y axis is vertical to the X axis and meets the right-hand rule; global coordinate system X _e O _e Y _e Then is the geodetic coordinate system; in the first step, the original data detected by the laser range finder is orientation information in a local coordinate system, and the orientation information needs to be converted into orientation information in a global coordinate system through coordinate transformation, and then the orientation information is correspondingly processed with the position and angle information related to the mobile robot and the virtual target in the global coordinate system to be used as the state S of the A3C algorithm.

4. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: and fourthly, in designing the complex static obstacle environment, the initial position of the mobile robot and the virtual target, wherein the complex static obstacle environment is a loop-shaped obstacle environment.

5. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: establishing an A3C-LSTM neural network, and taking the state S as an input layer of the network, wherein the whole network framework is divided into a global network part and a local network part, and 4 local networks are provided due to parallel asynchronous training; the global network and the local network both comprise an Actor and Critic sub-network structures, wherein an output layer of the Actor sub-network in the local network outputs a normally distributed parameter for selecting an action executed by the mobile robot, namely a heading angle variation, the Critic sub-network outputs an evaluation value of the selected action in the state, the two sub-networks are respectively composed of an input layer, an LSTM hidden layer, two full-connection hidden layers and an output layer, and the difference between the two sub-networks is the number of neurons in the hidden layers, an activation function and output.

6. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: and step six, training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing training, and applying a stored network model to the mobile robot to avoid the obstacle, wherein the whole network works by 4 threads in parallel, in the training process, corresponding parameters of the network are continuously updated in the direction of maximizing the reward, and the training is finished until the number of fixed rounds is reached and the total reward obtained by each round of the mobile robot in the final training process tends to be stable, so that the trained A3C-LSTM network is obtained, and the network is directly applied to the mobile robot to avoid the obstacle.