CN110750096B - Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment - Google Patents

Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment Download PDF

Info

Publication number
CN110750096B
CN110750096B CN201910953377.1A CN201910953377A CN110750096B CN 110750096 B CN110750096 B CN 110750096B CN 201910953377 A CN201910953377 A CN 201910953377A CN 110750096 B CN110750096 B CN 110750096B
Authority
CN
China
Prior art keywords
mobile robot
obstacle
reward
network
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910953377.1A
Other languages
Chinese (zh)
Other versions
CN110750096A (en
Inventor
王宏健
何姗姗
张宏瀚
袁建亚
于丹
贺巨义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201910953377.1A priority Critical patent/CN110750096B/en
Publication of CN110750096A publication Critical patent/CN110750096A/en
Application granted granted Critical
Publication of CN110750096B publication Critical patent/CN110750096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention belongs to the technical field of mobile robot navigation, and particularly relates to a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment. According to the method, a laser range finder is used for collecting original data, the processed data are used as a state S of an A3C algorithm, an A3C-LSTM neural network is constructed, the state S is used as network input, the neural network outputs corresponding parameters through an A3C algorithm, and the parameters are used for selecting actions to be executed by each step of the mobile robot through normal distribution. According to the invention, the environment is not required to be modeled, and the mobile robot can successfully avoid the obstacle in the complex static obstacle environment through a deep reinforcement learning algorithm. The invention designs a continuous action space model with a turn-bow constraint, adopts multi-thread asynchronous learning, greatly improves the learning and training time, reduces the sample correlation, ensures the high utilization of exploration space and the diversity of exploration strategies, and improves the convergence, stability and obstacle avoidance success rate of the algorithm compared with the common deep reinforcement learning method.

Description

Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
Technical Field
The invention belongs to the technical field of mobile robot navigation, and particularly relates to a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment.
Background
The application of the mobile robot is deep into various fields in life, and on an industrial production line, the robot can replace human beings to carry out heavy work with high repeatability, so that the production efficiency is improved, and the human labor force is liberated.
Obstacle avoidance is a key problem in robot research in the process of executing tasks by multiple robots. It is a dynamic, real-time, uncertain model of the typical environment. Furthermore, it requires that the robot adapt to changes in the environment quickly and independently. For obstacle avoidance problems, behavior-based response control methods and rule-based control strategies are often used. Both methods are simple, the certainty is strong, and the robot has better response to the environment. But when the task and environment become complex, the robot can run away, leading to erroneous decisions and deadlocks. Furthermore, designers of these methods must have a great deal of experience and domain knowledge to design integrated control strategies. However, it is difficult to fully describe the entire instance that may occur and to design rules for domains that are completely or partially unknown. Therefore, improving the adaptability and robustness of the robot is an effective method for making the robot self-learn, wherein deep reinforcement learning is receiving extensive attention.
The reinforcement learning and the deep reinforcement learning have good performances on the problems of obstacle avoidance, navigation and the like of the mobile robot. In general, the traditional obstacle avoidance method has large limitation, and is particularly not suitable for complex and dynamic unknown environments; the intelligent obstacle avoidance algorithm, particularly the obstacle avoidance algorithm combining deep learning and reinforcement learning, which is popular in recent years, has great advantages for continuous high-dimensional complex dynamic unknown environments.
In obstacle avoidance research in a complex environment, obstacles are generally divided into two states: convex obstacles and concave obstacles, the meander obstacles are one of the concave obstacles, but have a more complex structure. For complex static obstacles, especially for meander-shaped obstacles, it is difficult for the mobile robot to escape from such an environment without an applicable collision avoidance algorithm. The mobile robot static obstacle avoiding method based on deep reinforcement learning is suitable for the problem of continuous action space obstacle avoiding of unknown complex static obstacles, eliminates correlation through an asynchronous learning mechanism, saves communication cost and time cost, improves diversity of exploration strategies, finally improves algorithm stability, and can realize online learning. For a complex environment, a large amount of time and cost are consumed for environment modeling, and the method does not need to model the environment and can better solve the problem of obstacle avoidance of the mobile robot in the complex static obstacle environment, particularly the loop-shaped obstacle environment. In the work of the invention, the invention concentrates on learning the obstacle avoidance behavior, solves the problem of obstacle avoidance of the mobile robot in the complex static obstacle environment, improves the obstacle avoidance efficiency, and is better applied to autonomous safe navigation of the mobile robot.
Disclosure of Invention
The invention aims to: the method is safe, effective and reliable, ensures that the mobile robot can safely and effectively avoid the loop-shaped barrier when in operation, effectively solves the trap problem caused by the loop-shaped barrier, efficiently escapes the loop-shaped or any other concave structure barriers, and smoothly executes the operation task.
The purpose of the invention is realized as follows:
a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment comprises the following steps:
the method comprises the following steps: acquiring original data through a laser range finder carried by a mobile robot to acquire obstacle information;
step two: after data processing is carried out on the original data in the step one, corresponding processing is carried out by combining the original data with relevant position and angle information in a global coordinate system, and the processed data is the state S of the A3C algorithm;
step three: designing an action space and a reward function of an A3C algorithm under the environment, and performing corresponding reward and punishment after the mobile robot performs each step of action through the reward function;
step four: designing a complex static obstacle environment, an initial position of a mobile robot and a virtual target;
step five: establishing an A3C-LSTM neural network, and taking the state S as an input layer of the network;
step six: and (3) training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing training, and applying the stored network model to the obstacle avoidance of the mobile robot.
The method comprises the steps that first, original data are obtained through a laser range finder carried by the mobile robot, in the obtained obstacle information, real data collected by the laser range finder are simulated, the original data collected by the laser range finder are a series of discrete data points and are given in a polar coordinate mode under a local coordinate system, the data include distance information and azimuth angle information, and the distance information and the azimuth information of the obstacle under the local coordinate system are obtained through the information.
And step two, after the original data in the step one are subjected to data processing, the original data are combined with relevant position and angle information in a global coordinate system to be correspondingly processed, and the processed data are the state S of the A3C algorithm, wherein the local coordinate system X is the state S of the A3C algorithm m O m Y m The method is characterized in that the mobile robot is taken as the origin of coordinates, the motion direction of the mobile robot is taken as the positive direction of an X axis, and the positive direction of a Y axis is vertical to the X axis and meets the right-hand rule; global coordinate system X e O e Y e Then is the geodetic coordinate system; in the first step, the original data detected by the laser range finder is orientation information in a local coordinate system, and the orientation information needs to be converted into orientation information in a global coordinate system through coordinate transformation, and then the orientation information is correspondingly processed with the position and angle information related to the mobile robot and the virtual target in the global coordinate system to be used as the state S of the A3C algorithm.
Aiming at the action space and the reward function of the A3C algorithm under the environment, the corresponding reward and punishment is carried out on the mobile robot after each step of action is executed through the reward function, and the reward function comprises three parts of punishment on the distance from the target, punishment on the distance from the obstacle and punishment on the used step length, and represents the punishment on the distance from the nearest obstacle to the current mobile robot respectively; punishment on the distance between the target and the mobile robot; and punishment on the number of steps used in the whole round, wherein when the mobile robot executes one action in each step length, the reward function carries out corresponding punishment on the quality of the action selected in the current state.
And step four, setting different types of static obstacle environments and loop-shaped obstacle environments in the designed complex static obstacle environment, the initial position of the mobile robot and the virtual target.
Establishing an A3C-LSTM neural network, and taking the state S as an input layer of the network, wherein the whole network framework is divided into a global network part and a local network part, and 4 local networks are provided due to parallel asynchronous training; the global network and the local network both comprise an Actor and Critic sub-network structures, wherein an output layer of the Actor sub-network in the local network outputs a normally distributed parameter for selecting an action executed by the mobile robot, namely a heading angle variation, the Critic sub-network outputs an evaluation value of the selected action in the state, the two sub-networks are respectively composed of an input layer, an LSTM hidden layer, two full-connection hidden layers and an output layer, and the difference between the two sub-networks is the number of neurons in the hidden layers, an activation function and output.
And step six, training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing training, and applying a stored network model to the mobile robot to avoid the obstacle, wherein the whole network works by 4 threads in parallel, in the training process, corresponding parameters of the network are continuously updated in the direction of maximizing the reward, and the training is finished until the number of fixed rounds is reached and the total reward obtained by each round of the mobile robot in the final training process tends to be stable, so that the trained A3C-LSTM network is obtained, and the network is directly applied to the mobile robot to avoid the obstacle.
The invention has the beneficial effects that:
1. compared with the traditional obstacle avoidance method and the depth reinforcement learning method, on the premise of ensuring the safety distance, the obstacle avoidance efficiency and the success rate of the mobile robot are improved, the obstacle avoidance track is smoother, and the time cost consumption in the obstacle avoidance process is reduced.
2. In consideration of the structural characteristics of the special obstacles of the complex-loop obstacles and the heading limiting factors of the movement of the mobile robot, the method is specially researched for the obstacle avoidance problem and the local trap escape problem of the loop obstacles with the special structures, and the method is more intelligent and higher in transportability by applying reinforcement learning and the long-term memory characteristic of the LSTM network, and the experimental result shows that the method for escaping from the traps has the characteristics of high efficiency and safety under the condition that the obstacle information is locally known.
3. Compared with the common deep reinforcement learning method such as Q-learning algorithm and the like which is only suitable for discrete motion space, the method can be applied to high-dimensional continuous motion space and can also be applied to discrete motion space. The method can solve more problems, can realize on-line learning, realizes the effects of training and learning at the same time, adopts multi-thread parallel learning, reduces communication cost, improves training efficiency, enriches exploration strategies, enlarges exploration areas, is more intelligent, has good environmental adaptability for static complex obstacle environments, particularly clip-shaped obstacle environments, has high obstacle avoidance execution efficiency, and is more suitable for the technical field of mobile robot navigation.
Drawings
FIG. 1 is a flow chart of a static environment obstacle avoidance algorithm;
FIG. 2 is a view of a serpentine barrier environment;
FIG. 3 is a laser sensor detection model;
FIG. 4 is a mobile robot coordinate system;
FIG. 5 is a general static obstacle environment;
FIG. 6 shows a simulation result of obstacle avoidance for a general static obstacle environment;
FIG. 7 is a curve of a mobile robot heading angle adjustment process;
FIG. 8 is a simulation result of obstacle avoidance for a square-shaped obstacle environment;
FIG. 9 is a schematic diagram of the A3C-LSTM neural network structure.
Detailed Description
The invention is further described below with reference to the accompanying drawings and examples.
The invention discloses a mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment, which can be used for effectively avoiding obstacles when a mobile robot works in a complex static obstacle environment containing a square-shaped obstacle. According to the invention, corresponding data processing is carried out on the basis that a laser range finder collects original data, the processed data is used as a state S of an A3C algorithm, a corresponding A3C-LSTM neural network is constructed, the state S is used as network input, the neural network outputs corresponding parameters through an A3C algorithm, actions executed by each step of the mobile robot are selected through normal distribution by utilizing the related parameters, and the flow chart of the overall obstacle avoidance algorithm is shown in figure 1. In consideration of the complex and diversified environment, particularly the complex and diversified shape of the zigzag obstacle, the problem that the mobile robot can not escape when being trapped in the obstacle is easily caused, and aiming at the situation, the traditional obstacle avoidance method is difficult to obtain a good effect, and the algorithm does not need to model the environment and belongs to a deep reinforcement learning algorithm, namely a trial and error mechanism, the purpose of maximizing the reward is achieved, only the reward and punishment of each step of action of the mobile robot is given, and finally the purpose that the mobile robot can successfully avoid the obstacle, particularly can successfully escape the zigzag obstacle and reach the virtual target in the complex and diversified environment is achieved. In consideration of kinematic constraint conditions of the mobile robot, such as the maximum turning angular rate, a continuous action space model with turning constraint is designed, and multi-thread asynchronous learning is adopted. Experimental simulation results show that the algorithm provided by the invention has good environmental adaptability, high obstacle avoidance execution efficiency and capability of successfully escaping from the obstacle trap aiming at a static complex obstacle environment, particularly a loop-shaped obstacle environment, and is more suitable for autonomous safe navigation of a mobile robot.
The invention comprises the following steps:
the method comprises the following steps: obtaining original data through a laser range finder carried by a mobile robot, and obtaining obstacle information:
the invention simulates real laser range finder to collect data, the laser range finder equipped by a wheeled mobile robot 'traveler No. 4' is a master model, the detection range is 80 meters, the open angle is 180 degrees, the resolution is 1 degree, the original data collected by the laser range finder is a series of discrete data points, the data are given in a polar coordinate mode under a local coordinate system and all contain distance information and azimuth information, and the distance and azimuth information of an obstacle under the local coordinate system can be obtained through the information. The data detected back by 180 beams from the laser rangefinder is shown in fig. 3.
Step two: obtaining original data through a laser range finder carried by a mobile robot, and obtaining obstacle information:
local coordinate system X m O m Y m The method is characterized in that the mobile robot is taken as the origin of coordinates, the motion direction of the mobile robot is taken as the positive direction of an X axis, and the positive direction of a Y axis is vertical to the X axis and meets the right-hand rule; global coordinate system X e O e Y e The geodetic coordinate system. In the first step, the original data detected by the laser range finder is orientation information in a local coordinate system, and the orientation information needs to be converted into orientation information in a global coordinate system through coordinate transformation, and then the orientation information is correspondingly processed with the angle and position information related to the mobile robot and the virtual target in the global coordinate system to be used as the state S of the A3C algorithm. Suppose that the mobile robot has a position coordinate of (x) in the global coordinate system at time t t ,y t ) In a sampling period T s And if the mobile robot does uniform linear motion, the kinematic model of the mobile robot is shown as the following formula:
Figure BDA0002226454320000051
wherein (x) t+1 ,y t+1 ) Respectively the position coordinate v of the mobile robot under the global coordinate system at the moment t +1 t Sampling period T for mobile robot s Internal movementSpeed psi being the heading and X of the robot e The included angle is included in the positive direction of the axis.
The conversion formula for converting the polar coordinate form position information of the obstacle in the local coordinate system into the rectangular coordinate form position information is shown as the following formula:
Figure BDA0002226454320000052
wherein (x) o ,y o ) The position information of the obstacle is represented in a rectangular coordinate form under a local coordinate system, the (l, alpha) is the original polar coordinate information of the obstacle detected by a laser range finder, and the (l) is the point from the obstacle to the point O m A linear distance of (a) is the distance between the obstacle and O m Connecting line with Y m The included angle is included in the positive direction of the axis.
Finally, all local information is converted into information under a global coordinate system, so that coordinates of the obstacle under the local coordinate system are converted into rectangular coordinates (x) under the global coordinate system e ,y e ) The conversion formula of (c) is shown as follows:
Figure BDA0002226454320000053
wherein psi is defined as the heading angle of the mobile robot, i.e. its heading direction and the global coordinate system X e The angle between the positive direction of the axis and the point of the obstacle is m A linear distance of (a) is the distance between the obstacle and O m Connecting line with Y m The included angle is included in the positive direction of the axis. The coordinate system of the mobile robot, which is relied on in the whole transformation process, is shown in fig. 4.
Step three: the method is characterized in that an action space and a reward function of an A3C algorithm under the environment are designed, and the mobile robot carries out corresponding reward and punishment after executing each step of action through the reward function:
the designed motion space is a continuous motion space with the change range of each step of the heading angle of the mobile robot within the range of [ -10 degrees, +10 degrees ], wherein one step represents 1s, the speed of the mobile robot in the whole obstacle avoidance process is constant and is 1m/s, the initial heading angle is set to be 0 degree by taking the north direction as the reference, and the anticlockwise direction is positive. The reward function comprises three parts of penalty for the distance from the target, penalty for the distance from the obstacle and penalty for the used step length, and represents the penalty for the distance from the nearest obstacle to the current mobile robot; punishment on the distance between the target and the mobile robot; and punishment on the number of steps used in the whole round, wherein when the mobile robot executes one action in each step length, the reward function carries out corresponding punishment on the quality of the action selected in the current state.
The reward function R is shown as follows:
R(s,a)=p·tar_dis+q·obs_dis+k·step_count
wherein p tar dis is R 1 The penalty of the reward function to the distance between the mobile robot and the target is represented, tar _ dis represents the distance between the mobile robot and the virtual target at the current moment, p is a target reward coefficient, and since the final aim is to reach the target, p is set to be a negative value, so that the larger tar _ dis is, the larger the penalty given by the reward function is; otherwise, the penalty is reduced, namely the phase-change reward is obtained.
q.obs _ dis is R 2 The penalty of the reward function on the distance between the obstacle and the current mobile robot is represented, wherein the obstacle is the nearest obstacle, obs _ dis represents the distance between the current mobile robot and the nearest obstacle, q is an obstacle reward coefficient, as the most main purpose is to successfully avoid the obstacle, the corresponding safety distance is set, q is set to be a positive value, the larger the distance between the obstacle outside the safety distance and the current mobile robot is, the better the obs _ dis is, the larger the reward given by the reward function is; conversely, the smaller the reward, the phase change penalty is.
k·step count Is R 3 And k is a time penalty coefficient, and step _ count is the current accumulated step number. One of the important evaluation indexes for evaluating the quality of the obstacle avoidance algorithm is a cost function, so that step _ count directly represents the cost for obstacle avoidance of the mobile robot in the whole round, and corresponding reward and punishment needs to be carried out on the cost to ensure that the obstacle avoidance path is as low as possible and efficient. R 3 Represents the punishment of the time spent on the step number for avoiding the obstacle, and takes k asA negative number, such that the greater the step _ count, the greater the penalty given by the reward function; conversely, the smaller the reward, the phase change penalty is.
By utilizing the characteristic that the reward is maximized through reinforcement learning, corresponding reward and punishment is further carried out on each step of action of the mobile robot in the aspects of whether the mobile robot successfully avoids an obstacle, whether the mobile robot reaches a virtual target, whether the cost is low and the like, so that the mobile robot finally achieves the expected target that the obstacle is successfully avoided, the virtual target is reached, the obstacle avoiding path is smooth, and the used cost is low, namely the obstacle avoiding path is short. Aiming at avoiding obstacles, the method is divided into two conditions of existence and nonexistence of obstacles in the detection range of a laser range finder: if no obstacle exists in the detection range, corresponding reward is given to the heading angle of the robot, and if the mobile robot moves towards the target direction, the reward is given as shown in the following formula:
Figure BDA0002226454320000061
wherein tar _ det is a global coordinate system X e O e Y e Connecting line between lower trolley heading angle and trolley and target and X e Angle difference of included angle formed by positive direction of axis. If an obstacle exists in the detection range, punishing the collision state of the mobile robot, as shown in the following formula:
Figure BDA0002226454320000071
wherein obs _ dis represents the distance between the current mobile robot and the closest obstacle, dis crash Set as collision distance, take it as safe distance if obs _ dis<=dis crash The default is that collision occurs, corresponding punishment is carried out on the collision, the current round is ended, and the next round is started. The reward and punishment on whether the mobile robot reaches the target is shown as follows:
Figure BDA0002226454320000072
where tar _ dis represents the distance between the mobile robot and the virtual target at the current time, dis reach For a set distance of arrival, i.e. tar _ dis<=dis reach When the mobile robot reaches the target by default, it is rewarded. In order to enable the mobile robot to reach the target better and more continuously and stably, a continuous arrival reward R is set 1′ As shown in the following formula:
Figure BDA0002226454320000073
and if the number of the rounds which are not reached is not reached, the ep _ count is cleared.
Step four: designing a complex static obstacle environment containing a loop-shaped obstacle environment, an initial position of a mobile robot and a virtual target:
the size of the simulation environment is 800 multiplied by 600, the initial position of the mobile robot is set to be (50,100), the initial heading angle is defined to be 0 degrees by taking the positive north direction as a reference, namely the positive direction of the y axis of the global coordinate system, the dark gray square is a virtual target, and the side length is 30. The device comprises two static obstacle environments, namely a general static obstacle environment and a square obstacle environment, which are different in type. The general static obstacle environment is shown in fig. 5, and includes various types of static obstacles, which are all represented by black areas, a dark gray rectangular area is a virtual target, and a light gray curve is a motion trajectory of the mobile robot. The speed v of the mobile robot is 1m/s, and in consideration of reality, in order to make the simulation more real, the mobile robot is provided with a rotation angle limit, so that the change of the motion direction angle of the mobile robot is more reasonable. Fig. 6 shows the obstacle avoidance simulation result of the mobile robot in the general static obstacle environment. The corresponding curve of the mobile robot heading angle adjustment process of fig. 6 is shown in fig. 7. Fig. 8 shows a simulation result of obstacle avoidance for a circle-shaped obstacle environment.
Step five: establishing an A3C-LSTM neural network, taking the state S as an input layer of the network:
the whole network framework is divided into a global network part and a local network part, as shown in fig. 9. Due to parallel asynchronous training, 4 local networks are set. Both the global network and the local network contain two sub-network structures of Actor and Critic, but the global network only plays a role in storing network related parameters and in time pulling the related parameters to the local network. The output layer of the Actor sub-network in the local network outputs normally distributed parameters for selecting the action executed by the mobile robot, namely heading angle variation, and the criticic sub-network outputs evaluation values of the selected action in the state.
Establishing an A3C-LSTM neural network, inputting an Actor network as 8-dimensional state information after corresponding data processing, and having three hidden layers, wherein the input layer is connected with an LSTM neural network layer containing 64 LSTM memory units, batch _ size is 1, the initialization state is all 0, the output of the processed LSTM layer is used as the input of a second hidden layer, a nonlinear full connection mode with an activation function of RELU6 is adopted, the output of the layer is used as the input of a last hidden layer, the activation function of the last hidden layer is also the nonlinear full connection layer of RELU6, the output layer is finally connected, the dimension is 2, the activation functions are respectively tanh and softplus, and are connected in the nonlinear full connection mode, and the final output is sigma and mu. Because of the continuous motion space, the normal distribution is determined by the output parameters, and the selected motion is obtained by directly sampling in the normal distribution form according to the probability. The Critic network and the Actor network have basically the same structure, except that the neuron number difference of the hidden layer and the activation function are different, the output layer of the Critic network is only one layer, the dimensionality is 1, and the final output is V(s) t ) For corresponding evaluation of the selected action. And the advantage function is utilized, namely the advantage function can more clearly explain the good ring of the selected action in the current state compared with the base line, and the good or bad degree is expressed, particularly, the good or bad degree is more or less, and the bad or bad degree has a quantitative standard, so that the evaluation is more convenientThe method is accurate, and is beneficial to an Actor network to more accurately master the direction of the Actor network which is continuously updated and maximizes the reward. In addition, the updating of the network is performed by the Critic network, and the LSTM network layer in the Actor network is the same as that in the Critic network, does not play a role of updating the network memory module, and can only be directly applied.
Step six: training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing the training, and applying the stored network model to the mobile robot to avoid the obstacle:
the whole network works by 4 threads in parallel, namely 4 local networks exist, in the training process, each thread can adopt different exploration strategies, correspondingly obtain different states, explore different areas of the environment, and after 4 threads finish one-step action selection, the global network stores corresponding parameters of local networks with best performance and pulls the parameters to each local network to update the local networks. Therefore, more strategies can be explored while the correlation is removed, the exploration area is expanded, finally, the strategies are continuously updated in the direction of maximizing the reward, the training is finished until the number of fixed rounds is reached and the total reward obtained by each round of the mobile robot in the final training process tends to be stable, a trained A3C-LSTM network is obtained, and the network can be directly applied to obstacle avoidance of the mobile robot.
In order to solve the problem of local traps of complex static obstacles, particularly the difficulty brought to a mobile robot by a rectangular obstacle in an actual unknown working environment and improve the efficiency of the mobile robot in avoiding the complex spiral obstacle, the invention provides a mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment.
In order to verify the effect of the proposed obstacle avoidance algorithm on the escape from a complex static obstacle environment containing a loop-shaped obstacle, the whole experimental process of the mobile robot for avoiding the obstacle and successfully reaching a virtual target is designed. In the invention, the initial position of the mobile robot is set as (360,320), the size of the simulation environment is 800 multiplied by 600, and the heading angle is based on the global coordinate system Y e The positive axial direction is set to 0 degree in the true north direction, the initial heading angle is 0 degree, and the square-shaped barrier area is a black areaMeanwhile, a virtual target is set to (680,320) after the mobile robot escapes from the circular obstacle. Finally, the mobile robot can safely escape from the clip-shaped barrier and smoothly reach the appointed virtual target. And when the distance between the mobile robot and the virtual target is 15m, the target reaching zone bit is true, the AUV is marked to successfully complete the obstacle avoidance task after the target is reached, the training round is ended, and the round is restarted until the preset round times are reached.
According to the obstacle avoidance algorithm set for the square-shaped obstacles, as shown in fig. 8, a concave obstacle environment in a two-dimensional environment consisting of complex square-shaped obstacles is set, a light gray curve in the graph is a moving track of the mobile robot, the mobile robot can be seen from the graph to select a reasonable obstacle avoidance path to successfully reach a virtual target without bringing obstacle avoidance risks, meanwhile, the requirement of safe obstacle avoidance distance is kept, and experimental simulation results prove the reasonability and feasibility of the method provided by the invention.
The invention finally provides a mobile robot obstacle avoidance planning method based on deep reinforcement learning in a static environment aiming at the obstacle avoidance problem of a complex static obstacle, particularly a zigzag obstacle environment.

Claims (6)

1. A mobile robot collision avoidance planning method based on deep reinforcement learning in a static environment is characterized by comprising the following steps:
the method comprises the following steps: acquiring original data through a laser range finder carried by a mobile robot to acquire obstacle information;
step two: after data processing is carried out on the original data in the step one, corresponding processing is carried out by combining the original data with relevant position and angle information in a global coordinate system, and the processed data is the state S of the A3C algorithm;
step three: designing an action space and a reward function of an A3C algorithm under the environment, and performing corresponding reward and punishment after the mobile robot performs each step of action through the reward function;
the designed action space is a continuous action space with the change range of each step of the heading angle of the mobile robot within the range of [ -10 degrees, +10 degrees ], wherein one step represents 1s, the speed of the mobile robot in the whole obstacle avoidance process is constant and is 1m/s, the initial heading angle is set to be 0 degree by taking the north direction as the reference, and the anticlockwise direction is positive;
the reward function comprises three parts of punishment on the distance between the obstacle and the target, punishment on the distance between the obstacle and the used step length, and represents the punishment on the distance between the obstacle closest to the target and the current mobile robot, the punishment on the distance between the target and the mobile robot and the punishment on the number of steps used for the whole round; when the mobile robot finishes executing one action in each step, the reward function carries out corresponding reward and punishment on the quality of the action selected in the current state;
the reward function R is shown as follows:
R(s,a)=p·tar_dis+q·obs_dis+k·step_count
wherein p · tar _ dis is R 1 The penalty of the reward function on the distance between the mobile robot and the target is represented; tar _ dis represents the distance between the mobile robot and the virtual target at the current moment; p is a target reward coefficient, and p is set to be a negative value;
q.obs _ dis is R 2 The penalty of the reward function on the distance between the obstacle and the current mobile robot is represented, wherein the obstacle is the nearest obstacle; obs _ dis represents the distance between the current mobile robot and the closest obstacle; q is the barrier reward coefficient; the main purpose is to successfully avoid the obstacle, so that a corresponding safe distance is set, q is set to be a positive value, the larger the distance between an obstacle outside the safe distance and the current mobile robot is, the better the obstacle is, the larger obs _ dis is, the larger the reward given by a reward function is;
k step _ count is R 3 K is a time penalty coefficient, and step _ count is the current accumulated step number; r 3 Indicating the number of steps taken to avoid the obstacleTaking k as a negative number to make the penalty given by the reward function larger when step _ count is larger;
by utilizing the characteristic of maximizing the reward by reinforcement learning, corresponding reward and punishment is further carried out on each step of actions of the mobile robot in the aspects of whether the mobile robot successfully avoids an obstacle, whether the mobile robot reaches a virtual target, whether the cost is low and the like, so that the mobile robot finally realizes the purposes of successfully avoiding the obstacle, reaching the virtual target and having a smooth obstacle avoiding path under the environment that the complex static obstacle contains the clip-shaped obstacle, and the cost is low, namely the expected target with a short obstacle avoiding path is realized; aiming at avoiding obstacles, the method is divided into two conditions of existence and nonexistence of obstacles in the detection range of a laser range finder:
if no obstacle exists in the detection range, corresponding reward is given to the heading angle of the robot, and if the mobile robot moves towards the target direction, the reward is given as shown in the following formula:
Figure FDA0003626286180000021
wherein tar _ det is a global coordinate system X e O e Y e Connecting line between lower trolley heading angle and trolley and target and X e Angle difference of included angle formed by axial positive direction;
if an obstacle exists in the detection range, punishing the collision state of the mobile robot, as shown in the following formula:
Figure FDA0003626286180000022
wherein obs _ dis represents the distance between the current mobile robot and the closest obstacle; dis crash The collision distance is set as the safe distance if obs _ dis < ═ dis crash If the collision is defaulted, carrying out corresponding punishment on the collision, ending the current round and starting the next round;
the reward and punishment on whether the mobile robot reaches the target is shown as follows:
Figure FDA0003626286180000023
wherein tar _ dis represents the distance between the mobile robot and the virtual target at the current moment; dis reach For a set arrival distance, i.e. tar _ dis ≦ dis reach When the mobile robot reaches the target, the mobile robot is defaulted to award the target; in order to enable the mobile robot to reach the target better and more continuously and stably, a continuous arrival reward R is set 1′ As shown in the following formula:
Figure FDA0003626286180000024
wherein k is a continuous arrival reward coefficient; the ep _ count is the cumulative number of rounds of the mobile robot continuously reaching the target, and if the rounds which are not reached appear, the ep _ count is reset;
step four: designing a complex static obstacle environment, an initial position of a mobile robot and a virtual target;
step five: establishing an A3C-LSTM neural network, and taking the state S as an input layer of the network;
step six: and (3) training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing training, and applying the stored network model to the obstacle avoidance of the mobile robot.
2. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: the method comprises the steps that first, original data are obtained through a laser range finder carried by the mobile robot, in the obtained obstacle information, real data collected by the laser range finder are simulated, the original data collected by the laser range finder are a series of discrete data points and are given in a polar coordinate mode under a local coordinate system, the data include distance information and azimuth angle information, and the distance information and the azimuth information of the obstacle under the local coordinate system are obtained through the information.
3. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: and step two, after the original data in the step one are subjected to data processing, the original data are combined with relevant position and angle information in a global coordinate system to be correspondingly processed, and the processed data are the state S of the A3C algorithm, wherein the local coordinate system X is the state S of the A3C algorithm m O m Y m The method is characterized in that the mobile robot is taken as the origin of coordinates, the motion direction of the mobile robot is taken as the positive direction of an X axis, and the positive direction of a Y axis is vertical to the X axis and meets the right-hand rule; global coordinate system X e O e Y e Then is the geodetic coordinate system; in the first step, the original data detected by the laser range finder is orientation information in a local coordinate system, and the orientation information needs to be converted into orientation information in a global coordinate system through coordinate transformation, and then the orientation information is correspondingly processed with the position and angle information related to the mobile robot and the virtual target in the global coordinate system to be used as the state S of the A3C algorithm.
4. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: and fourthly, in designing the complex static obstacle environment, the initial position of the mobile robot and the virtual target, wherein the complex static obstacle environment is a loop-shaped obstacle environment.
5. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: establishing an A3C-LSTM neural network, and taking the state S as an input layer of the network, wherein the whole network framework is divided into a global network part and a local network part, and 4 local networks are provided due to parallel asynchronous training; the global network and the local network both comprise an Actor and Critic sub-network structures, wherein an output layer of the Actor sub-network in the local network outputs a normally distributed parameter for selecting an action executed by the mobile robot, namely a heading angle variation, the Critic sub-network outputs an evaluation value of the selected action in the state, the two sub-networks are respectively composed of an input layer, an LSTM hidden layer, two full-connection hidden layers and an output layer, and the difference between the two sub-networks is the number of neurons in the hidden layers, an activation function and output.
6. The mobile robot collision avoidance planning method based on deep reinforcement learning in the static environment according to claim 1, characterized in that: and step six, training and learning the established A3C-LSTM neural network by using an A3C algorithm, finishing training, and applying a stored network model to the mobile robot to avoid the obstacle, wherein the whole network works by 4 threads in parallel, in the training process, corresponding parameters of the network are continuously updated in the direction of maximizing the reward, and the training is finished until the number of fixed rounds is reached and the total reward obtained by each round of the mobile robot in the final training process tends to be stable, so that the trained A3C-LSTM network is obtained, and the network is directly applied to the mobile robot to avoid the obstacle.
CN201910953377.1A 2019-10-09 2019-10-09 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment Active CN110750096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910953377.1A CN110750096B (en) 2019-10-09 2019-10-09 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910953377.1A CN110750096B (en) 2019-10-09 2019-10-09 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment

Publications (2)

Publication Number Publication Date
CN110750096A CN110750096A (en) 2020-02-04
CN110750096B true CN110750096B (en) 2022-08-02

Family

ID=69277811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910953377.1A Active CN110750096B (en) 2019-10-09 2019-10-09 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment

Country Status (1)

Country Link
CN (1) CN110750096B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582441B (en) * 2020-04-16 2021-07-30 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
CN112084700A (en) * 2020-08-06 2020-12-15 南京航空航天大学 Hybrid power system energy management method based on A3C algorithm
CN112304314A (en) * 2020-08-27 2021-02-02 中国科学技术大学 Distributed multi-robot navigation method
CN111880549B (en) * 2020-09-14 2024-06-04 大连海事大学 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning
CN112801272A (en) * 2021-01-27 2021-05-14 北京航空航天大学 Fault diagnosis model self-learning method based on asynchronous parallel reinforcement learning
CN113093727B (en) * 2021-03-08 2023-03-28 哈尔滨工业大学(深圳) Robot map-free navigation method based on deep security reinforcement learning
CN112991544A (en) * 2021-04-20 2021-06-18 山东新一代信息产业技术研究院有限公司 Group evacuation behavior simulation method based on panoramic image modeling
CN113156959B (en) * 2021-04-27 2024-06-04 东莞理工学院 Self-supervision learning and navigation method for autonomous mobile robot in complex scene
CN113392584B (en) * 2021-06-08 2022-12-16 华南理工大学 Visual navigation method based on deep reinforcement learning and direction estimation
CN114237235B (en) * 2021-12-02 2024-01-19 之江实验室 Mobile robot obstacle avoidance method based on deep reinforcement learning
CN114526738B (en) * 2022-01-25 2023-06-16 中国科学院深圳先进技术研究院 Mobile robot visual navigation method and device based on deep reinforcement learning
CN114708527A (en) * 2022-03-09 2022-07-05 中国石油大学(华东) Polar coordinate representation-based digital curling strategy value extraction method
CN115857556B (en) * 2023-01-30 2023-07-14 中国人民解放军96901部队 Unmanned aerial vehicle collaborative detection planning method based on reinforcement learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092254A (en) * 2017-04-27 2017-08-25 北京航空航天大学 A kind of design method for the Household floor-sweeping machine device people for strengthening study based on depth
CN108255182A (en) * 2018-01-30 2018-07-06 上海交通大学 A kind of service robot pedestrian based on deeply study perceives barrier-avoiding method
CN109063823A (en) * 2018-07-24 2018-12-21 北京工业大学 A kind of intelligent body explores batch A3C intensified learning method in the labyrinth 3D
CN109540151A (en) * 2018-03-25 2019-03-29 哈尔滨工程大学 A kind of AUV three-dimensional path planning method based on intensified learning
CN110209152A (en) * 2019-06-14 2019-09-06 哈尔滨工程大学 The deeply learning control method that Intelligent Underwater Robot vertical plane path follows

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190272465A1 (en) * 2018-03-01 2019-09-05 International Business Machines Corporation Reward estimation via state prediction using expert demonstrations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092254A (en) * 2017-04-27 2017-08-25 北京航空航天大学 A kind of design method for the Household floor-sweeping machine device people for strengthening study based on depth
CN108255182A (en) * 2018-01-30 2018-07-06 上海交通大学 A kind of service robot pedestrian based on deeply study perceives barrier-avoiding method
CN109540151A (en) * 2018-03-25 2019-03-29 哈尔滨工程大学 A kind of AUV three-dimensional path planning method based on intensified learning
CN109063823A (en) * 2018-07-24 2018-12-21 北京工业大学 A kind of intelligent body explores batch A3C intensified learning method in the labyrinth 3D
CN110209152A (en) * 2019-06-14 2019-09-06 哈尔滨工程大学 The deeply learning control method that Intelligent Underwater Robot vertical plane path follows

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Deep Robust Reinforcement Learning for Practical Algorithmic Trading;YANG LI等;《IEEE Access》;20190802;第7卷;108014-108022 *
基于深度强化学习的未知环境下机器人路径规划的研究;卜祥津;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第01(2019)期);I140-1640 *
复杂场景下基于深度增强学习的移动机器人控制方法研究;周能;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190815(第08(2019)期);I140-340 *
应用于智能物流小车的避障策略研究;王蕾,等;《公路交通科技》;20181231(第4期);309-313 *

Also Published As

Publication number Publication date
CN110750096A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN110750096B (en) Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
Zhu et al. Deep reinforcement learning based mobile robot navigation: A review
Jiang et al. Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge
Wen et al. Path planning for active SLAM based on deep reinforcement learning under unknown environments
CN108036790B (en) Robot path planning method and system based on ant-bee algorithm in obstacle environment
Cao et al. Target search control of AUV in underwater environment with deep reinforcement learning
Xiang et al. Continuous control with deep reinforcement learning for mobile robot navigation
CN114625151A (en) Underwater robot obstacle avoidance path planning method based on reinforcement learning
Wei et al. Recurrent MADDPG for object detection and assignment in combat tasks
CN114083539B (en) Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning
CN112356031B (en) On-line planning method based on Kernel sampling strategy under uncertain environment
CN114510012A (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
Guo et al. Research on multi-sensor information fusion and intelligent optimization algorithm and related topics of mobile robots
Ma et al. Multi-AUV collaborative operation based on time-varying navigation map and dynamic grid model
Guo et al. Local path planning of mobile robot based on long short-term memory neural network
CN114596360A (en) Double-stage active instant positioning and graph building algorithm based on graph topology
CN114815801A (en) Adaptive environment path planning method based on strategy-value network and MCTS
Chen et al. Deep reinforcement learning-based robot exploration for constructing map of unknown environment
Shi et al. Research on Path Planning Strategy of Rescue Robot Based on Reinforcement Learning
CN115562258A (en) Robot social self-adaptive path planning method and system based on neural network
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial
Toan et al. Environment exploration for mapless navigation based on deep reinforcement learning
Wang et al. Research on SLAM road sign observation based on particle filter
Wang et al. Path planning model of mobile robots in the context of crowds
Meyer-Delius et al. Grid-based models for dynamic environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant