CN112799386B

CN112799386B - Robot path planning method based on artificial potential field and reinforcement learning

Info

Publication number: CN112799386B
Application number: CN201911020333.XA
Authority: CN
Inventors: 么庆丰; 郑泽宇; 赵明; 潘怡君
Original assignee: Shenyang Institute of Automation of CAS
Current assignee: Shenyang Institute of Automation of CAS
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2021-11-23
Anticipated expiration: 2039-10-25
Also published as: CN112799386A

Abstract

The invention discloses a robot path planning method based on an artificial potential field and reinforcement learning, and belongs to the field of path planning. First, a map is constructed using an artificial potential field method. Secondly, a small-range strong-acting-force domain potential field is added under the condition of multiple targets. And finally, realizing path planning of the multiple intelligent agents under the condition of multiple targets by using reinforcement learning and distributed course learning technologies. The method combines an artificial potential field method and a reinforcement learning method, effectively models a multi-target environment, reduces the occurrence of local stable points, learns and avoids the occurrence of the local stable points by using reinforcement learning, and improves the success rate of path planning. The invention has higher reliability for path planning.

Description

Robot path planning method based on artificial potential field and reinforcement learning

Technical Field

The invention belongs to the field of path planning, and particularly relates to a path planning method based on an artificial potential field, which utilizes a time sequence difference learning method and a reinforcement learning method.

Background

With the continuous development of intelligent agents and artificial intelligence theories, autonomous mobile intelligent agent technologies are mature day by day and are widely applied to various fields such as industry, military, medical treatment, service and the like. Meanwhile, the tasks of the intelligent agents are more complex, and the environment is changed from the original single intelligent agent and deterministic environment to a multi-intelligent agent and uncertain environment. Therefore, in recent years, research on the intelligent agent autonomous intelligent control technology in the complex system has gained wide attention in academic and industrial fields, and path planning and navigation, as key technologies thereof, have become one of the research hotspots of the intelligent agent at present.

Current path planning techniques include two broad categories: global planning based on the determined environment and local planning based on the sensing probe information. The former is to perform path planning in a static known environment, which is also called as a static path planning method, and currently, the more applied methods include: greedy algorithm, Dijkstra algorithm and a algorithm; the latter requires real-time path planning according to environmental information input by a sensor for the case where environmental information is unknown, and the mainstream methods include an artificial potential field method, a neural network method, a fuzzy logic method, and the like.

The artificial potential field method is a virtual force field method, which virtualizes the motion of an intelligent body in the environment into the motion in an artificial force field, wherein a target point generates attraction, an obstacle generates repulsion, and the resultant force of the attraction and the repulsion controls the motion of a robot. The movement of the robot is controlled by both attractive and repulsive forces. The algorithm is widely applied to the field of real-time obstacle avoidance and path planning due to the advantages of simple mathematical analysis, small calculated amount, smooth path and the like.

Disclosure of Invention

The invention provides a new path planning method, which adds a domain field with small range and strong acting force on an original artificial potential field method, and further adds a reinforcement learning algorithm in the domain potential field to solve the problem of multi-target point navigation and obstacle avoidance.

The technical scheme adopted by the invention for realizing the purpose is as follows:

the robot path planning method based on the artificial potential field and the reinforcement learning comprises the following steps:

the method comprises the following steps: constructing an artificial potential field, wherein the potential field is formed by overlapping a gravitational potential field and a repulsive potential field; the target point provides attraction for the intelligent body to form an attraction potential field; the obstacle provides repulsion to the intelligent body to form a repulsion potential field;

step two: and pre-training the reinforcement learning in a domain-artificial potential field to obtain a strategy for reinforcement learning, and the intelligent agent avoids obstacles according to the strategy and searches for a target point.

The method for planning the artificial potential field path aims at the intelligent algorithm optimization method of the non-convex obstacle and comprises the following steps: and further learning the intelligent agent which learns the preliminary strategy in the step two aiming at the specific local stable point condition, and learning and processing the environment of the complex condition.

The construction process of the potential field in the first step is as follows:

1) respectively constructing gravitational fields of the obstacle and the target point according to the positions of the obstacle and the target point, wherein the gravitational fields are as follows:

wherein U is_att(q) gravitational field, k, generated by the target point at position q_attThe gravity coefficient of the target point is larger, the target point has stronger attraction, q is a position coordinate, and the coordinate of the target point is q_gSo q is_gThe potential field is 0;

2) constructing repulsive force fields of obstacles

Wherein U is_rep(q) is the repulsive field generated by the obstacle at position q, k_repIs the repulsion coefficient of the obstacle, the larger the repulsion coefficient is, the stronger the repulsion around the obstacle is, q-q₀The distance between the current position coordinate and the obstacle is the repulsive force field range of the obstacle, and the repulsive force field range is p₀Beyond this range, the robot does not receive the repulsive force of the obstacle.

Further comprising constructing a domain potential field for the local stable point case

Wherein U is_str(q) is the field potential field, k_strIs a strong attractive force index, which is greater than k_att，q-q_gFor the distance between the current position coordinates and the target point, a range field p is provided_sWithin the range, strong attraction of the target point can be sensed.

In the second step, the pre-training of reinforcement learning in the domain-artificial potential field is carried out to obtain a strategy for reinforcement learning, and the steps are as follows:

1) and (3) establishing a Q function to calculate a reward value, and obtaining the reward when the intelligent body avoids the obstacle and arrives at the target point. The Q function predicts the total reward value obtained from the current policy until the end of the iteration, in the current action and state, the process for the agent to obtain the reward value is:

Q^π(s,a)＝E[r|s_t＝s,a_t＝a,π]

wherein Q^πIs the Q function of the strategy pi, s is the current state of the agent, i.e. the current potential field, a is the action taken by the agent, E is the mathematical expectation, r is the value of the reward obtained, s_tState of agent at time t, a_tThe action taken by the intelligent agent at the moment t, wherein pi is the strategy adopted by the intelligent agent at present;

2) the method comprises the following steps of approximating a Q function by using a deep neural network, using a deep Q learning method, using a neural network to express the value of the Q function generated by a target, and learning a value function by combining a time sequence difference method, wherein the method comprises the following steps:

wherein Y is_iAs a function of the time sequence value, gamma is the decay rate,

to take a ' of the maximum Q, s ' is the state of the agent at the next time, a ' is the action taken by the agent at the next time, θ_iThe strategy coefficients adopted by the agent for the ith iteration;

training was performed using the following loss function:

L(θ_i)＝E_s,a,r,s′[(Y_i-Q(s,a|θ_i))²]

wherein L (θ)_i) As a loss function, E_s,a,r,s′Adopting the expectation that a is currently awarded as r and the next state is s' for the behavior with the current state as s;

updating theta as a parameter of a deep neural network by gradient descent of a loss function_iCompleting pre-training;

3) and obtaining the reward value according to the real-time action and state of the intelligent agent, wherein the action corresponding to the maximum reward value is the strategy for reinforcement learning.

The input of the deep neural network is a and s, and the output is a reward value.

The following loss function is used for training, a method of belonging to the group of Greedy is used, a random behavior is selected by the probability of a greedy selection coefficient belonging to the group of greedy when the intelligent agent selects a new behavior each time, and the optimal behavior at the current moment, namely the action corresponding to the maximum reward value, is selected by the probability of 1 belonging to the group of greedy.

And aiming at the condition of local stable points, the potential field is the superposition of a gravitational field, a repulsive force field and a domain potential field.

The method is used for path planning of the industrial intelligent warehousing robot.

The invention has the following beneficial effects and advantages:

1. the invention has the advantages of exploration and development by using reinforcement learning, and the reinforcement learning can learn how to jump out of the local stable point under the condition that the traditional algorithm falls into the local optimum.

2. The method adds a domain field aiming at the condition of multiple target points to prevent local stable points from appearing, and simultaneously plays a role in helping to strengthen learning convergence.

3. The gravity potential field sent by the corresponding target point is controlled by the domain information, so that resource waste caused by the fact that a plurality of warehousing robots simultaneously move forward in the same direction is avoided, and the working efficiency of the multi-warehousing robot is improved.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention;

fig. 2 is a block diagram of a robot training flow of the method of the present invention.

Detailed Description

The invention provides a new path planning method, which adds a domain field with small range and strong acting force on an original artificial potential field method, and further adds a reinforcement learning algorithm in the domain potential field to solve the problem of multi-target point navigation and obstacle avoidance. The warehousing robot can automatically adapt to the environment under the condition of a partially visible Markov decision process, and the problem of local stable points of potential field superposition of multiple target points is solved by using basic information of the environment through Q learning and time difference learning. In addition, a distributed course learning method is used for strengthening the training of complex problems and helping the warehousing robot to process complex environments. And finally, under the condition of multiple robots, the domain signal is used as an alternating current signal to help control the action range of the potential field of the corresponding target point, so that the working efficiency of the multiple robots is enhanced.

Example (b):

the intelligent storage robot adopts intelligent operating system, through the system instruction, removes required goods shelves to operating personnel in the front for selecting, realizes the novel mode of "people is looked for to goods, people is looked for to the goods shelves": through advanced automatic weighing photographing, multi-layer conveying, cross sorting and other systems, productivity doubling can be achieved. The intelligent warehousing robot has the characteristics of stability, flexibility, high efficiency and intelligence. The intelligent storage system is connected by a wireless network, is provided with radar scanning, automatic searching and positioning, automatic charging, can work continuously for 24 hours, and is intelligently adaptive to various storage modes by utilizing big data analysis.

The use of intelligent storage robot is favorable to reducing the cost of commodity circulation letter sorting transport, reduces personnel's input, improves logistics management, reduces the probability that the goods transport harmd, can improve the letter sorting efficiency of modern commodity circulation, promotes the development of commodity circulation trade. It is valuable to enhance the ability of the smart warehousing robot to collaborate with each other and to enhance the robustness of the smart warehousing robot to different situations.

The following detailed description of the steps for carrying out the present invention is provided in conjunction with specific procedures:

as shown in fig. 1, a multi-agent path planning method based on artificial potential field and reinforcement learning mainly adopts a path planning method based on reinforcement learning and artificial potential field and a path optimization method of course learning, and includes the following steps:

the method comprises the following steps: constructing an improved artificial potential field, constructing a virtual potential field in an environment, wherein the potential field is formed by superposing two potential fields, and a target point provides the attraction force for the warehousing robot to form an attraction potential field; the obstacle provides repulsive force to form a repulsive force field. Under the drive of the potential field resultant force, the warehousing robot reaches a target point along a collision-free path.

Step two: the storage robot uses reinforcement learning to learn the improved artificial potential field without local stable points, avoids conventional obstacles and searches for target points.

Step three: and (4) aiming at the algorithm optimization of the warehousing robot of the non-convex obstacle, further learning the warehousing robot which learns the preliminary strategy in the step two aiming at the specific local stable point condition, and learning and processing the environment of the complex condition.

The potential field construction process in the first step is as follows:

2) constructing repulsive force fields of obstacles

Wherein U is_rep(q) is the repulsive field generated by the obstacle at position q, k_repIs the repulsion coefficient of the obstacle, the larger the repulsion coefficient is, the stronger the repulsion around the obstacle is, q-q₀The distance between the current coordinate and the obstacle is the repulsive force field range of the obstacle, and the repulsive force field range is p₀Beyond which the robot does not experience the repulsive force of the obstacle;

3) constructing a domain potential field with small-range strong acting force aiming at the condition of local stable points of multiple target points

Wherein U is_str(q) is the field potential field, k_strIs a strong attractive force index, which is greater than k_att，q-q_gFor the distance between the current coordinate and the target point, a range field p is provided_sWithin the range, strong attraction of the target point can be sensed. The local stable point represents the point where the sum of the obstacle and target point attractive forces experienced by the agent is 0.

As shown in fig. 2, the pre-training of the reinforcement learning in the domain artificial potential field in step two is as follows:

1) establishing a Q function to calculate an accumulated reward value, under the condition that the Q function predicts the current action and state, obtaining the total income according to the current strategy till the iteration is finished, wherein the process agent obtains the accumulated income;

Q^T(s,a)＝E[r|s_t＝s,a_t＝a,π]

wherein Q^πThe method comprises the steps that the function Q of a strategy pi is obtained, s is the current state of the storage robot, namely the current potential field condition, a is an action (including front, back, left, right, left front, left back, right front, right back and static) taken by the storage robot, E is a mathematical expectation, r is an obtained reward value (used for judging whether the storage robot can avoid obstacles and reach a target point), and pi is the strategy adopted by a current intelligent agent.

2) The traditional method uses an iterative Bellman equation to solve the Q function, but the Q function is difficult to realize under the condition of larger state space, a deep neural network is used for approximating the Q function, a deep Q learning method is used for expressing the Q value generated by a target by using the neural network, and a time sequence difference method is combined to learn the value function

training was performed using the following loss function:

L(θ_i)＝E_s,a,r,s′[(Y_i-Q(s,a|θ_i))²]

wherein L (θ)_i) As a loss function, E_s,a,r,s′Adopting a expectation that a is currently awarded as r and s is next state for s behavior

And simultaneously, an E-greedy method is used, when the warehousing robot selects a new behavior, an E probability selects a random behavior, a 1-E probability selects the current best behavior, and the E value is reduced along with the increase of the training time.

The storage robot algorithm for the non-convex obstacles (the projections on the horizontal ground are U-shaped and L-shaped) is optimized:

1) and applying a reinforcement learning algorithm to the pre-training storage robot in the second step. The storage robot learns the preliminary obstacle avoidance and target navigation capability.

2) And (3) aiming at the conditions that the obstacles with different shapes are easy to fall into local stable points, the warehousing robot can be further trained by using the algorithm of the third step under the condition that the warehousing robot can avoid square obstacles.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the claims.

Claims

1. The robot path planning method based on the artificial potential field and the reinforcement learning is characterized by comprising the following steps of:

2) constructing repulsive force fields of obstacles

Wherein U is_rep(q) is the repulsive field generated by the obstacle at position q, k_repIs the repulsion coefficient of the obstacle, the larger the repulsion coefficient is, the stronger the repulsion around the obstacle is, q-q₀The distance between the current position coordinate and the obstacle is the repulsive force field range of the obstacle, and the repulsive force field range is p₀Beyond which the robot does not experience the repulsive force of the obstacle;

further comprising:

constructing a domain potential field for a locally stable point condition

Wherein U is_str(q) is the field potential field, k_strIs a strong attractive force index, which is greater than k_att，q-q_gFor the distance between the current position coordinates and the target point, a range field p is provided_sStrong attraction of the target point can be sensed in the range;

2. The method for robot path planning based on artificial potential field and reinforcement learning of claim 1, wherein the method for robot path planning based on artificial potential field and reinforcement learning is an intelligent algorithm optimization method for non-convex obstacles, and comprises the following steps: and further learning the intelligent agent which learns the preliminary strategy in the step two aiming at the specific local stable point condition, and learning and processing the environment of the complex condition.

3. The method for robot path planning based on artificial potential field and reinforcement learning of claim 1, wherein in the second step, the reinforcement learning is pre-trained in a domain-artificial potential field to obtain a strategy for reinforcement learning, and the steps are as follows:

1) establishing a Q function to calculate an award value, obtaining the award when the intelligent agent avoids the obstacle and arrives at the target point, predicting a total award value obtained by the Q function according to the current strategy till the iteration is finished under the current action and state, wherein the process that the intelligent agent obtains the award value is as follows:

Q^π(s,a)＝E[r|s_t＝s,a_t＝a,π]

training was performed using the following loss function:

L(θ_i)＝E_s,a,r,s′[(Y_i-Q(s,a|θ_i))²]

4. The method of claim 3, wherein the deep neural network has inputs a, s and an output as a reward value.

5. The method for robot path planning based on artificial potential field and reinforcement learning of claim 3, wherein training is performed while using an e-greedy method, each time the agent selects a new behavior, a random behavior is selected with a probability of greedy selection coefficient e, and an action corresponding to the best behavior at the current time, i.e., the maximum reward value, is selected with a probability of 1-e.

6. The method for robot path planning based on artificial potential field and reinforcement learning of claim 1, wherein for a local stable point situation, the potential field is a superposition of an attraction field, a repulsion field and a domain potential field.

7. The robot path planning method based on the artificial potential field and the reinforcement learning of any one of claims 1 to 6, which is used for path planning of industrial intelligent storage robots.