CN116700258B

CN116700258B - Intelligent vehicle path planning method based on artificial potential field method and reinforcement learning

Info

Publication number: CN116700258B
Application number: CN202310692024.7A
Authority: CN
Inventors: 杨泽远; 杨刚; 陶发展; 何晓鹏; 熊心和; 匡海军; 宋兵; 汪洋; 杨一博
Original assignee: Wanjitai Technology Group Digital City Technology Co ltd
Current assignee: Wanjitai Technology Group Digital City Technology Co ltd
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2024-05-03
Anticipated expiration: 2043-06-13
Also published as: CN116700258A

Abstract

The application discloses an intelligent vehicle path planning method based on an artificial potential field method and reinforcement learning, which aims at the problem of 'minimum trap' existing in the traditional artificial potential field method, virtualizes a priori environment into a total artificial potential field through APF weighting, and calculates a proper learning value. The algorithm successfully solves the path planning problem in both known and unknown environments. The Q-learning method and the APF method are combined, so that the defects that a classical Q-learning method is low in learning speed, long in time consumption, incapable of learning in known and unknown environments and the like are overcome, and the phenomenon that the traditional artificial potential field method falls into local optimization is effectively avoided. Simulation results show that the APF reinforcement learning algorithm can improve learning speed, path length and path smoothness of path planning, and can avoid obstacles in moving obstacles and smoothly reach a destination.

Description

Intelligent vehicle path planning method based on artificial potential field method and reinforcement learning

Technical Field

The application relates to the technical field of automatic driving, in particular to an intelligent vehicle path planning method based on an artificial potential field method and reinforcement learning.

Background

Path planning is one of key technologies of automatic driving, and is a necessary condition for safe driving of an automatic driving automobile. The artificial potential field method (APF) is a relatively mature and real-time planning method in the path planning field, and the working principle is that environment information is abstracted into a gravitational field function and a repulsive force field function, and a collision-free path from a starting point to a target point is planned through a combined force field function. The traditional artificial potential field method cannot guarantee the efficiency of path planning, because two major defects exist: the vehicle may be subject to local minimum problems and under certain conditions the target may not be reachable.

Aiming at the problems of the traditional artificial potential field in intelligent vehicle path planning, the road boundary repulsive potential field is established to limit the running area of the vehicle; solving the problems of local optimum and unreachable targets by improving the function of the repulsive potential field of the obstacle; the information of the obstacle is described more accurately by establishing a speed repulsive force potential field, so that the problem of collision between the automobile and the obstacle is solved. Bei Shaoyi and the like invent an unmanned vehicle local path planning method (CN 202111673751.6) based on an improved artificial potential field method, a distance threshold is set in an attractive force function, a repulsive force field adjusting factor is added in a repulsive force function, when the unmanned vehicle falls into a local optimal solution, a strategy of jumping out the local optimal solution based on a smaller steering angle is adopted on the premise of meeting the steering constraint of the unmanned vehicle, but the calculation amount and the complexity of an algorithm are increased. The invention discloses an intelligent vehicle path planning method (CN 202011307835.3) for improving an artificial potential field algorithm by time culture and the like, which automatically adjusts the step length and the track moving out of an influence area according to the size of the influence area of a local minimum point to automatically escape when the intelligent vehicle falls into the local minimum point, but the calculation amount of the overall track planning is overlarge, and the planned path is not smooth enough.

The reinforcement learning algorithm is a learning method that does not require a priori knowledge. It emphasizes interactive learning with the environment and accumulates rewards by continually attempting to get environmental feedback and then selects the best action to complete the path planning task. The most widely used reinforcement learning algorithm in the path planning field is the Q-learning algorithm, which is a value-based algorithm in reinforcement learning, attempts an action in a particular state, and updates the Q value based on the instant prize or penalty it receives and its estimate of the value of the state taken. By repeatedly trying all actions in all states, then selecting the action that can obtain the maximum benefit according to the Q value.

Disclosure of Invention

The application provides an intelligent vehicle path planning method based on an artificial potential field method and reinforcement learning, wherein reinforcement learning algorithm is added to the traditional artificial potential field method, and APF weighting is utilized to guide Q-learning, so that the problems of local minimum and the like existing in the traditional potential field method are overcome, the convergence speed of classical Q-learning is accelerated, and the problem of path planning of an intelligent vehicle in a complex environment is solved.

In order to achieve the above purpose, the application adopts the following technical scheme: an intelligent vehicle path planning method based on an artificial potential field method and reinforcement learning comprises the following steps:

step S1: acquiring an environment image and establishing a grid map;

step S2, defining a reinforcement learning state space and an action space;

Step S3: initializing algorithm parameters;

step S4: randomly selecting an initial state from a state set;

Step S5: constructing an artificial potential field, wherein the potential field is formed by overlapping an attractive potential field and a repulsive potential field;

Step S6: in the action space, selecting actions by adopting an APF weighting or epsilon-greedy strategy;

step S7: executing the current action in the state s to obtain a new state s _t+ and a reward r;

Step S8: updating the Q value;

step S9: selecting the action with the maximum Q value in each step to obtain an optimal path;

step S10: repeating the steps 6,7 and 8 until a certain step number or a certain convergence condition is reached;

step S11: a path from the start point q ₀ to the target point q _f is generated, and the finally generated path is transmitted to a control center of the intelligent vehicle, and the intelligent vehicle runs according to the path.

Further, the specific operation of the step S1 is as follows: an environmental image is obtained based on an intelligent vehicle camera, the image is divided into 25 multiplied by 25 grids, an environmental model is built by adopting a grid method, if an obstacle is found in the grids, the grids are defined as obstacle positions, and the intelligent vehicle cannot pass through; if the target point is found in the grid, the target point is set at the target position of the grid, and the target point is the final position to be reached by the intelligent vehicle; other grids are defined as barrier-free grids through which the intelligent vehicle can pass.

Further, the specific operation of step S2 is as follows: the state space of reinforcement learning is defined as the current position coordinate and the last position coordinate of the agent, the action space is the actions in the upper, lower, left and right directions, and the agent moves a grid towards the corresponding direction after each action is executed.

Further, the algorithm parameters in the step S3 include a learning rate α e (0, 1), a discount factor γ e (0, 1), a greedy factor ζ e (0, 1), a maximum iteration number, and a reward function r; all Q values are initialized to 0.

Further, the specific operation of step S5 is as follows: constructing a gravitational field of the target point according to the positions of the obstacle and the target point:

Wherein U _att (q) is the gravitational field generated by the target point at the position q, k _att is the gravitational coefficient of the target point, q is the position coordinate, and q _f is the coordinate of the target point.

Constructing a repulsive force field of the obstacle:

wherein U _rep (q) is a repulsive potential field generated by an obstacle, k _rep is a repulsive coefficient of the obstacle, ρ represents a limit distance affected by the repulsive potential field, and ρ ₀ represents a shortest distance from a current position to the obstacle.

The total artificial potential field comprises two terms, an attraction potential function and a repulsion potential function. The total artificial potential field U (q) is then the sum of these two potential functions:

U(q)＝U_att(q)+U_rep(q)

Further, the specific operation of step S6 is as follows: generating uniform random numbers. If the uniform random number is greater than the decision rate ζ ε (0, 1), then APF weighting process would be performed. Otherwise, the explore-utilize method will be performed.

For APF weighting we will employ a molar neighborhood, which is defined on a two-dimensional square lattice and consists of a central cell (in this case, the current smart car state st) and eight cells surrounding it.

The probability assigned to adjacent cells of st is inversely proportional to their total artificial potential field. The neighbor cell with the lowest total artificial potential field has the highest probability of being assigned to st+1, while the neighbor cell with the highest total artificial potential field has the lowest probability of being assigned to st+1.

The probability of selecting a region p _i (i=1,., k) is calculated, where k is the number of neighboring cells. The probability of which region to select is:

a standard (unit) APF weighting function is calculated. APF weighting function σ:

Where i=1..k, R ^k→(0,1)^k,p＝(p₁,...,p_k)∈R^k

The cumulative probability obtained by the APF weighting function is used to select action a _t in state s _t. First, the probabilities contained in p are ordered. Then, a random number between zero and one is generated. Starting from the top of the list, a first neighbor cell with a cumulative probability greater than a random number is selected for state s _t+1.

If the APF weighting process is not performed, a explore-utilize method will be performed. Generating uniform random numbers. If the uniform random number is greater than the decision rate ζ, then a development process will be performed. Otherwise, a probing process is performed.

Further, the specific operation of step S7 is as follows: once action a _t is selected by the APF weighting, utilization, or exploration process. Then action a _t is performed and the prize r is received, reaching a new state s _t+1.

Further, the specific operation of updating the Q value in step S8 is as follows: it selects action a, based on state S, according to S6, reaches state S', and gets the immediate prize r, the update of Q value with mathematical formula can be expressed as:

wherein the learning rate α is empirically set to 0.3, the discount coefficient γ is set to 0.8, the decision rate ζ is set to 0.2, and the rewarding function is set as follows:

Further, the specific operation of step S10 is as follows:

The new state s _t+1 is assigned to the current state s _t to continue the iterative process, if s _t+1 is the end state, the current round of iteration is completed, otherwise go to step 6 until the target point is reached.

Further, the method for generating the path in the step S11 is as follows:

Initializing the index i taken by the target point to 0, starting the point q ₀ to the current state s _t, and then performing an iterative process of path generation. When the next state does not reach the target point, the index i is increased by 1. Further performing environment verification, if the environment changes, updating the environment information Q _G; the best action a _t in the current state s _t is selected using the learning value Q _m×n. Once action a _t has been selected, it is performed. Subsequently, a new state s _t+1 is calculated using equation s _t+1＝f(s_t,a_t). Finally, a new state s _t+1 is assigned to the current state s _t, and the current state is assigned to the target point q _i. If the current state s _t is ultimately equal to the target point q _f, the algorithm returns to the collision-free path between the starting point q ₀ and the target point q _f.

The application provides an intelligent vehicle path planning method based on an artificial potential field method and reinforcement learning, which aims at the problem of 'minimum trap' existing in the traditional artificial potential field method, virtualizes a priori environment into a total artificial potential field through APF weighting, and calculates a proper learning value. The algorithm successfully solves the path planning problem in both known and unknown environments. The Q-learning method and the APF method are combined, so that the defects that a classical Q-learning method is low in learning speed, long in time consumption, incapable of learning in known and unknown environments and the like are overcome, and the phenomenon that the traditional artificial potential field method falls into local optimization is effectively avoided. Simulation results show that the APF reinforcement learning algorithm can improve learning speed, path length and path smoothness of path planning, and can avoid obstacles in moving obstacles and smoothly reach a destination.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a drawing of a potential field of an artificial potential field;

FIG. 3 is a repulsive potential field diagram of an artificial potential field;

FIG. 4 is a total potential field diagram of an artificial potential field;

FIG. 5 is an overall frame diagram of the proposed new algorithm of the present invention;

FIG. 6 is a graph of a path planned by the improved algorithm of the present invention;

Fig. 7 is a step chart of each iteration of the algorithm.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1

Referring to fig. 1, the invention relates to an intelligent vehicle path planning method based on an artificial potential field method and reinforcement learning, which comprises the following specific implementation steps:

step S1: acquiring an environment image and establishing a grid map;

step S2, defining a reinforcement learning state space and an action space;

Step S3: initializing algorithm parameters;

step S4: randomly selecting an initial state from a state set;

step S7: executing the current action in the state s to obtain a new state st+1 and a reward r;

Step S8: updating the Q value;

Wherein, step S1 specifically comprises the following steps: an environmental image is obtained based on an intelligent vehicle camera, the image is divided into 25 multiplied by 25 grids, an environmental model is built by adopting a grid method, if an obstacle is found in the grids, the grids are defined as obstacle positions, and the intelligent vehicle cannot pass through; if the target point is found in the grid, the target point is set at the target position of the grid, and the target point is the final position to be reached by the intelligent vehicle; other grids are defined as barrier-free grids through which the intelligent vehicle can pass.

The step S2 defines a state space of reinforcement learning as a current position coordinate and a previous position coordinate of the agent, and an action space as actions in four directions of up, down, left and right, wherein the agent moves a grid in a corresponding direction after each action is executed.

The algorithm parameters in the step S3 comprise a learning rate alpha epsilon (0, 1), a discount factor gamma epsilon (0, 1), a greedy factor zeta epsilon (0, 1), a maximum iteration number and a reward function r; all Q values are initialized to 0.

The specific operation of step S5 is as follows: the main idea of the Artificial Potential Field (APF) method is to establish an attractive potential field force around the target point and an repulsive potential field force around the obstacle. With this idea, the APF method employs attraction and repulsion components to attract a smart car to its target while keeping it away from obstacles.

Thus, the total artificial potential field comprises two terms, an attraction potential function and a repulsion potential function. The total artificial potential field U (q) is then the sum of these two potential functions:

U(q)＝U_att(q)+U_rep(q)

as shown in fig. two, constructing a gravitational field of the target point according to the positions of the obstacle and the target point:

As shown in fig. three, a repulsive force field of the obstacle is constructed:

As shown in fig. four, a total artificial potential field is formed from the attractive potential field and the repulsive potential field.

As shown in fig. five, the action selection of the improved algorithm includes the following steps:

Generating uniform random numbers. If the uniform random number is greater than the decision rate ζ ε (0, 1), then APF weighting process would be performed. Otherwise, the explore-utilize method will be performed.

Where i=1..k, R ^k→(0,1)^k,p＝(p₁,...,p_k)∈R^k

Then, once action a _t is selected by the APF weighting, utilization, or exploration process. Then action a _t is performed and the prize r is received, reaching a new state s _t+1. And updating the Q table according to the formula Q (s, a) +.Q (s, a) +alpha (r+gammam _a axQ (s ', a') -Q (s, a)).

The method for generating the path in step S11 is as follows:

Initializing an index i adopted by the target point to 0, assigning a starting point q0 to a current state s _t, and performing an iterative process of path generation. When the next state does not reach the target point, the index i is increased by 1. Further performing environment verification, if the environment changes, updating the environment information Q _G; the best action a _t in the current state s _t is selected using the learning value Q _m×n. Once action a _t has been selected, it is performed. Subsequently, a new state s _t+1 is calculated using equation s _t+1＝f(s_t,a_t). Finally, a new state s _t+1 is assigned to the current state s _t, and the current state is assigned to the target point q _i. If the current state s _t is ultimately equal to the target point q _f, the algorithm returns to the collision-free path between the starting point q ₀ and the target point q _f.

By using the method, the optimal driving path of the intelligent vehicle is obtained based on the example through the parameter setting, and is shown in a figure six.

The seventh illustration shows that the step size efficiency required by the intelligent vehicle to find an optimal path is higher and higher through training.

The artificial potential field combined reinforcement learning algorithm provided by the invention not only effectively solves the defects of the traditional artificial potential field method, but also further accelerates the convergence rate of the algorithm and improves the efficiency of the algorithm.

Claims

1. An intelligent vehicle path planning method based on an artificial potential field method and reinforcement learning is characterized by comprising the following steps:

step S1: acquiring an environment image and establishing a grid map;

step S2, defining a reinforcement learning state space and an action space;

Step S3: initializing algorithm parameters;

step S4: randomly selecting an initial state from a state set;

Step S7: executing the current action in the state s to obtain a new state s _t+1 and a reward r;

Step S8: updating the Q value;

Step S11: generating a path from a starting point q ₀ to a target point q _f, and transmitting the finally generated path to a control center of the intelligent vehicle, wherein the intelligent vehicle runs according to the path;

The specific operation of the step S1 is as follows: an environmental image is obtained based on an intelligent vehicle camera, the image is divided into 25 multiplied by 25 grids, an environmental model is built by adopting a grid method, if an obstacle is found in the grids, the grids are defined as obstacle positions, and the intelligent vehicle cannot pass through; if the target point is found in the grid, the target point is set at the target position of the grid, and the target point is the final position to be reached by the intelligent vehicle; other grids are defined as barrier-free grids through which the intelligent vehicle can pass;

The specific operation of the step S2 is as follows: defining a state space of reinforcement learning as a current position coordinate and a last position coordinate of the intelligent agent, and an action space as actions in four directions of up, down, left and right, wherein the intelligent agent moves a grid towards a corresponding direction after each action is executed;

The algorithm parameters in the step S3 comprise a learning rate alpha epsilon (0, 1), a discount factor gamma epsilon (0, 1), a greedy factor zeta epsilon (0, 1), a maximum iteration number and a reward function r; initializing all Q values to 0;

The specific operation of the step S5 is as follows: constructing a gravitational field of the target point according to the positions of the obstacle and the target point:

Wherein U _att (q) is the gravitational field generated by the target point at the position q, k _att is the gravitational coefficient of the target point, q is the position coordinate, and q _f is the coordinate of the target point;

constructing a repulsive force field of the obstacle:

Wherein U _rep (q) is a repulsive force potential field generated by an obstacle, k _rep is a repulsive force coefficient of the obstacle, ρ represents a limit distance influenced by the repulsive force potential field, and ρ ₀ represents the shortest distance from the current position to the obstacle;

the total artificial potential field comprises two items, an attraction potential function and a rejection potential function; the total artificial potential field U (q) is then the sum of these two potential functions:

U(q)＝U_att(q)+U_rep(q)

The specific operation of the step S6 is as follows: generating uniform random numbers; if the uniform random number is greater than the decision rate ζ E (0, 1), an APF weighting process is performed; otherwise, the explore-utilize method will be performed;

for APF weighting we will employ a molar neighborhood defined on a two-dimensional square lattice and consisting of a central cell (in this case, the current smart car state st) and eight cells surrounding it;

the probability assigned to adjacent cells of st is inversely proportional to their total artificial potential field; the neighbor cell with the lowest total artificial potential field has the highest probability of being assigned to st+1, while the neighbor cell with the highest total artificial potential field has the lowest probability of being assigned to st+1;

Calculating a probability p _i of selecting a region (i=1,., k), where k is the number of neighboring cells; the probability of which region to select is:

Calculating a standard (unit) APF weighting function; APF weighting function σ:

Where i=1..k, R ^k→(0,1)^k,p＝(p₁,...,p_k)∈R^k

The cumulative probability obtained by the APF weighting function is used to select action a _t in state s _t; first, sorting the probabilities contained in p; then, generating a random number between zero and one; starting from the top of the list, selecting a first neighbor cell for state s _t+1 with a cumulative probability greater than a random number;

If the APF weighting process is not performed, the explore-utilize method will be performed; generating uniform random numbers; if the uniform random number is greater than the decision rate xi, the development process will be executed; otherwise, executing the detection process;

The specific operation of the step S7 is as follows: once action a _t is selected by the APF weighting, utilization, or exploration process; then act a _t is performed to receive the bonus r to reach a new state s _t+1;

the specific operation of updating the Q value in step S8 is as follows: it selects action a, based on state S, according to S6, reaches state S', and gets the immediate prize r, the update of Q value with mathematical formula can be expressed as:

the specific operation of step S10 is as follows:

The new state s _t+1 is allocated to the current state s _t to continue the iterative process, if s _t+1 is the end state, the current round of iteration is completed, otherwise, the process goes to the step 6 until the target point is reached;

the method for generating the path in the step S11 is as follows:

Initializing an index i adopted by a target point to be 0, allocating a starting point q ₀ to a current state s _t, and then carrying out an iterative process of path generation; when the next state does not reach the target point, the index i is increased by 1; further performing environment verification, if the environment changes, updating the environment information Q _G; selecting the best action a _t in the current state s _t using the learning value Q _m×n; once action a _t has been selected, the action is performed; subsequently, a new state s _t+1 is calculated using equation s _t+1＝f(s_t,a_t); finally, a new state s _t+1 is assigned to the current state s _t, and the current state is assigned to the target point q _i; if the current state s _t is ultimately equal to the target point q _f, the algorithm returns to the collision-free path between the starting point q ₀ and the target point q _f.