CN116382299A

CN116382299A - Path planning method, path planning device, electronic equipment and storage medium

Info

Publication number: CN116382299A
Application number: CN202310511820.6A
Authority: CN
Inventors: 张国林; 陆颖骅; 吴腾阳
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-07-04

Abstract

The embodiment of the application provides a path planning method, a path planning device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, wherein the path planning method comprises the following steps: receiving a path planning request from a target starting point to a target ending point of a robot, simulating the movement of the robot from the target starting point to the target ending point according to the path planning request, obtaining N candidate paths, and updating the movement position of the robot at the time t+1 according to the distance between the robot and the target ending point at the time t and the shortest distance between the robot and surrounding obstacles in the M-th simulation process and a strategy neural network model corresponding to the M-th simulation process; and determining a target path from the target starting point to the target end point of the robot in the N candidate paths. And path planning is carried out on the robot through a neural network model based on a strategy gradient algorithm, the movement behavior of the robot is decomposed, and a corresponding reward function is designed aiming at the decomposed movement behavior, so that the path planning efficiency and accuracy are effectively improved.

Description

Path planning method, path planning device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a path planning method, a path planning device, an electronic device, and a storage medium.

Background

In order to better serve clients and improve service efficiency, intelligent service robots (hereinafter referred to as robots) are introduced into many banks, and in order to avoid collision between the robots and obstacles in the moving process, path planning needs to be performed on the robots.

The path planning is to search an optimal or better collision-free path from the starting point to the end point according to some set evaluation standards such as the shortest path, the shortest planning time and the like.

Most of traditional path planning methods use mathematical models or physical models to construct interaction states of robots and obstacles, then traditional search algorithms such as genetic algorithms and the like are combined to complete path planning tasks, different parameters are required to be set according to different scenes, and path planning efficiency is low.

Disclosure of Invention

The embodiment of the application provides a path planning method, a path planning device, electronic equipment and a storage medium, which can improve the path planning efficiency.

In a first aspect, an embodiment of the present application provides a path planning method, including:

receiving a path planning request from a target starting point to a target destination of the robot;

simulating the movement of the robot from the target starting point to the target end point according to the path planning request to obtain N candidate paths, wherein N is a positive integer, the N candidate paths are generated through N simulation processes, and each candidate path is composed of positions at a plurality of moments; in the M-th simulation process, updating the moving position of the robot at the time t+1 according to the distance between the robot and the target end point at the time t and the shortest distance between the robot and the peripheral barrier and a strategy neural network model corresponding to the M-th simulation process; t is greater than or equal to 0, and M is less than or equal to N;

And determining a target path from a target starting point to a target destination of the robot in the N candidate paths.

Optionally, the updating the moving position of the robot at the time t+1 according to the shortest distance between the robot and the target end point at the time t and the shortest distance between the robot and the peripheral obstacle and the policy neural network model corresponding to the mth simulation process includes:

inputting the state of the robot at the time t into a strategy neural network model corresponding to the M-th simulation process, and acquiring candidate movement positions of a plurality of robots at the time t+1 and the probability of each candidate movement position; the state at the time t is used for indicating the distance between the robot and the target end position at the time t and the shortest distance between the robot and the peripheral obstacle;

and taking the candidate moving position with the highest probability of the candidate moving position as the moving position of the robot at the time t+1.

Optionally, after updating the movement position of the robot at time t+1, the method further includes:

simulating the movement of the robot according to the movement position at the time t+1, and acquiring the state of the robot at the time t+1;

Acquiring a first rewarding function value corresponding to a movement behavior of the robot to a target end point at the time t+1 and a second rewarding function value corresponding to an obstacle avoidance behavior according to the state at the time t and the state at the time t+1;

acquiring a total rewarding function value of the robot at the time t+1 according to the first rewarding function value and the second rewarding function value;

acquiring a cumulative discount prize of the robot at the time t+1 according to the total prize function value;

and updating the strategy neural network model corresponding to the M simulation processes according to the accumulated discount rewards at the time t+1 and the probability corresponding to the moving position at the time t+1, so as to obtain the strategy neural network model corresponding to the M+1th simulation process.

Optionally, the acquiring a first reward function value corresponding to the movement behavior of the robot to the target end point at the time t+1 and a second reward function value corresponding to the obstacle avoidance behavior includes:

acquiring the gravitation U of the robot at the moment t under the target end position ₁ And (b)The robot receives the attractive force U of the target end position at the time t+1 ₂ ；

According to the distance between the robot at the time t+1 and the target end position and the U ₁ The U is ₂ Acquiring the first reward function value;

and acquiring the second prize function value according to the shortest distance between the robot and the obstacle at the time t and the time t+1 and the preset safety distance.

Optionally, the acquiring the gravitation U of the robot to the target end position at the time t ₁ Comprising:

acquiring the U according to the distance between the t moment and the target end point position and the gravitation gain coefficient ₁ 。

Optionally, the step of adjusting the position of the U according to the distance between the robot and the target end position at time t+1 ₁ The U is ₂ Acquiring the first reward function value includes:

if the distance between the robot at time t+1 and the target end position is not within a preset distance interval, according to the U ₁ And said U ₂ And, a prize value adjustment coefficient obtains the first prize function;

and if the distance between the robot at the time t+1 and the target end position is within the preset distance interval, setting the first reward function value as a first preset value.

Optionally, the preset safety distance includes a maximum safety distance and a minimum safety distance;

The obtaining the second prize function value according to the shortest distance between the robot and the obstacle at the time t and the time t+1 and the preset safety distance comprises the following steps:

if the shortest distance between the robot and the obstacle at the time t+1 is greater than or equal to the maximum safe distance, setting the second prize function value as a second preset value;

if the shortest distance between the robot and the obstacle at the time t+1 is larger than the minimum safety distance and smaller than the maximum safety distance, and the difference between the shortest distance between the robot and the obstacle at the time t+1 and the shortest distance between the robot and the obstacle at the time t is larger than or equal to a default value, setting the second prize function value as a third preset value;

if the shortest distance between the robot and the obstacle at the time t+1 is larger than the minimum safety distance and smaller than the maximum safety distance, and the difference between the shortest distance between the robot and the obstacle at the time t+1 and the shortest distance between the robot and the obstacle at the time t is smaller than a default value, setting the second prize function value as a fourth preset value;

if the shortest distance between the robot and the obstacle at the time t+1 is smaller than or equal to the minimum safety distance, setting the second prize function value as a fifth preset value; the second preset value, the third preset value, the fourth preset value and the fifth preset value are sequentially reduced.

Optionally, the obtaining the total reward function value at the time t+1 according to the first reward function value and the second reward function value includes:

acquiring the weight of the second prize function value; wherein the weight value is inversely related to the shortest distance between the robot and the obstacle;

and obtaining the total rewarding function value according to the weight of the second rewarding function, the second rewarding function value and the first rewarding function value.

In a second aspect, an embodiment of the present application provides a path planning apparatus, including:

the receiving module is used for receiving a path planning request from a target starting point to a target destination of the robot;

the planning module is used for simulating the movement of the robot from the target starting point to the target end point according to the path planning request to obtain N candidate paths, wherein N is a positive integer, the N candidate paths are generated through N simulation processes, and each candidate path is composed of positions at a plurality of moments; in the M-th simulation process, updating the moving position of the robot at the time t+1 according to the distance between the robot and the target end point at the time t and the shortest distance between the robot and the peripheral barrier and a strategy neural network model corresponding to the M-th simulation process; t is greater than or equal to 0, and M is less than or equal to N;

And the determining module is used for determining a target path from a target starting point to a target destination of the robot in the N candidate paths.

In a third aspect, the present application provides an electronic device, comprising: a memory and a processor;

the memory is used for storing computer instructions; the processor is configured to execute the computer instructions stored in the memory to implement the method of any one of the first aspects.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method of any one of the first aspects.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method of any one of the first aspects.

According to the path planning method, the path planning device, the electronic equipment and the storage medium, the path planning request of the robot from the target starting point to the target ending point is received, and according to the path planning request, the behavior of the robot moving from the target starting point to the target ending point is simulated, so that N candidate paths are obtained, wherein the N candidate paths are generated through N simulation processes, and each candidate path is composed of positions at multiple moments; in the M-th simulation process, updating the moving position of the robot at the time t+1 according to the distance between the robot and the target end point at the time t and the shortest distance between the robot and the peripheral barrier and a strategy neural network model corresponding to the M-th simulation process; and determining a target path from a target starting point to a target destination of the robot in the N candidate paths. And path planning is carried out on the robot through a neural network model based on a strategy gradient algorithm, the movement behavior of the robot is decomposed, and a corresponding reward function is designed aiming at the decomposed movement behavior, so that the path planning efficiency and accuracy are effectively improved.

Drawings

Fig. 1 is a schematic view of a scenario provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a path planning method according to an embodiment of the present application;

fig. 3 is a second flow chart of the path planning method provided in the embodiment of the present application;

fig. 4 is a schematic structural diagram of a policy neural network model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an exploded action provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a simulated path planning process provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a path planning apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the embodiments of the present application, the words "first", "second", etc. are used to distinguish identical items or similar items having substantially the same function and action, and the order of them is not limited. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to denote examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In order to better serve clients and improve service efficiency, robots (hereinafter referred to as robots for short) are introduced into many banks, and in order to avoid collision between the robots and obstacles in the moving process, path planning is performed on the robots, wherein the path planning is to search an optimal or better collision-free path from a starting point to an end point according to set evaluation criteria such as a shortest path, a shortest planning time and the like.

The traditional path planning method is to construct the interaction state of the robot and the pedestrian by using a mathematical model or a physical model, then complete the path planning task by combining a traditional search algorithm such as a genetic algorithm and the like, has limited generalization capability for strange scenes and has lower path planning efficiency.

Along with the development of machine learning, the data driving method becomes a popular research direction of robot path planning in a pedestrian environment, for example, the robot path planning is performed through reinforcement learning, so that scene adaptability is greatly improved, but the problem that the accuracy of path planning is low because the robot movement cannot be accurately described due to single reward function is also faced.

In view of this, the embodiments of the present application provide a path planning method, apparatus, electronic device, and storage medium, which perform path planning on a robot through a neural network model based on a policy gradient algorithm, decompose a movement behavior of the robot, and design a corresponding reward function for the decomposed movement behavior, thereby effectively improving path planning efficiency and accuracy.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be implemented independently or combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 1 is a schematic view of an application scenario in an embodiment of the present application, as shown in fig. 1, including a target start point 101, a target end point 102, a plurality of obstacles 103, a robot 104, and a server 105.

The server 105 communicates with the robot, receives the path planning request sent by the robot 104, and can simulate the movement of the robot from the target starting point to the target ending point according to the target starting point 101, the target ending point 102 and the positions of the plurality of obstacles 103 of the robot 104, so as to obtain candidate movement paths of the plurality of robots 104.

The server 105 may select a path meeting the requirement from the candidate moving paths of the robots 104 according to a preset path planning condition as a target path for the movement of the robots 104.

The server 105 may issue the selected target path to the robot 105 so that the robot 105 may move according to the target path.

Alternatively, the server 105 may be a local server, or may be a server deployed in the cloud. The server 105 may also be a data analysis platform with computing capabilities, and the embodiments of the present application do not limit the type of server 105.

The application scenario of the present application is briefly described above, and a path planning method provided by an embodiment of the present application is described below by taking a server applied in fig. 1 as an example.

Fig. 2 is a flow chart of a path planning method according to an embodiment of the present application, as shown in fig. 2, including the following steps:

s201, receiving a path planning request from a target starting point to a target destination of the robot.

In this embodiment, the path planning request may include a target start point, a target end point, and position coordinates of different target obstacles in the environment where the robot is located.

For example, when the robot needs to move from the target starting point to the target ending point, the position coordinates of different target obstacles in the environment can be acquired through the sensor of the robot. The position coordinates of the target start point and the target end point may be input externally to the robot, for example, the user transmits the position coordinates of the target start point and the target end point to the robot through the electronic device.

Alternatively, in one possible implementation, the position coordinates of the target start point, the target end point, and the different target obstacles in the environment of the robot may be externally input to the robot.

The server may obtain a path planning request sent by the robot through interaction with the robot.

S202, simulating the movement of the robot from the target starting point to the target ending point according to the path planning request to obtain N candidate paths.

The N candidate paths are generated through N simulation processes, and each candidate path is composed of positions at a plurality of moments; in the M-th simulation process, updating the moving position of the robot at the time t+1 according to the distance between the robot and the target end point at the time t and the shortest distance between the robot and the peripheral barrier and a strategy neural network model corresponding to the M-th simulation process; and t is greater than or equal to 0, and M is less than or equal to N.

The policy neural network model may be a neural network model based on a policy gradient algorithm, for example. The strategy gradient algorithm is a typical learning algorithm based on strategy iteration, and is implemented by constructing a strategy neural network with an input of states observed by an agent and an output of motion probability distribution, selecting motions from a motion set, acting on an environment, updating the states of the environment according to the obtained motions, and feeding back a rewarding value. And the strategy gradient algorithm calculates the gradient and the loss function according to the reward and punishment value, and realizes the parameter update of the strategy network through gradient descent or gradient ascent.

In an exemplary process of path planning by a simulation, taking the moment t as an example, the server can input the distance between the robot and the target destination at the moment t and the shortest distance between the robot and the peripheral obstacle into the strategy neural network model to obtain the movement position of the robot at the moment t+1 output by the strategy neural network model. The distance between the robot and the target destination at the time t may be a euclidean distance between the robot and the target destination at the time t, and the shortest distance between the robot and the peripheral obstacle may be a minimum distance among euclidean distances between the robot and the plurality of obstacles.

The server can simulate the movement of the robot from the position at the time t to the position at the time t+1 according to the movement position of the robot at the time t+1. If the position of the robot at the time t+1 is not the target end position, the parameter value of the strategy neural network model can be updated according to the loss function of the strategy neural network model, and then the simulation path planning at the next time is performed. And finishing the simulation of one-time path planning until the simulation robot runs to the target destination, and outputting a candidate path.

The server may repeat the simulation process N times, outputting N candidate paths.

S203, determining a target path from a target starting point to a target destination of the robot in the N candidate paths.

In this embodiment of the present application, the server may determine, according to a preset screening condition, a target path from a target start point to a target end point of the robot from the N candidate paths.

For example, the preset screening conditions may be the shortest path, the shortest planning time, etc., and the embodiment of the present application does not limit the preset screening conditions.

According to the path planning method provided by the embodiment of the application, by receiving a path planning request from a target starting point to a target end point of a robot, according to the path planning request, simulating the behavior of the robot from the target starting point to the target end point to obtain N candidate paths, wherein the N candidate paths are generated through N simulation processes, and each candidate path is composed of positions at a plurality of moments; in the M-th simulation process, updating the moving position of the robot at the time t+1 according to the distance between the robot and the target end point at the time t and the shortest distance between the robot and the peripheral barrier and a strategy neural network model corresponding to the M-th simulation process; and determining a target path from a target starting point to a target destination of the robot in the N candidate paths. And path planning is carried out on the robot through a neural network model based on a strategy gradient algorithm, the movement behavior of the robot is decomposed, and a corresponding reward function is designed aiming at the decomposed movement behavior, so that the path planning efficiency and accuracy are effectively improved.

Fig. 3 is a second flow chart of the path planning method provided in the embodiment of the present application, and further illustrates, based on the embodiment shown in fig. 2, the path planning method in a primary simulation process, as shown in fig. 3, including:

s301, inputting the state of the robot at the time t into a strategy neural network model corresponding to the M-th simulation process, and obtaining candidate movement positions of a plurality of robots at the time t+1 and the probability of each candidate movement position.

In the embodiment of the application, the state at the time t is used for indicating the distance between the robot and the target end position at the time t and the shortest distance between the robot and the peripheral obstacle. That is, the state at time t may be used to characterize the position of the robot at time t.

Illustratively, as shown in fig. 4, the policy neural network model provided in the embodiment of the present application includes an input layer, an implicit layer, and an output layer. The state of the robot at the time t is input into the strategy neural network model, and a plurality of candidate mobile positions at the time t+1 and the probability corresponding to each candidate mobile position output by the strategy neural network model can be obtained. Wherein each candidate movement position corresponds to an action to be performed by the robot. When the robot executes actions to be executed corresponding to each candidate movement position, the robot can travel to the candidate movement position at the time t+1.

In the embodiment of the application, the actions to be executed at the next moment of the robot are divided according to the azimuth.

Illustratively, action a is to be performed _i The following can be mentioned:

A＝{a _i ,i＝1,2,3,4,5,6,7,8}＝{a _up ,a _down ,a _left ,a _right ,a _ur ,a _dr ,a _ul ,a _dl }

wherein a is _up ,a _down ,a _left ,a _right ,a _ur ,a _dr ,a _ul ,a _dl Respectively, the robot moves in eight directions of up, down, left, right up, right down, left up and left down.

It is understood that the probability of each candidate movement position being equivalent to the probability of each candidate position being the action to be performed.

In one possible implementation, the probabilities corresponding to the respective execution actions are as follows:

action

a ₁

a ₂

a ₃

a ₄

a ₅

a ₆

a ₇

a ₈

Probability of

P ₁

P ₂

P ₃

P ₄

P ₅

P ₆

P ₇

P ₈

Wherein P is ₁ +P ₂ +P ₃ +P ₄ +P ₅ +P ₆ +P ₇ +P ₈ ＝1。

S302, taking the candidate movement position with the highest probability of the candidate movement position as the movement position of the robot at the time t+1.

S303, the simulation robot moves according to the moving position at the time t+1, and the state of the robot at the time t+1 is acquired.

In this embodiment of the present application, when the server obtains the probabilities of the candidate mobile positions, the candidate position corresponding to the maximum probability among the probabilities of the candidate mobile positions may be used as the mobile position of the robot at time t+1.

The server may simulate the movement of the robot from the position at time t to the position at time t+1 based on the movement position of the robot at time t+1. And acquiring the Euclidean distance between the position of the robot at the time t+1 and the target destination according to the position coordinate at the time t+1 and the target destination position coordinate, and acquiring the shortest Euclidean distance between the position of the robot at the time t+1 and each obstacle according to the coordinates between the position of the robot and each obstacle. The state of the robot at time t+1 is obtained from the Euclidean distance to the target destination and the shortest Euclidean distance to each obstacle.

S304, according to the state at the time t and the state at the time t+1, acquiring a first rewarding function value corresponding to the movement behavior of the robot to the target end point at the time t+1 and a second rewarding function value corresponding to the obstacle avoidance behavior.

In the embodiment of the present application, in order to accurately describe the movement behavior of the robot, the movement behavior of the robot from the time t to the time t+1 is decomposed, as shown in fig. 5, in the embodiment of the present application, the movement behavior of the robot is the obstacle avoidance behavior and the movement behavior towards the target destination, the priorities of different behaviors of the mobile robot are different, the obstacle avoidance behavior has the highest priority, and the guiding target behavior has a lower priority. When no obstacle is found in the path planning process, the guiding target behavior is mainly performed. Otherwise, the obstacle avoidance behavior is preferentially executed, and then the guiding target behavior is executed.

Corresponding reward functions are respectively designed for the obstacle avoidance behavior and the target destination movement behavior, and a first reward function value corresponding to the target destination movement behavior of the robot at the time t+1 and a second reward function value corresponding to the obstacle avoidance behavior can be determined according to the distance between the robot and the acquired target destination position and the shortest distance between the robot and an obstacle.

Exemplary, the attraction force U of the robot to the target end position at the time t is obtained ₁ And, the robot receives the attraction force U of the target end point position at time t+1 ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the distance between the robot at the time t+1 and the target end position and the U ₁ The U is ₂ Acquiring the first reward function value; and acquiring the second prize function value according to the shortest distance between the robot and the obstacle at the time t and the time t+1 and the preset safety distance.

In the embodiment of the application, the moving process of the robot is approximately the movement in the virtual force field, namely, the robot is acted by the destination point gravitation and moves towards the destination point under the action of the gravitation. Thus, the first reward function value may be determined based on the robot being subject to the target endpoint attraction at time t and at time t+1.

During the movement of the robot, the obstacle avoidance behavior is a behavior of the robot that is preferentially executed, and thus, the second prize function value may be determined according to the distance from each obstacle.

The robot receives the gravitation U of the target end position at the moment t ₁ And, the robot receives the attraction force U of the target end point position at time t+1 ₂ Can be obtained as follows.

Illustratively, the U is obtained according to the distance between the time t and the target end point position and the gravitation gain coefficient ₁ 。

The robot receives the gravitation U of the target end position at the moment t ₁ The following formula may be satisfied:

U ₁ ＝k*s _goal

wherein k is an gravitational gain coefficient; s is(s) _goal Is the euclidean distance of the mobile robot from the target destination. k may be determined from empirical data.

It will be appreciated that the robot receives the attractive force U of the target end position at time t+1 ₂ And the robot receives the gravitation U of the target end position at the time t ₁ Similarly, the description is omitted here.

The server obtains the gravitation U of the robot at the moment t under the target end position ₁ And, the robot receives the attraction force U of the target end point position at time t+1 ₂ In this case, it can be determined according to U ₁ And U ₂ A first bonus function is obtained for the robot to move from a position at time t to a position at time t + 1.

Exemplary, if the distance between the robot at time t+1 and the target end position is not within a preset distance interval, according to the U ₁ And said U ₂ And, a prize value adjustment coefficient obtains the first prize function; and if the distance between the robot at the time t+1 and the target end position is within the preset distance interval, setting the first reward function value as a first preset value.

For example, taking an example that the euclidean distance between the robot and the target destination at the time t+1 is zero in the preset distance interval, the first reward function may satisfy the following formula:

wherein r is _g For the first bonus function, beta is the bonus adjustment parameter，d _g And (t+1) is the Euclidean distance between the robot and the target end point at the time t+1. Beta may be determined from empirical data.

It can be appreciated that the preset distance interval may be set according to actual requirements, which is not limited in the embodiment of the present application. When the distance between the moment t+1 and the target end point position of the robot is within the preset distance interval, the robot can be considered to reach the target end point.

Illustratively, the second prize function value may be obtained as follows:

and if the shortest distance between the robot and the obstacle at the time t+1 is greater than or equal to the maximum safety distance, setting the second prize function value as a second preset value.

And if the shortest distance between the robot and the obstacle at the time t+1 is larger than the minimum safety distance and smaller than the maximum safety distance, and the difference between the shortest distance between the robot and the obstacle at the time t+1 and the shortest distance between the robot and the obstacle at the time t is larger than or equal to a default value, setting the second prize function value as a third preset value.

And if the shortest distance between the robot and the obstacle at the time t+1 is larger than the minimum safety distance and smaller than the maximum safety distance, and the difference value between the shortest distance between the robot and the obstacle at the time t+1 and the shortest distance between the robot and the obstacle at the time t is smaller than a default value, setting the second prize function value as a fourth preset value.

Illustratively, the second prize function may satisfy the following formula:

wherein d _max Is the maximum safe distance; d, d _min Is the minimum safe distance; d, d _o Is the shortest Euclidean distance between the mobile robot and the obstacle; d, d _o (t+1) is the shortest Euclidean distance between the mobile robot and each obstacle point at time t+1; d, d _o (t) is the minimum Euclidean distance between each obstacle point of the mobile robot at the previous moment; d, d _o (t+1)-d _o (t) determining whether the mobile robot is approaching or moving away from the obstacle at time t. If d _o (t+1)-d _o (t). Gtoreq.0, it can be shown that the robot is moving from the position at time t to the position at time t+1 away from the obstacle.

If d _o (t+1)≤d _min Indicating that the t+1 robot collides with the obstacle, and the mobile robot fails to avoid the obstacle at the moment. To avoid collisions, a larger penalty is given to collision behavior, and the second prize function value may be set to-10.

If d _o (t+1)≥d _max No obstacle exists around the mobile robot at the moment t, and only the guiding target action is needed to be implemented at the moment.

If d _min ＜d _o (t+1)＜d _max And showing that an obstacle exists near the mobile robot at the time t, and performing obstacle avoidance behavior is required. If a motion is performed such that the mobile robot is continuously approaching the obstacle, i.e., d _o (t+1)-d _o (t) < 0, the second prize function value may be set to a greater penalty value of 5, otherwise a lesser prize value of 1 is given.

S305, acquiring a total rewarding function value of the robot at the time t+1 according to the first rewarding function value and the second rewarding function value.

In this embodiment, the total prize function value of the robot at time t+1 may be a weighted sum based on the first prize function value and the second prize function value.

Illustratively, a weight of the second prize function value is obtained; wherein the weight value is inversely related to the shortest distance between the robot and the obstacle; and obtaining the total rewarding function value according to the weight of the second rewarding function, the second rewarding function value and the first rewarding function value.

Illustratively, the total prize function value may satisfy the following formula:

r _t+1 ＝ξr _o +(1-ξ)r _g

wherein ζ ε [0,1] is the weight of the second bonus function ro. The farther the shortest distance between the robot and the obstacle is, the closer the value of ζ is to 0.

In one possible implementation, the value of ζ may also be based on the maximum safe distance, when d _o (t+1)＜d _max When the robot preferably executes obstacle avoidance, the specific gravity of the second rewarding function is higher, the value of xi can be set to be 0.7, and when d _o (t+1)≥d _max When the robot moves, no obstacle exists around the robot, the obstacle avoidance behavior does not occupy the specific gravity, and the value of ζ can be set to 0.

S306, acquiring the accumulated discount rewards of the robot at the time t+1 according to the total rewards function value.

In the embodiment of the application, when the robot is simulated to move, the actions bringing more rewards in the historical interaction are more favored, better decisions can be hidden in the unselected actions, the problem that the robot falls into local optimum when selecting the actions can be reduced by setting the rewards discount, and the accumulated discount rewards can be used for indicating the total rewards degree obtained by the position of the robot moving to the current moment.

Illustratively, the cumulative discount prize at time t+1 may satisfy the following equation:

v _t+1 ＝γr _t +r _t+1

Wherein r is _t For the total prize function value at time t, r _t+1 The total prize function value at time t+1, gamma being the discount factor, can be determined from empirical values.

S307, updating the strategy neural network model corresponding to the M simulation processes according to the accumulated discount rewards at the time t+1 and the probability corresponding to the moving position at the time t+1, and obtaining the strategy neural network model corresponding to the M+1th simulation process.

In the embodiment of the application, when determining the cumulative discount prize at the time t+1, if the position at the time t+1 is not the target destination position that the robot needs to reach, the server needs to update the policy neural network model, and then enters the next loop iteration.

For example, the loss function value of the policy neural network model may be calculated according to the cumulative discount prize at time t+1 and the probability corresponding to the movement position at time t+1, and the parameter θ of the policy neural network model may be updated according to the loss function value in a gradient descent manner.

By way of example, the loss function may satisfy the following formula:

wherein alpha is the learning rate,

for the gradient of the strategy to the parameter theta at the time t+1, pi _θ (s _t+1 ,a _t+1 ) Represented by the network parameter theta and the state s _t+1 Action a is taken at that time _t+1 Is a probability of (2).

Alternatively, in one possible implementation, the server may target the path for which the cumulative discount prize is greatest when M simulated path plans are completed.

To sum up, as shown in fig. 6, in a simulation process, the server may determine the state at the time t according to the distance between the position at the time t of the robot and the target destination and the shortest distance between the obstacles, input the state at the time t into the policy neural network model, determine the execution action at the time t+1 according to the output result of the policy neural network model, simulate the robot to execute the action at the time t+1 and update the state at the time t+1, calculate the corresponding reward function value and the cumulative reward discount according to the movement of the robot, calculate the loss function, and update the policy neural network model according to the loss function. And repeating the process until the simulation robot moves to the target end point.

According to the path planning method, the path planning is carried out on the robot through the neural network model based on the strategy gradient algorithm, the movement behavior of the robot is decomposed into the obstacle avoidance behavior and the movement behavior towards the target end point, and the corresponding reward function is designed aiming at the decomposed movement behavior, so that the learning convergence speed of the strategy neural network model can be improved, and the path planning efficiency and accuracy are effectively improved.

The embodiment of the application also provides a path planning device.

Fig. 7 is a schematic structural diagram of a path planning apparatus 700 according to an embodiment of the present application, as shown in fig. 7, including:

a receiving module 701, configured to receive a path planning request from a target start point to a target end point of the robot.

The planning module 702 is configured to simulate, according to the path planning request, a movement of the robot from the target start point to the target end point to obtain N candidate paths, where N is a positive integer, the N candidate paths are generated through N simulation processes, and each candidate path is formed by positions at multiple moments; in the M-th simulation process, updating the moving position of the robot at the time t+1 according to the distance between the robot and the target end point at the time t and the shortest distance between the robot and the peripheral barrier and a strategy neural network model corresponding to the M-th simulation process; and t is greater than or equal to 0, and M is less than or equal to N.

A determining module 703, configured to determine a target path from a target start point to a target end point of the robot from the N candidate paths.

Optionally, the planning module 702 is further configured to input a state of the robot at time t into a strategic neural network model corresponding to the mth simulation process, and obtain candidate mobile positions of the plurality of robots at time t+1, and probabilities of the candidate mobile positions; the state at the time t is used for indicating the distance between the robot and the target end position at the time t and the shortest distance between the robot and the peripheral obstacle; and taking the candidate moving position with the highest probability of the candidate moving position as the moving position of the robot at the time t+1.

Optionally, the planning module 702 is further configured to simulate movement of the robot according to the movement position at the time t+1, and obtain a state of the robot at the time t+1; acquiring a first rewarding function value corresponding to a movement behavior of the robot to a target end point at the time t+1 and a second rewarding function value corresponding to an obstacle avoidance behavior according to the state at the time t and the state at the time t+1; acquiring a total rewarding function value of the robot at the time t+1 according to the first rewarding function value and the second rewarding function value; acquiring a cumulative discount prize of the robot at the time t+1 according to the total prize function value; and updating the strategy neural network model corresponding to the M simulation processes according to the accumulated discount rewards at the time t+1 and the probability corresponding to the moving position at the time t+1, so as to obtain the strategy neural network model corresponding to the M+1th simulation process.

Optionally, the planning module 702 is further configured to obtain an attractive force U of the robot subjected to the target end position at time t ₁ And, the robot receives the attraction force U of the target end point position at time t+1 ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the distance between the robot at the time t+1 and the target end position and the U ₁ The U is ₂ Acquiring the first reward function value; and acquiring the second prize function value according to the shortest distance between the robot and the obstacle at the time t and the time t+1 and the preset safety distance.

Optionally, the planning module 702 is further configured to obtain the U according to a distance between the time t and the target destination position and an gravitational gain coefficient ₁ 。

Optionally, the planning module 702 is further configured to, if the distance between the robot at time t+1 and the target destination position is not within a preset distance interval, determine that the distance is within a preset distance interval according to the U ₁ And said U ₂ And, a prize value adjustment coefficient obtains the first prize function; if the distance between the robot at the time t+1 and the target end position is within the preset distance range, the first reward is givenThe function value is set to a first preset value.

Optionally, the planning module 702 is further configured to set the second prize function value to a second preset value if the shortest distance between the robot and the obstacle at time t+1 is greater than or equal to the maximum safe distance; if the shortest distance between the robot and the obstacle at the time t+1 is larger than the minimum safety distance and smaller than the maximum safety distance, and the difference between the shortest distance between the robot and the obstacle at the time t+1 and the shortest distance between the robot and the obstacle at the time t is larger than or equal to a default value, setting the second prize function value as a third preset value; if the shortest distance between the robot and the obstacle at the time t+1 is larger than the minimum safety distance and smaller than the maximum safety distance, and the difference between the shortest distance between the robot and the obstacle at the time t+1 and the shortest distance between the robot and the obstacle at the time t is smaller than a default value, setting the second prize function value as a fourth preset value; if the shortest distance between the robot and the obstacle at the time t+1 is smaller than or equal to the minimum safety distance, setting the second prize function value as a fifth preset value; the second preset value, the third preset value, the fourth preset value and the fifth preset value are sequentially reduced.

Optionally, the planning module 702 is further configured to obtain a weight of the second prize function value; wherein the weight value is inversely related to the shortest distance between the robot and the obstacle; and obtaining the total rewarding function value according to the weight of the second rewarding function, the second rewarding function value and the first rewarding function value.

The path planning device provided in the embodiment of the present application may execute the path planning method provided in any of the foregoing embodiments, and the principle and technical effects of the path planning device are similar and are not repeated herein.

The embodiment of the application also provides electronic equipment.

Fig. 8 is a schematic structural diagram of an electronic device 800 according to an embodiment of the present application, as shown in fig. 8, the electronic device 800 may include: at least one processor 801, memory 802

A memory 802 for storing programs. In particular, the program may include program code including computer-operating instructions.

Memory 802 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 801 is configured to execute computer-executable instructions stored in the memory 802 to implement the actions of the path planning method described in the foregoing method embodiment. The processor 801 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more integrated circuits configured to implement embodiments of the present application.

Optionally, the electronic device 800 may also include a communication interface 803.

In a specific implementation, if the communication interface 803, the memory 802, and the processor 801 are implemented independently, the communication interface 803, the memory 802, and the processor 801 may be connected to each other and perform communication with each other through buses. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. Buses may be divided into address buses, data buses, control buses, etc., but do not represent only one bus or one type of bus.

Alternatively, in a specific implementation, if the communication interface 803, the memory 802, and the processor 801 are implemented on a single chip, the communication interface 803, the memory 802, and the processor 801 may complete communication through internal interfaces.

The embodiment of the present application further provides a computer readable storage medium, on which a computer program is stored, where the computer program implements the technical solution of the path planning method embodiment described above when executed by a processor, and the implementation principle and the technical effect are similar, and are not repeated herein.

In one possible implementation, the computer readable medium may include random access Memory (Random Access Memory, RAM), read-Only Memory (ROM), compact disk (compact disc Read-Only Memory, CD-ROM) or other optical disk Memory, magnetic disk Memory or other magnetic storage device, or any other medium targeted for carrying or storing the desired program code in the form of instructions or data structures, and accessible by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (Digital Subscriber Line, DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes optical disc, laser disc, optical disc, digital versatile disc (Digital Versatile Disc, DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The embodiment of the application further provides a computer program product, which comprises a computer program, and the computer program when executed by a processor realizes the technical scheme of the path planning method embodiment, and the implementation principle and the technical effect are similar, and are not repeated here.

In the specific implementation of the terminal device or the server, it should be understood that the processor may be a central processing unit (in english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, digital signal processors (in english: digital Signal Processor, abbreviated as DSP), application specific integrated circuits (in english: application Specific Integrated Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

Those skilled in the art will appreciate that all or part of the steps of any of the method embodiments described above may be accomplished by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium, which when executed, performs all or part of the steps of the method embodiments described above.

The technical solution of the present application, if implemented in the form of software and sold or used as a product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the technical solutions of the present application may be embodied in the form of a software product stored in a storage medium comprising a computer program or several instructions. The computer software product causes a computer device (which may be a personal computer, a server, a network device, or similar electronic device) to perform all or part of the steps of the methods described in embodiments of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of path planning, comprising:

2. The method according to claim 1, wherein updating the movement position of the robot at time t+1 according to the distance between the robot and the target destination at time t and the shortest distance between the robot and the peripheral obstacle, and the policy neural network model corresponding to the mth simulation process comprises:

3. The method of claim 2, wherein after updating the movement position of the robot at time t+1, the method further comprises:

4. The method of claim 3, wherein the obtaining a first reward function value corresponding to a target destination movement behavior and a second reward function value corresponding to an obstacle avoidance behavior of the robot at the time t+1 comprises:

acquiring the gravitation U of the robot at the moment t under the target end position ₁ And, the robot receives the attraction force U of the target end point position at time t+1 ₂ ；

5. The method of claim 4, wherein the acquiring the attractive force U of the robot to the target end position at time t ₁ Comprising:

6. The method according to claim 5, wherein the U is based on a distance between the robot and the target end position at time t+1 ₁ The U is ₂ Acquiring the first reward function value includes:

if the distance between the robot at time t+1 and the target end position is not positionedWithin a preset distance interval, according to the U ₁ And said U ₂ And, a prize value adjustment coefficient obtains the first prize function;

7. The method of claim 6, wherein the preset safety distance comprises a maximum safety distance and a minimum safety distance;

8. The method of claim 7, wherein the obtaining the total prize function value at time t+1 based on the first prize function value and the second prize function value comprises:

9. A path planning apparatus, comprising:

10. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the method of any one of claims 1-8.

11. A computer readable storage medium, having stored thereon a computer program, the computer program being executed by a processor to implement the method of any of claims 1-8.