CN113894787B

CN113894787B - Heuristic reward function design method for mechanical arm reinforcement learning motion planning

Info

Publication number: CN113894787B
Application number: CN202111278998.8A
Authority: CN
Inventors: 白成超; 张家维; 郭继峰
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-10-31
Filing date: 2021-10-31
Publication date: 2022-06-14
Anticipated expiration: 2041-10-31
Also published as: CN113894787A

Abstract

The invention discloses a design method of a heuristic reward function used in mechanical arm reinforcement learning motion planning, and relates to the technical field of robot motion planning and intelligent control. The method aims to solve the problem that a reward function design of a mechanical arm motion planning algorithm based on reinforcement learning is designed without a unified guidance method which usually depends on experience. The invention includes: establishing a heuristic function of a mechanical arm motion planning problem; constructing a heuristic reward function of the mechanical arm movement planning according to the heuristic function; determining parameter values in a heuristic reward function; and training the neural network motion planner for mechanical arm motion planning by using the constructed heuristic reward function. The heuristic reward function obviously improves the success rate of the motion planning and accelerates the convergence speed. The method is used for the field of motion planning and intelligent control of the mechanical arm.

Description

Heuristic reward function design method for mechanical arm reinforcement learning motion planning

Technical Field

The invention relates to a heuristic reward function design method in mechanical arm reinforcement learning motion planning, and belongs to the technical field of robot motion planning and intelligent control.

Background

Compared with the motion planning of unmanned vehicles or unmanned planes, the motion planning of mechanical arms is usually performed in an abstract high-dimensional joint space (configuration space), which makes some classical planning algorithms difficult to apply to a mechanical arm system because the unobstructed configuration space of the mechanical arm is difficult to obtain explicitly. The current commonly used mechanical arm motion planning algorithm can be divided into a sampling motion planning algorithm, a track optimization algorithm, an artificial potential field method and a graph searching motion planning algorithm. The traditional motion planning algorithm for the mechanical arm is difficult to realize rapid planning under a high-dimensional complex environment at present, and in recent years, the motion planning algorithm based on reinforcement learning is concerned by a plurality of scholars. With the continuous and deep research of the deep reinforcement learning on the multi-dimensional continuous motion space task, the reinforcement learning-based motion planning algorithm has the potential of simultaneously meeting high-dimensional adaptability, complex environment adaptability and planning rapidity. At present, researches for the problem of mechanical arm reinforcement learning motion planning mainly focus on the design of a neural network and the research of an auxiliary learning strategy, the research on a reward function design method of the motion planning problem is less, and the existing researches comprise the following steps: the method comprises the following steps of orientation reward function design, switching strategies between intensive reward and sparse reward and the like, and the reward function of the existing mechanical arm movement planning problem is usually designed by depending on experience without guidance of a unified design method. Therefore, in the prior art, no one has proposed how to design a heuristic reward function for the mechanical arm reinforcement learning motion planning.

The heuristic function is an important concept in an information-based search algorithm, and in a graph search algorithm, the heuristic function represents dissipation estimation values of the lowest dissipation path from a current node to a target node. By careful design of heuristic functions, graph search motion planning exhibits high success rate in very complex environments, which is difficult to realize by other planning algorithms. The design of the heuristic function has great influence on the performance of the graph search motion planning, and the meaning of the heuristic function is similar to that of the reinforcement learning reward function.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention provides a heuristic reward function design method used in mechanical arm reinforcement learning motion planning, and aims to solve the problem that convergence speed and success rate of mechanical arm reinforcement learning motion planning are affected because the existing reward function design based on the reinforcement learning mechanical arm motion planning algorithm is designed by means of experience because a unified guidance method is not available.

The technical scheme adopted by the invention for solving the technical problems is as follows:

A design method for a heuristic reward function in mechanical arm reinforcement learning motion planning comprises the following steps:

the method comprises the following steps: establishing a heuristic function h (n) of a mechanical arm motion planning problem;

step two: constructing a heuristic reward function of the mechanical arm movement plan according to the heuristic function;

step three: determining parameter values in a heuristic reward function;

step four: and training the neural network motion planner for mechanical arm motion planning by using the constructed heuristic reward function.

Step one, the heuristic function h (n) is constructed in the following three ways:

1) when obvious constraint conditions exist in the motion planning problem, a relaxation problem which is easy to calculate the optimal solution is obtained by removing some constraint conditions of the original problem, and the solution of the relaxation problem is used as a heuristic function of the original problem;

2) when the motion planning problem has a subproblem with clear structure, the solution and dissipation of the subproblem of the original problem are used as the heuristic method of the original problem;

3) when the heuristic function is difficult to construct by 1) or 2), the heuristic function is constructed by directly inducing and learning experiences.

The process of constructing the heuristic function h (n) by the method 1) is as follows (the heuristic function of two mechanical arm motion plans is constructed by the invention):

The method comprises the following steps: removing constraint conditions from the original motion planning problem and relaxing: adopting a first relaxation mode or a second relaxation mode;

a first relaxation mode: when the obstacles are dense and the environment is complex in a scene of mechanical arm motion planning, removing mechanical arm kinematic constraint in the original problem, wherein the original problem is relaxed and is the problem that the tail end of the mechanical arm moves to a target position without collision;

a second relaxation mode: when the scenario of the mechanical arm motion planning is simple, kinematic constraint and collision-free constraint of the mechanical arm tail end and the obstacle are removed, and the original problem is relaxed to be the problem that the mechanical arm tail end and the obstacle are directly moved to the target position under the condition that the collision is not considered.

The first step is: taking the solution of the relaxed problem as a heuristic function of the original problem:

the heuristic function corresponding to the first relaxation mode is as follows: an estimated value of the shortest path length for a sphere representing the tail end of a mechanical arm to reach a target position without collision in the three-dimensional working space;

the heuristic function corresponding to the second relaxation mode is as follows: and estimating the linear distance between the tail end of the mechanical arm and the target position.

Step two, the heuristic function corresponding to the first relaxation mode has the following concrete form:

in the above formula, P represents a sphere motion path calculated by the RRT-connect motion planning algorithm, and is composed of N path points, P (i) represents the position of the ith path point, and h ₁(s_t) Referred to as RRT heuristic functions.

The specific form of the heuristic function corresponding to the second relaxation mode is as follows:

in the above formula, p(s)_t) Denotes the end position of the robot arm at time t, p_goalTarget position, h, representing a robot arm motion plan₂(s_t) Referred to as straight line heuristic functions.

Step two, the heuristic reward function of the mechanical arm movement plan is constructed in the following mode:

in the above equation,. epsilon.denotes a distance threshold for judging whether planning is successful or not, f (h(s)_t+1) ) are defined as follows:

f(h(s_t+1))＝λ₁+λ₂(λ₃-h(s_t+1))

the above formula is composed of two parts, the former term lambda₁The time penalty item aims to ensure that the mechanical arm can move to a target position as fast as possible; the effect of the latter term is to convert h(s)_t+1) Is scaled by the value of₃For adjusting h(s)_t+1) Positive and negative constants, λ₂For adjusting h(s)_t+1) A constant of magnitude.

Step three, the mode of determining the parameter value in the heuristic reward function is as follows:

the values of the parameters are adjusted according to the size of the heuristic function and are designed according to the constraint relation of the following formula,

in the above formula, T_endRepresents the total number of interaction steps in a training round, t represents the t-th interaction, gamma^tA value discount coefficient representing a t-th interaction; if the value of the above equation is less than-1, the agent may choose to actively collide with the obstacle to end the current round in order to increase the jackpot, and if the value of the above equation is greater than 1, the agent may move around the goal state until the training round is ended; when the value of the above expression is between-1 and 1, the intelligent agent can be prevented from learning a wrong strategy, and the function of guiding learning is achieved.

The invention has at least the following beneficial technical effects:

the heuristic function in the mechanical arm movement planning is designed based on a heuristic function design method in a graph search movement planning algorithm, then the heuristic function is used for designing a reward function in the mechanical arm reinforcement learning movement planning, and finally the obtained reward function is called the heuristic reward function.

The invention provides a novel exercise planning reward function design framework based on a heuristic function design method in a graph search exercise planning algorithm, and experiments verify that two heuristic functions designed based on the framework can accelerate the training process of reinforcement learning and improve the success rate of exercise planning. The method solves the problem that the reward function design of the mechanical arm motion planning algorithm based on reinforcement learning is designed without a unified guidance method which usually depends on experience. The invention firstly establishes a heuristic function through a heuristic function establishing method of original problem relaxation, subproblem solving, learning and induction, and provides a method for establishing a reward function by utilizing the established heuristic function on the basis.

By verification, the heuristic reward function designed by the invention obviously improves the success rate of motion planning and accelerates the convergence speed. The method is used for the field of motion planning and intelligent control of the mechanical arm.

Drawings

FIG. 1 is a simulation scenario;

FIG. 2 is a graph of the success rate of the training process in a desktop scenario;

FIG. 3 is a graph of success rate of the training process in a wall obstruction scenario;

fig. 4 is a graph of the success rate of the training process in a cabinet scenario.

Detailed Description

The first specific implementation way is as follows:

the heuristic reward function design framework for mechanical arm motion planning in the embodiment comprises the following steps of:

the method comprises the following steps: establishing a heuristic function h (n) of the mechanical arm motion planning problem, wherein the heuristic function can be established in the following three ways:

1) when obvious constraint conditions exist in the motion planning problem, a relaxation problem which is easy to calculate the optimal solution is obtained by removing some constraint conditions of the original problem, and the solution of the relaxation problem is used as a heuristic function of the original problem.

2) And when the motion planning problem has sub-problems with clear structures, the solution and dissipation of the sub-problems of the original problem are used as the heuristic method of the original problem.

3) When the heuristic function is difficult to construct, the heuristic function is constructed by directly inducing and learning experiences.

Taking the mode 1) as an example, the method constructs two heuristic functions of mechanical arm motion planning according to the following steps:

The method comprises the following steps: and removing the constraint condition of the original motion planning problem and relaxing.

The first relaxation mode is as follows: when the obstacles are dense and the environment is complex in a scene of mechanical arm motion planning, the first relaxation mode is to remove the mechanical arm kinematic constraint in the original problem, and the original problem is the problem that the tail end of the mechanical arm moves to a target position without collision.

And (2) a second relaxation mode: when the scenario of the mechanical arm motion planning is simple, kinematic constraint and collision-free constraint of the mechanical arm tail end and the obstacle are removed, and the original problem is relaxed to be the problem that the mechanical arm tail end and the obstacle are directly moved to the target position under the condition that the collision is not considered.

The first step is: and taking the solution of the relaxed problem as a heuristic function of the original problem.

The heuristic function corresponding to the first relaxation mode is as follows: an estimate of the shortest path length in the three dimensional workspace to the target location without collision for a sphere representing the end of the robot arm. The invention adopts an RRT-connect motion planning algorithm based on sampling to calculate the collision-free path of a sphere in a three-dimensional Euclidean space, and the heuristic function is as follows:

in the above formula, P represents the sphere motion path calculated by RRT-connect motion planning algorithm, and is composed of N path points, P (i) represents the position of the ith path point, and h is calculated ₁(s_t) Referred to as RRT heuristic functions.

The heuristic function corresponding to the second relaxation mode is as follows: the linear distance estimation value of the tail end of the mechanical arm from the target position is obtained by directly calculating the distance of the tail end of the mechanical arm from the target position, and the heuristic function is as follows:

in the above formula, p(s)_t) Denotes the end position of the robot arm at time t, p_goalTarget position, h, representing a robot arm motion plan₂(s_t) Referred to as straight-line heuristic functions.

Step two: and constructing a heuristic reward function of the mechanical arm movement plan according to the heuristic function:

in the above equation,. epsilon.represents a distance threshold for determining whether planning is successful, and f (h(s)_t+1) ) are defined as follows:

f(h(s_t+1))＝λ₁+λ₂(λ₃-h(s_t+1))

the above formula is composed of two parts, the former term lambda₁The aim is to make the mechanical arm move to the target position as fast as possible for a time penalty item. The effect of the latter term is to convert h(s)_t+1) Is scaled to a suitable range, λ₃For adjusting positive and negative, lambda₂For resizing.

Step three: and determining parameter values in the heuristic reward function, wherein the value of each parameter needs to be adjusted according to the size of the heuristic function and can be designed according to the constraint relation of the following formula.

In the above formula, T_endRepresenting the total time of a training round, if the value of the above equation is less than-1, the agent may choose to actively collide with obstacles to increase the jackpot to end the current round, and if the value of the above equation is greater than 1, the agent may move around the goal state until the training round ends. When the value of the above expression is between-1 and 1, the intelligent agent can be prevented from learning a wrong strategy, and the function of guiding learning is achieved.

The beneficial effects of the present invention are demonstrated with the following examples:

example (b):

1) experimental setup

In the invention, lambda is₁、λ₂、λ₃The values of are respectively selected as: -0.001, -0.02, 0.2. In order to highlight the role of the heuristic function in the reward function, a non-addition function is setHeuristic sparse reward function to compare:

the invention is based on a jaco2 cooperative mechanical arm with 7 degrees of freedom for training and testing. In order to fully test the effect of the heuristic reward function, 3 experimental scenes with increasing difficulty are set, namely a desktop scene, a wall obstacle scene and a cabinet scene, as shown in fig. 1. Each scene establishes a corresponding training environment in MuJoCo, and the task of the mechanical arm is to move from a random initial position of a three-dimensional working space to a random target position.

Experimental results and analysis:

the method evaluates the effectiveness of the provided heuristic reward function from the two aspects of success rate and convergence speed, in the aspect of success rate evaluation, under the condition of taking noise-eliminating disturbance and random strategy after training, each neural network motion planner executes planning for 100 times under new initial configuration and target configuration, and the success rate of real planning of the neural network motion planner is evaluated by the success rate of planning for 100 times. In the aspect of convergence rate, the number of training rounds when the planning success rate reaches 80% for the first time in the training process is used as an index for evaluating the convergence rate of the strategy, and if the success rate cannot reach 80% before the training is finished, the training is considered to fail.

Because of two heuristic functions h in the desktop scene₁(s_t) And h₂(s_t) Are identical, so only h is taken for ease of calculation₁(s_t) To be tested. The experiment under the desktop scene has 2 groups of contrast experiments, and the wall obstacle and the cupboard scene respectively carry out 3 groups of contrast experiments. Each group of experiments are repeated for 3 times under the same training parameters, the specific meaning of each control experiment and the success rate curve of the training process are shown in fig. 2-4, the solid line in the graph is the success rate average of 3 times of training, and the shadow behind the solid line is the success rate coverage of 3 times of training.

TABLE 1 success Rate test results

TABLE 2 Convergence Rate test results

The success rate test results and convergence rate data after training are shown in tables 1 and 2, respectively. The experimental results show that the success rate of the heuristic reward functions is higher than that of the situation without the heuristic reward functions, and the straight line heuristic reward functions have higher success rate than the RRT heuristic reward functions. In the aspect of convergence speed, the convergence speed of the heuristic reward function is higher than that of the case without the heuristic function, and the RRT heuristic reward function has more advantages than the straight line heuristic function in the convergence speed.

According to the method provided by the invention, the training convergence speed and the motion planning success rate of the mechanical arm motion planning algorithm based on reinforcement learning can be improved, and guidance is provided for the design of a reward function.

The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it should be understood that various changes and modifications can be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A heuristic reward function design method for mechanical arm reinforcement learning motion planning is characterized in that: the method comprises the following steps:

step two: constructing a heuristic reward function of the mechanical arm movement plan according to the heuristic function; the heuristic reward function of the mechanical arm motion planning is constructed in the following mode:

f(h(s_t+1))＝λ₁+λ₂(λ₃-h(s_t+1))

the above formula is composed of two parts, the former term lambda₁The time penalty item aims to ensure that the mechanical arm can move to a target position as fast as possible; the effect of the latter term is to convert h(s)_t+1) Is scaled by the value of₃For adjusting h(s)_t+1) Positive and negative constants, λ₂For adjusting h(s)_t+1) A constant of magnitude; p(s)_t) Denotes the end position of the robot arm at time t, p_goalA target position representing a robot arm motion plan;

Step three: determining parameter values in a heuristic reward function;

2. A heuristic reward function design method for mechanical arm reinforcement learning exercise planning as claimed in claim 1, wherein the step one heuristic function h (n) is constructed by three ways:

3. A heuristic reward function design method for mechanical arm reinforcement learning movement planning as claimed in claim 2, wherein the process of constructing the heuristic function h (n) by the method of the 1 st) is as follows:

The first relaxation mode is as follows: when obstacles are dense in a scene of mechanical arm motion planning, removing mechanical arm kinematic constraint in the original problem, wherein the original problem is relaxed and is the problem that the tail end of the mechanical arm moves to a target position without collision;

and (2) a second relaxation mode: when the scenario of the mechanical arm motion planning is simple, removing kinematic constraint and collision-free constraint of the tail end of the mechanical arm and the barrier, and relaxing the original problem into the problem that the mechanical arm directly moves to the target position without considering the collision of the tail end of the mechanical arm and the barrier;

4. A heuristic reward function design method for mechanical arm reinforcement learning movement planning as claimed in claim 3, wherein the step two the relaxation mode one corresponds to a heuristic function in a concrete form as follows:

in the above formula, P represents the sphere motion path calculated by RRT-connect motion planning algorithm, and N path point groups P (i) represents the position of the ith path point, h₁(s_t) Referred to as RRT heuristic functions.

5. The heuristic reward function design method for the mechanical arm reinforcement learning movement planning as claimed in claim 3, wherein the specific form of the heuristic function corresponding to the second relaxation mode in the first and second steps is as follows:

6. A heuristic reward function design method for mechanical arm reinforcement learning movement planning according to claim 1, characterized in that the way of determining the parameter values in the heuristic reward function in step three is as follows:

in the above formula, T_endRepresents the total number of interaction steps of a training round, m represents the mth interaction, gamma^mA value discount coefficient representing the mth interaction; if the value of the above formula is less than-1, the mechanical arm selects to collide with the obstacle actively for increasing the accumulated reward to finish the current round, and if the value of the above formula is more than 1, the mechanical arm moves around the target state until the training round is finished; when the value of the above expression is between-1 and 1, the mechanical arm can be prevented from learning a wrong strategy, and the function of guiding learning is achieved.

7. A computer-readable storage medium, characterized in that: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the heuristic reward function design method for robot arm reinforcement learning motion planning of any of claims 1-6.