CN115178944B

CN115178944B - Narrow space robot operation planning method for safety reinforcement learning

Info

Publication number: CN115178944B
Application number: CN202210930544.2A
Authority: CN
Inventors: 王涛; 许银涛
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2024-05-24
Anticipated expiration: 2042-08-04
Also published as: CN115178944A

Abstract

The invention discloses a narrow space robot operation planning method for safety reinforcement learning, which comprises the following steps: before the mechanical arm moves, setting a planning task and a target point; according to the current state information of the mechanical arm and the related kinematic constraint, calculating the expected acceleration, and simultaneously calculating the braking acceleration; testing the expected acceleration of the joint, if the mechanical arm does not collide and does not violate the kinematic constraint of the joint after the action is executed, the expected acceleration is feasible, and the expected acceleration is executed as a substitute action; otherwise, executing the calculated braking acceleration as a substitute action; the feasible action space of the mechanical arm is formed by the replacement action of each joint of the mechanical arm; and planning a motion track for the mechanical arm under the action space by using a deep reinforcement learning algorithm and obtaining an optimal strategy. The invention combines the idea of alternative actions, redesigns the action space for reinforcement learning training, and further ensures the safety of planning results.

Description

Narrow space robot operation planning method for safety reinforcement learning

Technical Field

The invention relates to the field of robot operation planning research, in particular to a narrow space robot operation planning method for safety reinforcement learning.

Background

The robot is required to autonomously move from a current position to a given position in a limited environment by an obstacle, rapidly and collision-free. By giving the starting position and the end position, a path meeting certain constraint, such as no collision, meeting kinematic conditions and the like, is found in the working space of the robot, and the path is also as short as possible. In path planning, firstly, an obstacle space is modeled, a robot is placed in the space, and a traditional planning algorithm such as a genetic algorithm, an artificial potential field method or a deep reinforcement learning algorithm is used for training. The computational complexity of the planning under the high-dimensional condition of the algorithms also grows exponentially, which often leads to difficulty in real-time planning. Safety reinforcement learning is used as a derivation method of reinforcement learning, and safety constraints are complied with in the learning stage and the deployment process. Under the condition that the environment consists of a controllable robot and static obstacles and the shape and the position of each object are known, the constraints such as collision, related kinematic restriction and the like are considered in the training process, the concept of replacing safety behaviors is applied to reinforcement learning, the feasibility of planning results is greatly improved, and the method is applicable to a high-dimensional robot system.

Disclosure of Invention

The invention aims to provide a narrow space robot operation planning method for safety reinforcement learning, which is used for further improving the safety of planning results.

In order to realize the tasks, the invention adopts the following technical scheme:

a narrow space robot operation planning method for safety reinforcement learning comprises the following steps:

Before the mechanical arm moves, setting a planning task and a target point;

According to the current state information of the mechanical arm and the related kinematic constraint, calculating the expected acceleration a _t+1N and simultaneously calculating the braking acceleration a _t+1B, thereby constructing a feasible action space of the mechanical arm, and comprising:

Defining kinematic constraints of the joint;

Detecting the minimum distance between the robot and the obstacle and the minimum distance between the robot and the mechanical arm connecting rod of the robot in discrete time points to determine the collision condition, and if the minimum distance is smaller than a preset safe distance threshold value, regarding the collision;

Acquiring state information of the mechanical arm through a sensor built in pybullet environment;

a neural network is established as an action prediction network for predicting the action at the next moment, the state information of the joints is input into the action prediction network, the corresponding action scalar m _t+1 epsilon < -1,1 > of each joint is predicted, and then the following formula is adopted Obtaining the expected acceleration a _t+1N of the joint, wherein a _t+1min、a_t+1max is the minimum and maximum safe acceleration of the joint respectively; knowing the expected acceleration, the velocity and position of the joint at the next time t+1 can be obtained;

Calculating braking acceleration: when the joint velocity v _t corresponding to the current time t is more than 0, taking m ' _t+1＝2*m_t+1 -1, otherwise taking m ' _t+1＝2(1-m_t+1) -1, and taking m ' _t+1 Calculating to obtain braking acceleration;

Testing the expected acceleration of the joint by a _t+1N, if the mechanical arm does not collide after the action is executed and the defined kinematic constraint of the joint is not violated, the expected acceleration a _t+1N is feasible, and the expected acceleration a _t+1N is executed as an alternative action; otherwise, executing the calculated braking acceleration a _t+1B as a substitute action; the expected acceleration a _t+1N calculated by each joint is executed after braking; starting from the state information corresponding to the current time t, if no collision occurs after corresponding action is executed, the behavior is safe, otherwise, the movement is stopped;

The feasible action space of the mechanical arm is formed by the replacement action of each joint of the mechanical arm; and planning a motion track for the mechanical arm under the action space by using a deep reinforcement learning algorithm and obtaining an optimal strategy.

Further, the target point is a welding starting point, and the planning task is to plan a safe path so that the tail end of the mechanical arm moves to the welding starting point.

Further, the status information includes a position, a velocity, an acceleration, and a distance from the obstacle for each joint.

Further, in order to prevent oscillation during movement, take a_t+1max＝m'_t+1*(a_t+1max-a_t+1min),a_t+1min＝a_t+1min+(1-m'_t+1)*(a_t+1max-a_t+1min).

Further, the deep reinforcement learning algorithm includes:

Setting an Actor network and a Critic network as reinforcement learning networks, wherein a loss function used for updating the Actor adopts a loss function of a self-adaptive KL penalty coefficient, critic adopts TD-error updating, a hidden layer uses swish as an activation function, and an output layer uses tanh as an activation function;

Training path planning under the action space;

setting a training ending condition, and when the tail end of the mechanical arm continuously reaches a preset target point for a plurality of times, planning can be considered. Training was successfully stopped.

Further, the input quantity of the deep reinforcement learning algorithm is state information s _t of the mechanical arm, and an Actor network and a Critic network are set for training. The network structure is 400 multiplied by 300 multiplied by 10 multiplied by 1, the hidden layers all use swish as an activation function, the output layer of the Actor network uses tanh as an activation function, and the output action range is [ -1,1].

Further, training of path planning is performed under the action space, so that expected actions of the mechanical arm can be obtained, namely actions which maximize the Q value in the action space and are executed by the mechanical arm, wherein the Q value is an action function value in reinforcement learning and represents the expectation of the total sum of rewards of the final state after the robot selects the actions; and executing each action to obtain a corresponding rewarding value, and when rewarding stability and convergence are considered as successful planning and training is stopped, the strategy obtained through training is the optimal strategy.

Further, the reward function for deep reinforcement learning is that r=r _target-R_action-R_adaptation-R_distance includes four terms, and the first term R _target is a reward term of the distance from the tail end of the mechanical arm to the target point, so as to train the mechanical arm to approach the target point; the second term R _action is an action penalty term to avoid actions too close to the limit; the third term R _adaptation is a brake penalty term, which is 1 when the action collides to execute the braking action, otherwise, is 0; and the fourth term R _distance is a distance penalty term, and if the distance between the rod pieces of the mechanical arm and between each rod piece and the obstacle after the replacement action is executed is smaller than a certain threshold value, a penalty term 1 is applied, otherwise, the penalty term is 0.

Compared with the prior art, the invention has the following technical characteristics:

according to the method, a track of a current time interval is calculated according to the current motion state of a robot and network prediction, and if the predicted track accords with all safety constraints, predicted motion is carried out; otherwise, the braking track calculated in the previous time interval is used as an alternative safety action, and a feasible action space is further obtained. The design of the reinforcement-learned motion space ensures that all predicted trajectories conform to kinematic joint limits. Compared with the existing deep reinforcement learning algorithm, the method combines the idea of alternative actions, redesigns the action space for reinforcement learning training, and further ensures the safety of planning results.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a graph of the outcome of a plan of an embodiment of the present invention, with reward curves trained using three algorithms; wherein OURS is the method, and the PPO algorithm and the DDPG algorithm are both existing methods; as can be seen from the figure, the results trained using the method of the present invention can converge faster and be more stable;

FIG. 3 is a schematic flow chart of the method of the present invention.

Detailed Description

The scheme aims at planning a path in a narrow welding seam space in an industrial welding scene, and an industrial mechanical arm with six degrees of freedom and a welding gun is mounted on an adopted robot and used for welding operation.

Referring to the attached drawings, the narrow space robot operation planning method for safety reinforcement learning comprises the following steps:

Step 1, setting a planning task and a target point before the mechanical arm moves; the target point is a welding starting point, and the planning task is to plan a safe path so that the tail end of the mechanical arm moves to the welding starting point.

In the training process, the state information of the mechanical arm is acquired from the simulated welding environment, wherein the state information comprises the kinematic information of the joint of the mechanical arm and the position relation between the joint and an obstacle. The state information can be obtained through interaction between the robot and the simulated welding environment in the training process. The kinematic information comprises the position, the speed, the acceleration and the jerk of the mechanical arm joint.

Step 2, calculating the expected acceleration a _t+1N according to the current state information of the mechanical arm and the related kinematic constraint, and simultaneously calculating the braking acceleration a _t+1B; the method comprises the following steps:

step 2.1, defining kinematic constraints of the joint, including constraints of the position, the speed, the acceleration and the jerk speed of the joint, wherein the constraints include maximum and minimum values of the position, the speed, the acceleration and the jerk speed.

In step 2.2, in order to avoid collision, defining an obstacle-rod pair and a rod-rod pair, detecting a minimum distance d between the robot and the obstacle and a mechanical arm connecting rod of the robot in discrete time points to determine collision conditions, and if the minimum distance d is smaller than a preset safety distance threshold d _S, regarding collision.

Step 2.3, the simulation is performed in pybullet environment, and the state information of the mechanical arm can be obtained through the built-in sensor in pybullet environment, wherein the state comprises the position, the speed, the acceleration and the distance between each joint and the obstacle.

Establishing a neural network as an action prediction network for predicting the action at the next moment, using SELU as an activation function by using a hidden layer, wherein the size of a first hidden layer is 256, the size of a second hidden layer is 128, inputting the state information of joints into the action prediction network, predicting the corresponding action scalar m _t+1 epsilon < -1,1 > of each joint, and then using the formulaObtaining the expected acceleration a _t+1N of the joint, wherein a _t+1min、a_t+1max corresponds to the minimum and maximum safe accelerations of the joint respectively; knowing the acceleration, the velocity and position of the joint at the next time t+1 can be obtained.

Step 2.4, calculating braking acceleration; when the joint velocity v _t corresponding to the current time t is more than 0, taking m ' _t+1＝2*m_t+1 -1, otherwise taking m ' _t+1＝2(1-m_t+1) -1, and taking m ' _t+1 Calculating to obtain braking acceleration; at the same time, in order to prevent oscillation during the movement process a_t+1max＝m'_t+1*(a_t+1max-a_t+1min),a_t+1min＝a_t+1min+(1-m'_t+1)*(a_t+1max-a_t+1min).

Step 3, testing the expected acceleration a _t+1N of the joint, if the mechanical arm does not collide after the action is executed and the kinematic constraint of the joint defined in step 2.1 is not violated, the expected acceleration a _t+1N is feasible, and the expected acceleration a _t+1N is executed as an alternative action; otherwise, the calculated braking acceleration a _t+1B is executed as an alternative action.

In order to ensure that there is a safe and feasible action at the next time t+1, the calculated expected acceleration a _t+1N of each joint is performed after braking; and (2) starting from the state information corresponding to the current moment t, namely from the position, the speed and the acceleration of the mechanical arm joint at the current moment, if no collision occurs after the corresponding action is executed, the action is safe, otherwise, stopping the motion, and returning to the step (2) for re-prediction.

And 4, through the steps 2 to 3, the replacement action of each joint of the mechanical arm can be obtained, and the replacement actions form a feasible action space of the mechanical arm. And planning a motion track for the mechanical arm under the action space by using a deep reinforcement learning algorithm and obtaining an optimal strategy. In the reinforcement learning training process, one action is selected from the action space to be executed, and then the quality of the action is reflected by the value of the rewards.

The deep reinforcement learning algorithm comprises:

And 4.1, setting an Actor network and a Critic network as reinforcement learning networks, wherein a loss function used for updating the Actor adopts a loss function of a self-adaptive KL penalty coefficient, critic adopts TD-error updating, a hidden layer uses swish as an activation function, and an output layer uses tanh as an activation function.

Step 4.2, training path planning under the action space;

And 4.3, setting a training ending condition, and stopping training when the tail end of the mechanical arm continuously reaches a preset target point for a plurality of times.

In this embodiment, the input quantity of the deep reinforcement learning algorithm is state information s _t of the mechanical arm, and an Actor network and a Critic network are set for training. The network structure is 400 multiplied by 300 multiplied by 10 multiplied by 1, the hidden layers all use swish as an activation function, the output layer of the Actor network uses tanh as an activation function, and the output action range is [ -1,1].

The training of path planning is carried out under the redesigned action space, the current joint position, speed and acceleration of the mechanical arm and the distance between the tail end of the mechanical arm and the target point are input into the reinforcement learning network, so that the expected action of the mechanical arm can be obtained, namely the action which maximizes the Q value in the action space and is executed by the mechanical arm, and the Q value is the action function value in reinforcement learning and is used for evaluating the action value, and the action value represents the expectation that the agent rewards the sum up to the final state after selecting the action. And executing each action to obtain a corresponding rewarding value, and when rewarding stability and convergence are considered as successful planning and training is stopped, the strategy obtained through training is the optimal strategy.

The reward function for deep reinforcement learning is that r=r _target-R_action-R_adaptation-R_distance includes four terms, and the first term R _target is a reward term of the distance from the tail end of the mechanical arm to the target point, and is used for training the mechanical arm to approach the target point; the second term R _action is an action penalty term to avoid actions too close to the limit; the third term R _adaptation is a brake penalty term, which is 1 when the action collides to execute the braking action, otherwise, is 0; and the fourth term R _distance is a distance penalty term, and if the distance between the rod pieces of the mechanical arm and between each rod piece and the obstacle after the replacement action is executed is smaller than a certain threshold value, a penalty term 1 is applied, otherwise, the penalty term is 0.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. The narrow space robot operation planning method for safety reinforcement learning is characterized by comprising the following steps of:

Before the mechanical arm moves, setting a planning task and a target point;

Defining kinematic constraints of the joint;

Testing the expected acceleration a _t+1N of the joint, if the mechanical arm does not collide after the action is executed and the defined kinematic constraint of the joint is not violated, the expected acceleration a _t+1N is feasible, and the expected acceleration a _t+1N is executed as an alternative action; otherwise, executing the calculated braking acceleration a _t+1B as a substitute action; the expected acceleration a _t+1N calculated by each joint is executed after braking; starting from the state information corresponding to the current time t, if no collision occurs after corresponding action is executed, the behavior is safe, otherwise, the movement is stopped;

The feasible action space of the mechanical arm is formed by the replacement action of each joint of the mechanical arm; planning a motion track for the mechanical arm under the action space by using a deep reinforcement learning algorithm and obtaining an optimal strategy;

The deep reinforcement learning algorithm comprises:

Training path planning under the action space;

setting a training ending condition, and stopping training when the tail end of the mechanical arm continuously reaches a preset target point for a plurality of times;

Training path planning under the action space can obtain the expected action of the mechanical arm, namely the action which maximizes the Q value in the action space and is executed by the mechanical arm, wherein the Q value is the action function value in reinforcement learning and represents the expectation of the total rewarding of the final state after the robot selects the action; and executing each action to obtain a corresponding rewarding value, and when rewarding stability and convergence are considered as successful planning and training is stopped, the strategy obtained through training is the optimal strategy.

2. The method of claim 1, wherein the target point is a welding start point and the planning task is to plan a safe path so that the end of the arm moves to the welding start point.

3. The method of claim 1, wherein the status information includes a position, a speed, an acceleration, and a distance from an obstacle for each joint.

4. The method for planning operation of a robot in a confined space for reinforcement learning as set forth in claim 1, wherein the oscillation phenomenon is prevented during the movement a_t+1max＝m'_t+1*(a_t+1max-a_t+1min),a_t+1min＝a_t+1min+(1-m'_t+1)*(a_t+1max-a_t+1min).

5. The narrow space robot operation planning method for safety reinforcement learning according to claim 1, wherein the input quantity of the deep reinforcement learning algorithm is state information of a mechanical arm, and an Actor network and a Critic network are set for training; the network structure is 400 multiplied by 300 multiplied by 10 multiplied by 1, the hidden layers all use swish as an activation function, the output layer of the Actor network uses tanh as an activation function, and the output action range is [ -1,1].

6. The narrow space robotic work planning method of claim 1, wherein the reward function for the deep reinforcement learning is R = R _target-R_action-R_adaptation-R_distance comprising four terms, the first term R _target being a reward term for arm tip to target point distance for training the arm to approach the target point; the second term R _action is an action penalty term to avoid actions too close to the limit; the third term R _adaptation is a brake penalty term, which is 1 when the action collides to execute the braking action, otherwise, is 0; and the fourth term R _distance is a distance penalty term, and if the distance between the rod pieces of the mechanical arm and between each rod piece and the obstacle after the replacement action is executed is smaller than a certain threshold value, a penalty term 1 is applied, otherwise, the penalty term is 0.