CN115178944A

CN115178944A - Narrow space robot operation planning method for safety reinforcement learning

Info

Publication number: CN115178944A
Application number: CN202210930544.2A
Authority: CN
Inventors: 王涛; 许银涛
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2022-10-14
Anticipated expiration: 2042-08-04
Also published as: CN115178944B

Abstract

The invention discloses a narrow space robot operation planning method for safety reinforcement learning, which comprises the following steps: setting a planning task and a target point before the movement of the mechanical arm; calculating expected acceleration and calculating braking acceleration according to the current state information of the mechanical arm and relevant kinematic constraints; testing the expected acceleration of the joint, if the mechanical arm does not collide and violate the kinematic constraint of the joint after the action is executed, the expected acceleration is feasible, and the expected acceleration is executed as a substitute action; otherwise, executing the calculated braking acceleration as a substitute action; the feasible motion space of the mechanical arm is formed by the alternative motion of each joint of the mechanical arm; and planning a motion track for the mechanical arm in the action space by utilizing a deep reinforcement learning algorithm and obtaining an optimal strategy. The invention combines the thought of replacing actions, redesigns the action space for strengthening the learning and training and further ensures the safety of the planning result.

Description

Narrow space robot operation planning method for safety reinforcement learning

Technical Field

The invention relates to the field of robot operation planning research, in particular to a narrow space robot operation planning method for safety reinforcement learning.

Background

The robot is required to move from the current position to a given position in an autonomous and fast and collision-free manner in an environment constrained by obstacles. By giving a starting position and an end position, a path meeting certain constraints is found in the working space of the robot, such as no collision, the requirement of kinematic conditions and the like are met, and the path is also as short as possible. When planning a path, firstly, a model is built for an obstacle space, the robot is placed in the space, and a traditional planning algorithm such as a genetic algorithm, an artificial potential field method or a deep reinforcement learning algorithm is used for training. The planning computation complexity of the algorithms under the high-dimensional condition also grows exponentially, and the real-time planning is often difficult to realize. Safety reinforcement learning is used as a derivative method of reinforcement learning, and safety constraints are observed in both a learning stage and a deployment process. Under the condition that the environment is composed of a controllable robot and a static obstacle and the shape and the position of each object are known, constraints such as collision and relevant kinematic limits are considered in the training process, the concept of replacing safety behaviors is applied to reinforcement learning, the feasibility of a planning result is greatly improved, and the method is suitable for a high-dimensional robot system.

Disclosure of Invention

The invention aims to provide a narrow space robot operation planning method for safety reinforcement learning, which is used for further improving the safety of a planning result.

In order to realize the task, the invention adopts the following technical scheme:

a narrow space robot operation planning method for safety reinforcement learning comprises the following steps:

setting a planning task and a target point before the movement of the mechanical arm;

calculating expected acceleration a according to the current state information of the mechanical arm and relevant kinematic constraints _t+1N While calculating the braking acceleration a _t+1B Thereby constructing a feasible action space of the mechanical arm, comprising the following steps:

defining kinematic constraints for the joints;

detecting the minimum distance between the robot and the obstacle and between the robot arm connecting rods of the robot at discrete time points to determine the collision condition, and if the minimum distance is smaller than a preset safety distance threshold value, determining that the collision occurs;

acquiring state information of the mechanical arm through a built-in sensor in a pybull environment;

establishing a neural network as a motion prediction network for predicting the motion at the next moment, inputting the state information of the joints into the motion prediction network, and predicting the motion scalar m corresponding to each joint _t+1 ∈[-1,1]Then represented by the formula a _t+1N ＝a _t+1min +(1+m _t+1 )/2·(a _t+1max -a _t+1min ) Obtaining the desired acceleration a of the joint _t+1N Wherein a is _t+1min 、a _t+1max Respectively the minimum and maximum safe acceleration of the joint; knowing the expected acceleration, the velocity and position of the joint at the next time t +1 can be obtained;

calculating the braking acceleration: joint velocity v corresponding to current time t _t >At 0 time, take m' _t+1 ＝2*m _t+1 -1, otherwise take m' _t+1 ＝2(1-m _t+1 ) -1, m' _t+1 Brings into a _t+1B ＝a _t+1min +(1+m’ _t+1 )/2·(a _t+1max -a _t+1min ) Calculating to obtain braking acceleration;

a desired acceleration of the joint _t+1N Testing is carried out, if the mechanical arm does not collide after the action is executed and does not violate the kinematic constraint of the defined joint, the expected acceleration a _t+1N It is feasible that an acceleration a will be desired _t+1N Performing as an alternative action; otherwise the calculated braking acceleration a _t+1B Performing as an alternative action; expected acceleration a calculated for each joint _t+1N After braking is performed; starting from the state information corresponding to the current moment t, if no collision occurs after the corresponding action is executed, the behavior is safe, otherwise, the movement is stopped;

the feasible action space of the mechanical arm is formed by the alternative action of each joint of the mechanical arm; and planning a motion track for the mechanical arm under the action space by utilizing a deep reinforcement learning algorithm and obtaining an optimal strategy.

Further, the target point is a welding starting point, and the planning task is to plan a safe path so that the tail end of the mechanical arm moves to the welding starting point.

Further, the state information includes a position, a velocity, an acceleration, and a distance from the obstacle of each joint.

Further, in order to prevent oscillation phenomenon during movement, a is taken _t+1max ＝m’ _t+1 *(a _t+1max -a _t+1min )，a _t+1min ＝a _t+1min +(1-m’ _t+1 )*(a _t+1max -a _t+1min )。

Further, the deep reinforcement learning algorithm includes:

setting an Actor network and a Critic network as reinforcement learning networks, wherein the loss function used by the Actor updating adopts a loss function of a self-adaptive KL penalty coefficient, the Critic adopts TD-error updating, a hidden layer uses swish as an activation function, and an output layer uses tanh as an activation function;

training path planning under the action space;

and setting a training ending condition, and regarding the condition as planning when the tail end of the mechanical arm continuously reaches a preset target point for multiple times. The training is successfully stopped.

Further, the input quantity of the deep reinforcement learning algorithm is the state information s of the mechanical arm _t And setting an Actor network and a Critic network for training. The network structure is 400 multiplied by 300 multiplied by 10 multiplied by 1, the hidden layers all use swish as an activation function, the output layer of the Actor network uses tanh as an activation function, and the output action range is [ -1,1]。

Further, the path planning training is performed in the motion space, so that the expected motion of the mechanical arm can be obtained, namely the motion in the motion space, which maximizes the Q value, i.e. the motion function value in reinforcement learning, which represents the expectation of the robot selecting the motion and then reaching the final state reward sum; and executing each action to obtain a corresponding reward value, and when the reward is stably converged, considering that the planning is successful and the training is stopped, wherein the strategy obtained through the training is the optimal strategy.

Further, the reward function for deep reinforcement learning is R = R _target -R _action -R _adaptation -R _distance Comprising four terms, a first term R _target A reward item for the distance from the tail end of the mechanical arm to the target point, wherein the reward item is used for training the mechanical arm to approach the target point; second term R _action Action punishment item is used for avoiding the action from being too close to the limit value; third term R _adaptation The brake penalty item is 1 when the action is collided and the braking action is executed, otherwise, the brake penalty item is 0; fourth term R _distance And for the distance penalty item, after the alternative action is executed, the distance between the rods of the mechanical arm and the distance between each rod and the obstacle are smaller than a certain threshold value, the penalty item 1 is applied, and otherwise, the penalty item is 0.

Compared with the prior art, the invention has the following technical characteristics:

the method comprises the steps of calculating the track of the current time interval according to the current motion state and network prediction of the robot, and predicting motion if the predicted track conforms to all safety constraints; otherwise, the braking trajectory calculated in the previous time interval is taken as a substitute safety action, and a feasible action space is obtained. The design of the reinforcement learning motion space ensures that all predicted trajectories meet the kinematic joint limits. Compared with the existing deep reinforcement learning algorithm, the method combines the thought of replacing actions, redesigns the action space for reinforcement learning training, and further ensures the safety of the planning result.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a diagram of planning results of an embodiment of the present invention, using three algorithms to train the resulting reward curves; wherein OURS is the method, PPO and DDPG algorithms are both the existing methods; as can be seen from the figure, the result of training by using the method of the invention can be converged more quickly and is more stable;

FIG. 3 is a schematic flow chart of the method of the present invention.

Detailed Description

The technical scheme aims to perform path planning in a narrow welding seam space in an industrial welding scene, and the adopted robot is provided with a six-degree-of-freedom industrial mechanical arm with a welding gun for welding operation.

Referring to the attached drawings, the narrow space robot operation planning method for safety reinforcement learning comprises the following steps:

step 1, before a mechanical arm moves, setting a planning task and a target point; the target point is a welding starting point, and the planning task is to plan a safe path so that the tail end of the mechanical arm moves to the welding starting point.

In the training process, state information of the mechanical arm is obtained from the simulated welding environment, and the state information comprises kinematic information of a mechanical arm joint and a position relation with an obstacle. The state information can be obtained by interaction between the robot and the simulated welding environment in the training process. Wherein the kinematic information includes a position, a velocity, an acceleration, and a jerk velocity of a joint of the mechanical arm.

Step 2, calculating expected acceleration a according to the current state information of the mechanical arm and relevant kinematic constraints _t+1N While calculating the braking acceleration a _t+1B (ii) a The method comprises the following steps:

and 2.1, defining kinematic constraints of the joint, wherein the kinematic constraints comprise the constraints of the position, the speed, the acceleration and the jerk speed of the joint, and the constraints comprise the maximum value and the minimum value of the position, the speed, the acceleration and the jerk speed.

Step 2.2, in order to avoid collision, defining an obstacle-rod piece pair and a rod piece-rod piece pair, detecting a minimum distance d between the robot and the obstacle and between the robot and a mechanical arm connecting rod of the robot at discrete time points to determine the collision condition, and if the minimum distance d is smaller than a preset safety distance threshold value d _S Then the collision is deemed to have occurred.

And 2.3, simulating in a pybull environment, and acquiring the state information of the mechanical arm through a sensor built in the pybull environment, wherein the state information comprises the position, the speed, the acceleration and the distance between each joint and an obstacle.

Establishing a neural network as a motion prediction network for predicting the motion at the next moment, wherein SELU is used as an activation function for hidden layers, the size of the hidden layer at the first layer is 256, and the hidden layer at the second layer is hiddenThe hidden layer size is 128, the state information of the joints is input into the motion prediction network, and the motion scalar m corresponding to each joint is predicted _t+1 ∈[-1,1]Then represented by the formula a _t+1N ＝a _t+1min +(1+m _t+1 )/2·(a _t+1max -a _t+1min ) Obtaining the desired acceleration a of the joint _t+1N Wherein a is _t+1min 、a _t+1max Respectively corresponding to the minimum and maximum safe acceleration of the joint; knowing the acceleration, the velocity and position of the joint at the next time t +1 can be determined.

Step 2.4, calculating braking acceleration; joint velocity v corresponding to current time t _t >0 time, take m' _t+1 ＝2*m _t+1 -1, else take m' _t+1 ＝2(1-m _t+1 ) -1, m' _t+1 Brings into a _t+1B ＝a _t+1min +(1+m’ _t+1 )/2·(a _t+1max -a _t+1min ) Calculating to obtain braking acceleration; meanwhile, in order to prevent oscillation phenomenon in the movement process, a is taken _t+1max ＝m’ _t+1 *(a _t+1max -a _t+1min )，a _t+1min ＝a _t+1min +(1-m’ _t+1 )*(a _t+1max -a _t+1min )。

Step 3, carrying out a on the expected acceleration of the joint _t+1N A test is performed to expect an acceleration a if the robot arm has not collided and the kinematic constraints of the joint defined in step 2.1 have not been violated after the action has been performed _t+1N It is feasible that an acceleration a will be desired _t+1N Performing as an alternative action; otherwise the calculated braking acceleration a _t+1B Performed as an alternative action.

To ensure that there is a safe and feasible motion at the next time t +1, the expected acceleration a calculated for each joint _t+1N After braking is performed; and (3) starting from the state information corresponding to the current moment t, namely from the position, the speed and the acceleration of the joint of the mechanical arm at the current moment, if no collision occurs after the corresponding action is executed, the behavior is safe, otherwise, the movement is stopped, and the step 2 of predicting again is carried out.

And 4, through the steps 2 to 3, the alternative actions of each joint of the mechanical arm can be obtained, and the alternative actions form a feasible action space of the mechanical arm. And planning a motion track for the mechanical arm in the action space by utilizing a deep reinforcement learning algorithm and obtaining an optimal strategy. In the reinforcement learning training process, an action is selected from the action space to be executed, and the quality of the action is reflected by an incentive value.

The deep reinforcement learning algorithm comprises the following steps:

and 4.1, setting an Actor network and a Critic network as reinforcement learning networks, wherein the loss function used by the Actor update adopts a loss function of a self-adaptive KL penalty coefficient, the Critic adopts TD-error update, the hidden layer uses swish as an activation function, and the output layer uses tanh as an activation function.

Step 4.2, training path planning is carried out in the action space;

and 4.3, setting a training ending condition, and taking the planning success to stop training when the tail end of the mechanical arm continuously reaches a preset target point for multiple times.

In this embodiment, the input quantity of the deep reinforcement learning algorithm is the state information s of the mechanical arm _t The Actor network and Critic network are set for training. The network structure is 400 × 300 × 10 × 1, the hidden layers all use swish as an activation function, the output layer of the Actor network uses tanh as an activation function, and the output action range is [ -1,1]。

And (3) carrying out path planning training in the redesigned action space, inputting the current joint position, speed and acceleration of the mechanical arm and the distance between the tail end of the mechanical arm and a target point into the reinforcement learning network to obtain the expected action of the mechanical arm, namely the action which maximizes the Q value in the action space and enables the mechanical arm to execute, wherein the Q value is an action function value in the reinforcement learning and is used for evaluating the value of the action, and represents the expectation of the reward sum of the final state after the intelligent agent selects the action. And executing each action to obtain a corresponding reward value, and when the reward is stably converged, considering that the planning is successful and the training is stopped, wherein the strategy obtained through the training is the optimal strategy.

The reward function for deep reinforcement learning is R = R _target -R _action -R _adaptation -R _distance Comprising four terms, a first term R _target Reward items for the distance from the tail end of the mechanical arm to the target point so as to train the mechanical arm to approach the target point; second term R _action Action penalty item, which is used to avoid action from being too close to the limit value; third term R _adaptation The brake penalty item is 1 when the action is collided and the braking action is executed, otherwise, the brake penalty item is 0; fourth term R _distance And for the distance penalty item, after the alternative action is executed, the distance between the rods of the mechanical arm and the distance between each rod and the obstacle are smaller than a certain threshold value, the penalty item 1 is applied, and otherwise, the penalty item is 0.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A narrow space robot operation planning method for safety reinforcement learning is characterized by comprising the following steps:

defining kinematic constraints for the joints;

acquiring state information of the mechanical arm through a built-in sensor in a pybullet environment;

establishing a neural network as a motion prediction network for predicting the motion at the next moment, inputting the state information of the joints into the motion prediction network, and predicting the motion scalar m corresponding to each joint _t+1 ∈[-1,1]Then represented by the formula a _t+1N ＝a _t+1min +(1+m _t+1 )/2·(a _t+1max -a _t+1min ) Obtaining the desired acceleration a of the joint _t+1N Wherein a is _t+1min 、a _t+1max Respectively the minimum and maximum safe acceleration of the joint; knowing the expected acceleration, the speed and position of the joint at the next time t +1 can be obtained;

calculating the braking acceleration: joint velocity v corresponding to current time t _t >At 0 time, take m' _t+1 ＝2*m _t+1 -1, else take m' _t+1 ＝2(1-m _t+1 ) -1, m' _t+1 Carry-in a _t+1B ＝a _t+1min +(1+m’ _t+1 )/2·(a _t+1max -a _t+1min ) Calculating to obtain braking acceleration;

a desired acceleration of the joint _t+1N Testing is carried out, if the mechanical arm does not collide after the action is executed and does not violate the kinematic constraint of the defined joint, the acceleration a is expected _t+1N It is feasible that an acceleration a will be desired _t+1N Performing as an alternative action; otherwise the calculated braking acceleration a _t+1B Performing as an alternative action; expected acceleration a calculated for each joint _t+1N After braking is performed; starting from the state information corresponding to the current moment t, if no collision occurs after the corresponding action is executed, the behavior is safe, otherwise, the movement is stopped;

the feasible motion space of the mechanical arm is formed by the alternative motion of each joint of the mechanical arm; and planning a motion track for the mechanical arm under the action space by utilizing a deep reinforcement learning algorithm and obtaining an optimal strategy.

2. The safe reinforcement learning narrow space robot operation planning method according to claim 1, wherein the target point is a welding start point, and the planning task is to plan a safe path so that the end of the mechanical arm moves to the welding start point.

3. The safety-enhanced learning narrow space robot operation planning method according to claim 1, wherein the state information includes a position, a velocity, an acceleration, and a distance to an obstacle of each joint.

4. The safe reinforcement learning narrow space robot operation planning method according to claim 1, wherein a is taken to prevent oscillation phenomenon during the movement process _t+1max ＝m’ _t+1 *(a _t+1max -a _t+1min )，a _t+1min ＝a _t+1min +(1-m’ _t+1 )*(a _t+1max -a _t+1min )。

5. The safety-reinforcement-learning narrow-space robot work planning method according to claim 1, wherein the deep reinforcement learning algorithm comprises:

training path planning under the action space;

6. The narrow space robot operation planning method for safety reinforcement learning according to claim 1, wherein an input amount of the deep reinforcement learning algorithm is status information of a robot arm, and an Actor network and a Critic network are provided for training. The network structure is 400 multiplied by 300 multiplied by 10 multiplied by 1, hidden layers all use swish as an activation function, an output layer of the Actor network uses tanh as an activation function, and the output action range is [ -1,1].

7. The narrow space robot operation planning method for safety reinforcement learning according to claim 1, wherein the path planning training is performed in the motion space, so as to obtain the desired motion of the robot arm, i.e. the motion in the motion space that maximizes the Q value, i.e. the motion function value in reinforcement learning, representing the expectation of the robot to the final state reward sum after selecting the motion; and executing each action to obtain a corresponding reward value, and when the reward is stably converged, considering that the planning is successful and the training is stopped, wherein the strategy obtained through the training is the optimal strategy.

8. The narrow space robot work planning method for safety reinforcement learning according to claim 1, wherein the reward function for deep reinforcement learning is R = R _target -R _action -R _adaptation -R _distance Comprising four terms, a first term R _target Reward items for the distance from the tail end of the mechanical arm to the target point so as to train the mechanical arm to approach the target point; second term R _action Action penalty item, which is used to avoid action from being too close to the limit value; third term R _adaptation The brake penalty item is 1 when the action is collided and the braking action is executed, otherwise, the brake penalty item is 0; fourth term R _distance And for the distance penalty item, after the alternative action is executed, the distance between the rods of the mechanical arm and the distance between each rod and the obstacle are smaller than a certain threshold value, the penalty item 1 is applied, and otherwise, the penalty item is 0.