CN115042185A

CN115042185A - Mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning

Info

Publication number: CN115042185A
Application number: CN202210788006.4A
Authority: CN
Inventors: 蔡尚雷; 林志赟; 王博; 韩志敏
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2022-09-13

Abstract

The invention relates to a mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning, which comprises the following steps of: acquiring and executing a first stage task, and executing a second stage task when the rewards acquired in the training period reach a threshold value and the reward difference value acquired in each training period is within the threshold value; acquiring and executing the second stage task, and executing the third stage task when the rewards acquired in the training period reach a threshold value and the reward difference value acquired in each training period is within the threshold value; acquiring and executing a third-stage task, and finishing training when the rewards acquired in the training periods reach a threshold value and the reward difference acquired in each training period is within the threshold value; the position of the obstacle in the third stage is randomly generated, and the invention provides a more effective state representation and reward design by combining the grasping and obstacle avoidance setting environment and task aiming at the actual industrial environment, thereby improving the learning effect of the robot on the task.

Description

Mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning

Technical Field

The invention relates to the technical field of intelligent learning, in particular to a mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning.

Background

Currently, robotic arms can be used for different tasks, such as assembly, pick and place, food cutting, etc.

The robot arm should grasp the target object and place the object at the target location to complete the pick-and-place task. The whole process includes grabbing the object and motion planning to reach the target point. Current research is focused on how to grasp objects of different shapes, or to pick objects from a clutter. However, despite the improved grip, the task of pick and place in a real industrial environment requires consideration of obstacles, which may be boxes placed in the vicinity of the work area or the operator. Therefore, another part of the work is focused on the mechanical arm avoiding the obstacle problem. Part of research adopts the RL method to solve the obstacle avoidance problem of arm: the Artificial Potential Field Method (APFM) is combined with the RL method, so that the obstacle avoidance problem of the dynamic obstacle is solved. However, although the RL has achieved significant results in the field of grabbing and obstacle avoidance, how to combine these two tasks to accomplish a pick and place task in an industrial environment remains a problem to be solved.

Specifically, for a long-period combination task, the training efficiency of the RL algorithm is low due to the long observation period or the insufficient feature extraction capability of the network architecture. Researchers have addressed these problems by making the algorithm more efficient in using training samples, or by building networks that can efficiently extract features. For some long-term tasks, most work relies on adjusting good shaping rewards, which not only facilitates delivery of goals to agencies, but also alleviates exploration problems. However, it is very difficult to adjust the appropriate reward without causing degradation of the solution. Such as a non-strategic RL algorithm that uses a demonstration track to quickly guide challenging long-cycle motion tasks, such as various similar intrusive tasks. They have studied the task of manual demonstration to replace these shape reward functions, which are difficult to adjust. And the geometric perception operation space controller is used for rich contact operation and is combined with a PPO algorithm to finish three tasks of track following, block pushing and long-period task door opening. Similar research has also designed a new experience playback mechanism, or an effective reward modeling method, to solve the problem of difficult training of long-period tasks. Some studies have considered long-term tasks, but the work to solve them in practical industrial scenarios is still lacking. In addition, the RL method still has poor training efficiency for long-period tasks, and how to train the long-period tasks effectively is a challenge.

While the long-period grab and obstacle avoidance problem is defined as the Markov Decision Process (MDP): at any given state s at time t _t Under S, the agent (i.e. robot) is based on the state space S _t E A S and policy π for executing action context at E A _(st) The agent being dependent on a set reward function R _(St，at) To obtain a corresponding prize r _t . The goal of the mastery problem is to find an optimal strategy of pi ^* To maximize the expected sum of the present future rewards, i.e., the gamma-present sum of all future rewards from t to ∞.

In the problem, RL is used to find the optimal strategy π ^* The grabbing process is optimized, the collision probability is reduced, and the success rate of task completion is maximized. In addition, there is a need to optimize training efficiency so that the robotic arm can effectively learn how to complete the long-term task of the design.

In summary, there is a need for a mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning, which has more effective state representation and reward design and improves the learning effect of a robot, by combining grabbing and obstacle avoidance setting environments and tasks, for an actual industrial environment.

Disclosure of Invention

The invention aims to provide a mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning, which aims at the actual industrial environment, combines grabbing and obstacle avoidance setting environments and tasks, has more effective state representation and reward design, and improves the learning effect of a robot.

In order to achieve the purpose, the invention adopts the technical scheme that: a mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning comprises the following steps:

(1) sequentially executing training tasks, wherein the training tasks at least comprise a first-stage training task, a second-stage training task and a third-stage training task with sequentially increasing difficulty;

(2) acquiring an obstacle avoidance grabbing model, wherein the obstacle avoidance grabbing model is obtained by performing deep learning on an execution training task;

(3) and inputting tasks to be executed based on the acquired obstacle avoidance grasping model, and realizing obstacle avoidance grasping of the mechanical arm.

Further, obtaining the obstacle avoidance grabbing model comprises:

acquiring a first-stage task training model, wherein the first-stage task training model is obtained by deep learning of executing a first-stage training task, and when the reward obtained in a training period reaches a threshold value and the reward difference value obtained in each training period is within the threshold value when the first-stage task is executed, the first-stage task training task is completed;

acquiring a second-stage task training model, wherein the second-stage task training model is obtained by deep learning for executing a second-stage training task, the second-stage training task is executed by a first-stage task training model, and when the reward obtained in a training period reaches a threshold value and the reward difference value obtained in each training period is within the threshold value during execution of the second-stage task, the second-stage task training task is completed;

and acquiring a third-stage task training model, wherein the third-stage task training model is obtained by deep learning of a third-stage training task, the third-stage training task is executed by the second-stage task training model, and when the reward obtained in the training period reaches a threshold value and the reward difference obtained in each training period is within the threshold value, the training is completed when the third-stage task is executed.

Further, the first-stage training task is a target object picking task provided with a first-stage obstacle; and the first-stage barrier is fixed in position;

the second-stage training task is a target object picking task provided with second-stage obstacles, and the second-stage obstacles are fixed in position and are more than the first-stage obstacles in number;

the third-stage training task is a target object picking task provided with a third-stage obstacle, the position of the third-stage obstacle is randomly generated, and the number of the obstacles is equal to that of the second-stage obstacles.

Further, performing the training task includes:

acquiring a state, wherein the state acquisition is used for identifying the relative positions of the mechanical arm, the target object and the obstacle, and comprises the step of acquiring the minimum distance between the mechanical arm and the obstacle;

a motion control for controlling the motion of the robotic arm, including converting the robotic arm motion to a position control in a Cartesian coordinate system;

reward acquisition, wherein the reward acquisition is a record of successful completion of a target action; the target actions include reach, grip, lift, and hover.

Further, the state acquisition includes: regarding a plurality of connecting arms, an end effector and an obstacle on a mechanical arm as line segments, dividing each line segment into a plurality of points, wherein the distance between the points on the mechanical arm and the points on the obstacle represents the distance between the mechanical arm and the obstacle, and regarding the minimum distance d between the mechanical arm and the obstacle _min As a first part of the state, the two fingers of the robot arm gripper are relative to the position y in the y-axis direction of the end effector coordinate system _f As a second part of the state, the position p of the target object in three-dimensional space _o As a third part of the state, position p in three-dimensional space of the end effector _e Fourth section as state:

S ₁ ＝{d _min ，y _f ，p _o ，p _e }

the range for each state is as follows:

d _min1 ，d _min2 ∈[0，1.3]

y _left ∈[0，0.04]，y _right ∈[-0.04，0]

x _o ∈[-2，2]，y _o ∈[-2，2]，z _o ∈[-2，2]

x _e ∈[-1.35，0.35]，y _e ∈[-0.98，0.72]，z _e ∈[0.52，1.82]

wherein d is _min1 And d _min2 Respectively showing the mechanical armsA minimum distance between two obstacles; y is _left And y _right Respectively representing the positions of the left finger and the right finger of the mechanical arm clamping jaw in the y-axis direction relative to the coordinate system of the end effector; x is the number of _o ，y _o ，z _o Respectively representing the position of the target object on x, y and z axes of the end effector coordinate system; x is a radical of a fluorine atom _e ，y _e ，z _e Representing the position of the end effector in the x, y and z axes, respectively, of the world coordinate system.

Further, the motion control includes: the motion of the mechanical arm is converted into position control in a Cartesian coordinate system by using Operation Space Control (OSC), and the motion space a of the mechanical arm is as follows:

a＝(Δx，Δy，Δz，Δg)，Δx，Δy，Δz，Δg∈[-1，1]

the system comprises a mechanical arm, a manipulator, a gripper, a manipulator and a controller, wherein delta x, delta y and delta z represent the offset of the manipulator on x, y and z axes in a Cartesian coordinate system, delta g is the opening and closing state of the gripper, the gripper is a clamp on the manipulator, and if delta g is less than 0, the clamp is closed; if Δ g is 0, the clamp remains unchanged, and if Δ g > 0, the clamp opens.

Further, the reward acquisition comprises rewards of four stages of reaching, grabbing, lifting and hovering:

r _reach ＝α*(1-tanh(λ*d ₁ ))

r _grasp ＝β

r _lift ＝r _grasp +(γ-β)*(1-tanh[η*(z _set -z _object )])

r _obstacle ＝k

r _time ＝τ

wherein r is _reach ，r _grab ，r _lift ，r _hover Respectively representing rewards for reaching, grasping, lifting, and hovering; r is a radical of hydrogen _obstacle A penalty indicating a collision of the mechanical arm and the obstacle; r is _time Indicating incomplete taskA time penalty incurred before a transaction; d ₁ Representing a cartesian distance from the fixture to the target object; z is a radical of _set 、z _object Representing the height of the set point and the height of the object to be reached, respectively; d ₂ Representing the Cartesian distance, r, of the target object to the target point _hover1 ，r _hover2 The rewards are respectively represented as hovering in two cases; alpha, beta, gamma, mu, lambda and eta respectively represent coefficients set in the reward formulas of reaching, grasping, lifting and hovering; k represents a coefficient of collision penalty setting; τ denotes the coefficient of the temporal penalty setting.

Further, there is one of the first stage obstacles; there are two second-stage obstacles and two third-stage obstacles.

Further, the deep learning algorithm is executed by using the Soft Actor criticic algorithm.

The invention has the advantages that:

1) the mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning can be used for setting environments and tasks by combining grabbing and obstacle avoidance aiming at actual industrial environments such as scenes of waste battery recycling and the like, and has more effective state representation and reward design, so that the learning effect of the robot on the task is improved.

2) According to the mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning, the training tasks are gradually trained in stages according to the difficulty, corresponding training models are obtained, and then the primary training model executes higher training tasks to obtain corresponding higher training models, so that the convergence speed of the picking tasks is higher, and higher training efficiency can be obtained.

Drawings

Fig. 1 is a flow chart of a mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning.

Fig. 2 is a schematic diagram of a training environment of the first-stage training task of the mechanical arm obstacle avoidance grasping method based on continuous reinforcement learning.

Fig. 3 is a schematic diagram of a training environment of the second-stage training task of the mechanical arm obstacle avoidance gripping method based on continuous reinforcement learning according to the present invention.

Fig. 4 is a schematic diagram of a training environment of the third-stage training task of the mechanical arm obstacle avoidance grasping method based on continuous reinforcement learning.

Fig. 5 is a schematic view of a continuous learning mode of the mechanical arm obstacle avoidance grasping method based on continuous reinforcement learning.

In the figure, 01, a target object; 02. an obstacle; 03. target frame

Detailed Description

The invention will be further illustrated with reference to specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Furthermore, it should be understood that various changes and modifications can be made by those skilled in the art after reading the disclosure of the present invention, and equivalents fall within the scope of the appended claims.

Example 1

The invention provides a mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning, which is shown in the attached drawing 1, wherein the attached drawing 1 is a flow chart of the mechanical arm obstacle avoidance grabbing method based on the continuous reinforcement learning, and the method at least comprises the following steps: step S10-step S20:

step S10: sequentially executing training tasks, wherein the training tasks at least comprise a first-stage training task, a second-stage training task and a third-stage training task with sequentially increasing difficulty;

step S20: and acquiring an obstacle avoidance grabbing model, wherein the obstacle avoidance grabbing model is obtained by performing deep learning on the execution of the training task.

It should be noted that the mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning provided by the invention is characterized in that the training task is gradually trained according to the difficulty level and corresponding training models are obtained, and then the primary training model executes higher training tasks to obtain corresponding higher training models, so that the convergence rate of the picking task is higher, and higher training efficiency can be obtained.

Specifically, the step of obtaining the obstacle avoidance grasp model includes steps S22 to S26:

step S22: acquiring a first-stage task training model, wherein the first-stage task training model is obtained by deep learning of executing a first-stage training task, and when the reward obtained in a training period reaches a threshold value and the reward difference value obtained in each training period is within the threshold value when the first-stage task is executed, the first-stage task training task is completed;

referring to fig. 2, fig. 2 is a schematic diagram of a training environment (not shown) of the robot obstacle avoidance gripping method based on continuous reinforcement learning in the first stage training task according to the present invention. The first-stage training task is a target object picking task provided with a first-stage obstacle; and the first-stage barrier is fixed in position;

step S24: acquiring a second-stage task training model, wherein the second-stage task training model is obtained by deep learning for executing a second-stage training task, the second-stage training task is executed by a first-stage task training model, and when the reward obtained in a training period reaches a threshold value and the reward difference value obtained in each training period is within the threshold value during execution of the second-stage task, the second-stage task training task is completed;

referring to fig. 3, fig. 3 is a schematic diagram of a training environment of the second stage training task (the robot arm is not shown) of the robot arm obstacle avoidance grabbing method based on continuous reinforcement learning according to the present invention. The second-stage training task is a target object picking task provided with second-stage obstacles, and the second-stage obstacles are fixed in position and are more than the first-stage obstacles in number; and the position of the second-stage barrier is fixed;

step S26: acquiring a third-stage task training model, wherein the third-stage task training model is obtained by deep learning of a third-stage training task, and the third-stage training task is executed by a second-stage task training model, and when the third-stage task is executed, when the reward obtained in a training period reaches a threshold value and the reward difference value obtained in each training period is within the threshold value, the training is completed;

step S30: and based on the acquired model, the mechanical arm is used for avoiding obstacles and grabbing.

Referring to fig. 4, fig. 4 is a schematic diagram of a training environment of the third-stage training task (the robot arm is not shown) according to the robot arm obstacle avoidance grabbing method based on continuous reinforcement learning of the present invention. The third-stage training task is a target object picking task provided with a third-stage obstacle, and the position of the third-stage obstacle is randomly generated;

according to the mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning, the training difficulty is increased once in each training task stage, after the primary training model is built, the training model trained in the previous time is trained by adopting the more complicated training difficulty, the final complete obstacle avoidance grabbing model is obtained, the training process is decomposed, and the training efficiency is greatly improved.

It should be noted that the deep learning strategy and the execution method adopted in the present embodiment are as follows:

A.SAC

soft Actor Critical (SAC) [21] is an algorithm for optimizing a random strategy in a non-strategy mode, and combines the advantages of random strategy optimization with methods such as DDPG and the like. SAC changes the DRL problem to require maximum entropy for each action that the policy outputs, in addition to learning a policy to maximize the jackpot.

SAC learning a strategy pi _θ And two Q functions

SAC uses neural networks to approximate policy and Q functions. And updating the strategy network parameters and the Q network parameters by a gradient descent method. First, the algorithm samples data from the playback buffer R for the purpose of computing the Q function. The target is calculated as follows:

wherein a is _θ(s) Indicating that the action is from a policyπ _θ(·|s) Is sampled in the middle. SAC learns the cost function using a clipped double Q similar to TD3, and takes the minimum value Q between two similar values. Q function

The parameters of (2) are updated as follows:

the policy parameter θ is updated as follows:

B. status, action presentation, reward

1) State representation

The state st is defined in several parts. The robot section only considers links6, links7, links8, and end effectors (links6, links7, links8 are parts of the robot arm that are all within the confines of the workspace). Further, links6, links7, links8, the end effector, and the obstacle are considered line segments, each of which is divided into 10 points. The distance between the points on the robot arms and the points on the obstacle represents the distance between the robot arm and the obstacle. The minimum distance d between the mechanical arm and the obstacle _min As a first part of the state, the two fingers of the robot arm gripper are relative to the position y in the y-axis direction of the end effector coordinate system _f As a second part of the state, the position p of the target object in three-dimensional space _o As a third part of the state, position p in three-dimensional space of the end effector _e Fourth section as state:

S ₁ ＝{d _min ，y _f ，p _o ，p _e }

the range for each state is as follows:

d _min1 ，d _min2 ∈[0，1，3]

y _left ∈[0，0.04]，y _right ∈[-0.04，0]

x _o ∈[-2，2]，y _o ∈[-2，2]，z _o ∈[-2，2]

x _e ∈[-1.35，0.35]，y _e ∈[-0.98，0.72]，z _e ∈[0.52，1.82]

wherein d is _min1 And d _min2 Respectively representing the minimum distance between the mechanical arm and two obstacles; y is _left And y _right Respectively representing the positions of the left finger and the right finger of the mechanical arm clamping jaw in the y-axis direction relative to the coordinate system of the end effector; x is the number of _o ，y _o ，z _o Respectively representing the position of the target object on x, y and z axes of the end effector coordinate system; x is the number of _e ，y _e ，z _e Representing the position of the end effector in the x, y and z axes, respectively, of the world coordinate system.

It should be noted that: preferably, the positions of the two fingers on the mechanical arm clamping jaw relative to the y-axis direction on the end execution coordinate system are used as the second part of the state; taking the position of the target object as a third part of the state; the position of the end effector is taken as the fourth part of the state.

2) Motion representation

The controller determines the advanced control type for a given robotic arm. Manipulator motion is converted to position control in a cartesian coordinate system using an Operating Space Control (OSC). The action space of the mechanical arm is as follows:

a＝(Δx，Δy，Δz，Δg)，Δx，Δy，Δz，Δg∈[-1，1]

Δ x, Δ y, and Δ z represent the offset of the end effector in the x, y, and z axes, Δ g represents the state of the gripper opening and closing motion, positive values represent positive motion, and negative values vice versa. Here, the ranges Δ x, Δ y, and Δ z are within the input value range, and have a proportional relationship of 0.05 with the actual output value. That is, when the X input is 1, the output moves 0.05 meters in the positive X-axis direction, where positive and negative values represent positive and negative directions of motion. The gripper opening and closing action Δ g is discrete in nature, comprising only opening and closing actions. In order to use the RL algorithm for the continuous motion space, the opening and closing motion is continuously processed. If Δ g < 0, the clamp is closed; if Δ g is 0, the clamp remains unchanged. If Δ g > 0, the clamp is opened.

3) Reward shaping

In reinforcement learning, the design of the reward directly affects the training effect. In this embodiment, the grabbing task is divided into four phases: reach, grab, lift, and hover. The reward settings are as follows:

r _reach ＝α*(1-tanh(λ*d ₁ ))

r _grasp ＝β

r _lift ＝r _grasp +(γ-β)*(1-tanh[η*(z _set -z _object )])

r _obstacle ＝k

r _time ＝τ

wherein r is _reach ，r _grab ，r _lift ，r _hover1 ，r _hover2 Respectively representing rewards for reaching, grasping, lifting, and hovering; r is _obstacle A penalty indicating a collision of the mechanical arm and the obstacle; r is _time Indicating the time penalty suffered before the task is not finished; d ₁ Representing a cartesian distance from the fixture to the target object; z is a radical of _set 、z _object Representing the height of the set point and the height of the object to be reached, respectively; d ₂ Representing the cartesian distance of the target object to the target point; alpha, beta, gamma, mu, lambda and eta respectively represent coefficients set in the reward formulas of reaching, grasping, lifting and hovering; k represents a coefficient of collision penalty setting; τ denotes the coefficient of the temporal penalty setting. In this embodiment, α is 0.1, β is 0.35, γ is 0.5, μ is 0.75, κ is-5, λ is 15, η is 10, and τ is-0.01.

At the start of a task, the robotic arm is rewarded for proximity to the target object. When the left clamping jaw and the right clamping jaw of the manipulator are in contact with the target object, the manipulator is considered to grab the object, and a larger reward r is obtained _grasp . When grabbingGet the reward r _grasp When the number is more than 0, the arm obtains a lifting reward r _lift This encourages the arm to lift the object. Reward r for lifting an object close to the target box _hover The following two cases can be classified: when the lifted object is within the x and y coordinates of the target frame, it receives a reward r _hover1 While in other cases it will receive a relatively small prize r _hover2 . As for obstacle avoidance, a punished reward obstacle is set. In this training task, only four parts of the robot arm are considered in the collision with the obstacle, since only 3 links of the robot arm and the end effector are within the working space range. When these components collide with an obstacle, the robot will be penalized for colliding with the obstacle. Meanwhile, when the mechanical arm does not complete the task, that is, the target object is not put into the target frame, the robot always receives the time r _time Penalty of (2).

4) Continuous learning

Referring to fig. 5, fig. 5 is a schematic view of a continuous learning mode of the mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning according to the present invention.

Continuous learning can provide an auxiliary task for the agent to slowly complete difficult tasks, starting with a simple task. Three task environments with different difficulties are set in the above embodiment: one fixed-position obstacle environment, one with two fixed-position obstacles, and one with two randomly generated obstacles. For the current task, when the reward obtained by the robot in an epoch reaches the expected value and the training result tends to be stable, namely the training task of the current task is completed, the model is moved to the next environment for training. Training continues until the robot completes the last task.

Example 2

Constructing a specific task environment based on a Mujoco physical engine and Robosuite:

1) environment: a table with the length of 0.25m, the width of 0.18m and the height of 0.8m is placed in the environment. A piece of bread, the target object 01, is placed randomly on the left half of the table and a cylinder, the obstacle 02, is placed on the right side of the table. A four-pane frame is placed on the right side of the desktop, one of which is the target frame 03. A virtual object is placed on the target frame 03, and the position of the virtual object is the position of the target point. In this environment, the training task of the robot is to grab and place the target object 01 in the target frame 03 without colliding with the obstacle 02. Each epoch and training period of the task comprises 500 steps. In the context of the first stage task, a cylindrical barrier is placed in a fixed position on the right side of the table. In the environment of the second stage task, two cylindrical barriers are placed at fixed positions on the right side of the table, and the other positions are the same as the environment of the first stage task; in the third stage task environment, two obstacles are randomly generated on the right side of the table, and the rest are consistent with the first stage task environment and the second stage task environment.

2) The network structure is as follows: in the DRL algorithm used, the operator and critical networks use two convolution and flip layers of size 256.

3) And (3) hyper-parameter: the hyper-parameter setting of the reinforcement learning algorithm is shown in table 1, and when the criticizing number N is 5, the atomic number M is 25, the best effect can be achieved.

TABLE 1

4) Training details: 4000 epochs were trained on a system running Ubuntu16.04, Intel Xeon Gold 5118 CPU and NVIDIA Tesla NVLINK V100 SXM2 graphics cards and CUDA 10.

Training results are as follows:

1) state validity:

some studies have considered the minimum distance between the surface of the manipulator and the surface of the obstacle. However, such a representation can result in a significant amount of computation and affect the time to complete the entire training. In order to reduce the amount of calculation, some work expresses the relationship between the robot arm and the obstacle by the distance between the joint and the obstacle. This state is represented as follows:

S ₂ ＝{d _joint6 ，d _joint7 ，d _joint8 ，y _f ，p _o ，p _e }

d _joint6 、d _joint7 、d _joint8 j representing a robot arm _oint6 、j _oint7 、j _oint8 Distance from the obstacle. Two different states are compared, representing the effectiveness of the training tasks in the first stage task environment at S1, S2.

From the results it can be seen that the first state used achieves a better training result. The first state is used as a state representation in subsequent experiments.

2) Data efficiency

And comparing the training results of the model migrated by using the CL method with the training results of the model directly started from 0 in the third-stage task environment. Model1 refers to a Model that has been trained from a first environment to a second environment and then has undergone two migrations from the second environment to a third environment. Model2 is a Model trained starting from 0.

It can be seen from the results that the model trained in the third environment after two migrations has a great difference in efficiency from the model trained directly from 0. Model1 reaches 1 × 10 training steps ⁶ Previously converged to good results; however, model2 is still in a lack of rewards state.

3) Final performance of

The training effects of the two methods at the same training time are compared. Model1 is defined here to start training from the first stage mission environment, and after reaching a high reward (e.g., an epoch jackpot exceeding 600), the Model is migrated to the next environment training, and so on. Model2 begins training directly in the third environment. The training duration is the number of steps the Model1 takes from the first stage task environment to the third stage task environment.

The results show that in the early stage, the robot takes a certain time to train from a simple task and then moves to a difficult task; the convergence rate of the picking task is faster, and the robot can train better.

Furthermore, the model migrated using the CL method was compared with the training effect of the model in a shorter period starting from 0, i.e., the total step size per epoch was reduced from 500 to 300.

The result shows that the transplanted model not only realizes higher training efficiency. And in a shorter period of time, the mechanical arm can complete the task more quickly and obtain higher rewards. It should be noted that since each round is shortened from 500 steps to 300 steps, the reward for the robot will be lower than the first three environments.

It should be noted that: the mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning is used for optimizing the training efficiency of reinforcement learning on long-period tasks, obviously reduces the required training time per se, enables a robot model to be capable of quickly adapting to the next more difficult task after multiple times of migration, and reduces the deployment cost of the task in practical industrial application.

Meanwhile, a more effective state representation is provided for the pick-up type training task, and a good starting point is provided for training the long-period task. In order to solve the problem that the long-time task is difficult to train in reinforcement learning, a continuous learning method is adopted to optimize the training efficiency, and the transplanted model has a better training effect and can adapt to a new working environment more quickly.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and additions can be made without departing from the principle of the present invention, and these should also be considered as the protection scope of the present invention.

Claims

1. A mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning is characterized by comprising the following steps:

2. The mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning as claimed in claim 1, wherein the obtaining of the obstacle avoidance grabbing model comprises:

and acquiring a third-stage task training model, wherein the third-stage task training model is obtained by deep learning of the third-stage training task, the third-stage training task is executed by the second-stage task training model, and when the reward acquired in the training period reaches a threshold value and the reward difference value acquired in each training period is within the threshold value, the training is finished.

3. The mechanical arm obstacle avoidance grasping method based on the continuous reinforcement learning as claimed in claim 2,

the first-stage training task is a target object picking task provided with a first-stage obstacle; and the first-stage barrier is fixed in position;

4. The mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning as claimed in claim 1, wherein the executing of the training task comprises:

reward acquisition, wherein the reward acquisition is a record of successful completion of a target action.

5. The mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning as claimed in claim 4, wherein the state acquisition includes: regarding a plurality of connecting arms, an end effector and an obstacle on a mechanical arm as line segments, dividing each line segment into a plurality of points, wherein the distance between the points on the mechanical arm and the points on the obstacle represents the distance between the mechanical arm and the obstacle, and regarding the minimum distance d between the mechanical arm and the obstacle _min As a first part of the state, the two fingers of the robot arm gripper are relative to the position y in the y-axis direction in the end effector coordinate system _f As a second part of the state, the position p of the target object in three-dimensional space _o As a third part of the state, position p in three-dimensional space of the end effector _e Fourth section as state:

S ₁ ＝{d _min ，y _f ，p _o ，p _e }

the range for each state is as follows:

d _min1 ，d _min2 ∈[0，1.3]

y _left ∈[0，0.04]，y _right ∈[-0.04，0]

X _o ∈[-2，2]，y _o ∈[-2，2]，z _o ∈[-2，2]

x _e ∈[-1.35，0.35]，y _e ∈[-0.98，0.72]，z _e ∈[0.52，1.82]

6. The mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning as claimed in claim 4, wherein the motion control comprises: the motion of the mechanical arm is converted into position control in a Cartesian coordinate system by using Operation Space Control (OSC), and the motion space a of the mechanical arm is as follows:

a＝(Δx，Δy，Δz，Δg)，Δx，Δy，Δz，Δg∈[-1，1]

7. The mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning of claim 4, wherein the reward acquisition comprises rewards of four stages of arrival, grabbing, lifting and hovering:

r _reach ＝α*(1-tanh(λ*d ₁ ))

r _grasp ＝β

r _lift ＝r _grasp +(γ-β)*(1-tanh[η*(z _set -z _object )])

r _obstacle ＝k

r _time ＝τ

wherein r is _reach ，r _grasp ，r _lift ，r _hover Respectively representing rewards for reaching, grasping, lifting, and hovering; r is _obstacle A penalty indicating a collision of the mechanical arm and the obstacle; r is _time Indicating the time penalty suffered before the task is not finished; d ₁ Representing a cartesian distance from the fixture to the target object; z is a radical of _set 、z _object Representing the height of the set point and the height of the object to be reached, respectively; d ₂ Representing the Cartesian distance, r, of the target object to the target point _hover1 ，r _hover2 Rewards representing hover for both cases, respectively; alpha, beta, gamma, mu, lambda and eta respectively represent coefficients set in the reward formulas of reaching, grasping, lifting and hovering; k represents a coefficient of collision penalty setting; τ denotes the coefficient of the temporal penalty setting.

8. The mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning as claimed in claim 3, wherein there is one obstacle in the first stage; there are two second-stage obstacles and two third-stage obstacles.

9. The mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning as claimed in claim 1, wherein an algorithm of the deep learning is executed by using a Soft Actor criticic algorithm.