CN116945180A

CN116945180A - Mechanical arm dynamic object grabbing method based on reinforcement learning

Info

Publication number: CN116945180A
Application number: CN202310991331.5A
Authority: CN
Inventors: 张诗笛; 毕运波
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-08-08
Filing date: 2023-08-08
Publication date: 2023-10-27

Abstract

The invention discloses a mechanical arm dynamic object grabbing method based on reinforcement learning. The invention comprises the following steps: decomposing long-period complex tasks, and designing five-stage training tasks with sequentially increasing difficulty; and by executing training tasks at each stage, an intelligent model for grabbing the dynamic object by the mechanical arm is obtained, and the mechanical arm can effectively grab the dynamic object. Aiming at the actual industrial environment and combining the needs of long-period complex tasks, the invention provides effective state representation and rewarding function design, thereby realizing faster convergence speed and higher training efficiency. In addition, the invention performs dynamics optimization on the robot movement, and ensures the stability and accuracy in the robot movement process.

Description

Mechanical arm dynamic object grabbing method based on reinforcement learning

Technical Field

The invention relates to a mechanical arm dynamic object grabbing method in the field of intelligent control of mechanical arms, in particular to a mechanical arm dynamic object grabbing method based on reinforcement learning.

Background

Currently, mechanical arms are widely used in a plurality of fields such as manufacturing, mechanical assembly, parts handling, man-machine cooperation, and the like.

In the process of carrying out a grabbing task by the mechanical arm, the mechanical arm identifies grabbing objects and goes to a grabbing position to grab target objects, and then the target objects are placed at the target position to realize a complete grabbing process. The most original implementation method is to realize grabbing through a manual teaching mode, the grabbing positions are manually identified, and the robot is controlled to go to the grabbing positions through the teaching board to realize grabbing, however, the method consumes a great deal of manpower resources, and can only grab static objects, and the method has little feasibility in a dynamic object grabbing scene. However, the capturing task in the actual industrial environment often needs to consider a dynamic object scene, for example, a capturing scene of an object on a conveyor belt, and for example, a scene of an object passing between man-machine interaction. Another conventional control method of the mechanical arm is to plan to go to a grabbing position through movement of a robot, and after the grabbing target position is obtained, a feasible path is planned through movement planning and rules to go to the target position for grabbing. However, the mechanical arm control method cannot be dynamically adapted to unstructured environments or working occasions with excessive uncertain factors, particularly in the scene of dynamic object grabbing, a new feasible path needs to be calculated by using a motion planning algorithm once the object position changes, and the calculation consumption and the time consumption are increased. In recent years, reinforcement learning (Reinforcement Learning, RL) has been introduced into the control of robotic arms with great success in artificial intelligence gaming, robotic control, and the like. Researchers use reinforcement learning to study the problems of obstacle avoidance, static object grabbing and the like of the mechanical arm, and obtain a good effect. Although reinforcement learning achieves remarkable results in basic scenes such as obstacle avoidance and static object grabbing, the problem that the dynamic object grabbing of the mechanical arm belongs to a long-period complex task, and how to enable the mechanical arm to intelligently grab objects with moving speed in an industrial scene is still a problem to be solved.

Specifically, a long-period complex task often needs to realize a plurality of tasks, such as the problem of dynamic object grabbing of the mechanical arm, not only needs to adjust the grabbing gesture of the clamping jaw according to the state of the target object, but also needs to adjust the position of the mechanical arm in real time along with the movement of the object, which has extremely high requirements on the intelligent model of the mechanical arm. Under the condition that the mechanical arm controls the high dimensionality scene, the state and the action of the learning task are increased, and the calculation consumption and the storage consumption are increased; moreover, due to long observation period or insufficient feature extraction capability of the network architecture to the state, the mechanical arm often has difficulty in accurately sensing the change of the environmental state; in addition, the definition of the bonus function is also difficult due to the complexity of the task. The reinforcement learning algorithm has low training efficiency under the long-period complex task scene and is difficult to train a satisfactory intelligent model due to various reasons.

And such long-period complex tasks may be defined as Markov Decision Processes (MDPs): any given state s at time t _t Under S, the agent (i.e., robot) is based on state space S _t S and policy pi for E A execution action Environment at E A _(st) The agent awards the function according to the settingObtain corresponding rewards r _t . The goal of grasping the problem is to find an optimal strategy pi ^* To maximize the expected sum of future rewards, i.e., the sum of all future rewards from t to +.

In question, reinforcement learning is used to find the optimal strategy pi ^* To optimize the gripping process, to optimize the gripping pose and to track the moving target object to achieve dynamic gripping. In addition, the training efficiency needs to be optimized, so that the mechanical arm can learn how to complete long-period complex tasks efficiently and rapidly.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a mechanical arm dynamic object grabbing method based on reinforcement learning. Aiming at the requirement of the mechanical arm for grabbing the dynamic object in the actual industrial environment, the invention combines the mechanical arm motion planning and grabbing planning in the dynamic object grabbing scene, has more effective state representation and rewarding design, realizes the stable and efficient grabbing of the dynamic object of the robot, and improves the learning effect of the robot.

In order to achieve the above purpose, the invention adopts the following technical scheme:

s1: constructing a multi-stage training task;

s2: training the reinforcement learning network model in sequence according to the multi-stage training task until training is completed, and obtaining a mechanical arm dynamic object grabbing intelligent model;

s3: and inputting the target task into the mechanical arm dynamic object grabbing intelligent model to realize grabbing of the target object by the mechanical arm.

In the step S1, the multi-stage training task comprises a first stage training task, a second stage training task, a third stage training task, a fourth stage training task and a fifth stage training task; the task difficulty of the training tasks of the first stage, the second stage, the third stage, the fourth stage and the fifth stage is sequentially increased.

Among the multi-stage training tasks, the first stage training task is a target grabbing task with a static object to be grabbed and a fixed position;

the second stage training task is a target grabbing task of which the grabbed object is static and the position of the grabbed object is randomly generated;

the training task in the third stage is a target grabbing task of which the grabbed object is static and the position is randomly generated;

the training task in the fourth stage is a target grabbing task with a dynamic object to be grabbed and a fixed position;

the fifth stage training task is a target grabbing task of which the grabbed object is dynamic and the position of the grabbed object is randomly generated.

The step S2 is specifically as follows:

and training the reinforcement learning network model according to the multi-stage training tasks in sequence, taking the reinforcement learning network model trained by the previous stage training task as an initial reinforcement learning network model of the next stage training task until all the stage training tasks are completed, and taking the final reinforcement learning network model as a mechanical arm dynamic object grabbing intelligent model.

And in the training process of each stage of the reinforcement learning network model, when the rewards obtained in the training round number reach a rewarding threshold value and the rewards obtained in each training round number are within the threshold value, the training task of the current stage is completed.

Input state quantity S of the reinforcement learning network model ₁ Comprises six states, namely the position P of the object to be grabbed under the three-dimensional coordinate system of the robot _o Position P of clamping jaw under three-dimensional coordinate system of robot _g Jaw action a _g Contact condition C of clamping jaw and gripped object _og Time information t and mechanical arm joint informationTheta of rest _j Satisfy S ₁ ＝{P _o ，P _g ，a _g ，C _og ，t，θ _j }。

In the training process of each stage of the reinforcement learning network model, the rewarding function of each stage comprises basic rewards, and the formulas are respectively as follows:

R ₁ ＝R ₂ ＝r _base

R ₄ ＝r _base ′

R ₅ ＝r _base ″

wherein R is ₁ 、R ₂ 、R ₃ 、R ₄ And R is ₅ The value of the bonus function, r, of the first stage to the fifth stage respectively _base For the first base prize value, j is the corresponding joint angle label,rewarding for joint angle limitation>For smooth control of rewards, θ _j Represents the joint angle, θ, of the robot _{j_limit} Represents the limited range of joint angles of the robot, theta', and _j represents the angular acceleration of the joint of the robot, j _max Maximum value of the index indicating the joint angle, r _base ' is the second base prize value, r _base "third base prize value, +.>And ω is the first adjustment coefficient and the second adjustment coefficient, respectively.

The formula of the base rewards is as follows:

r _base ＝r _reach +r _lift +r _t

r _reach ＝r _dog +r _go

r _lift ＝r _gc +r _cog +r _zo

wherein r is _base For the first base prize value, r _reach To approach the prize value of the process, r _lift To promote the prize value of the process, r _t For the time prize value r _dog For distance rewards value r _go For rewarding jaw opening, r _{dog_max} Is the maximum distance between the clamping jaw and the target object, p is the rewarding sensitivity parameter, r _gc Is the prize value of the jaw closure, r _cog Is the contact rewarding value of two fingers of a clamping jaw and an object, r _zo Is the target object promotes the rewarding value, z _o Z is the real-time height of the object _{o_initial} For the height of the object at scene initialization, t _l For the first time limit, α, β, γ, δ, ε are the third adjustment coefficient-the seventh adjustment coefficient, respectively;

the second base prize value r _base ' the first time limit t in the base prize _l Replaced by a second time limit t _l4 Obtained by post-calculation, where t _l4 Satisfy t _l4 ＝∈*t _l E is an eighth adjustment coefficient;

the third base prize value r _base "is to limit the first time t in the underlying prize _l Replaced by a third time limit t _l5 Obtained by post-calculation, where t _l5 Satisfy t _l5 ＝μ*t _l Mu is the ninth adjustment coefficient and E<μ。

The reinforcement learning network model adopts an Actor-Critic type near-end strategy optimization algorithm.

The beneficial effects of the invention are as follows:

1) The mechanical arm dynamic object grabbing method based on reinforcement learning can be used for providing more effective state representation and rewarding function design according to actual industrial environments such as the situations of grabbing moving objects on a conveyor belt, transferring objects by a man-machine interaction and the like and combining task requirements, so that the learning effect of the mechanical arm in the task is improved.

2) According to the mechanical arm dynamic object grabbing method based on reinforcement learning, the multi-stage training task with increasing difficulty is designed by decomposing the long-period complex task in the mechanical arm control problem, and a targeted reward function is designed according to the characteristics of each stage. After each training task is completed, a staged reinforcement learning network model is obtained, and then the staged reinforcement learning network model is put into the training task of the next stage, so that the network model has good effect on completing long-period complex tasks from realizing simple static object grabbing to realizing dynamic object grabbing, and faster convergence speed and higher training efficiency are realized.

3) According to the mechanical arm dynamic object grabbing method based on reinforcement learning, dynamics optimization is carried out on the movement of a robot in the execution process, a targeted reward function design is carried out on joint speed and joint acceleration in the reinforcement learning process, unnecessary impact and vibration in the running process of the robot are reduced, and smoothness, stability and accuracy of the robot in complex task training are ensured. In addition, the optimization strategy provides powerful support for migration of the agent model between the simulation environment and the real industrial scene.

Drawings

Fig. 1 is a flowchart of a method for capturing dynamic objects by a mechanical arm based on reinforcement learning.

Fig. 2 is a schematic diagram of the first-stage training task and its environment according to the reinforcement learning-based mechanical arm dynamic object grabbing method of the present invention.

Fig. 3 is a schematic diagram of the training task and the environment of the second stage of the robot arm dynamic object grabbing method based on reinforcement learning.

Fig. 4 is a schematic diagram of the training task and the environment of the third stage of the reinforcement learning-based mechanical arm dynamic object grabbing method according to the present invention.

Fig. 5 is a schematic diagram of the training task and the environment of the fourth stage of the mechanical arm dynamic object grabbing method based on reinforcement learning.

Fig. 6 is a schematic diagram of the training task and the environment of the fifth stage of the robot arm dynamic object grabbing method based on reinforcement learning.

Fig. 7 is a diagram of a model training and obtaining process in the training process of the mechanical arm dynamic object grabbing method based on reinforcement learning.

In the figure: a six-axis mechanical arm 1; a clamping jaw 2; an experiment table 3; a target object 4; a target object random position set 5; the target object moves in direction 6.

Detailed Description

The invention is further described below with reference to the drawings and examples of implementation.

It should be noted that the invention provides a mechanical arm dynamic object grabbing method based on reinforcement learning, which divides training tasks into different difficulty grades, obtains corresponding reinforcement learning network models in the training tasks of each grade, utilizes the reinforcement learning network models of low grade to complete the training tasks of high grade, and enables the capacity of the models to complete the tasks to be continuously improved, and finally obtains a mechanical arm dynamic object grabbing intelligent model, so that the convergence speed of the mechanical arm dynamic object grabbing tasks can be accelerated, and the training efficiency is improved.

As shown in fig. 1, the method comprises the steps of:

s1: constructing a multi-stage training task;

in S1, the multi-stage training task comprises a first stage, a second stage, a third stage, a fourth stage and a fifth stage training task; the task difficulty of the training tasks of the first stage, the second stage, the third stage, the fourth stage and the fifth stage is sequentially increased.

The multi-stage training task comprises a first-stage training task and a second-stage training task, wherein the first-stage training task is a target grabbing task with a static object to be grabbed and a fixed position;

the training task in the third stage is a target grabbing task of which the grabbed object is static and the position is randomly generated; the second stage training task and the third stage training task have the same scene, but in the training process of the reinforcement learning network model, the training targets are different, and the corresponding reward functions are also different.

The fourth stage training task is a target grabbing task of which the grabbed object is dynamic (namely has a certain moving speed) and the position of the grabbed object is fixed;

s2 specifically comprises the following steps:

And in the training process of each stage of the reinforcement learning network model, when rewards obtained in the training round number reach a rewarding threshold value and the rewarding difference value obtained in each training round number is within the threshold value, the training task of the current stage is completed.

The intelligent mechanical arm dynamic object grabbing model specifically comprises the following steps:

the method comprises the steps of obtaining a first-stage reinforcement learning network model, wherein the first-stage reinforcement learning network model is obtained by performing a first-stage training task through reinforcement learning after an intelligent model is initialized. When the reward obtained in the training round reaches a threshold and the difference in the rewards obtained in each training round is within the threshold while the first stage training task is being performed, the first stage training task is considered to be completed.

As shown in fig. 2, in the training task environment of the first stage of the mechanical arm dynamic object grabbing method based on reinforcement learning according to the present invention, the target object 4 is kept stationary on the test stand 3. In the reinforcement learning training process, the six-axis mechanical arm 1 drives the clamping jaw 2 to approach the target through joint movement and completes the grabbing task.

And obtaining a second-stage reinforcement learning network model, wherein the second-stage reinforcement learning network model is obtained by reinforcement learning through performing a second-stage training task by the first-stage reinforcement learning network model. When the reward obtained in the training round reaches a threshold and the difference in the rewards obtained in each training round is within the threshold while the second stage training task is being performed, the second stage training task is deemed complete.

As shown in fig. 3, in the training task environment of the second stage of the mechanical arm dynamic object grabbing method based on reinforcement learning according to the present invention, the target object 4 is kept stationary on the test stand 3, the initial position is randomly generated, and the initial position generation area is defined in the target object random position set 5. In the reinforcement learning training process, the six-axis mechanical arm 1 drives the clamping jaw 2 to approach the target through joint movement and completes the grabbing task.

And acquiring a third-stage reinforcement learning network model, wherein the third-stage reinforcement learning network model is obtained by reinforcement learning through performing a third-stage training task by the second-stage reinforcement learning network model. When the reward obtained in the training round reaches a threshold and the difference in the rewards obtained in each training round is within the threshold while the third stage training task is being performed, the third stage training task is deemed complete.

As shown in fig. 4, in the third stage training task environment of the reinforcement learning-based mechanical arm dynamic object grabbing method, since the training task aims at performing dynamics optimization for the mechanical arm motion process, the scene setting is the same as the second stage training task environment. The target object 4 is held stationary on the test bed 3, its initial position is randomly generated, and the initial position generation area is defined in the target object random position set 5. In the reinforcement learning training process, the six-axis mechanical arm 1 drives the clamping jaw 2 to approach the target through joint movement and completes the grabbing task.

And acquiring a fourth-stage reinforcement learning network model, wherein the fourth-stage reinforcement learning network model is obtained by reinforcement learning through performing a fourth-stage training task by the third-stage reinforcement learning network model. When the reward obtained in the training round reaches the threshold and the difference in the rewards obtained in each training round is within the threshold while the fourth stage training task is being performed, the fourth stage training task is considered to be completed.

As shown in fig. 5, in the training task environment of the fourth stage of the mechanical arm dynamic object grabbing method based on reinforcement learning, the object 4 has a certain moving speed on the test bed 3, so as to simulate the scene of the object transported by the conveyor belt in real industrial production. In the reinforcement learning training process, the six-axis mechanical arm 1 drives the clamping jaw 2 to approach the target through joint movement and completes the grabbing task.

And acquiring a fifth-stage reinforcement learning network model, wherein the fifth-stage reinforcement learning network model is obtained by reinforcement learning through performing a fifth-stage training task by the fourth-stage reinforcement learning network model. When the reward obtained in the training round number reaches the threshold value and the difference value of the rewards obtained in each training round number is within the threshold value when the training task of the fifth stage is executed, the training task of the fifth stage is regarded as being completed. The model is the final intelligent model for grabbing the dynamic object of the mechanical arm.

As shown in fig. 6, in the training task environment of the fifth stage of the reinforcement learning-based mechanical arm dynamic object grabbing method, the object 4 has a certain moving speed on the test bed 3, so as to simulate the scene of the object transportation of the conveyor belt in real industrial production. Further, the initial positions of the objects 4 are randomly generated, and the initial position generation area is defined within the target object random position set 5. In the reinforcement learning training process, the six-axis mechanical arm 1 drives the clamping jaw 2 to approach the target through joint movement and completes the grabbing task.

Performing training tasks for the reinforcement learning network model includes:

and the state acquisition is used for identifying the state information of the mechanical arm and the state information of the grabbed object.

And the motion control is used for controlling the movement of the mechanical arm and the opening and closing of the clamping jaw and comprises the step of converting the movement of the mechanical arm into position control in a Cartesian coordinate system.

And the reward acquisition comprises positive rewards and negative rewards and is used for evaluating the advantages and disadvantages of the mechanical arm for completing a certain action. The positive rewards are used for encouraging the mechanical arm to successfully grab the target object or realize the preset task, and the negative rewards are used for indicating the bad behaviors or the false actions of the mechanical arm.

S3: and inputting the target task into a dynamic object grabbing intelligent model of the mechanical arm, so as to realize grabbing of the target object by the mechanical arm, wherein the target object is dynamic or static.

As shown in fig. 7, in the training process of the mechanical arm dynamic object grabbing method based on reinforcement learning, the basic model is trained by tasks of a first stage, a second stage, a third stage, a fourth stage and a fifth stage with increasing difficulty, the adaptability and the intelligence of the grabbing task are continuously improved, the mechanical arm dynamic object grabbing intelligent model is obtained finally, and the mechanical arm dynamic object grabbing intelligent model has good dynamic characteristics.

The steps of the embodiment of the invention are as follows:

1) System environment configuration

Under the Ubuntu20.04 system, a training experiment of reinforcement learning is carried out by using a notebook computer with an Nvidia RTX 3080Laptop display card and an AMD Ryzen 9 5900HX processor.

2) Simulation environment construction

Building a reinforcement learning related training environment based on a PyBullet physical engine: a test stand 1m long, 0.32m wide and 0.6m high was placed in the environment for placement of the target object. A robot base is placed in the environment for mounting the UR5e collaborative robot, and the jaws Robotiq are mounted at the ends of the UR5e collaborative robot. In addition, according to the requirements of different tasks, the initial state of the target object is set, in the training tasks of the second stage and the third stage, a random position set of the target object of 0.28m x0.24m is set, and the object randomly appears in the position set during the training round initialization. In the fifth stage training task, a random position set of the target object of 0.56mx0.24m is set, and the object randomly appears from the position set when the training round is initialized. It should be noted that, since the target object in the fifth stage training task needs to have a position movement, the random position set of the target object in the fifth stage training task scene is smaller than the random position sets of the target object in the second stage training task scene and the third stage training task scene.

3) Reinforcement learning algorithm architecture construction

In the embodiment, when the reinforcement learning task is trained, an Actor-Critic type near-end strategy optimization algorithm (Proximal Policy Optimization, PPO) is adopted as a specific execution algorithm. The PPO algorithm is a strategy gradient based reinforcement learning algorithm aimed at maximizing jackpots by optimizing strategies. The Actor and Critic in the PPO algorithm are updated in an iterative manner to gradually optimize the accuracy of the strategy and value function. In the process of updating the policy parameters, the PPO uses a policy gradient to update the policy parameters, uses a clipping function to limit the magnitude of the policy gradient to ensure that the update magnitude is not excessive, and the objective function is as follows:

wherein pi _θ Is a current policy that is to be used,is an old strategy, A _t Is the dominant function of the action, clip () is the clipping function and epsilon is the clipping amplitude.

In the process of updating the cost function, the PPO uses the regression loss of the value function to update the value function parameters, and the objective function is as follows:

4) Input state definition

Input state quantity S of reinforcement learning network model ₁ Comprises six states, namely the position P of the object to be grabbed under the three-dimensional coordinate system of the robot _o Position P of the clamping jaw (i.e. the mechanical arm end effector) in the three-dimensional coordinate system of the robot _g Jaw action a _g Contact condition C of clamping jaw and gripped object _og Time information t and arm joint information θ _j Satisfy S ₁ ＝{P _o ，P _g ，a _g ，C _og ，t，θ _j }。

The constraints of each state are as follows:

X _o ∈[0.45，0.75]，Y _o ∈[-0.4，0.4]，Z _o ∈[0.6，0.9]

X _g ∈[0.4，0.8]，Y _g ∈[-0.45，0.45]，Z _g ∈[0.5，1]

a _gl ∈{0，0.04}，a _gr ∈{0，0.04}

C _og ∈[0，1]

t∈[0，60]

θ ₁ ∈[-180°，180°]，θ ₂ ∈[-180°，180°]，θ ₃ ∈[-180°，180°]

θ ₄ ∈[-180°，180°]，θ ₅ ∈[-180°，180°]，θ ₆ ∈[-180°，180°]

wherein X is _o 、Y _o And Z _o Respectively representing the positions of x, y and z axes of the gripped object under a three-dimensional coordinate system of the robot; x is X _g 、Y _g And Z _g Respectively representing the positions of the clamping jaw in x, y and z axes under a three-dimensional coordinate system of the robot; a, a _gl And a _gr The positions of the left finger and the right finger of the clamping jaw relative to the y-axis direction under the coordinate system of the end effector are respectively represented, so as to represent the opening and closing states of the clamping jaw; c (C) _og Representing a distance relationship between the gripping jaw and the gripped object; t represents the time it takes for the current training round to go from the beginning to the end of the round; θ ₁ 、θ ₂ 、θ ₃ 、θ ₄ 、θ ₅ And theta ₆ And respectively representing the position information of six joint angles of the mechanical arm.

5) Neural network structure setting

In this embodiment, both the policy network and the value network are encoded using MLP, the policy network uses a three-layer MLP structure, the hidden layer size is 64, and the activation function uses a modified linear unit (ReLU). The value network adopts a three-layer MLP architecture, using 128 hidden units and a ReLU activation function.

Super parameter selection

The super-parameter settings of the reinforcement learning training process of this embodiment are shown in table 1:

table 1 is a hyper-parameter table for reinforcement learning training process

6) Bonus function setting

R ₁ ＝R ₂ ＝r _base

R ₄ ＝r _base ′

R ₅ ＝r _base ″

wherein R is ₁ 、R ₂ 、R ₃ 、R ₄ And R is ₅ The value of the bonus function, r, of the first stage to the fifth stage respectively _base For the first base prize value, j is the corresponding joint angle label,rewarding for joint angle limitation>For smooth control of rewards, θ _j Represents the joint angle, θ, of the robot _{j_limit} Represents the limited range of joint angles of the robot, theta', and _j represents the angular acceleration of the joint of the robot, j _max Represents the maximum value of the joint angle index (i.e., the number of joint angles), r _base ' is the second base prize value, r _base "third base prize value, +.>And ω is the first adjustment coefficient and the second adjustment coefficient, respectively.

Wherein the first and second stepsThe segment grabbing task is relatively simple to train, and all adopt a basic rewarding function r _base As a function of the rewards at this stage. In the three training task stages, in order to adapt to the more complex training tasks of the fourth and fifth stages, the motion of the robot is dynamically optimized by utilizing the reward function design in advance, so that the robot can move in a stable and smooth mode in the complex task training. On the one hand, a negative rewarding and rewarding function is designed aiming at the joint limit problem, and once the joint angle theta of the robot is reached _j Exceeding joint limit theta _{j_limit} Then give a negative prizeThis will guide the robot to articulate in a suitable way, avoiding exceeding the affordable range of the joints; on the other hand, a negative bonus function is designed for the joint acceleration problem, once the robot joint angular acceleration +.>Exceeding the angular acceleration limit of the joint θ _{j_limit} The negative prize- ω is given which will lead the robot to move in a smooth way reducing unnecessary shocks and vibrations during the operation of the robot. The reward function design ensures that the intelligent body obtained through training can run in a stable and controllable mode, and simultaneously enhances the adaptability of the intelligent body when processing complex training tasks. In the fourth and fifth stages, aiming at training tasks with gradually increased difficulty, the basic rewards r are increased _base The round time limit in (3) is used for improving the fault tolerance in the complex task training, and the mechanical arm can be guided to complete the grabbing task quickly.

The formula of the underlying prize is as follows:

r _base ＝r _reach +r _lift +r _t

r _reach ＝r _dog +r _go

r _lift ＝r _gc +r _cog +z _o

wherein r is _base For the first base prize value, r _reach To approach the prize value of the process, r _lift To promote the prize value of the process, r _t For the time prize value r _dog For distance rewards value r _go For rewarding jaw opening, r _{dog_max} Is the maximum distance between the clamping jaw and the target object, p is the rewarding sensitivity parameter, r _gc Is the prize value of the jaw closure, r _cog Is the contact rewarding value of the clamping jaw and the object, r _zo Is the target object promotes the rewarding value, z _o Z is the real-time height of the object _{o_initial} For the height of the object at scene initialization, t _l For the first time limit, α, β, γ, δ, ε are the third adjustment coefficient-the seventh adjustment coefficient, respectively;

second base prize value r _base ' the first time limit t in the base prize _l Replaced by a second time limit t _l4 Obtained by post-calculation, i.e. using a second time limit t _l4 For the first time limit t _l Make corrections, where t _l4 Satisfy t _l4 ＝∈*t _l E is an eighth adjustment coefficient;

third base prize value r _base "is to limit the first time t in the underlying prize _l Replaced by a third time limit t _l5 Obtained by post-calculation, i.e. using a third time limit t _l5 For the first time limit t _l Make corrections, where t _l5 Satisfy t _l5 ＝μ*t _l Mu is the ninth adjustment coefficient and E<μ。

Wherein r is _reach Refers to rewards of the approaching process, r _lift Refers to the rewarding of the lifting phase, while the demarcations of the two processes are defined by the distance d between the jaws and the target object _og To distinguish, d _og ≥d _ogl When it is a proximity process, d _ogl A distance threshold and vice versa a lifting phase. And r is _t For time rewards, which are negative rewards (punitive rewards), when the arm is at time limit t _l The incomplete tasks can obtain negative rewards, and the robot can be encouraged to complete tasks more quickly. The specific reward design is as follows:

during the approach, r _dog The distance rewards are awarded, i.e. the rewards are given on the basis of the distance between the clamping jaw and the target object. Specifically, the distance rewards are designed to be based on distance r _dog The smaller the distance, the higher the reward and the smaller the distance the faster the reward finger increases, the more effectively the reward function design will promote the jaw to move toward the target location. Wherein r is _{dog_max} The maximum distance between the clamping jaw and the target object is the maximum distance, and the rewarding sensitivity parameter p can regulate and control the speed of rewarding growth; and r is _go The reward for jaw opening encourages jaw opening by giving the jaw opening reward during the approximation process ready for subsequent grasping of the target object.

During the lifting process, r _gc Is a jaw closure reward which encourages jaw closure to grasp an object during lifting by giving the jaw closure reward; r is (r) _cog Is the contact rewarding of the clamping jaw and the object, and is given when two fingers of the clamping jaw are contacted with the target object, and the two fingers are provided with rewardingThe probability of the clamping jaw grabbing the target object is increased when the root finger is simultaneously contacted with the target object, and the design aims to improve the probability of grabbing success; r is (r) _zo The target object is lifted and rewarded, and when the object is lifted for a certain distance, the object is successfully grabbed, the lifting task is completed, and rewarding is given.

The time reward is a negative reward, namely punitive reward, and the mechanical arm limits the time t _l The incomplete tasks in the interior get a negative prize-epsilon and the design is intended to promote faster completion of the tasks by the robot.

In the present embodiment, the adjustment coefficients are selected respectivelyω=0.5, α=2, β=1, γ=1, δ=3, ε=0.5, e=1.5, μ=2; selecting a first time limit t _l =40s; selecting a limited range theta of the joint angle of the robot _{j_limit} The following are provided:

θ _{1_limit} ∈[-150°,150°]，θ _{2_limit} ∈[-120°,120°]，

θ _{3_limit} ∈[-120°,-120°]，θ _{4_limit} ∈[-120°,120°]，

θ _{5_limit} ∈[0°,250°]，θ _{6_limit} ∈[-180°,180°]。

7) Mechanical arm motion control mode

The manipulator motion is converted into position control in a Cartesian coordinate system by using Operation Space Control (OSC), and the action space a of the manipulator is as follows:

a＝(Δx，Δy，Δz，Δg)，Δx，Δy，Δz，Δg∈[-1，1]

wherein Δx, Δy and Δz represent the offsets of the end effector of the robotic arm in the cartesian coordinate system on the x, y and z axes, Δg is the jaw open state, Δg <0 represents jaw closed, and Δg >0 represents jaw open.

8) Model training

As shown in fig. 7, after the preparation work is completed, multi-stage reinforcement learning training with sequentially increasing difficulty is formally performed, and the reinforcement learning network model is iterated for a plurality of times, so that a dynamic object grabbing intelligent model is finally obtained.

In the first stage training task, the target object is held stationary on the test bed. In the reinforcement learning training process, the six-axis mechanical arm drives the clamping jaw to approach the target through joint movement and completes the grabbing task. The initialized reinforcement learning network model is used for executing a first-stage training task, and when rewards obtained in the training round number reach a threshold value and a rewards difference value obtained in each training round number is within the threshold value, the first-stage training task is regarded as being completed and the first-stage reinforcement learning network model is obtained.

In the second stage training task, the target object is kept stationary on the test bed, the initial position of the target object is randomly generated, and the initial position generation area is limited in the random position set of the target object. In the reinforcement learning training process, the six-axis mechanical arm drives the clamping jaw to approach the target through joint movement and completes the grabbing task. When the reward obtained in the previous stage reaches the threshold value and the difference value of the rewards obtained in each training round is within the threshold value while the second stage training task is executed, the second stage training task is regarded as being completed and the second stage reinforcement learning network model is obtained.

In the training task of the third stage, the target object is kept stationary on the test bed, the initial position of the target object is randomly generated, and the initial position generation area is limited in the random position set of the target object. In the reinforcement learning training process, the six-axis mechanical arm drives the clamping jaw to approach the target through joint movement and completes the grabbing task. In addition, in the third stage training task, since the training task is aimed at dynamically optimizing the motion process of the mechanical arm, the scene setting is the same as the second stage training task environment. When the reward obtained in the previous stage reaches the threshold value and the difference value of the rewards obtained in each training round is within the threshold value when the reinforcement learning network model obtained in the previous stage executes the training task of the third stage, the training task of the third stage is regarded as being completed and the reinforcement learning network model of the third stage is obtained.

In the training task of the fourth stage, the target object has a certain moving speed on the test bed. In the reinforcement learning training process, the six-axis mechanical arm drives the clamping jaw to approach the target through joint movement and completes the grabbing task. When the reward obtained in the previous stage reaches the threshold value and the difference value of the rewards obtained in each training round is within the threshold value when the reinforcement learning network model obtained in the previous stage executes the training task in the fourth stage, the training task in the fourth stage is regarded as being completed and the reinforcement learning network model in the fourth stage is obtained.

In the training task of the fifth stage, the target object has a certain moving speed on the test bed, and the initial position of the target object is randomly generated. In the reinforcement learning training process, the six-axis mechanical arm drives the clamping jaw to approach the target through joint movement and completes the grabbing task. When the reinforcement learning network model obtained in the previous stage executes the training task in the fifth stage, and when rewards obtained in the training round number reach a threshold value and the rewards obtained in each training round number are within the threshold value, the training task in the fifth stage is regarded as being completed and the final mechanical arm dynamic object grabbing intelligent model is obtained.

Dynamic object grabbing intelligent model effect

The threshold for the reward during training is set to 1000, and when the reward reaches the threshold and fluctuates around the threshold, this training phase is indicated as being completed. The method of the present invention is compared with training results of the direct training generation model from 0. Agent1 represents the model obtained by training of the present invention, and Agent0 represents the model generated by training directly from 0.

By comparing the experimental results, when the Agent1 completes all training (convergence is realized in the fifth training stage) in the total training round number less than or equal to 4000, the distance between the Agent0 rewarding curve and the convergence is still a longer distance. Especially, under a training scene of dynamic object grabbing, the difference between the rewarding curves of Agent0 and Agent1 is obvious, the fluctuation of the Agent0 curve is large, and the Agent0 curve is difficult to stably rise; while Agent1 is relatively robust, gradually climbing to higher levels. The method improves training efficiency and convergence speed, and has good effect on realizing long-period complex tasks.

In addition, the joint velocity curve and the joint acceleration curve in the Agent1 training process are observed to find that the joint velocity curve is stable, the joint acceleration is free from mutation, and compared with the Agent0 curve, the method has obvious advantages, the joint vibration is obviously reduced, the stability of the robot in a complex task is improved, and powerful support is provided for the migration of the intelligent model between a simulation environment and a real industrial scene.

The foregoing detailed description is provided to illustrate the present invention and not to limit the invention, and any modifications and changes made to the present invention within the spirit of the present invention and the scope of the appended claims fall within the scope of the present invention.

Claims

1. The mechanical arm dynamic object grabbing method based on reinforcement learning is characterized by comprising the following steps of:

s1: constructing a multi-stage training task;

2. The method for capturing dynamic objects by using a mechanical arm based on reinforcement learning according to claim 1, wherein in S1, the multi-stage training task includes a first stage, a second stage, a third stage, a fourth stage and a fifth stage training task; the task difficulty of the training tasks of the first stage, the second stage, the third stage, the fourth stage and the fifth stage is sequentially increased.

3. The method for capturing dynamic objects by using a mechanical arm based on reinforcement learning according to claim 1, wherein the multi-stage training task is a target capturing task in which the captured object is static and the position of the captured object is fixed;

4. The method for capturing dynamic objects by using a mechanical arm based on reinforcement learning according to claim 1, wherein the step S2 is specifically:

5. The method for capturing dynamic objects by using a mechanical arm based on reinforcement learning according to claim 1, wherein during each stage of training of the reinforcement learning network model, when rewards obtained in the training round number reach a rewards threshold value, and the rewards obtained in each training round number are within the threshold value, the training task of the current stage is completed.

6. The method for capturing dynamic object by mechanical arm based on reinforcement learning according to claim 1, wherein the input state quantity S of the reinforcement learning network model ₁ Comprises six states, namely the position P of the object to be grabbed under the three-dimensional coordinate system of the robot _o Position P of clamping jaw under three-dimensional coordinate system of robot _g Jaw action a _g Contact condition C of clamping jaw and gripped object _og Time information t and arm joint information θ _j Satisfy S ₁ ＝{P _o ，P _g ，a _g ，C _og ，t，θ _j }。

7. The method for capturing dynamic objects by using a mechanical arm based on reinforcement learning according to claim 1, wherein during the training process of each stage of the reinforcement learning network model, the reward function of each stage includes a basic reward, and the formulas are respectively:

R ₁ ＝R ₂ ＝r _base

R ₄ ＝r _base ′

R ₅ ＝r _base ″

8. The method for capturing dynamic objects by using a mechanical arm based on reinforcement learning according to claim 7, wherein the formula of the basic rewards is as follows:

r _base ＝r _reach +r _lift +r _t

r _reach ＝r _dog +r _go

r _lift ＝r _gc +r _cog +r _zo

9. The method for capturing dynamic objects by using a mechanical arm based on reinforcement learning according to claim 1, wherein the reinforcement learning network model adopts an Actor-Critic type near-end strategy optimization algorithm.