CN114800488B - Redundant mechanical arm operability optimization method and device based on deep reinforcement learning - Google Patents

Redundant mechanical arm operability optimization method and device based on deep reinforcement learning Download PDF

Info

Publication number
CN114800488B
CN114800488B CN202210272600.8A CN202210272600A CN114800488B CN 114800488 B CN114800488 B CN 114800488B CN 202210272600 A CN202210272600 A CN 202210272600A CN 114800488 B CN114800488 B CN 114800488B
Authority
CN
China
Prior art keywords
operability
mechanical arm
reinforcement learning
redundant
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210272600.8A
Other languages
Chinese (zh)
Other versions
CN114800488A (en
Inventor
梁斌
王学谦
杨皓强
孟得山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202210272600.8A priority Critical patent/CN114800488B/en
Publication of CN114800488A publication Critical patent/CN114800488A/en
Application granted granted Critical
Publication of CN114800488B publication Critical patent/CN114800488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/1643Programme controls characterised by the control loop redundant control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/1651Programme controls characterised by the control loop acceleration, rate control
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a redundant mechanical arm operability optimization method based on deep reinforcement learning, which is characterized by comprising the following steps of completing the approach training of the redundant mechanical arm to a random target under a fixed reset mechanism by using a reinforcement learning method; continuously finishing the approach training of the redundant mechanical arm to the random target under the random reset mechanism by using a reinforcement learning method; wherein "randomly resetting" refers to letting the mechanical arm be in a random state; adding an operability item in the reward function, adding a coefficient of the operability item, and completing the optimization of the operability of the redundant mechanical arm by using a reinforcement learning method again; and controlling the redundant mechanical arm by using an optimized algorithm. According to the invention, the mechanical arm is trained by using the reinforcement learning method with the operability rewarding for the first time, so that the mechanical arm has the capability of automatically optimizing the operability while having the terminal track tracking capability, has good universality and can train various complex robot structures.

Description

Redundant mechanical arm operability optimization method and device based on deep reinforcement learning
Technical Field
The invention relates to the technical field of redundant mechanical arm control, in particular to a redundant mechanical arm operability optimization method and device based on deep reinforcement learning.
Background
Redundant mechanical arms have redundant space motion degrees of freedom, have great advantages in the aspect of space obstacle avoidance and motion planning, and become hot spots in the field of robot research. However, an important control difficulty exists in the field of redundant mechanical arm control, and is a singular point problem in motion planning. Although the redundant mechanical arm has strong flexibility, the problem of a singular arm type can still be encountered in actual motion planning, when the mechanical arm is close to a singular state, the joint of the mechanical arm can be severely dithered due to small displacement of the tail end, so that the problems of joint damage and sensor fault are caused. In order to solve the problem, a plurality of students optimize the operation performance evaluation index (such as operability) of the robot in the robot motion planning so as to ensure the dexterity of the robot motion, thereby being far away from the singular state of the robot as far as possible in the motion process.
In the smart control of robots, it is common practice to base the control on conventional control methods, i.e. to add operability in the joint zero space when planning a pathGradient of work degree w with angle q
Figure BDA0003554275360000011
The arm type moves towards the direction with high operability as much as possible during planning, but the processing brings complex matrix derivation and matrix inversion operation, and is inconvenient for real-time calculation. Reinforcement learning is one of machine learning, and it is a problem to study how to let an agent learn an execution strategy so that it can obtain the largest rewards in the environment. For example, chinese patent CN201710042360.1 proposes a motion planning method for optimizing the operability of a redundant manipulator, which comprises setting an optimized motion performance index with the maximum operability derivative of the redundant manipulator and a constraint relation corresponding to the motion performance index; converting the motion performance index and the corresponding constraint relation into a quadratic programming problem; solving the quadratic programming problem through a quadratic programming solver to obtain a solving result; and controlling the mechanical arm to move according to the solving result. However, this patent suffers from several drawbacks: a) The operability optimization of the patent is based on the traditional jacobian matrix optimization, and multiple iterative computations are needed, so that great time complexity is brought to the track planning process, and the operation speed is low; b) The operability optimization needs to carry out mathematical transformation aiming at the structures of different robots, and the formula is complex, so that the method is inconvenient to popularize on robots with more complex structures.
Disclosure of Invention
The invention aims to solve the technical problems of poor real-time performance, low operation speed and complex formula aiming at mathematical transformation of optimizing operability in a track planning process in the prior art, and provides a redundant mechanical arm operability optimizing method and device based on deep reinforcement learning.
The invention provides a redundant mechanical arm operability optimization method based on deep reinforcement learning, which comprises the following steps:
s1, finishing the approach training of the redundant mechanical arm to the random target by using a reinforcement learning method under a fixed reset mechanism;
s2, continuously finishing the approach training of the redundant mechanical arm to the random target by using a reinforcement learning method under a random reset mechanism; wherein "randomly resetting" refers to letting the mechanical arm be in a random state;
s3, adding an operability item in the reward function, increasing the coefficient of the operability item, and completing the optimization of the operability of the redundant mechanical arm by using a reinforcement learning method again;
and S4, controlling the redundant mechanical arm by using an optimized algorithm.
In some embodiments, the fixed reset in step S1 is that the mechanical arm is in a horizontally straightened state.
In some embodiments, in step S3, the algorithm is allowed to converge normally by adjusting the coefficients of the "operability" term.
In some embodiments, the random target approach task under the fixed reset mechanism of the redundant robotic arm is accomplished using a TD3 algorithm in reinforcement learning.
In some embodiments, in step S1, the arm is in a horizontal straightening state at the beginning of each round, and then the end of the arm reaches a randomly set target point, and is fixedly reset to the horizontal straightening state after each round is completed.
In some embodiments, the value ranges of the input state and the output action are symmetrically processed, so that the symmetrical distribution characteristics of the input state and the output action are guaranteed.
In some embodiments, the reward is set as the opposite number of euclidean distances of the robotic arm tip position from the target point.
In some embodiments, the discount factor γ is taken to be 0 to eliminate the interference of the next action value Q (s, a).
In some embodiments, k is taken w1 The value of (c) is such that k w1 /w t+1 And d t+1 Is of similar order, thereby taking into account both the end-of-line task and the increased operability task in training, where k w1 Is an adjustable super parameter d t+1 For the euclidean distance between the end position of the mechanical arm and the target point, the subscript t represents the state variable at the time t, and the subscript t+1 represents the state variable at the time t+1.
The invention also provides a redundant mechanical arm control device, which comprises: comprising at least one memory and at least one processor;
the memory includes at least one executable program stored therein;
the executable program, when executed by the processor, implements the method.
According to the redundant mechanical arm operability optimization method based on deep reinforcement learning, the mechanical arm is trained by using the reinforcement learning method with operability rewarding for the first time, operability indexes are increased in a rewarding function of the reinforcement learning method, the trained mechanical arm can automatically increase the operability of the mechanical arm while the tail end track moves, complex kinematic solution and iterative calculation are not needed, and the real-time performance is higher, so that the problem of real-time difference in the traditional method is solved, the mechanical arm has the capability of automatically optimizing the operability while having the tail end track tracking capability, and has good universality, and various complex robot structures can be trained.
In addition, according to the redundant mechanical arm operability optimization method based on deep reinforcement learning, through step-by-step optimization, the 'operability' item is added in the reward function step by step from easy to difficult, and the coefficient of the 'operability' item is increased, so that the training can be converged.
Drawings
FIG. 1 is a schematic flow chart of a redundant manipulator operability optimization method based on deep reinforcement learning according to an embodiment of the present invention;
FIG. 2 is a diagram of a 6-joint 12-degree-of-freedom super-redundant manipulator in a mujoco simulation engine according to an embodiment of the present invention;
FIG. 3 is a graph showing the success rate of different γ in the evaluation process according to the round under the fixed reset mechanism in the embodiment of the present invention;
FIG. 4 is a graph showing the return of different γ values during the evaluation process according to the fixed reset mechanism in the embodiment of the present invention;
FIG. 5 is a graph showing the success rate of different gamma in the evaluation process according to the round under the random reset mechanism in the embodiment of the invention;
FIG. 6 is a graph showing the return of different γ values during evaluation according to the round under the random reset mechanism according to the embodiment of the present invention;
FIG. 7 shows different k in an embodiment of the invention w1 A plot of success rate over round during evaluation;
FIG. 8 shows different k in an embodiment of the invention w1 A plot of success rate over round during evaluation;
FIG. 9 shows different k in an embodiment of the invention w1 A plot of success rate over round during evaluation;
FIG. 10 shows different k in an embodiment of the invention w1 In the evaluation process, the operability is changed along with the change curve graph of the motion steps of the mechanical arm in the circle track tracking process;
FIG. 11 shows different k w1 In the evaluation process, the operability is changed along with the change curve graph of the motion steps of the mechanical arm in the linear track tracking process;
FIG. 12 shows different k in an embodiment of the invention w1 In the process of evaluating the mixed track of the line segment and the circle, the mixed track comprises k w1 A graph of variation of operability versus number of mechanical arm steps of motion =0;
FIG. 13 shows different k in an embodiment of the invention w1 In the process of evaluating the mixed track of the line segment and the circle, k is deleted w1 A plot of the variation of the manipulator motion steps with operability =0.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Fig. 1 is a schematic flow chart of a redundant manipulator operability optimization method based on deep reinforcement learning, which is provided by an embodiment of the invention, and includes the following steps:
s1, finishing the approach training of the redundant mechanical arm to the random target by using a reinforcement learning method under a fixed reset mechanism;
s2, continuously finishing the approach training of the redundant mechanical arm to the random target by using a reinforcement learning method under a random reset mechanism; wherein "randomly resetting" refers to letting the mechanical arm be in a random state;
s3, adding an operability item in the reward function, increasing the coefficient of the operability item, and completing the optimization of the operability of the redundant mechanical arm by using a reinforcement learning method again;
and S4, controlling the redundant mechanical arm by using an optimized algorithm.
In one embodiment of the invention, a TD3 algorithm in reinforcement learning is used to complete a random target approach task under a fixed reset mechanism of a 12-degree-of-freedom mechanical arm (experimental system hardware component: a computer equipped with a Linux system). Specifically: the mechanical arm is in a horizontal straightening state at the beginning of each round, then the tail end of the mechanical arm reaches a randomly arranged target point, and the mechanical arm is fixedly reset to the horizontal straightening state after the round is finished. This task is the basis for the following random reset mechanism task and the end trace tracking task (i.e., no reset at all).
In order to embody the super-redundancy characteristic of the mechanical arm, the 12-degree-of-freedom mechanical arm only considers the position of the tail end of the mechanical arm, and does not consider the gesture of the mechanical arm, so that 12 control amounts are super-redundant for the tail end position information of 3 degrees of freedom. It is worth mentioning that the concept of the invention can be fully extended to the addition of the terminal gesture information.
The invention is a mechanical arm shown in fig. 2, which has 6 joints, and each joint has two degrees of freedom of pitching and yawing, and the total degree of freedom is 12. Each arm rod has a length of 0.09m, each joint and end effector is represented by a small sphere, and the diameter of the small sphere is 0.01m, so that the length of the whole mechanical arm is 0.7m. From the actual robot conditions, the environment is globally observable and the state transitions conform to the markov chain, so the motion process of the robot can be regarded as a markov decision process. The Markov decision process may be composed of a six-tuple
Figure BDA0003554275360000061
Indicating (I)>
Figure BDA0003554275360000062
Is state space, ++>
Figure BDA0003554275360000063
Is a motion space, & lt + & gt>
Figure BDA0003554275360000064
For rewarding space, & lt>
Figure BDA0003554275360000066
Is a state transition probability space ρ 0 Is the initial state distribution and gamma is the discount factor.
For convenience of the following description, the state space of the mechanical arm is denoted as
Figure BDA0003554275360000065
The mechanical arm comprises an arm joint angle, an arm joint angular speed, a mechanical arm tail end position and a mechanical arm tail end linear speed; the state space of the target point position is recorded as +.>
Figure BDA0003554275360000067
Since the task of this section is the approach of random target points, it is known from generalized value function fitter method that information of target points needs to be introduced as part of state to help convergence of reinforcement learning algorithm, i.e. state space->
Figure BDA0003554275360000068
Is formed by splicing two parts. Input states of "actor" network and "critique" network>
Figure BDA0003554275360000069
As shown in table 1, it is composed of five parts: arm joint angle, arm joint angular velocity, arm end position, arm end linear velocityAnd target coordinates. Action->
Figure BDA00035542753600000610
Is the value of the driver in the mujoco simulation engine. Through simple tests, the driving mode of the joint in mujoco is set to be a speed mode and a position mode which are not greatly different, and neither the joint speed nor the joint angle is directly equal to the set value and is regulated by PID control.
In order to make the neural network better converged, the value ranges of the input state and the output action are symmetrically processed, so that the values of the input state and the output action are ensured to have the symmetric distribution characteristics of [ -X, X ]. Because the TD3 algorithm is a model-free (model-free) algorithm, the training algorithm can be popularized to the mechanical arm with more degrees of freedom.
Because the kinematics of the robotic arm are well defined, the probability of state transition
Figure BDA00035542753600000611
The value p of (2) is also completely determined, and the formula (3-4) is satisfied, wherein f (·) represents the positive kinematics of the mechanical arm, pr [ · ]]Representing probability S t Representing the state variable at the time t, and the corresponding S represents the value and S of the variable t+1 Representing the state variable at time t+1, and the corresponding s' represents the value of the variable.
Figure BDA0003554275360000071
The most important thing of reinforcement learning algorithm is to set rewards
Figure BDA0003554275360000072
The correct rewards can guide the agent towards converging on the intended strategy. In general R t+1 Is with S t ,A t ,S t+1 Related, but according to the formula, S t+1 Will be S t And A t Uniquely determined, so for simplicity, a prize R is set t+1 Is S t+1 End position e of middle mechanical arm t+1 Is->
Figure BDA0003554275360000073
Euclidean distance d of (2) t+1 The opposite number of (3-5) is satisfied, wherein R is awarded t+1 Is S t+1 End position e of middle mechanical arm t+1 Is->
Figure BDA0003554275360000074
Euclidean distance d of (2) t+1 R (S) t ,A t ) Indicating the variable R t+1 And S is equal to t And A t Related to S t+1 Is irrelevant; thus, not only can the task purpose be directly represented and the learning action of the agent be correctly guided, but also the form of rewards is simple enough.
Figure BDA0003554275360000075
Because of the state space
Figure BDA0003554275360000076
By->
Figure BDA0003554275360000077
And->
Figure BDA0003554275360000078
Two parts so that the initial state distribution ρ 0 And is also described in two parts. In this section, the mechanical arm is fixed and reset, the joint angular velocity and the joint angle of the mechanical arm are 0 at the beginning of each round (the end position and the linear velocity can be determined by the joint angle and the angular velocity), and the state of the mechanical arm at this time is recorded as m s 0 . While the target point position g is in the workspace +.>
Figure BDA0003554275360000079
Randomly selected. So the initial state distribution ρ 0 Satisfy formula (3-6), wherein +.>
Figure BDA00035542753600000710
Is the target point positionSpace for placing
Figure BDA00035542753600000711
The number of all points Pr [ S ] 0 =s]Representing the current state variable S 0 Probability of value when =s.
Figure BDA00035542753600000712
Discount factor gamma epsilon 0, 1. This parameter is represented in the TD3 algorithm by an update to the "critique" network, which represents the degree of importance to the next action value Q (s ', a'), the greater γ, the more importance it is to the next, which is represented in the equation.
The end of each round is marked d t+1 ≤d threshold =0.02 or the number of steps of the robot arm movement equals 100 steps.
TABLE 1 input cases of "actor" network and "critique" network
Figure BDA0003554275360000081
a The drives in mujoco are all single control inputs, and if the speed mode is set, the speed of the drive does not reach that value directly, but rather by PID regulation, thus requiring a certain dwell time.
b This value is the output of the "actor" network
c Attempts to add, but it was found that adding this to the state variable had no significant effect on the improvement
TABLE 2 super parameters of "actor" network and "critique" network
Figure BDA0003554275360000091
The discount factor gamma is an important super-parameter that affects reinforcement learning training. This section studied that under different random seeds,the effect of the value of the discount factor gamma on training. As shown in fig. 3 and 4, the meaning of each point in the figure is: the evaluation was performed every 40 rounds in 12000 rounds of training, the value of each point was the mean of the success rate and the mean of return of the last 10 evaluations, the solid line in the figure represents the mean of the running results under 3 different random seeds, and the hatched area was the 95% confidence interval obtained. As can be seen from fig. 3 and 4, the larger the γ value, the worse the effect, and therefore the γ is optimally valued at 0. Analyzing the reason, the rewards are set to be in the form of the formula (3-5), and the action value Q (s, a) can be well depicted to represent the current action A t For the current state S t Instead, the influence of the action value Q (s ', a') of the next step is considered on the basis of the reward function, so that the interference is increased, and the network convergence of a 'commentator' is not facilitated.
Random target approach task under random reset mechanism
In the previous section, the TD3 algorithm is found to be used for well converging the mechanical arm to a target strategy and completing a random target approaching task under a fixed reset mechanism. The section further randomizes the initial state, and on the basis of the formula (3-6), randomizes the initial joint angle of the mechanical arm, wherein the joint angular velocity of the mechanical arm is still 0. Initial state distribution under random reset mechanism satisfies (3-7), wherein
Figure BDA0003554275360000101
Is the size of the state space of the mechanical arm.
Figure BDA0003554275360000102
The settings of the super parameters are exactly the same as in table 2. Under the random reset mechanism, the mechanical arm is trained for 20000 rounds, and the evaluation is also carried out once every 40 rounds, and the average success rate and return change curve is shown in fig. 4. Comparing fig. 3-4 with fig. 5-6, it can be seen that random reset is more difficult to converge than fixed reset, and as such, the smaller γ, the better the convergence effect.
Operability optimized end trajectory tracking
Operability isThe most commonly used index in robotics, describing the performance of a robot, generally represents the dexterity of the robot, the greater the operability, the more dexterous the robot. Specifically, the operability w is defined based on a jacobian matrix J (θ) of the robot velocity, and the calculation formula is represented by σ i Is the singular value of matrix J (θ).
Figure BDA0003554275360000103
Because the smaller the operability is, the closer the mechanical arm is to the singular state, in order to avoid the singular state of the mechanical arm in the motion process, a plurality of students at home and abroad can optimize the operability of the mechanical arm in the motion planning, thereby ensuring the dexterity in the motion process. The problems generally encountered in the traditional control method and the neural network solving method are poor real-time performance, complex solving, incapability of moving to other types of mechanical arms and poor universality, so that training by using a reinforcement learning method is necessary, and the mechanical arms can automatically optimize the operability in the motion process.
The velocity jacobian matrix of the mechanical arm can be deduced according to the mechanical arm DH parameter method, and further the operability expression is deduced, and the deduction process is not described in detail in this section, and the operability of the mechanical arm is related to time and is marked as w t . This section focuses on how to add operability to the reward function for reinforcement learning training.
For a 12 degree of freedom robotic arm in the present invention, distance d t In the order of 10 -2 ~10 -1 Degree of operability w t Is generally on the order of 10 -2 . Putting the operability into the reward function needs to meet two requirements at the same time, firstly, the mechanical arm hopes to learn a strategy to make the operability become large as much as possible, and secondly, the main tail end approaching task cannot be covered. Equations (3-9) satisfy the first point requirement, the greater the operability, the greater the prize, but do not satisfy the second point requirement because the symbol before operability of the robotic arm is positive, which leads toSo that the mechanical arm continuously adjusts the arm near the target point to obtain positive rewards to perform brushing without completing the approaching task of the target point.
R t+1 =-d t+1 +w t+1 (3-9)
Then, by combining the two requirements, we can design many rewards meeting the requirements. Formulas (3-10) are one possible prize, where k w1 Is an adjustable hyper-parameter. k (k) w1 In the order of 10 -4 ~10 -3 Better, on the one hand
Figure BDA0003554275360000111
Not more than d t+1 The strategy learned by the mechanical arm is guaranteed to still be capable of completing the tail end approaching task; on the other hand->
Figure BDA0003554275360000112
And not too small to be ignored during the training process. FIGS. 7-8 and 9 try different k w1 The effect on the success rate of the proximity task is the same as in table 2 for the rest of the hyper parameters, and the value of γ is 0. Can see k w1 In the range of 10 -4 ~10 -1 The end proximity task works well when it is, but exceeds 10 -1 The end proximity task is not easily accomplished because the order of magnitude of the term for operability is far beyond the order of magnitude of the Euclidean distance, and adjusting the operability task is considered more important than the end proximity task in the reinforcement learning algorithm.
Figure BDA0003554275360000113
Then comparing different k after 20000 rounds of training (thus ensuring that the random target points generated during training are consistent) on the same random seed w1 And the difference of the tail end track tracking effect after the mechanical arm is trained. Due to k w1 Exceeding 10 -1 The algorithm is difficult to converge correctly, so training and testing using the TD3 algorithm is limited to k only w1 ∈[0,10 -1 ]From which we selected 5 values, respectively: k (k) w1 =0,10 -4 ,10 -3 ,10 -2 ,10 -1 . In the test section, the initial state of the mechanical arm is a horizontal straightening state, and the task we give is to track the following three different paths:
1. and (3) a circle. The end of the mechanical arm is required to be capable of tracking a circular track with a circle center position (0.6,0,0) and a radius of 0.1 in a test.
2. A line segment. The robot arm end is required to be able to track the line segment trajectory with the starting point (0.55, -0.1, 0) and the ending point (0.65,0.2,0) in the test.
3. Line segment + circle. The end of the mechanical arm is required to be capable of tracking the line segment track with the starting point (0.8,0,0) and the end point (0.7,0,0) firstly, and then tracking the circle track with the circle center position (0.6,0,0) and the radius of 0.1 in the test.
FIGS. 10-11 and 12-13 show the test results k of track following after training the mechanical arm by TD3 algorithm w1 =0 means that no item of operability is added to the prize. kw1=0 represents that the operability is not added, kw1 is not equal to 0 represents that the operability is added, then I draw an operability change curve in the motion process, and find that the operability value of the manipulator with the operability in the motion process is obviously higher than that of the manipulator without the operability, so that the manipulator is really improved in dexterity and the singular state is avoided.
Comparing the performances of the mechanical arm trained by five different rewards in three path tracking tasks, we can obtain the following three-point observations:
1. by k w1 A reward of =0 may require a longer time step for the arm to complete a given task, and sometimes may not even complete the task, such as it may not complete task three.
2. By k w1 The arm trained with =0 generally has a lower operability value during movement than other arms trained with the operability reward, and in particular, the final operability is half that of the other arms.
3. K in all robotic arms trained with operability rewards w1 =10 -3 Best performing. Not only is the operability maximized, but the time steps required for the exercise process are minimized.
The conclusion from the three observations is that adding the operability rewards can enable the TD3 algorithm to better train the mechanical arm to complete the tail end track tracking task, so that the operability (representative dexterity) in the mechanical arm movement process can be improved, and the control step length of the mechanical arm movement can be shortened. k (k) w1 =10 -3 Best performing because of k at this time w1 /w t+1 And d t+1 The order of magnitude of the steps is similar, and the end approaching task and the operability increasing task can be considered in training.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (10)

1. The redundant mechanical arm operability optimization method based on deep reinforcement learning is characterized by comprising the following steps of:
s1, finishing the approach training of the redundant mechanical arm to the random target by using a reinforcement learning method under a fixed reset mechanism;
s2, continuously finishing the approach training of the redundant mechanical arm to the random target by using a reinforcement learning method under a random reset mechanism; wherein "randomly resetting" refers to letting the mechanical arm be in a random state;
s3, adding an operability item in the reward function, increasing the coefficient of the operability item, and completing the optimization of the operability of the redundant mechanical arm by using a reinforcement learning method again;
wherein the operability is added to the reward function for reinforcement learning training, and the formulas (3-10) are one possible reward:
Figure FDA0004197839640000011
wherein R is t+1 For rewarding, d t+1 K is Euclidean distance between the tail end position of the mechanical arm and the target point w1 Is an adjustable super parameter, w t+1 Is operable;
and S4, controlling the redundant mechanical arm by using an optimized algorithm.
2. The method for optimizing the operability of a redundant manipulator based on deep reinforcement learning of claim 1, wherein said fixed reset in step S1 is a horizontal straightening state of the manipulator.
3. The redundant manipulator operability optimization method based on deep reinforcement learning according to claim 1, wherein in step S3, the algorithm is allowed to converge normally by adjusting the coefficient of the "operability" term; wherein training and testing using the TD3 algorithm is limited to k only w1 ∈[0,10 -1 ]。
4. The method for optimizing the operability of the redundant manipulator based on the deep reinforcement learning according to claim 1, wherein a TD3 algorithm in the reinforcement learning is used for completing a random target approaching task under a fixed reset mechanism of the redundant manipulator; the motion process of the mechanical arm can be regarded as a Markov decision process; the Markov decision process may be composed of a six-tuple
Figure FDA0004197839640000021
ρ 0 γ represents->
Figure FDA0004197839640000022
Is state space, ++>
Figure FDA0004197839640000023
Is a motion space, & lt + & gt>
Figure FDA0004197839640000024
For rewarding space, & lt>
Figure FDA0004197839640000025
Is a state transition probability space ρ 0 Is the initial state distribution and gamma is the discount factor.
5. The method for optimizing the operability of a redundant manipulator based on deep reinforcement learning according to claim 1, wherein in the step S1, the manipulator is in a horizontal straightening state at the beginning of each round, and then the manipulator end reaches a randomly set target point, and is fixedly reset to the horizontal straightening state after the end of each round.
6. The method for optimizing the operability of a redundant manipulator based on deep reinforcement learning of claim 4, wherein the value ranges of the input state and the output action are symmetrically processed, and the symmetric distribution characteristics of the input state and the output action are guaranteed.
7. The method for optimizing the operability of a redundant manipulator based on deep reinforcement learning of claim 4, wherein the reward is set to be the inverse of the euclidean distance between the manipulator end position and the target point.
8. The redundant manipulator operability optimization method based on deep reinforcement learning according to claim 4, wherein the discount factor γ is valued to 0 to eliminate the interference of the next action value Q (s, a); wherein the next action value Q (S, a) is embodied in the next state S t+1 Action A at the next moment t+1 Is of value (c).
9. The redundant manipulator operability optimization method based on deep reinforcement learning of claim 1, wherein k is taken w1 The value of (c) is such that k w1 /w t+1 And d t+1 Is of similar order, thereby taking into account both the end-of-line task and the increased operability task in training, where k w1 Is an adjustable super parameter d t+1 For the euclidean distance between the end position of the mechanical arm and the target point, the subscript t represents the state variable at the time t, and the subscript t+1 represents the state variable at the time t+1.
10. A redundant robot arm control apparatus, comprising: comprising at least one memory and at least one processor;
the memory includes at least one executable program stored therein;
the executable program, when executed by the processor, implements the method of any one of claims 1 to 9.
CN202210272600.8A 2022-03-18 2022-03-18 Redundant mechanical arm operability optimization method and device based on deep reinforcement learning Active CN114800488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210272600.8A CN114800488B (en) 2022-03-18 2022-03-18 Redundant mechanical arm operability optimization method and device based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210272600.8A CN114800488B (en) 2022-03-18 2022-03-18 Redundant mechanical arm operability optimization method and device based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114800488A CN114800488A (en) 2022-07-29
CN114800488B true CN114800488B (en) 2023-06-20

Family

ID=82530104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210272600.8A Active CN114800488B (en) 2022-03-18 2022-03-18 Redundant mechanical arm operability optimization method and device based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114800488B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115272541B (en) * 2022-09-26 2023-01-03 成都市谛视无限科技有限公司 Gesture generation method for driving intelligent agent to reach multiple target points

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956297B (en) * 2016-05-09 2022-09-13 金陵科技学院 Comprehensive evaluation and optimization method for redundant robot motion flexibility performance
CN108326844B (en) * 2017-01-20 2020-10-16 香港理工大学深圳研究院 Motion planning method and device for optimizing operability of redundant manipulator
CN106842907B (en) * 2017-02-16 2020-03-27 香港理工大学深圳研究院 Cooperative control method and device for multi-redundancy mechanical arm system
CN110333739B (en) * 2019-08-21 2020-07-31 哈尔滨工程大学 AUV (autonomous Underwater vehicle) behavior planning and action control method based on reinforcement learning
CN111923039B (en) * 2020-07-14 2022-07-05 西北工业大学 Redundant mechanical arm path planning method based on reinforcement learning
CN112528552A (en) * 2020-10-23 2021-03-19 洛阳银杏科技有限公司 Mechanical arm control model construction method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN114800488A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN109960880B (en) Industrial robot obstacle avoidance path planning method based on machine learning
US20180036882A1 (en) Layout setting method and layout setting apparatus
CN109901397B (en) Mechanical arm inverse kinematics method using particle swarm optimization algorithm
Thakar et al. Accounting for part pose estimation uncertainties during trajectory generation for part pick-up using mobile manipulators
CN114800488B (en) Redundant mechanical arm operability optimization method and device based on deep reinforcement learning
CN106965171A (en) Possesses the robot device of learning functionality
CN112847235B (en) Robot step force guiding assembly method and system based on deep reinforcement learning
CN116533249A (en) Mechanical arm control method based on deep reinforcement learning
Laezza et al. Reform: A robot learning sandbox for deformable linear object manipulation
CN115091469B (en) Depth reinforcement learning mechanical arm motion planning method based on maximum entropy frame
CN113664829A (en) Space manipulator obstacle avoidance path planning system and method, computer equipment and storage medium
Hebecker et al. Towards real-world force-sensitive robotic assembly through deep reinforcement learning in simulations
Ranjbar et al. Residual feedback learning for contact-rich manipulation tasks with uncertainty
CN116803635A (en) Near-end strategy optimization training acceleration method based on Gaussian kernel loss function
Lämmle et al. Simulation-based learning of the peg-in-hole process using robot-skills
CN113967909B (en) Direction rewarding-based intelligent control method for mechanical arm
CN110114195B (en) Action transfer device, action transfer method, and non-transitory computer-readable medium storing action transfer program
CN115042185A (en) Mechanical arm obstacle avoidance grabbing method based on continuous reinforcement learning
Yovchev Finding the optimal parameters for robotic manipulator applications of the bounded error algorithm for iterative learning control
CN117140527B (en) Mechanical arm control method and system based on deep reinforcement learning algorithm
CN113290557A (en) Snake-shaped robot control method based on data driving
US11921492B2 (en) Transfer between tasks in different domains
Liu et al. Optimizing Non-diagonal Stiffness Matrix of Compliance Control for Robotic Assembly Using Deep Reinforcement Learning
Flageat et al. Incorporating Human Priors into Deep Reinforcement Learning for Robotic Control.
US20230195843A1 (en) Machine learning device, machine learning method, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant