CN114378820B - Robot impedance learning method based on safety reinforcement learning - Google Patents

Robot impedance learning method based on safety reinforcement learning Download PDF

Info

Publication number
CN114378820B
CN114378820B CN202210055753.7A CN202210055753A CN114378820B CN 114378820 B CN114378820 B CN 114378820B CN 202210055753 A CN202210055753 A CN 202210055753A CN 114378820 B CN114378820 B CN 114378820B
Authority
CN
China
Prior art keywords
robot
learning
impedance
information
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210055753.7A
Other languages
Chinese (zh)
Other versions
CN114378820A (en
Inventor
潘永平
冯晓欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210055753.7A priority Critical patent/CN114378820B/en
Publication of CN114378820A publication Critical patent/CN114378820A/en
Application granted granted Critical
Publication of CN114378820B publication Critical patent/CN114378820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • B25J9/161Hardware, e.g. neural networks, fuzzy logic, interfaces, processor

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a robot impedance learning method based on safety reinforcement learning, which comprises the following steps: the controller outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation; constructing an input item according to the position information, the speed information and the return information of a learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed according to the admittance model as target input of a controller. The invention has high stability, improves the feasibility of admittance control, and can be widely applied to the technical field of artificial intelligence.

Description

Robot impedance learning method based on safety reinforcement learning
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a robot impedance learning method based on safety reinforcement learning.
Background
Impedance control is an effective method for controlling interaction force of a robot, and expected impedance parameters are often required to be given, so that an impedance controller is designed to control the robot to achieve expected interaction force. However, due to the unknowns and uncertainties of the external environment, it is often necessary to introduce virtual compliance into the system to ensure the security of the interaction process.
The existing robot impedance learning method has the following defects:
1. traditional optimization methods (e.g., gradient descent optimization) require learning impedance parameters with the environmental model and the robot model known.
2. Existing reinforcement learning algorithms applied to impedance learning, such as depth deterministic strategy gradient (Deep Deterministic Policy Gradient, DDPG) algorithms, have certain drawbacks: the Actor network used in the DDPG algorithm is easy to cause algorithm instability due to non-delay updating and no noise added to the output action, and is not the most suitable algorithm for realizing impedance learning in the robot shaft hole assembly task.
3. The existing probabilistic reasoning (PILCO) algorithm in learning COntrol for impedance learning is a model-based reinforcement learning algorithm, which needs to model and predict a future state by using a Gaussian Process (GP) first, and modeling has a certain model error.
4. General reinforcement learning adds exploratory learning with random properties to the solution process, while exploratory learning that is not security limited is likely to present significant risks. If reinforcement learning is directly applied to real-world tasks, the intelligent agent is allowed to perform trial and error exploration learning, and the decision made may cause the system to fall into a dangerous state.
Disclosure of Invention
In view of the above, the embodiment of the invention provides a robot impedance learning method with high stability based on safety reinforcement learning, so as to realize the feasibility of variable admittance control.
An aspect of the present invention provides a robot impedance learning method based on safety reinforcement learning, including:
the controller outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation;
constructing an input item according to the position information, the speed information and the return information of a learning algorithm;
determining a decision action according to the input item, and further determining an impedance parameter;
taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;
taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed as target input of a controller according to the admittance model;
wherein the target input is for causing the controller to control the robot motion.
Optionally, the calculating the position information and the speed information of the cartesian space of the robot according to the robot dynamics equation specifically includes:
calculating the position information and the speed information through a robot dynamics equation;
the expression of the robot dynamics equation is as follows:
Figure BDA0003476138710000021
wherein ,
Figure BDA0003476138710000022
is joint moment; />
Figure BDA0003476138710000023
Is a joint space inertia matrix; />
Figure BDA0003476138710000024
Is a coriolis force and centripetal force coupling matrix; />
Figure BDA0003476138710000025
Is a gravity term; />
Figure BDA0003476138710000026
Non-rigid body power items such as friction;
Figure BDA0003476138710000027
acting a moment for the environment in the joint space; />
Figure BDA0003476138710000028
The space position and speed information of the robot joint are obtained;
the position information is Cartesian space position information, and the calculation formula of the Cartesian space position information is as follows:
x=ψ(q);
the calculation formula of the speed information is as follows:
Figure BDA0003476138710000029
wherein ,
Figure BDA00034761387100000210
representing position information; />
Figure BDA00034761387100000211
Representing speed information; psi (·) is the robot kinematic solution; j (q) is a jacobian matrix.
Optionally, the determining a decision action according to the input item, and further determining an impedance parameter, includes:
processing the state information of the input item;
evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain decision actions;
and taking the decision action as an impedance parameter.
Optionally, the calculation formula of the environment interaction force is:
Figure BDA00034761387100000212
wherein ,
Figure BDA00034761387100000213
representing environmental interaction forces; />
Figure BDA00034761387100000214
Damping, rigidity and inertia diagonal matrix of the admittance model of the robot respectively; />
Figure BDA00034761387100000215
Is the desired interaction force; />
Figure BDA00034761387100000216
Respectively the Cartesian space position, the speed and the acceleration of the robot; />
Figure BDA00034761387100000217
The desired position, the desired velocity and the desired acceleration of the robot, respectively, in cartesian space.
Optionally, the admittance model has the expression:
Figure BDA0003476138710000031
wherein ,
Figure BDA0003476138710000032
the reference position, the reference speed and the reference acceleration are respectively a Cartesian space reference position, a reference speed and a reference acceleration of the robot constrained after the robot is subjected to external force.
Optionally, the evaluating and optimizing the processed input item through the Critic network and the Actor network to obtain a decision action includes:
respectively inputting state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises the position information and the speed information of the end effector of the current mechanical arm;
inputting the first processing result into a first group of Critic networks to obtain a third processing result;
inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;
and adjusting the third processing result according to the fourth processing result to obtain a final decision action.
Optionally, the method further comprises the steps of: safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the method specifically comprises the following steps:
introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;
defining a set of feasible solutions according to the constraint threshold;
searching an optimal strategy according to the set of feasible solutions;
according to the actual task, optimizing and adjusting the loss function;
and performing safety reinforcement learning according to the adjusted loss function.
Another aspect of the embodiments of the present invention further provides a robot impedance learning device based on safety reinforcement learning, including:
the first module is used for outputting control moment by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to the dynamics equation of the robot;
the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
the third module is used for determining a decision action according to the input item so as to determine an impedance parameter;
the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module, and calculating to obtain the environment interaction force;
a fifth module, configured to take the impedance parameter and the environmental interaction force as input of an admittance model, and determine a reference position and a reference speed according to the admittance model as target input of a controller;
wherein the target input is for causing the controller to control the robot motion.
Another aspect of the embodiment of the invention also provides an electronic device, which includes a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Another aspect of the embodiments of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
The controller of the embodiment of the invention outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation; constructing an input item according to the position information, the speed information and the return information of a learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed according to the admittance model as target input of a controller. The invention has high stability and improves the feasibility of admittance control.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of a learning control framework provided by an embodiment of the present invention;
fig. 2 is an assembly schematic diagram of a shaft hole of a robot according to an embodiment of the present invention;
FIG. 3 is a network architecture diagram of a dual delay depth deterministic strategy gradient (Twin Delayed Deep Deterministic Policy Gradient, TD 3) algorithm provided by an embodiment of the invention;
FIG. 4 is a schematic diagram of a constrained Markov decision process framework provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
First, description will be made of related technical terms related to the present invention:
admittance control: the compliance control is a robot interaction control method, and the compliance control is not intended to independently control the position and interaction force of the robot, but rather is intended to realize a specified dynamic relationship between the interaction force and the position error, namely, to carry out impedance or admittance shaping on the robot, and virtual compliance is introduced into the system in a control mode so as to ensure the safety of the interaction process. The compliant control is divided into impedance control based on force control and impedance control based on position control, the latter is abbreviated as admittance control, and the idea is to modulate the robot in the interaction process into a second-order admittance model by a control mode, wherein the model comprises three impedance parameters of inertia, damping and rigidity.
Admittance variation control: the fixed admittance model directs that the inertial, damping and stiffness parameters in the admittance model are fixed. Under many conditions, the interaction control of the fixed admittance model cannot achieve the expected effect of adapting to the environment and tasks, so that the variable admittance control concept is introduced, and according to the specific environment and tasks, the impedance parameters are adjusted, so that the robot is more compliant with the environmental force, and the compliant operation under the unknown dynamic environment is realized.
The impedance learning method comprises the following steps: the process of adjusting the impedance parameter is commonly referred to as impedance learning. The common impedance adjustment methods include simulation learning, iterative learning, gradient descent optimization, neural network, reinforcement learning and the like.
Reinforcement learning: the method can overcome the limitation caused by the fact that the traditional optimal control algorithm cannot completely model the environment, and the optimal solution is found through interaction with the environment. In robotic applications, one of the main objectives of reinforcement learning is to have the robot interact with the environment entirely autonomously, an important feature of which is to learn the best behavior without human involvement, without having to know models of the robot and the environmental system. In the task of robot variable impedance learning, the main purpose of reinforcement learning is to autonomously learn and adjust the impedance parameters of the robot so as to show more proper flexibility.
Constrained markov decision process (Constrained Markov Decision Process, CMDP): the markov decision process (Markov Decision Process, MDP) is a mathematical model of sequential decisions for simulating the randomness strategies and rewards achievable by agents in environments where the system states have markov properties, almost all reinforcement learning problems can be translated into MDP, which is used for modeling reinforcement learning problems. By using dynamic programming, random sampling, etc., the MDP can solve agent policies that maximize returns. The Constraint Markov Decision Process (CMDP) additionally introduces a loss function and constraints, and the objective function of the CMDP problem is to meet the maximum long-term benefit under all constraint conditions.
Aiming at the problems existing in the prior art, the embodiment of the invention provides a robot impedance learning method based on safety reinforcement learning, which comprises the following steps:
the controller outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation;
constructing an input item according to the position information, the speed information and the return information of a learning algorithm;
determining a decision action according to the input item, and further determining an impedance parameter;
taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;
taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed as target input of a controller according to the admittance model;
wherein the target input is for causing the controller to control the robot motion.
Optionally, the calculating the position information and the speed information of the cartesian space of the robot according to the robot dynamics equation specifically includes:
calculating the position information and the speed information through a robot dynamics equation;
the expression of the robot dynamics equation is as follows:
Figure BDA0003476138710000061
wherein ,
Figure BDA0003476138710000062
is joint moment; />
Figure BDA0003476138710000063
Is a joint space inertia matrix; />
Figure BDA0003476138710000064
Is a coriolis force and centripetal force coupling matrix; />
Figure BDA0003476138710000065
Is a gravity term; />
Figure BDA0003476138710000066
Non-rigid body power items such as friction;
Figure BDA0003476138710000067
acting a moment for the environment in the joint space; />
Figure BDA0003476138710000068
The space position and speed information of the robot joint are obtained;
the position information is Cartesian space position information, and the calculation formula of the Cartesian space position information is as follows:
x=ψ(q);
the calculation formula of the speed information is as follows:
Figure BDA0003476138710000069
wherein ,
Figure BDA00034761387100000610
representing position information; />
Figure BDA00034761387100000611
Representing speed information; psi (·) is the robot kinematic solution; j (q) is a jacobian matrix.
Optionally, the determining a decision action according to the input item, and further determining an impedance parameter, includes:
processing the state information of the input item;
evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain decision actions;
and taking the decision action as an impedance parameter.
Optionally, the calculation formula of the environment interaction force is:
Figure BDA00034761387100000612
wherein ,
Figure BDA00034761387100000613
representing environmental interaction forces; />
Figure BDA00034761387100000614
Damping, rigidity and inertia diagonal matrix of the admittance model of the robot respectively; />
Figure BDA00034761387100000615
Is the desired interaction force; />
Figure BDA00034761387100000616
Respectively the Cartesian space position, the speed and the acceleration of the robot; />
Figure BDA00034761387100000617
The desired position, the desired velocity and the desired acceleration of the robot, respectively, in cartesian space.
Optionally, the admittance model has the expression:
Figure BDA00034761387100000618
wherein ,
Figure BDA00034761387100000619
respectively, the Cartesian space reference position, the reference speed and the reference speed of the robot constrained by external forceReference acceleration. />
Optionally, the evaluating and optimizing the processed input item through the Critic network and the Actor network to obtain a decision action includes:
respectively inputting state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises the position information and the speed information of the end effector of the current mechanical arm;
inputting the first processing result into a first group of Critic networks to obtain a third processing result;
inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;
and adjusting the third processing result according to the fourth processing result to obtain a final decision action.
Optionally, the method further comprises the steps of: safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the method specifically comprises the following steps:
introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;
defining a set of feasible solutions according to the constraint threshold;
searching an optimal strategy according to the set of feasible solutions;
according to the actual task, optimizing and adjusting the loss function;
and performing safety reinforcement learning according to the adjusted loss function.
Another aspect of the embodiments of the present invention further provides a robot impedance learning device based on safety reinforcement learning, including:
the first module is used for outputting control moment by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to the dynamics equation of the robot;
the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
the third module is used for determining a decision action according to the input item so as to determine an impedance parameter;
the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module, and calculating to obtain the environment interaction force;
a fifth module, configured to take the impedance parameter and the environmental interaction force as input of an admittance model, and determine a reference position and a reference speed according to the admittance model as target input of a controller;
wherein the target input is for causing the controller to control the robot motion.
Another aspect of the embodiment of the invention also provides an electronic device, which includes a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Another aspect of the embodiments of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
The following describes the specific implementation of the present invention in detail with reference to the drawings of the specification:
as shown in fig. 1, fig. 1 is a learning control frame diagram according to an embodiment of the present invention.
Specifically, the overall learning control flow is as follows:
1) The inner ring is a control ring, so that the robot system with unknown dynamics characteristics shows the behavior of a specified admittance model, namely the controller outputs a control moment tauCalculating the Cartesian space actual position x and the speed of the robot according to the robot dynamics equation
Figure BDA0003476138710000081
The robot dynamics equation is
Figure BDA0003476138710000082
wherein ,
Figure BDA0003476138710000083
for joint moment->
Figure BDA0003476138710000084
Is a joint space inertia matrix>
Figure BDA0003476138710000085
For the coriolis force and centripetal force coupling matrix, < >>
Figure BDA0003476138710000086
Is gravity item->
Figure BDA0003476138710000087
Is a non-rigid body power item such as friction,
Figure BDA0003476138710000088
acting a moment for the environment in the joint space, +.>
Figure BDA0003476138710000089
For the space position and speed information of the robot joint, the corresponding Cartesian space position and speed of the robot are obtained through a conversion model, and the specific formula is as follows:
x=ψ(q),
Figure BDA00034761387100000810
wherein, ψ (·) is the robot kinematic solution, and J (q) is the jacobian matrix.
2) Actual position x and velocity
Figure BDA00034761387100000811
As an input item of the reinforcement learning algorithm, the return r is also used as an input item of the reinforcement learning algorithm, and after the input state information is processed, decision actions, namely the impedance parameter K, are output through a Critic network evaluation and an Actor network, and the specific algorithm implementation process is shown in fig. 3.
Fig. 3 is a network architecture in a dual delay depth deterministic strategy gradient TD3 algorithm.
Wherein the present embodiment learns impedance parameters using a dual delay depth deterministic strategy gradient (Twin Delayed Deep Deterministic Policy Gradient, TD 3) algorithm, wherein state S comprises the current position x and velocity of the robotic end effector
Figure BDA00034761387100000812
And an action K outputted at the previous moment; a represents a strategy output by an Actor network; a' represents a strategy output by an Actor target network; q1 represents a value function calculated by Critic network 1 evaluation; q2 represents a value function calculated by Critic network 2 evaluation; q' represents a target value function; r represents an instant prize; td_error1 represents the error obtained by subtracting Q1 from the weighted sum of R and Q'; td_error2 represents the error obtained by subtracting Q2 from the weighted sum of R and Q'; target represents a weighted sum of R and Q'; the difference between the Actor network and the Actor target network is that the Actor network is updated in an experience pool in each step, and the Actor target network copies the network parameters of the Actor network into the Actor target network at intervals to realize the update of the Actor target network; the Critic network 1 and the Critic network 2 respectively and independently update network parameters by using the same target value function; the difference between the Critic network 1 and the Critic target network 1 is that the Critic network 1 is updated in an experience pool in each step, and the Critic target network 1 copies network parameters of the Critic network 1 into the Critic target network 1 at intervals, so that the updating of the Critic target network 1 is realized. The return function defined in this embodiment is:
r=-a*||(F e -F d ) 2 ||-b*||(x-x d ) 2 ||-c*||(x-x obj ) 2 ||+r final
wherein ,
Figure BDA0003476138710000091
represents the target position, r final Is a positive integer.
The return function comprises four items in total, wherein the first three items represent the instant return of each step and are respectively used for punishing the actions of generating great interaction force, deviating from an expected track and keeping away from a target position, and the last item represents that tasks are completed within a specified time, namely rewards are given when the target position is reached. The purpose of the reward function is thus to encourage movement to small holes while suppressing behavior that would create large interaction forces.
3) In addition, the actual position x and the speed of the mechanical arm end effector
Figure BDA00034761387100000917
As input of the environment module, the environment interaction force F is calculated e The method comprises the steps of carrying out a first treatment on the surface of the The environmental interaction force is designed as follows:
Figure BDA0003476138710000092
wherein ,
Figure BDA0003476138710000093
damping, stiffness and inertia diagonal matrix of the robot admittance model, respectively +.>
Figure BDA0003476138710000094
To expect interactive force, ++>
Figure BDA0003476138710000095
Cartesian spatial position, speed and acceleration of the robot, respectively->
Figure BDA0003476138710000096
The desired position, the desired velocity and the desired acceleration of the robot, respectively, in cartesian space.
4) Impedance parameter K and environmental interaction force F e As input to the admittance model, a reference position x is calculated from the admittance model r And a reference speed
Figure BDA0003476138710000097
And serves as an input to the controller. The admittance model is as follows:
Figure BDA0003476138710000098
wherein ,
Figure BDA0003476138710000099
the reference position, the reference speed and the reference acceleration are respectively a Cartesian space reference position, a reference speed and a reference acceleration of the robot constrained after the robot is subjected to external force. Specifically calculating a reference position x according to an admittance model r And reference speed->
Figure BDA00034761387100000910
The specific steps are that the solution is carried out by an integration method, and the last time t-tau is used s Reference speed of +.>
Figure BDA00034761387100000911
And reference position x r (t-τ s ) Calculating the reference acceleration of the current moment t>
Figure BDA00034761387100000912
Then the reference speed of the current moment is obtained through the integration operation>
Figure BDA00034761387100000913
And reference position x r (t) the formula is as follows:
Figure BDA00034761387100000914
Figure BDA00034761387100000915
Figure BDA00034761387100000916
the final purpose of this scheme is through the impedance parameter of reinforcement learning mode study robot execution task in-process, strengthens the compliance effect of robot, ensures the safety of robot and environment. The learning curve converges after training hundreds of rounds, and the optimal motion track and environment interaction force change curve of the robot can be obtained through simulation, so that the whole task is completed.
Fig. 2 is a schematic diagram of a robot shaft hole assembly, with the task of robot shaft hole assembly as an environmental context. The initial position of the tail end of the mechanical arm is x 0 The desired trajectory is first from x 0 To x 1 Then from x 1 To x 2 When deviating from the expected track, the mechanical arm receives the environment interaction force generated by collision with the wall, and generates a compliant effect by adjusting the impedance parameter, so that the environment interaction force is reduced.
FIG. 4 is a schematic diagram of a constrained Markov decision process framework in combination with which the present invention implements secure reinforcement learning.
Compared with a general Markov decision process, the method additionally introduces a loss function c in the constraint Markov decision process and sets a constraint threshold d. Order the
Figure BDA0003476138710000101
For long-term loss of policy pi within the constraint, a set of feasible solutions is defined as pi c= { pi e pi J C(π) D is less than or equal to d). Then, under the condition of meeting the constraint condition, searching a strategy pi for maximizing long-term benefit * =argmax π∈∏c J (pi). The loss function is designed to be c=w|| (F e -F d ) 2 I, wherein w is a parameter for adjusting weightThe purpose of such design of the loss function is to constrain the environmental interaction forces to a safe range, and then to achieve learning of the impedance parameters within that safe range.
In summary, the TD3 algorithm is applied to the impedance learning of the Panda robot for the first time, and the feasibility of performing the impedance learning by the Panda robot based on the reinforcement learning mode and further realizing the admittance variation control is verified; the invention also applies the safety reinforcement learning idea to impedance learning, combines the CMDP with the TD3 algorithm, and applies the safety reinforcement learning idea to the impedance learning task in the robot shaft hole assembly process, thereby guaranteeing the safety of the shaft hole assembly task.
Compared with the prior art, the invention has the following advantages:
firstly, applying the impedance learning based on the TD3 algorithm to a Panda robot simulation platform for the first time, and verifying the feasibility of performing the impedance learning by the Panda robot based on the reinforcement learning mode so as to realize the variable admittance control; secondly, the deep reinforcement learning (namely TD 3) algorithm has higher performance, higher stability and faster learning of the optimal impedance parameter, and is more suitable for being applied to the impedance learning task of the robot shaft hole assembly process; the safety reinforcement learning thought is introduced, and the CMDP is combined, so that the safety guarantee function is realized, and the robot shaft hole equipment process is safer.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. The robot impedance learning method based on the safety reinforcement learning is characterized by comprising the following steps of:
the controller outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation;
constructing an input item according to the position information, the speed information and the return information of a learning algorithm;
determining a decision action according to the input item, and further determining an impedance parameter;
taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;
taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed as target input of a controller according to the admittance model;
wherein the target input is for causing the controller to control the robot motion.
2. The robot impedance learning method based on the safety reinforcement learning according to claim 1, wherein the calculation of the position information and the speed information of the cartesian space of the robot according to the robot dynamics equation is specifically as follows:
calculating the position information and the speed information through a robot dynamics equation;
the expression of the robot dynamics equation is as follows:
Figure FDA0004193562010000011
wherein ,
Figure FDA0004193562010000012
is joint moment; />
Figure FDA0004193562010000013
Is a joint space inertia matrix; />
Figure FDA0004193562010000014
Is a coriolis force and centripetal force coupling matrix; />
Figure FDA0004193562010000015
Is a gravity term; />
Figure FDA0004193562010000016
Is a non-rigid body power item; />
Figure FDA0004193562010000017
Acting a moment for the environment in the joint space; />
Figure FDA0004193562010000018
The space position and speed information of the robot joint are obtained;
the position information is Cartesian space position information, and the calculation formula of the Cartesian space position information is as follows:
x=ψ(q);
the calculation formula of the speed information is as follows:
Figure FDA0004193562010000019
wherein ,
Figure FDA00041935620100000110
representing position information; />
Figure FDA00041935620100000111
Representing speed information; psi (·) is the robot kinematic solution; j (q) is a jacobian matrix.
3. The method for learning impedance of a robot based on safety reinforcement learning of claim 1, wherein determining a decision action based on the input term, and thus determining an impedance parameter, comprises:
processing the state information of the input item;
evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain decision actions;
and taking the decision action as an impedance parameter.
4. The robot impedance learning method based on safety reinforcement learning according to claim 1, wherein the calculation formula of the environment interaction force is:
Figure FDA0004193562010000021
wherein ,
Figure FDA0004193562010000022
representing environmental interaction forces;/>
Figure FDA0004193562010000023
Damping, rigidity and inertia diagonal matrix of the admittance model of the robot respectively; />
Figure FDA0004193562010000024
Is the desired interaction force;
Figure FDA0004193562010000025
respectively the Cartesian space position, the speed and the acceleration of the robot; />
Figure FDA0004193562010000026
The desired position, the desired velocity and the desired acceleration of the robot, respectively, in cartesian space.
5. The robot impedance learning method based on safety reinforcement learning of claim 4, wherein the admittance model has the expression:
Figure FDA0004193562010000027
wherein ,
Figure FDA0004193562010000028
the reference position, the reference speed and the reference acceleration are respectively a Cartesian space reference position, a reference speed and a reference acceleration of the robot constrained after the robot is subjected to external force.
6. The method for learning impedance of a robot based on safety reinforcement learning according to claim 3, wherein the evaluating and optimizing the processed input item through a Critic network and an Actor network to obtain a decision action comprises:
respectively inputting state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises the position information and the speed information of the end effector of the current mechanical arm;
inputting the first processing result into a first group of Critic networks to obtain a third processing result;
inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;
and adjusting the third processing result according to the fourth processing result to obtain a final decision action.
7. The robot impedance learning method based on safety reinforcement learning of claim 6, further comprising the steps of: safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the method specifically comprises the following steps:
introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;
defining a set of feasible solutions according to the constraint threshold;
searching an optimal strategy according to the set of feasible solutions;
according to the actual task, optimizing and adjusting the loss function;
and performing safety reinforcement learning according to the adjusted loss function.
8. The utility model provides a robot impedance learning device based on safe reinforcement study which characterized in that includes:
the first module is used for outputting control moment by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to the dynamics equation of the robot;
the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
the third module is used for determining a decision action according to the input item so as to determine an impedance parameter;
the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module, and calculating to obtain the environment interaction force;
a fifth module, configured to take the impedance parameter and the environmental interaction force as input of an admittance model, and determine a reference position and a reference speed according to the admittance model as target input of a controller;
wherein the target input is for causing the controller to control the robot motion.
9. An electronic device comprising a processor and a memory;
the memory is used for storing programs;
the processor executing the program implements the method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the storage medium stores a program that is executed by a processor to implement the method of any one of claims 1 to 7.
CN202210055753.7A 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning Active CN114378820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210055753.7A CN114378820B (en) 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210055753.7A CN114378820B (en) 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning

Publications (2)

Publication Number Publication Date
CN114378820A CN114378820A (en) 2022-04-22
CN114378820B true CN114378820B (en) 2023-06-06

Family

ID=81203767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210055753.7A Active CN114378820B (en) 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning

Country Status (1)

Country Link
CN (1) CN114378820B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115421387B (en) * 2022-09-22 2023-04-14 中国科学院自动化研究所 Variable impedance control system and control method based on inverse reinforcement learning

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153153B (en) * 2017-12-19 2020-09-11 哈尔滨工程大学 Learning variable impedance control system and control method
DE102019006725B4 (en) * 2018-10-02 2023-06-01 Fanuc Corporation control device and control system
CN112847235B (en) * 2020-12-25 2022-09-09 山东大学 Robot step force guiding assembly method and system based on deep reinforcement learning
CN112757344B (en) * 2021-01-20 2022-03-11 清华大学 Robot interference shaft hole assembling method and device based on force position state mapping model
CN113341706B (en) * 2021-05-06 2022-12-06 东华大学 Man-machine cooperation assembly line system based on deep reinforcement learning
CN113352322B (en) * 2021-05-19 2022-10-04 浙江工业大学 Adaptive man-machine cooperation control method based on optimal admittance parameters
CN113510704A (en) * 2021-06-25 2021-10-19 青岛博晟优控智能科技有限公司 Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Also Published As

Publication number Publication date
CN114378820A (en) 2022-04-22

Similar Documents

Publication Publication Date Title
Meyer et al. Taming an autonomous surface vehicle for path following and collision avoidance using deep reinforcement learning
US10800040B1 (en) Simulation-real world feedback loop for learning robotic control policies
Vecerik et al. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
Jiang et al. A brief review of neural networks based learning and control and their applications for robots
JP4803212B2 (en) Data processing apparatus, data processing method, and program
Chu et al. Motion control of unmanned underwater vehicles via deep imitation reinforcement learning algorithm
Kim et al. From exploration to control: learning object manipulation skills through novelty search and local adaptation
US9747543B1 (en) System and method for controller adaptation
Toloue et al. Position tracking of a 3-PSP parallel robot using dynamic growing interval type-2 fuzzy neural control
CN114521262A (en) Controlling an agent using a causal correct environment model
CN114378820B (en) Robot impedance learning method based on safety reinforcement learning
Ramamurthy et al. Leveraging domain knowledge for reinforcement learning using MMC architectures
Tan et al. Optimized deep reinforcement learning approach for dynamic system
Rayyes et al. Learning inverse statics models efficiently with symmetry-based exploration
Jiang et al. Generative adversarial interactive imitation learning for path following of autonomous underwater vehicle
CN114529010A (en) Robot autonomous learning method, device, equipment and storage medium
CN114326722A (en) Adaptive gait planning method, system, device and medium for hexapod robot
CN111531543B (en) Robot self-adaptive impedance control method based on biological heuristic neural network
Amaya et al. Neurorobotic reinforcement learning for domains with parametrical uncertainty
Hillebrand et al. A design methodology for deep reinforcement learning in autonomous systems
Sathyan et al. Collaborative control of multiple robots using genetic fuzzy systems approach
Liu et al. Her-pdqn: A reinforcement learning approach for uav navigation with hybrid action spaces and sparse rewards
Naderi et al. Learning physically based humanoid climbing movements
US20210383243A1 (en) Stable and efficient training of adversarial models by an iterated update operation of second order or higher
Fernandez et al. Deep reinforcement learning with linear quadratic regulator regions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant