CN115421387A - Variable impedance control system and control method based on inverse reinforcement learning - Google Patents
Variable impedance control system and control method based on inverse reinforcement learning Download PDFInfo
- Publication number
- CN115421387A CN115421387A CN202211161566.3A CN202211161566A CN115421387A CN 115421387 A CN115421387 A CN 115421387A CN 202211161566 A CN202211161566 A CN 202211161566A CN 115421387 A CN115421387 A CN 115421387A
- Authority
- CN
- China
- Prior art keywords
- variable impedance
- mechanical arm
- track
- reinforcement learning
- strategy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000006870 function Effects 0.000 claims abstract description 85
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 49
- 238000011217 control strategy Methods 0.000 claims abstract description 40
- 230000009471 action Effects 0.000 claims abstract description 21
- 239000011159 matrix material Substances 0.000 claims description 49
- 238000013016 damping Methods 0.000 claims description 46
- 230000001133 acceleration Effects 0.000 claims description 16
- 239000012636 effector Substances 0.000 claims description 11
- 230000002441 reversible effect Effects 0.000 claims description 9
- 230000005484 gravity Effects 0.000 claims description 7
- 230000008859 change Effects 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 230000010399 physical interaction Effects 0.000 abstract description 6
- 238000012546 transfer Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 8
- 238000012549 training Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 102100040653 Tryptophan 2,3-dioxygenase Human genes 0.000 description 3
- 101710136122 Tryptophan 2,3-dioxygenase Proteins 0.000 description 3
- 230000003042 antagnostic effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 101150043283 ccdA gene Proteins 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/02—Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The present disclosure relates to a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of the reward function is improved in task setting, the generalized representation of the variable impedance technology is realized, the layered impedance control of the mechanical arm can be realized, the relatively complex physical interaction is completed, the motion precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.
Description
Technical Field
The present disclosure relates to the field of mechanical arms and automatic control, and in particular, to a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning.
Background
Robotic systems are increasingly used in various unstructured environments, such as hospitals, factories, houses, etc., where the robot needs to perform complex operational tasks, adjust the impedance according to different task phases and environmental constraints, while interacting with an unknown environment in a safe and stable manner. Impedance control, which establishes mass-spring-damped contact dynamics, has been widely used in these robotic systems to ensure safe physical interaction. In addition, many complex operating tasks require the robot to change impedance according to the task phase, and flexibility and robustness have become one of the important indicators for developing surgical robot controllers for physical interaction. However, conventional impedance control schemes do not understand the actual surgical scenario, including complex physical interactions on the robotic arm, resulting in a loss of precision, and in practice, achieving such tasks requires variable impedance skills.
The existing learning-based method for obtaining variable impedance skills mainly includes the following categories:
the first type is a teaching learning-based approach. The human expert controls the robot through a haptic interface and a hand-held impedance control interface, which is based on a linear spring-reset potentiometer that maps button positions to robot arm stiffness. This arrangement allows a human expert to adjust the compliance of the robot according to given task requirements, encode the demonstrated motion and stiffness trajectories using dynamic motion primitives, and learn using local weight regression. If the illustrated trajectory has a high variance, the impedance should be low, and if the illustrated trajectory has a low variance, the impedance should be high. Such a strategy may provide a very good solution for many manipulation tasks. The advantage is that no separate demonstration of the impedance is required. However, in some interactive tasks, such as a sliding task in a groove, low trajectory variability does not necessarily correspond to high impedance.
The second type is based on a deep reinforcement learning method with a variable impedance motion space. When using reinforcement learning to control robot motion, an important challenge is the parameterized selection of strategies. Parameters with relevant nonlinear features are usually extracted from a set of motion demonstrations following the teaching learning paradigm using gaussian mixture regression, the final parameterization takes the form of a nonlinear time-invariant dynamic system, this time-invariant dynamic system is used as a parameterization strategy for a variant of the PI2 strategy search algorithm, and finally the time-invariant motion is represented by the PI2 strategy search algorithm. However, this approach has certain drawbacks, and in the first place it is more idealized, assuming that there is no noise in the system other than the detection noise, which means that disturbances encountered during sampling the trajectory have a negative impact on learning and cannot be considered to improve the strategy. Second, it is initially designed to learn a trajectory from a particular initial state, and using it to learn a trajectory from multiple initial states increases the number of deployments required. While many inverse reinforcement learning algorithms employ entropy regularization to prevent simple emulation of expert strategies, most previous efforts have not focused on the impact of action space selection on a priori knowledge.
While many methods based on deep reinforcement learning and teaching learning have been proposed to obtain variable impedance skills that are exposed to rich operating tasks, these skills are typically task-specific based and may be sensitive to changes in task settings, and the task-specific impedance skills obtained by teaching learning methods may fail when a task changes. Furthermore, designing suitable reward functions is challenging for reinforcement learning, and therefore their skill transferability is limited.
Accordingly, there is a need for one or more methods to address the above-mentioned problems.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
An object of the present disclosure is to provide a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning, thereby overcoming, at least to some extent, one or more of the problems due to the limitations and disadvantages of the related art.
According to an aspect of the present disclosure, there is provided an inverse reinforcement learning-based variable impedance control system, the system including a variable impedance controller, an impedance gain controller, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting a track according to the first feedback force and an expected track based on the acquired target rigidity and damping coefficient;
the impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and the mechanical arm control is completed based on the second feedback force.
Preferably, the variable impedance control system further comprises an inverse reinforcement learning algorithm module and a variable impedance control strategy module, wherein:
the reverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating a variable impedance control strategy through a maximized reward function;
the variable impedance control strategy module is used for calculating target rigidity and damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the existing variable impedance control strategy, and sending the target rigidity and damping coefficient to the variable impedance controller.
Preferably, the variable impedance controller is based on a second order impedance model
The robot arm tip desired position increment for the revised trajectory is generated as:
wherein, M d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,x d respectively the expected acceleration and speed of the robot endAnd position, fd and F are respectively expected contact force and actual contact force between the robot end and the environment, E (n) is contact force error, T is control period, w 1 ,w 2 ,w 3 Are all intermediate variables;
w 1 =4M d (t)+2B d (t)T+K d (t)T 2
w 2 =-8M d (t)+2K d (t)T 2
w 3 =4M d (t)-2B d (t)T+K d (t)T 2 。
preferably, the impedance gain controller is based on a dynamical model of the robot in cartesian space:
and the kinetic equation:
the feed forward terms that generate the impedance control law are:
the second feedback force is:
wherein M (x) is a mass inertia matrix,in the form of a matrix of the coriolis forces,as a vector of the gravity force,and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively inputting torque and external force of a joint space motor; m d 、B d 、K d A desired mass, damping and stiffness matrix; e andto track position errors and track velocity errors.
Preferably, the variable impedance control strategy module tracks errors according to cartesian spatial locations:
the variable impedance control strategy generated according to the distance of the mechanical arm close to the target position is as follows:
wherein e is 1 、e 2 Two gain change points of 0.4m and 0.2m, respectively.
Preferably, the inverse reinforcement learning algorithm module is used for basing on expert strategy and reward function
Wherein the content of the first and second substances,d i,t the distances between the ith mixed track point and the expected point at the t moment respectively,d i,t+1 at the t +1 th time,The distance between the ith mixed track point and the expected point, wherein gamma is a proportionality coefficient;
discriminating motion trajectories from expert trajectories using a discriminator and calculating a loss function
Wherein r is θ (o, a) is a reward function to be learned, and pi (a | o) is the probability of taking action a when the observed value is o under the current strategy pi;
the discriminator is updated by minimizing a penalty function and the variable impedance control strategy is updated by maximizing a reward function.
Preferably, the scale factor value range in the inverse reinforcement learning algorithm module is 0-1.
In one aspect of the present disclosure, there is provided a variable impedance control method based on inverse reinforcement learning, the method including:
initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track by a variable impedance controller according to the tail end position, the first feedback force and an expected track of the mechanical arm on the basis of the target rigidity and the damping coefficient;
and the impedance gain controller generates a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and completes mechanical arm control based on the second feedback force.
Preferably, the method further comprises:
the reverse reinforcement learning algorithm module is based on an expert strategy and an incentive function in a reverse reinforcement learning algorithm, a discriminator is used for distinguishing a motion track and the expert track and calculating a loss function, the discriminator is updated through a minimized loss function, and a variable impedance control strategy is updated through a maximized incentive function;
and the variable impedance control strategy module calculates a target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module, and sends the target rigidity and the damping coefficient to the variable impedance controller.
Preferably, the inverse reinforcement learning algorithm in the inverse reinforcement learning algorithm module comprises:
collecting the force and torque exerted by a specialist on the mechanical arm end effector in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);
initializing a first impedance gain strategy by using random weight;
collecting a first trace under the first impedance gain strategy;
exploring to obtain a second impedance gain strategy by using an inverse reinforcement learning algorithm based on the first track;
collecting a second trace according to the second impedance gain strategy;
and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator through the minimized loss function, repeating the inverse reinforcement learning algorithm, and judging and generating the optimal variable impedance control strategy based on the reward function.
In one aspect of the present disclosure, there is provided an electronic device including:
a processor; and
a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method according to any of the above.
In an aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the method according to any one of the above.
An inverse reinforcement learning-based variable impedance control system, a control method, an electronic device, and a storage medium in exemplary embodiments of the present disclosure. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of a reward function is improved in task setting, generalized representation of variable impedance skills is achieved, layered impedance control of the mechanical arm can be achieved, complex physical interaction is completed, the movement precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 illustrates a system block diagram of an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;
FIG. 2 illustrates a controller design schematic of an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;
FIG. 3 illustrates a flowchart of an inverse reinforcement learning algorithm for an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;
FIG. 4 illustrates a flow chart of a variable impedance control method based on inverse reinforcement learning according to an exemplary embodiment of the present disclosure;
FIG. 5 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure; and
fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
In the present exemplary embodiment, there is first provided an inverse reinforcement learning-based variable impedance control system; referring to fig. 1, the variable impedance control system based on inverse reinforcement learning includes a variable impedance controller, an impedance gain controller, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting a track according to the first feedback force and an expected track based on the acquired target rigidity and damping coefficient;
the impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and the mechanical arm control is completed based on the second feedback force.
An inverse reinforcement learning-based variable impedance control system, a control method, an electronic device, and a storage medium in exemplary embodiments of the present disclosure. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of the reward function is improved in task setting, the generalized representation of the variable impedance technology is realized, the layered impedance control of the mechanical arm can be realized, the relatively complex physical interaction is completed, the motion precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.
Next, a variable impedance control system based on inverse reinforcement learning in the present exemplary embodiment will be further described.
In the embodiment of the example, the variable impedance strategy and the reward function are recovered from the teaching based on the method of inverse reinforcement learning, the reward function is maximized by generating a new variable impedance strategy for different task settings by using a reinforcement learning algorithm, and different action spaces of the reward function are explored to realize generalized representation of the variable impedance skill. The method mainly comprises the following three parts:
in the embodiment of the present example, the Cartesian space impedance control design section
Consider a kinetic model of a robot in cartesian space:
where M (x) is the mass inertia matrix,a matrix of the coriolis force is represented,is the vector of the force of gravity,and x is the Cartesian acceleration, velocity and position, respectively, relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively representing the torque input and the external force of the joint space motor. Under the law of impedance control, the robot will behave as a mass-spring-damper system, which followsFollowing the kinetic equation:
wherein M is d 、B d 、K d Is the required mass, damping and stiffness matrix. By solving (1), (2) and setting M d = M (x), the impedance control law can be written as:
τ=J T F
the impedance control law can be further divided into two parts: feed forward term F ff To eliminate nonlinear robot dynamics and feedback terms F fb Tracking a required track:
wherein e andare the tracking error and the tracking speed. Rigidity matrix K d And a damping matrix B d Also called impedance gain matrices, because they map tracking errors and velocities to feedback forces F fb 。
In the present exemplary embodiment, the controller design is depicted in FIG. 1 by the antagonistic inverse reinforcement learning variable impedance skills section. In the method, the observed values of the robot and the environment are a tracking error e and a tracking speedThe strategy adopted accepts observation and outputs impedance gainK. B or feedback force F fb Depending on the motion space design. The impedance gain controller then calculates the control input and controls the robot using equation (3), learning expert strategies and reward functions using antagonistic inverse reinforcement learning, the training process being detailed in the algorithm.
In the present invention, an inverse reinforcement learning algorithm is used to learn expert strategies and reward functions, and in this antagonism training environment, the discriminator that separates the generator trajectory from the expert trajectory is defined as:
wherein r is θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a when observed as o under the current strategy. Updating the discriminator to minimize the loss:
the generator is a variable impedance strategy. During training, the strategy is updated to maximize the trajectory reward, evaluated by a reward function, and strategy updating is performed by a strategy gradient-based reinforcement learning method confidence domain strategy optimization algorithm (TRPO). Because the environment dynamics are unknown, new strategies are re-optimized in different task settings by applying reinforcement learning to test the performance of the learning reward function. In the reinforcement learning training process, strategy updating is the same as an inverse reinforcement learning method, but a fixed learning reward function is provided. The detailed process of algorithm training is as follows:
in the embodiment of the present example, the method applies the part; when the method is put into use, expert data is collected by human experts of real robots, and then learned strategies are transferred to the real robots for performance evaluation.
1. And setting a task. The real world experiment device consists of a host computer, a target computer, an F/T sensor and a robot. A cartesian variable impedance control algorithm is written on the host PC that controls the Real robotic system connected to the target PC through Simulink Real-Time. Model parameters of real robots, such as mass inertia matrix M (x), coriolis force matrixAnd the gravity vector G (x) is obtained by the euler-lagrange method.
2. Human expert data is collected. During data collection, a human expert applies forces and torques on the end effector to cause the end of the robotic arm to complete a desired trajectory. The 6-dimensional Cartesian space forces and torques are measured by the F/T sensors, and then the control inputs are calculated using equation (3), which will track the stateAnd the force adopted by the human expert is recorded as human expert data, and the income of the human expert is estimated in data processing.
3. The gain estimation is performed using a sliding window method. To recover the expert gain strategy, a short sliding window is used to estimate the stiffness and damping of the force. Each time window contains ten state-force pairs and the expert gain estimates the test by solving equation (5) with least squares. Strategies and reward functions are learned in a simulated environment using antagonistic reverse reinforcement learning with real-world human expert data.
The variable impedance control system based on the inverse reinforcement learning comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting the track according to the first feedback force and the expected track based on the target rigidity and the damping coefficient generated and sent by the variable impedance control strategy module.
In the exemplary embodiment, the variable impedance controller in the system is based on a second order impedance model
The mechanical arm tip desired position increment for correcting the trajectory is generated as:
wherein M is d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,x d respectively the desired acceleration, velocity and position of the robot tip, F d And F the expected and actual contact force between the robot tip and the environment, respectively, and E (n) the contact force error.
In the present exemplary embodiment, to achieve the desired dynamic behavior of the tip, a second order impedance model is used:
wherein M is d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,x d respectively the desired acceleration, velocity and position of the robot tip, F d And F is the expected and actual contact forces between the robot tip and the environment, respectively.
To obtain the corrected desired position increment, the second order impedance model is lagrangian transformed and a bilinear transformation s =2T is used -1 (z-1)(z+1) -1 Discretizing to obtain:
w 1 =4M d (t)+2B d (t)T+K d (t)T 2 (3)
w 2 =-8M d (t)+2K d (t)T 2 (4)
w 3 =4M d (t)-2B d (t)T+K d (t)T 2 (5)
where T is the control period, the difference equation of the impedance controller, i.e., the expected position increment of the terminal, is:
to simplify the calculation, the target inertia matrix is set to a constant M d (t) = I, so the variable impedance controller requires a time-varying target stiffness K d (t) damping coefficient B d (t) adjusting the desired position with the contact force error E (n).
The impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the mechanical arm tail end.
In an embodiment of the present example, the impedance gain controller in the system is based on a dynamical model of the robot in cartesian space:
and the kinetic equation:
the feed forward term to generate the impedance control law is:
the second feedback force is:
wherein M (x) is a mass inertia matrix,in the form of a matrix of the coriolis forces,as a vector of the gravity force,and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively inputting torque and external force of a joint space motor; m is a group of d 、B d 、K d A desired mass, damping and stiffness matrix; e andto track position errors and track velocity errors.
In the exemplary embodiment, the mass inertia matrix M (x), the Coriolis matrixAnd the gravity vectorThe model parameters are automatically calculated using a Mujoco simulation model.
Constructing a dynamic model of the robot in a Cartesian space:
wherein M (x) is a mass inertia matrix,is a Coriolis force matrix, G (x) is a gravity vector,and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively a joint space motor torque input and an external force.
Under the law of impedance control, the robot will behave as a mass-spring-damper system, which follows the kinetic equation:
wherein M is d 、B d 、K d Is the desired mass, damping and stiffness matrix. By solving (1), (2) and setting M d = M (x), the impedance control law can be written as:
τ=J T F
the impedance control law can be further divided into two parts: feedforward term F ff To eliminate nonlinear robot dynamics and feedback term F fb Tracking a required track:
wherein e andare a tracking position error and a tracking velocity error. Rigidity matrix K d And damping matrix B d Also referred to as impedance gain matrices, because they map tracking position errors and tracking velocity errors to feedback forces F fb . To simplify the notation, we use K (stiffness) and B (damping) to denote K in the rest of the text d And B d . Fig. 2 depicts a controller design.
The variable impedance control strategy module is used for calculating target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on a preset variable impedance control strategy, and sending the target rigidity and the damping coefficient to the variable impedance controller.
In this exemplary embodiment, the variable impedance control strategy module in the system tracks errors based on cartesian spatial position:
the variable impedance control strategy generated according to the distance of the mechanical arm close to the target position is as follows:
wherein e is 1 、e 2 Two gain change points of 0.4m and 0.2 m.
In the embodiment of the present example, (1) observation space: will track error e and tracking speedTogether serve as a viewing space for the task for which the end effector is located on the cup. In addition, due to a pair of tracking error e and tracking speedWithout providing acceleration information, it is not possible to fully represent the system dynamics, for which a preamble observation is used, comprising e and e from the first five time stepsIs evaluated.
(2) An action space: for the impedance gain action space, the strategy outputs the impedance gain, and the control input is obtained by equation (11).
To reduce the dimension of the gain action space, it is assumed that the stiffness matrix K and the damping matrix B are diagonal. Furthermore, by forcing the diagonal elements to be positive, it is ensured that the stiffness matrix K and the damping matrix B are positive-definite. To extend the method to the full matrix case, a Cholesky decomposition can be utilized to ensure K, B > 0. For cup set tasks, tracking speed is large and damping terms can affect performance. Thus, the output of the policy is now [ K ] 1 ,K 2 ,K 3 ,K 4 ,K 5 ,K 6 ,d]Containing an additional damping factor d, the stiffness and damping matrix can then be obtained by:
a 1-dimensional damping factor is used instead of another 6-dimensional damping to reduce the dimensions.
(3) Variable impedance control strategy:
cartesian spatial position tracking error:
the variable impedance control strategy comprises three phases:
e 1 、e 2 two gain change points of 0.4m and 0.2 m. The expert control law selects the maximum gain for acceleration during the acceleration phase, and generally switches to a smaller gain during the switching phase. At the arrival stage, the robotic arm approaches the board at a minimum speed to ensure safety.
The inverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating the variable impedance control strategy through a maximized reward function.
In the exemplary embodiment, the inverse reinforcement learning algorithm module of the system is configured to base expert strategies and reward functions
Wherein the content of the first and second substances,d i,t the distances between the ith mixing track point and the expected point at the t-th moment respectively,d i,t+1 the distances between the ith mixed track point and the desired point at the t +1 th time point, and gamma is a ratioExample coefficients;
discriminating motion trajectories from expert trajectories using a discriminator and calculating a loss function
Wherein r is θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a with an observation of o under the current strategy π;
the discriminator is updated by minimizing a loss function and the variable impedance control strategy is updated by maximizing a reward function.
In the embodiment of the example, the scale factor in the inverse reinforcement learning algorithm module of the system has a value range of 0-1;
further, a scaling factor value in the inverse reinforcement learning algorithm module of the system is 0.95.
In the present exemplary embodiment, an inverse reinforcement learning algorithm is employed to learn the expert strategy and reward function. The input of the method is a mixed track of an expert track and a robot generation track, and the output is a target rigidity K of an impedance controller d (t) and damping coefficient B d (t)。
Firstly, a reward function is designed according to the states of the observation space and the action space, because the expert track and the robot generation track are not separated when the reward function is designed, the reward function is designed as follows:
wherein the content of the first and second substances,d i,t the distances between the ith mixed track point and the expected point at the t moment respectively,d i,t+1 the distances between the ith mixing track point and the expected point at the t +1 th moment are respectively, gamma is a proportionality coefficient, the magnitude of gamma is between 0 and 1, and the magnitude of gamma is generally 0.95.
Then, distinguishing the track generated by the robot from the expert track by using an identifier, wherein the distinguishing process comprises the following steps: the integral track is divided into 50 track points, and the identifier takes reward values and state action transition probabilities obtained by calculating the track points through a reward function as input to obtain the following formula:
wherein r is θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a with an observation of o under the current strategy π, which discriminator then uses many-to-one LSTM model with elements with time steps as input, and a scalar as output, the formula is as follows:
where F represents the fused features of all trace points in the trace (i.e., F = [ F ]) 0 ,f 1 ,...,f 49 ]) Wherein f is i Is the fusion feature vector of the ith trace point,is the weight matrix of the LSTM model, and h is the output scalar of the LSTM model.
Binary classification (expert or generation) of scalar outputs using a unit dense layer with sigmoid activation function:
O d =D be (h;W bc ) (20)
wherein D bc Is a unit dense layer with sigmoid function for binary classification, W bc Is its weight matrix, O d Is the final output of the arbiter, which generates a trajectory for the expert trajectory and the robot.
Updating the discriminator by minimizing the loss:
during training, the strategy is updated to maximize the track reward, evaluated by a reward function, the strategy update is the same as the inverse reinforcement learning method but with a fixed learning reward function, the strategy update is carried out by adopting a reinforcement learning method confidence domain strategy optimization algorithm (TRPO) based on strategy gradient to obtain the target stiffness K of the impedance controller which maximizes r (o, a) d (t) and damping coefficient B d (t)。
The TRPO algorithm is used as follows: first, several functions are defined, each of which is an action value function Q π (a t ,s t ) Function of state value V π (s t ) And a merit function A π (s,a):
A π (s,a)=Q π (s,a)-V π (s) (24)
Wherein s is t ,a t ,s t+1 The state (position and velocity) and the motion of the robot at time t, and the state at time t +1 are shown. The action value function evaluates the quality of a state action pair, the state value function evaluates the quality of a state, and the dominance function evaluates a relative concept, namely the quality of the action relative to other actions in the same state. The strategy to learn pi often represents a neural network, with inputs being states and outputs being actions. Let us assume that the parameter of the neural network is θ, then
Now the goal is translated into finding a theta, which corresponds to the strategy pi θ Corresponding eta (pi) θ ) Maximum expectation of right in (25) formulaIs according to pi θ (a t |s t ) Sampling is a process occurring in the real world and cannot be easily calculated, and a substitute function is needed
To rewrite the formula, first, define
p π (s)=P(s 0 =s)+γP(s 1 =s)+γ 2 p(s 2 =s)+… (27)
ρ π (s) is related to pi, which represents the frequency that each state may be visited, with a gamma discount. Note that the following equation holds true:
A π (s t ,a t ) Only with a single s, a, but is expectedSeparately count each s t ,a t The probability of occurrence is obtainedThen is provided with
Regarding pi of the above formula as an old strategy, letAs new policies are considered, only one new policy needs to be found so thatThen this new strategy will certainly allow the accumulated reward η to be boosted. When all ofThis condition is not satisfied, which means that the original strategy is optimal. Thus, by the above-described procedure, the impedance controller target rigidity K is obtained such that the reward obtains the maximum value d (t) and damping coefficient B d And (t), thereby realizing the aim of adjusting the track of the robot in real time.
It should be noted that although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order or that all of the depicted steps must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Further, in the present exemplary embodiment, there is also provided a variable impedance control method based on inverse reinforcement learning. Referring to fig. 4, the variable impedance control method based on inverse reinforcement learning includes:
s110, initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track according to the tail end position, the first feedback force and an expected track of the mechanical arm by a variable impedance controller based on the target rigidity and the damping coefficient;
and S120, generating a second feedback force for controlling the motion of the mechanical arm by the impedance gain controller according to the expected position increment of the tail end of the mechanical arm, and finishing mechanical arm control based on the second feedback force.
S130, the reverse reinforcement learning algorithm module distinguishes the motion track and the expert track and calculates the loss function by using a discriminator based on the expert strategy and the reward function in the reverse reinforcement learning algorithm, updates the discriminator by the minimized loss function and updates the variable impedance control strategy by the maximized reward function;
and S140, calculating a target rigidity and a damping coefficient by the variable impedance control strategy module based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module according to the tail end position of the mechanical arm and the second feedback force, and sending the target rigidity and the damping coefficient to the variable impedance controller.
In the embodiment of the present example, the inverse reinforcement learning algorithm in the control method further includes:
collecting the force and torque applied to the mechanical arm end effector by a specialist in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);
initializing a first impedance gain strategy by using random weight;
collecting a first track under the first impedance gain strategy;
a second impedance gain strategy is obtained by using an inverse reinforcement learning algorithm based on the first track;
collecting a second trace according to the second impedance gain strategy;
and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator by minimizing the loss function, repeating the inverse reinforcement learning algorithm, and judging and generating an optimal variable impedance control strategy based on a reward function.
In the present exemplary embodiment, as shown in fig. 3, the inverse reinforcement learning algorithm of the present invention mainly includes five steps:
1) Gathering forces and torques applied by a human expert on an end effector to cause the end of a robotic arm to complete a desired trajectoryOr traces collected by a designed variable impedance controller performing the taskDesigning a reward function r (o, a);
2) Initializing an impedance gain strategy pi by using a random weight;
3) Trace tau under collection strategy pi i ;
4) An optimal impedance gain strategy pi (theta) is obtained by using an inverse reinforcement learning algorithm;
5) Setting the policy pi x ← pi (theta), and applying the policy to the system to collect a new track;
6) And (5) repeating the steps (3-5) until a satisfactory control strategy is obtained through learning.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 500 according to such an embodiment of the invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, a bus 530 connecting various system components (including the memory unit 520 and the processing unit 510), and a display unit 540.
Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 510 may perform steps S110 to S140 as shown in fig. 1.
The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) 5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.
The electronic device 500 may also communicate with one or more external devices 570 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. As shown, the network adapter 560 communicates with the other modules of the electronic device 500 over a bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.
Referring to fig. 6, a program product 600 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.
Claims (12)
1. A variable impedance control system based on inverse reinforcement learning, the system comprising a variable impedance controller, an impedance gain controller, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting a track according to the first feedback force and an expected track based on the acquired target rigidity and damping coefficient;
the impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and the mechanical arm control is completed based on the second feedback force.
2. The system of claim 1, wherein the variable impedance control system further comprises an inverse reinforcement learning algorithm module and a variable impedance control strategy module, wherein:
the reverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating a variable impedance control strategy through a maximized reward function;
the variable impedance control strategy module is used for calculating target rigidity and damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the existing variable impedance control strategy, and sending the target rigidity and damping coefficient to the variable impedance controller.
3. The system of claim 2, wherein the variable impedance controller is based on a second order impedance model
The robot arm tip desired position increment for the revised trajectory is generated as:
wherein M is d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,x d respectively the desired acceleration, velocity and position of the robot tip, F d And F is the expected contact force and the actual contact force between the robot end and the environment, respectively, E (n) is the contact force error, T is the control period, w 1 ,w 2 ,w 3 Are all intermediate variables;
w 1 =4M d (t)+2B d (t)T+K d (t)T 2
w 2 =-8M d (t)+2K d (t)T 2
w 3 =4M d (t)-2B d (t)T+K d (t)T 2 。
4. the system of claim 2, wherein the impedance gain controller is based on a model of the dynamics of the robot in cartesian space:
and a kinetic equation:
the feed forward term to generate the impedance control law is:
the second feedback force is:
wherein M (x) is a mass inertia matrix,is a Coriolis force matrix, G (x) is a gravity vector,and x is the Cartesian acceleration, velocity and position, respectively, relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively inputting torque and external force of a motor in a joint space; m is a group of d 、B d 、K d A desired mass, damping and stiffness matrix; e andto track position errors and track velocity errors.
5. The system of claim 2, wherein the variable impedance control strategy module tracks errors based on cartesian spatial locations:
the variable impedance control strategy generated according to the distance of the mechanical arm close to the target position is as follows:
wherein e is 1 、e 2 Two gain change points of 0.4m and 0.2m, respectively.
6. The system of claim 2, wherein the inverse reinforcement learning algorithm module is to base expert strategies and reward functions on
Wherein the content of the first and second substances,d i,t the distances between the ith mixed track point and the expected point at the t moment respectively,d i,t+1 respectively at the t +1 th moment, the ith mixed track point and the expected pointGamma is a proportionality coefficient;
discriminating motion trajectories from expert trajectories using a discriminator and calculating a loss function
Wherein r is θ (o, a) is a reward function needing to be learned, and pi (a | o) is the probability of taking action a when the observed value is o under the current strategy pi;
the discriminator is updated by minimizing a penalty function and the variable impedance control strategy is updated by maximizing a reward function.
7. The system of claim 6, wherein the scale factor in the inverse reinforcement learning algorithm module ranges from 0 to 1.
8. A variable impedance control method based on inverse reinforcement learning, the method comprising:
initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track by a variable impedance controller according to the tail end position, the first feedback force and an expected track of the mechanical arm on the basis of the target rigidity and the damping coefficient;
and the impedance gain controller generates a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and completes mechanical arm control based on the second feedback force.
9. The control method of claim 8, wherein the method further comprises:
the reverse reinforcement learning algorithm module is based on an expert strategy and an incentive function in a reverse reinforcement learning algorithm, a discriminator is used for distinguishing a motion track and the expert track and calculating a loss function, the discriminator is updated through a minimized loss function, and a variable impedance control strategy is updated through a maximized incentive function;
and the variable impedance control strategy module calculates a target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module, and sends the target rigidity and the damping coefficient to the variable impedance controller.
10. The control method of claim 9, wherein the inverse reinforcement learning algorithm in the inverse reinforcement learning algorithm module comprises:
collecting the force and torque exerted by a specialist on the mechanical arm end effector in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);
initializing a first impedance gain strategy with random weights;
collecting a first trace under the first impedance gain strategy;
exploring to obtain a second impedance gain strategy by using an inverse reinforcement learning algorithm based on the first track;
collecting a second trace according to the second impedance gain strategy;
and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator through the minimized loss function, repeating the inverse reinforcement learning algorithm, and judging and generating the optimal variable impedance control strategy based on the reward function.
11. An electronic device, comprising
A processor; and
a memory having computer-readable instructions stored thereon that, when executed by the processor, implement the method of any of claims 8-10.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 8-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211161566.3A CN115421387B (en) | 2022-09-22 | 2022-09-22 | Variable impedance control system and control method based on inverse reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211161566.3A CN115421387B (en) | 2022-09-22 | 2022-09-22 | Variable impedance control system and control method based on inverse reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115421387A true CN115421387A (en) | 2022-12-02 |
CN115421387B CN115421387B (en) | 2023-04-14 |
Family
ID=84203645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211161566.3A Active CN115421387B (en) | 2022-09-22 | 2022-09-22 | Variable impedance control system and control method based on inverse reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115421387B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116643501A (en) * | 2023-07-18 | 2023-08-25 | 湖南大学 | Variable impedance control method and system for aerial working robot under stability constraint |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108153153A (en) * | 2017-12-19 | 2018-06-12 | 哈尔滨工程大学 | A kind of study impedance control system and control method |
WO2020118730A1 (en) * | 2018-12-14 | 2020-06-18 | 中国科学院深圳先进技术研究院 | Compliance control method and apparatus for robot, device, and storage medium |
US20210122037A1 (en) * | 2019-10-25 | 2021-04-29 | Robert Bosch Gmbh | Method for controlling a robot and robot controller |
CN114378820A (en) * | 2022-01-18 | 2022-04-22 | 中山大学 | Robot impedance learning method based on safety reinforcement learning |
CN114800489A (en) * | 2022-03-22 | 2022-07-29 | 华南理工大学 | Mechanical arm compliance control method based on combination of definite learning and composite learning, storage medium and robot |
CN114851193A (en) * | 2022-04-26 | 2022-08-05 | 北京航空航天大学 | Intelligent flexible control method for contact process of space manipulator and unknown environment |
CN115256401A (en) * | 2022-08-29 | 2022-11-01 | 南京理工大学 | Space manipulator shaft hole assembly variable impedance control method based on reinforcement learning |
-
2022
- 2022-09-22 CN CN202211161566.3A patent/CN115421387B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108153153A (en) * | 2017-12-19 | 2018-06-12 | 哈尔滨工程大学 | A kind of study impedance control system and control method |
WO2020118730A1 (en) * | 2018-12-14 | 2020-06-18 | 中国科学院深圳先进技术研究院 | Compliance control method and apparatus for robot, device, and storage medium |
US20210122037A1 (en) * | 2019-10-25 | 2021-04-29 | Robert Bosch Gmbh | Method for controlling a robot and robot controller |
CN114378820A (en) * | 2022-01-18 | 2022-04-22 | 中山大学 | Robot impedance learning method based on safety reinforcement learning |
CN114800489A (en) * | 2022-03-22 | 2022-07-29 | 华南理工大学 | Mechanical arm compliance control method based on combination of definite learning and composite learning, storage medium and robot |
CN114851193A (en) * | 2022-04-26 | 2022-08-05 | 北京航空航天大学 | Intelligent flexible control method for contact process of space manipulator and unknown environment |
CN115256401A (en) * | 2022-08-29 | 2022-11-01 | 南京理工大学 | Space manipulator shaft hole assembly variable impedance control method based on reinforcement learning |
Non-Patent Citations (2)
Title |
---|
张刚;布挺;焦文潭;王波;: "柔性机器人动力学跟踪变阻抗控制" * |
李超: "基于强化学习的学习变阻抗控制" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116643501A (en) * | 2023-07-18 | 2023-08-25 | 湖南大学 | Variable impedance control method and system for aerial working robot under stability constraint |
CN116643501B (en) * | 2023-07-18 | 2023-10-24 | 湖南大学 | Variable impedance control method and system for aerial working robot under stability constraint |
Also Published As
Publication number | Publication date |
---|---|
CN115421387B (en) | 2023-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114502335B (en) | Method and system for trajectory optimization for non-linear robotic systems with geometric constraints | |
Peters et al. | Reinforcement learning by reward-weighted regression for operational space control | |
EP3788549B1 (en) | Stacked convolutional long short-term memory for model-free reinforcement learning | |
Argall et al. | Learning robot motion control with demonstration and advice-operators | |
JP7301034B2 (en) | System and Method for Policy Optimization Using Quasi-Newton Trust Region Method | |
Qi et al. | Stable indirect adaptive control based on discrete-time T–S fuzzy model | |
CN114761966A (en) | System and method for robust optimization for trajectory-centric model-based reinforcement learning | |
Nguyen et al. | Adaptive chattering free neural network based sliding mode control for trajectory tracking of redundant parallel manipulators | |
Dong et al. | Learning and recognition of hybrid manipulation motions in variable environments using probabilistic flow tubes | |
Li et al. | Kinematic control of redundant robot arms using neural networks | |
Khansari-Zadeh et al. | Learning to play minigolf: A dynamical system-based approach | |
CN115351780A (en) | Method for controlling a robotic device | |
CN115421387B (en) | Variable impedance control system and control method based on inverse reinforcement learning | |
Zhang et al. | Model‐Free Attitude Control of Spacecraft Based on PID‐Guide TD3 Algorithm | |
Vinogradska et al. | Numerical quadrature for probabilistic policy search | |
Jiang et al. | Bioinspired control design using cerebellar model articulation controller network for omnidirectional mobile robots | |
Veselic et al. | Human-robot interaction with robust prediction of movement intention surpasses manual control | |
Lin et al. | Objective learning from human demonstrations | |
Nohooji et al. | Actor–critic learning based PID control for robotic manipulators | |
Feng et al. | Adaptive neural network tracking control of an omnidirectional mobile robot | |
Langsfeld | Learning task models for robotic manipulation of nonrigid objects | |
US20220410380A1 (en) | Learning robotic skills with imitation and reinforcement at scale | |
Yin et al. | Learning cost function and trajectory for robotic writing motion | |
Gams et al. | Manipulation learning on humanoid robots | |
Afzali et al. | A Modified Convergence DDPG Algorithm for Robotic Manipulation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |