CN115421387A - Variable impedance control system and control method based on inverse reinforcement learning - Google Patents

Variable impedance control system and control method based on inverse reinforcement learning Download PDF

Info

Publication number
CN115421387A
CN115421387A CN202211161566.3A CN202211161566A CN115421387A CN 115421387 A CN115421387 A CN 115421387A CN 202211161566 A CN202211161566 A CN 202211161566A CN 115421387 A CN115421387 A CN 115421387A
Authority
CN
China
Prior art keywords
variable impedance
mechanical arm
track
reinforcement learning
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211161566.3A
Other languages
Chinese (zh)
Other versions
CN115421387B (en
Inventor
边桂彬
李桢
钱琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202211161566.3A priority Critical patent/CN115421387B/en
Publication of CN115421387A publication Critical patent/CN115421387A/en
Application granted granted Critical
Publication of CN115421387B publication Critical patent/CN115421387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The present disclosure relates to a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of the reward function is improved in task setting, the generalized representation of the variable impedance technology is realized, the layered impedance control of the mechanical arm can be realized, the relatively complex physical interaction is completed, the motion precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.

Description

Variable impedance control system and control method based on inverse reinforcement learning
Technical Field
The present disclosure relates to the field of mechanical arms and automatic control, and in particular, to a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning.
Background
Robotic systems are increasingly used in various unstructured environments, such as hospitals, factories, houses, etc., where the robot needs to perform complex operational tasks, adjust the impedance according to different task phases and environmental constraints, while interacting with an unknown environment in a safe and stable manner. Impedance control, which establishes mass-spring-damped contact dynamics, has been widely used in these robotic systems to ensure safe physical interaction. In addition, many complex operating tasks require the robot to change impedance according to the task phase, and flexibility and robustness have become one of the important indicators for developing surgical robot controllers for physical interaction. However, conventional impedance control schemes do not understand the actual surgical scenario, including complex physical interactions on the robotic arm, resulting in a loss of precision, and in practice, achieving such tasks requires variable impedance skills.
The existing learning-based method for obtaining variable impedance skills mainly includes the following categories:
the first type is a teaching learning-based approach. The human expert controls the robot through a haptic interface and a hand-held impedance control interface, which is based on a linear spring-reset potentiometer that maps button positions to robot arm stiffness. This arrangement allows a human expert to adjust the compliance of the robot according to given task requirements, encode the demonstrated motion and stiffness trajectories using dynamic motion primitives, and learn using local weight regression. If the illustrated trajectory has a high variance, the impedance should be low, and if the illustrated trajectory has a low variance, the impedance should be high. Such a strategy may provide a very good solution for many manipulation tasks. The advantage is that no separate demonstration of the impedance is required. However, in some interactive tasks, such as a sliding task in a groove, low trajectory variability does not necessarily correspond to high impedance.
The second type is based on a deep reinforcement learning method with a variable impedance motion space. When using reinforcement learning to control robot motion, an important challenge is the parameterized selection of strategies. Parameters with relevant nonlinear features are usually extracted from a set of motion demonstrations following the teaching learning paradigm using gaussian mixture regression, the final parameterization takes the form of a nonlinear time-invariant dynamic system, this time-invariant dynamic system is used as a parameterization strategy for a variant of the PI2 strategy search algorithm, and finally the time-invariant motion is represented by the PI2 strategy search algorithm. However, this approach has certain drawbacks, and in the first place it is more idealized, assuming that there is no noise in the system other than the detection noise, which means that disturbances encountered during sampling the trajectory have a negative impact on learning and cannot be considered to improve the strategy. Second, it is initially designed to learn a trajectory from a particular initial state, and using it to learn a trajectory from multiple initial states increases the number of deployments required. While many inverse reinforcement learning algorithms employ entropy regularization to prevent simple emulation of expert strategies, most previous efforts have not focused on the impact of action space selection on a priori knowledge.
While many methods based on deep reinforcement learning and teaching learning have been proposed to obtain variable impedance skills that are exposed to rich operating tasks, these skills are typically task-specific based and may be sensitive to changes in task settings, and the task-specific impedance skills obtained by teaching learning methods may fail when a task changes. Furthermore, designing suitable reward functions is challenging for reinforcement learning, and therefore their skill transferability is limited.
Accordingly, there is a need for one or more methods to address the above-mentioned problems.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
An object of the present disclosure is to provide a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning, thereby overcoming, at least to some extent, one or more of the problems due to the limitations and disadvantages of the related art.
According to an aspect of the present disclosure, there is provided an inverse reinforcement learning-based variable impedance control system, the system including a variable impedance controller, an impedance gain controller, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting a track according to the first feedback force and an expected track based on the acquired target rigidity and damping coefficient;
the impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and the mechanical arm control is completed based on the second feedback force.
Preferably, the variable impedance control system further comprises an inverse reinforcement learning algorithm module and a variable impedance control strategy module, wherein:
the reverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating a variable impedance control strategy through a maximized reward function;
the variable impedance control strategy module is used for calculating target rigidity and damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the existing variable impedance control strategy, and sending the target rigidity and damping coefficient to the variable impedance controller.
Preferably, the variable impedance controller is based on a second order impedance model
Figure RE-GDA0003905660750000031
The robot arm tip desired position increment for the revised trajectory is generated as:
Figure RE-GDA0003905660750000032
wherein, M d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,
Figure RE-GDA0003905660750000033
x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,
Figure RE-GDA0003905660750000034
x d respectively the expected acceleration and speed of the robot endAnd position, fd and F are respectively expected contact force and actual contact force between the robot end and the environment, E (n) is contact force error, T is control period, w 1 ,w 2 ,w 3 Are all intermediate variables;
w 1 =4M d (t)+2B d (t)T+K d (t)T 2
w 2 =-8M d (t)+2K d (t)T 2
w 3 =4M d (t)-2B d (t)T+K d (t)T 2
preferably, the impedance gain controller is based on a dynamical model of the robot in cartesian space:
Figure RE-GDA0003905660750000041
and the kinetic equation:
Figure RE-GDA0003905660750000042
the feed forward terms that generate the impedance control law are:
Figure RE-GDA0003905660750000043
the second feedback force is:
Figure RE-GDA0003905660750000044
wherein M (x) is a mass inertia matrix,
Figure RE-GDA0003905660750000045
in the form of a matrix of the coriolis forces,
Figure RE-GDA0003905660750000046
as a vector of the gravity force,
Figure RE-GDA0003905660750000047
and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively inputting torque and external force of a joint space motor; m d 、B d 、K d A desired mass, damping and stiffness matrix; e and
Figure RE-GDA0003905660750000048
to track position errors and track velocity errors.
Preferably, the variable impedance control strategy module tracks errors according to cartesian spatial locations:
Figure RE-GDA0003905660750000049
the variable impedance control strategy generated according to the distance of the mechanical arm close to the target position is as follows:
Figure RE-GDA00039056607500000410
wherein e is 1 、e 2 Two gain change points of 0.4m and 0.2m, respectively.
Preferably, the inverse reinforcement learning algorithm module is used for basing on expert strategy and reward function
Figure RE-GDA0003905660750000051
Wherein the content of the first and second substances,
Figure RE-GDA0003905660750000052
d i,t the distances between the ith mixed track point and the expected point at the t moment respectively,
Figure RE-GDA0003905660750000053
d i,t+1 at the t +1 th time,The distance between the ith mixed track point and the expected point, wherein gamma is a proportionality coefficient;
discriminating motion trajectories from expert trajectories using a discriminator and calculating a loss function
Figure RE-GDA0003905660750000054
Wherein r is θ (o, a) is a reward function to be learned, and pi (a | o) is the probability of taking action a when the observed value is o under the current strategy pi;
the discriminator is updated by minimizing a penalty function and the variable impedance control strategy is updated by maximizing a reward function.
Preferably, the scale factor value range in the inverse reinforcement learning algorithm module is 0-1.
In one aspect of the present disclosure, there is provided a variable impedance control method based on inverse reinforcement learning, the method including:
initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track by a variable impedance controller according to the tail end position, the first feedback force and an expected track of the mechanical arm on the basis of the target rigidity and the damping coefficient;
and the impedance gain controller generates a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and completes mechanical arm control based on the second feedback force.
Preferably, the method further comprises:
the reverse reinforcement learning algorithm module is based on an expert strategy and an incentive function in a reverse reinforcement learning algorithm, a discriminator is used for distinguishing a motion track and the expert track and calculating a loss function, the discriminator is updated through a minimized loss function, and a variable impedance control strategy is updated through a maximized incentive function;
and the variable impedance control strategy module calculates a target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module, and sends the target rigidity and the damping coefficient to the variable impedance controller.
Preferably, the inverse reinforcement learning algorithm in the inverse reinforcement learning algorithm module comprises:
collecting the force and torque exerted by a specialist on the mechanical arm end effector in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);
initializing a first impedance gain strategy by using random weight;
collecting a first trace under the first impedance gain strategy;
exploring to obtain a second impedance gain strategy by using an inverse reinforcement learning algorithm based on the first track;
collecting a second trace according to the second impedance gain strategy;
and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator through the minimized loss function, repeating the inverse reinforcement learning algorithm, and judging and generating the optimal variable impedance control strategy based on the reward function.
In one aspect of the present disclosure, there is provided an electronic device including:
a processor; and
a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method according to any of the above.
In an aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the method according to any one of the above.
An inverse reinforcement learning-based variable impedance control system, a control method, an electronic device, and a storage medium in exemplary embodiments of the present disclosure. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of a reward function is improved in task setting, generalized representation of variable impedance skills is achieved, layered impedance control of the mechanical arm can be achieved, complex physical interaction is completed, the movement precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 illustrates a system block diagram of an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;
FIG. 2 illustrates a controller design schematic of an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;
FIG. 3 illustrates a flowchart of an inverse reinforcement learning algorithm for an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;
FIG. 4 illustrates a flow chart of a variable impedance control method based on inverse reinforcement learning according to an exemplary embodiment of the present disclosure;
FIG. 5 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure; and
fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
In the present exemplary embodiment, there is first provided an inverse reinforcement learning-based variable impedance control system; referring to fig. 1, the variable impedance control system based on inverse reinforcement learning includes a variable impedance controller, an impedance gain controller, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting a track according to the first feedback force and an expected track based on the acquired target rigidity and damping coefficient;
the impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and the mechanical arm control is completed based on the second feedback force.
An inverse reinforcement learning-based variable impedance control system, a control method, an electronic device, and a storage medium in exemplary embodiments of the present disclosure. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of the reward function is improved in task setting, the generalized representation of the variable impedance technology is realized, the layered impedance control of the mechanical arm can be realized, the relatively complex physical interaction is completed, the motion precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.
Next, a variable impedance control system based on inverse reinforcement learning in the present exemplary embodiment will be further described.
In the embodiment of the example, the variable impedance strategy and the reward function are recovered from the teaching based on the method of inverse reinforcement learning, the reward function is maximized by generating a new variable impedance strategy for different task settings by using a reinforcement learning algorithm, and different action spaces of the reward function are explored to realize generalized representation of the variable impedance skill. The method mainly comprises the following three parts:
in the embodiment of the present example, the Cartesian space impedance control design section
Consider a kinetic model of a robot in cartesian space:
Figure RE-GDA0003905660750000091
where M (x) is the mass inertia matrix,
Figure RE-GDA0003905660750000092
a matrix of the coriolis force is represented,
Figure RE-GDA0003905660750000093
is the vector of the force of gravity,
Figure RE-GDA0003905660750000094
and x is the Cartesian acceleration, velocity and position, respectively, relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively representing the torque input and the external force of the joint space motor. Under the law of impedance control, the robot will behave as a mass-spring-damper system, which followsFollowing the kinetic equation:
Figure RE-GDA0003905660750000095
wherein M is d 、B d 、K d Is the required mass, damping and stiffness matrix. By solving (1), (2) and setting M d = M (x), the impedance control law can be written as:
τ=J T F
Figure RE-GDA0003905660750000096
the impedance control law can be further divided into two parts: feed forward term F ff To eliminate nonlinear robot dynamics and feedback terms F fb Tracking a required track:
Figure RE-GDA0003905660750000097
Figure RE-GDA0003905660750000098
wherein e and
Figure RE-GDA0003905660750000099
are the tracking error and the tracking speed. Rigidity matrix K d And a damping matrix B d Also called impedance gain matrices, because they map tracking errors and velocities to feedback forces F fb
In the present exemplary embodiment, the controller design is depicted in FIG. 1 by the antagonistic inverse reinforcement learning variable impedance skills section. In the method, the observed values of the robot and the environment are a tracking error e and a tracking speed
Figure RE-GDA0003905660750000101
The strategy adopted accepts observation and outputs impedance gainK. B or feedback force F fb Depending on the motion space design. The impedance gain controller then calculates the control input and controls the robot using equation (3), learning expert strategies and reward functions using antagonistic inverse reinforcement learning, the training process being detailed in the algorithm.
In the present invention, an inverse reinforcement learning algorithm is used to learn expert strategies and reward functions, and in this antagonism training environment, the discriminator that separates the generator trajectory from the expert trajectory is defined as:
Figure RE-GDA0003905660750000102
wherein r is θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a when observed as o under the current strategy. Updating the discriminator to minimize the loss:
Figure RE-GDA0003905660750000103
the generator is a variable impedance strategy. During training, the strategy is updated to maximize the trajectory reward, evaluated by a reward function, and strategy updating is performed by a strategy gradient-based reinforcement learning method confidence domain strategy optimization algorithm (TRPO). Because the environment dynamics are unknown, new strategies are re-optimized in different task settings by applying reinforcement learning to test the performance of the learning reward function. In the reinforcement learning training process, strategy updating is the same as an inverse reinforcement learning method, but a fixed learning reward function is provided. The detailed process of algorithm training is as follows:
Figure RE-GDA0003905660750000104
Figure RE-GDA0003905660750000111
in the embodiment of the present example, the method applies the part; when the method is put into use, expert data is collected by human experts of real robots, and then learned strategies are transferred to the real robots for performance evaluation.
1. And setting a task. The real world experiment device consists of a host computer, a target computer, an F/T sensor and a robot. A cartesian variable impedance control algorithm is written on the host PC that controls the Real robotic system connected to the target PC through Simulink Real-Time. Model parameters of real robots, such as mass inertia matrix M (x), coriolis force matrix
Figure RE-GDA0003905660750000112
And the gravity vector G (x) is obtained by the euler-lagrange method.
2. Human expert data is collected. During data collection, a human expert applies forces and torques on the end effector to cause the end of the robotic arm to complete a desired trajectory. The 6-dimensional Cartesian space forces and torques are measured by the F/T sensors, and then the control inputs are calculated using equation (3), which will track the state
Figure RE-GDA0003905660750000113
And the force adopted by the human expert is recorded as human expert data, and the income of the human expert is estimated in data processing.
3. The gain estimation is performed using a sliding window method. To recover the expert gain strategy, a short sliding window is used to estimate the stiffness and damping of the force. Each time window contains ten state-force pairs and the expert gain estimates the test by solving equation (5) with least squares. Strategies and reward functions are learned in a simulated environment using antagonistic reverse reinforcement learning with real-world human expert data.
The variable impedance control system based on the inverse reinforcement learning comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting the track according to the first feedback force and the expected track based on the target rigidity and the damping coefficient generated and sent by the variable impedance control strategy module.
In the exemplary embodiment, the variable impedance controller in the system is based on a second order impedance model
Figure RE-GDA0003905660750000121
The mechanical arm tip desired position increment for correcting the trajectory is generated as:
Figure RE-GDA0003905660750000122
wherein M is d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,
Figure RE-GDA0003905660750000123
x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,
Figure RE-GDA0003905660750000124
x d respectively the desired acceleration, velocity and position of the robot tip, F d And F the expected and actual contact force between the robot tip and the environment, respectively, and E (n) the contact force error.
In the present exemplary embodiment, to achieve the desired dynamic behavior of the tip, a second order impedance model is used:
Figure RE-GDA0003905660750000125
wherein M is d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,
Figure RE-GDA0003905660750000126
x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,
Figure RE-GDA0003905660750000127
x d respectively the desired acceleration, velocity and position of the robot tip, F d And F is the expected and actual contact forces between the robot tip and the environment, respectively.
To obtain the corrected desired position increment, the second order impedance model is lagrangian transformed and a bilinear transformation s =2T is used -1 (z-1)(z+1) -1 Discretizing to obtain:
Figure RE-GDA0003905660750000128
w 1 =4M d (t)+2B d (t)T+K d (t)T 2 (3)
w 2 =-8M d (t)+2K d (t)T 2 (4)
w 3 =4M d (t)-2B d (t)T+K d (t)T 2 (5)
where T is the control period, the difference equation of the impedance controller, i.e., the expected position increment of the terminal, is:
Figure RE-GDA0003905660750000131
to simplify the calculation, the target inertia matrix is set to a constant M d (t) = I, so the variable impedance controller requires a time-varying target stiffness K d (t) damping coefficient B d (t) adjusting the desired position with the contact force error E (n).
The impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the mechanical arm tail end.
In an embodiment of the present example, the impedance gain controller in the system is based on a dynamical model of the robot in cartesian space:
Figure RE-GDA0003905660750000132
and the kinetic equation:
Figure RE-GDA0003905660750000133
the feed forward term to generate the impedance control law is:
Figure RE-GDA0003905660750000134
the second feedback force is:
Figure RE-GDA0003905660750000135
wherein M (x) is a mass inertia matrix,
Figure RE-GDA0003905660750000136
in the form of a matrix of the coriolis forces,
Figure RE-GDA0003905660750000137
as a vector of the gravity force,
Figure RE-GDA0003905660750000138
and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively inputting torque and external force of a joint space motor; m is a group of d 、B d 、K d A desired mass, damping and stiffness matrix; e and
Figure RE-GDA0003905660750000139
to track position errors and track velocity errors.
In the exemplary embodiment, the mass inertia matrix M (x), the Coriolis matrix
Figure RE-GDA0003905660750000141
And the gravity vector
Figure RE-GDA0003905660750000142
The model parameters are automatically calculated using a Mujoco simulation model.
Constructing a dynamic model of the robot in a Cartesian space:
Figure RE-GDA0003905660750000143
wherein M (x) is a mass inertia matrix,
Figure RE-GDA0003905660750000144
is a Coriolis force matrix, G (x) is a gravity vector,
Figure RE-GDA0003905660750000145
and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively a joint space motor torque input and an external force.
Under the law of impedance control, the robot will behave as a mass-spring-damper system, which follows the kinetic equation:
Figure RE-GDA0003905660750000146
wherein M is d 、B d 、K d Is the desired mass, damping and stiffness matrix. By solving (1), (2) and setting M d = M (x), the impedance control law can be written as:
τ=J T F
Figure RE-GDA0003905660750000147
the impedance control law can be further divided into two parts: feedforward term F ff To eliminate nonlinear robot dynamics and feedback term F fb Tracking a required track:
Figure RE-GDA0003905660750000148
Figure RE-GDA0003905660750000149
wherein e and
Figure RE-GDA00039056607500001410
are a tracking position error and a tracking velocity error. Rigidity matrix K d And damping matrix B d Also referred to as impedance gain matrices, because they map tracking position errors and tracking velocity errors to feedback forces F fb . To simplify the notation, we use K (stiffness) and B (damping) to denote K in the rest of the text d And B d . Fig. 2 depicts a controller design.
The variable impedance control strategy module is used for calculating target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on a preset variable impedance control strategy, and sending the target rigidity and the damping coefficient to the variable impedance controller.
In this exemplary embodiment, the variable impedance control strategy module in the system tracks errors based on cartesian spatial position:
Figure RE-GDA0003905660750000151
the variable impedance control strategy generated according to the distance of the mechanical arm close to the target position is as follows:
Figure RE-GDA0003905660750000152
wherein e is 1 、e 2 Two gain change points of 0.4m and 0.2 m.
In the embodiment of the present example, (1) observation space: will track error e and tracking speed
Figure RE-GDA0003905660750000153
Together serve as a viewing space for the task for which the end effector is located on the cup. In addition, due to a pair of tracking error e and tracking speed
Figure RE-GDA0003905660750000154
Without providing acceleration information, it is not possible to fully represent the system dynamics, for which a preamble observation is used, comprising e and e from the first five time steps
Figure RE-GDA0003905660750000155
Is evaluated.
(2) An action space: for the impedance gain action space, the strategy outputs the impedance gain, and the control input is obtained by equation (11).
To reduce the dimension of the gain action space, it is assumed that the stiffness matrix K and the damping matrix B are diagonal. Furthermore, by forcing the diagonal elements to be positive, it is ensured that the stiffness matrix K and the damping matrix B are positive-definite. To extend the method to the full matrix case, a Cholesky decomposition can be utilized to ensure K, B > 0. For cup set tasks, tracking speed is large and damping terms can affect performance. Thus, the output of the policy is now [ K ] 1 ,K 2 ,K 3 ,K 4 ,K 5 ,K 6 ,d]Containing an additional damping factor d, the stiffness and damping matrix can then be obtained by:
Figure RE-GDA0003905660750000156
a 1-dimensional damping factor is used instead of another 6-dimensional damping to reduce the dimensions.
(3) Variable impedance control strategy:
cartesian spatial position tracking error:
Figure RE-GDA0003905660750000161
the variable impedance control strategy comprises three phases:
Figure RE-GDA0003905660750000162
e 1 、e 2 two gain change points of 0.4m and 0.2 m. The expert control law selects the maximum gain for acceleration during the acceleration phase, and generally switches to a smaller gain during the switching phase. At the arrival stage, the robotic arm approaches the board at a minimum speed to ensure safety.
The inverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating the variable impedance control strategy through a maximized reward function.
In the exemplary embodiment, the inverse reinforcement learning algorithm module of the system is configured to base expert strategies and reward functions
Figure RE-GDA0003905660750000163
Wherein the content of the first and second substances,
Figure RE-GDA0003905660750000164
d i,t the distances between the ith mixing track point and the expected point at the t-th moment respectively,
Figure RE-GDA0003905660750000165
d i,t+1 the distances between the ith mixed track point and the desired point at the t +1 th time point, and gamma is a ratioExample coefficients;
discriminating motion trajectories from expert trajectories using a discriminator and calculating a loss function
Figure RE-GDA0003905660750000166
Wherein r is θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a with an observation of o under the current strategy π;
the discriminator is updated by minimizing a loss function and the variable impedance control strategy is updated by maximizing a reward function.
In the embodiment of the example, the scale factor in the inverse reinforcement learning algorithm module of the system has a value range of 0-1;
further, a scaling factor value in the inverse reinforcement learning algorithm module of the system is 0.95.
In the present exemplary embodiment, an inverse reinforcement learning algorithm is employed to learn the expert strategy and reward function. The input of the method is a mixed track of an expert track and a robot generation track, and the output is a target rigidity K of an impedance controller d (t) and damping coefficient B d (t)。
Firstly, a reward function is designed according to the states of the observation space and the action space, because the expert track and the robot generation track are not separated when the reward function is designed, the reward function is designed as follows:
Figure RE-GDA0003905660750000171
wherein the content of the first and second substances,
Figure RE-GDA0003905660750000172
d i,t the distances between the ith mixed track point and the expected point at the t moment respectively,
Figure RE-GDA0003905660750000173
d i,t+1 the distances between the ith mixing track point and the expected point at the t +1 th moment are respectively, gamma is a proportionality coefficient, the magnitude of gamma is between 0 and 1, and the magnitude of gamma is generally 0.95.
Then, distinguishing the track generated by the robot from the expert track by using an identifier, wherein the distinguishing process comprises the following steps: the integral track is divided into 50 track points, and the identifier takes reward values and state action transition probabilities obtained by calculating the track points through a reward function as input to obtain the following formula:
Figure RE-GDA0003905660750000174
wherein r is θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a with an observation of o under the current strategy π, which discriminator then uses many-to-one LSTM model with elements with time steps as input, and a scalar as output, the formula is as follows:
Figure RE-GDA0003905660750000181
where F represents the fused features of all trace points in the trace (i.e., F = [ F ]) 0 ,f 1 ,...,f 49 ]) Wherein f is i Is the fusion feature vector of the ith trace point,
Figure RE-GDA0003905660750000182
is the weight matrix of the LSTM model, and h is the output scalar of the LSTM model.
Binary classification (expert or generation) of scalar outputs using a unit dense layer with sigmoid activation function:
O d =D be (h;W bc ) (20)
wherein D bc Is a unit dense layer with sigmoid function for binary classification, W bc Is its weight matrix, O d Is the final output of the arbiter, which generates a trajectory for the expert trajectory and the robot.
Updating the discriminator by minimizing the loss:
Figure RE-GDA0003905660750000183
during training, the strategy is updated to maximize the track reward, evaluated by a reward function, the strategy update is the same as the inverse reinforcement learning method but with a fixed learning reward function, the strategy update is carried out by adopting a reinforcement learning method confidence domain strategy optimization algorithm (TRPO) based on strategy gradient to obtain the target stiffness K of the impedance controller which maximizes r (o, a) d (t) and damping coefficient B d (t)。
The TRPO algorithm is used as follows: first, several functions are defined, each of which is an action value function Q π (a t ,s t ) Function of state value V π (s t ) And a merit function A π (s,a):
Figure RE-GDA0003905660750000184
Figure RE-GDA0003905660750000185
A π (s,a)=Q π (s,a)-V π (s) (24)
Wherein s is t ,a t ,s t+1 The state (position and velocity) and the motion of the robot at time t, and the state at time t +1 are shown. The action value function evaluates the quality of a state action pair, the state value function evaluates the quality of a state, and the dominance function evaluates a relative concept, namely the quality of the action relative to other actions in the same state. The strategy to learn pi often represents a neural network, with inputs being states and outputs being actions. Let us assume that the parameter of the neural network is θ, then
Figure RE-GDA0003905660750000191
Now the goal is translated into finding a theta, which corresponds to the strategy pi θ Corresponding eta (pi) θ ) Maximum expectation of right in (25) formula
Figure RE-GDA0003905660750000192
Is according to pi θ (a t |s t ) Sampling is a process occurring in the real world and cannot be easily calculated, and a substitute function is needed
Figure RE-GDA0003905660750000193
To rewrite the formula, first, define
p π (s)=P(s 0 =s)+γP(s 1 =s)+γ 2 p(s 2 =s)+… (27)
ρ π (s) is related to pi, which represents the frequency that each state may be visited, with a gamma discount. Note that the following equation holds true:
Figure RE-GDA0003905660750000194
A π (s t ,a t ) Only with a single s, a, but is expected
Figure RE-GDA0003905660750000195
Separately count each s t ,a t The probability of occurrence is obtained
Figure RE-GDA0003905660750000196
Then is provided with
Figure RE-GDA0003905660750000197
Regarding pi of the above formula as an old strategy, let
Figure RE-GDA0003905660750000198
As new policies are considered, only one new policy needs to be found so that
Figure RE-GDA0003905660750000199
Then this new strategy will certainly allow the accumulated reward η to be boosted. When all of
Figure RE-GDA00039056607500001910
This condition is not satisfied, which means that the original strategy is optimal. Thus, by the above-described procedure, the impedance controller target rigidity K is obtained such that the reward obtains the maximum value d (t) and damping coefficient B d And (t), thereby realizing the aim of adjusting the track of the robot in real time.
It should be noted that although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order or that all of the depicted steps must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Further, in the present exemplary embodiment, there is also provided a variable impedance control method based on inverse reinforcement learning. Referring to fig. 4, the variable impedance control method based on inverse reinforcement learning includes:
s110, initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track according to the tail end position, the first feedback force and an expected track of the mechanical arm by a variable impedance controller based on the target rigidity and the damping coefficient;
and S120, generating a second feedback force for controlling the motion of the mechanical arm by the impedance gain controller according to the expected position increment of the tail end of the mechanical arm, and finishing mechanical arm control based on the second feedback force.
S130, the reverse reinforcement learning algorithm module distinguishes the motion track and the expert track and calculates the loss function by using a discriminator based on the expert strategy and the reward function in the reverse reinforcement learning algorithm, updates the discriminator by the minimized loss function and updates the variable impedance control strategy by the maximized reward function;
and S140, calculating a target rigidity and a damping coefficient by the variable impedance control strategy module based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module according to the tail end position of the mechanical arm and the second feedback force, and sending the target rigidity and the damping coefficient to the variable impedance controller.
In the embodiment of the present example, the inverse reinforcement learning algorithm in the control method further includes:
collecting the force and torque applied to the mechanical arm end effector by a specialist in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);
initializing a first impedance gain strategy by using random weight;
collecting a first track under the first impedance gain strategy;
a second impedance gain strategy is obtained by using an inverse reinforcement learning algorithm based on the first track;
collecting a second trace according to the second impedance gain strategy;
and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator by minimizing the loss function, repeating the inverse reinforcement learning algorithm, and judging and generating an optimal variable impedance control strategy based on a reward function.
In the present exemplary embodiment, as shown in fig. 3, the inverse reinforcement learning algorithm of the present invention mainly includes five steps:
1) Gathering forces and torques applied by a human expert on an end effector to cause the end of a robotic arm to complete a desired trajectory
Figure RE-GDA0003905660750000211
Or traces collected by a designed variable impedance controller performing the task
Figure RE-GDA0003905660750000212
Designing a reward function r (o, a);
2) Initializing an impedance gain strategy pi by using a random weight;
3) Trace tau under collection strategy pi i
4) An optimal impedance gain strategy pi (theta) is obtained by using an inverse reinforcement learning algorithm;
5) Setting the policy pi x ← pi (theta), and applying the policy to the system to collect a new track;
6) And (5) repeating the steps (3-5) until a satisfactory control strategy is obtained through learning.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 500 according to such an embodiment of the invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, a bus 530 connecting various system components (including the memory unit 520 and the processing unit 510), and a display unit 540.
Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 510 may perform steps S110 to S140 as shown in fig. 1.
The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) 5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.
Storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 530 may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or any of a variety of bus architectures.
The electronic device 500 may also communicate with one or more external devices 570 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. As shown, the network adapter 560 communicates with the other modules of the electronic device 500 over a bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.
Referring to fig. 6, a program product 600 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (12)

1. A variable impedance control system based on inverse reinforcement learning, the system comprising a variable impedance controller, an impedance gain controller, wherein:
the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting a track according to the first feedback force and an expected track based on the acquired target rigidity and damping coefficient;
the impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and the mechanical arm control is completed based on the second feedback force.
2. The system of claim 1, wherein the variable impedance control system further comprises an inverse reinforcement learning algorithm module and a variable impedance control strategy module, wherein:
the reverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating a variable impedance control strategy through a maximized reward function;
the variable impedance control strategy module is used for calculating target rigidity and damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the existing variable impedance control strategy, and sending the target rigidity and damping coefficient to the variable impedance controller.
3. The system of claim 2, wherein the variable impedance controller is based on a second order impedance model
Figure FDA0003859808510000011
The robot arm tip desired position increment for the revised trajectory is generated as:
Figure FDA0003859808510000012
wherein M is d (t)、B d (t)、K d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,
Figure FDA0003859808510000013
x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,
Figure FDA0003859808510000014
x d respectively the desired acceleration, velocity and position of the robot tip, F d And F is the expected contact force and the actual contact force between the robot end and the environment, respectively, E (n) is the contact force error, T is the control period, w 1 ,w 2 ,w 3 Are all intermediate variables;
w 1 =4M d (t)+2B d (t)T+K d (t)T 2
w 2 =-8M d (t)+2K d (t)T 2
w 3 =4M d (t)-2B d (t)T+K d (t)T 2
4. the system of claim 2, wherein the impedance gain controller is based on a model of the dynamics of the robot in cartesian space:
Figure FDA0003859808510000021
and a kinetic equation:
Figure FDA0003859808510000022
the feed forward term to generate the impedance control law is:
Figure FDA0003859808510000023
the second feedback force is:
Figure FDA0003859808510000024
wherein M (x) is a mass inertia matrix,
Figure FDA0003859808510000025
is a Coriolis force matrix, G (x) is a gravity vector,
Figure FDA0003859808510000026
and x is the Cartesian acceleration, velocity and position, respectively, relative to the end effector, J is the Jacobian matrix, τ, F ext Respectively inputting torque and external force of a motor in a joint space; m is a group of d 、B d 、K d A desired mass, damping and stiffness matrix; e and
Figure FDA0003859808510000028
to track position errors and track velocity errors.
5. The system of claim 2, wherein the variable impedance control strategy module tracks errors based on cartesian spatial locations:
Figure FDA0003859808510000027
the variable impedance control strategy generated according to the distance of the mechanical arm close to the target position is as follows:
Figure FDA0003859808510000031
wherein e is 1 、e 2 Two gain change points of 0.4m and 0.2m, respectively.
6. The system of claim 2, wherein the inverse reinforcement learning algorithm module is to base expert strategies and reward functions on
Figure FDA0003859808510000032
Wherein the content of the first and second substances,
Figure FDA0003859808510000035
d i,t the distances between the ith mixed track point and the expected point at the t moment respectively,
Figure FDA0003859808510000033
d i,t+1 respectively at the t +1 th moment, the ith mixed track point and the expected pointGamma is a proportionality coefficient;
discriminating motion trajectories from expert trajectories using a discriminator and calculating a loss function
Figure FDA0003859808510000034
Wherein r is θ (o, a) is a reward function needing to be learned, and pi (a | o) is the probability of taking action a when the observed value is o under the current strategy pi;
the discriminator is updated by minimizing a penalty function and the variable impedance control strategy is updated by maximizing a reward function.
7. The system of claim 6, wherein the scale factor in the inverse reinforcement learning algorithm module ranges from 0 to 1.
8. A variable impedance control method based on inverse reinforcement learning, the method comprising:
initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track by a variable impedance controller according to the tail end position, the first feedback force and an expected track of the mechanical arm on the basis of the target rigidity and the damping coefficient;
and the impedance gain controller generates a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and completes mechanical arm control based on the second feedback force.
9. The control method of claim 8, wherein the method further comprises:
the reverse reinforcement learning algorithm module is based on an expert strategy and an incentive function in a reverse reinforcement learning algorithm, a discriminator is used for distinguishing a motion track and the expert track and calculating a loss function, the discriminator is updated through a minimized loss function, and a variable impedance control strategy is updated through a maximized incentive function;
and the variable impedance control strategy module calculates a target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module, and sends the target rigidity and the damping coefficient to the variable impedance controller.
10. The control method of claim 9, wherein the inverse reinforcement learning algorithm in the inverse reinforcement learning algorithm module comprises:
collecting the force and torque exerted by a specialist on the mechanical arm end effector in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);
initializing a first impedance gain strategy with random weights;
collecting a first trace under the first impedance gain strategy;
exploring to obtain a second impedance gain strategy by using an inverse reinforcement learning algorithm based on the first track;
collecting a second trace according to the second impedance gain strategy;
and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator through the minimized loss function, repeating the inverse reinforcement learning algorithm, and judging and generating the optimal variable impedance control strategy based on the reward function.
11. An electronic device, comprising
A processor; and
a memory having computer-readable instructions stored thereon that, when executed by the processor, implement the method of any of claims 8-10.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 8-10.
CN202211161566.3A 2022-09-22 2022-09-22 Variable impedance control system and control method based on inverse reinforcement learning Active CN115421387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211161566.3A CN115421387B (en) 2022-09-22 2022-09-22 Variable impedance control system and control method based on inverse reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211161566.3A CN115421387B (en) 2022-09-22 2022-09-22 Variable impedance control system and control method based on inverse reinforcement learning

Publications (2)

Publication Number Publication Date
CN115421387A true CN115421387A (en) 2022-12-02
CN115421387B CN115421387B (en) 2023-04-14

Family

ID=84203645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211161566.3A Active CN115421387B (en) 2022-09-22 2022-09-22 Variable impedance control system and control method based on inverse reinforcement learning

Country Status (1)

Country Link
CN (1) CN115421387B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116643501A (en) * 2023-07-18 2023-08-25 湖南大学 Variable impedance control method and system for aerial working robot under stability constraint

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153153A (en) * 2017-12-19 2018-06-12 哈尔滨工程大学 A kind of study impedance control system and control method
WO2020118730A1 (en) * 2018-12-14 2020-06-18 中国科学院深圳先进技术研究院 Compliance control method and apparatus for robot, device, and storage medium
US20210122037A1 (en) * 2019-10-25 2021-04-29 Robert Bosch Gmbh Method for controlling a robot and robot controller
CN114378820A (en) * 2022-01-18 2022-04-22 中山大学 Robot impedance learning method based on safety reinforcement learning
CN114800489A (en) * 2022-03-22 2022-07-29 华南理工大学 Mechanical arm compliance control method based on combination of definite learning and composite learning, storage medium and robot
CN114851193A (en) * 2022-04-26 2022-08-05 北京航空航天大学 Intelligent flexible control method for contact process of space manipulator and unknown environment
CN115256401A (en) * 2022-08-29 2022-11-01 南京理工大学 Space manipulator shaft hole assembly variable impedance control method based on reinforcement learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153153A (en) * 2017-12-19 2018-06-12 哈尔滨工程大学 A kind of study impedance control system and control method
WO2020118730A1 (en) * 2018-12-14 2020-06-18 中国科学院深圳先进技术研究院 Compliance control method and apparatus for robot, device, and storage medium
US20210122037A1 (en) * 2019-10-25 2021-04-29 Robert Bosch Gmbh Method for controlling a robot and robot controller
CN114378820A (en) * 2022-01-18 2022-04-22 中山大学 Robot impedance learning method based on safety reinforcement learning
CN114800489A (en) * 2022-03-22 2022-07-29 华南理工大学 Mechanical arm compliance control method based on combination of definite learning and composite learning, storage medium and robot
CN114851193A (en) * 2022-04-26 2022-08-05 北京航空航天大学 Intelligent flexible control method for contact process of space manipulator and unknown environment
CN115256401A (en) * 2022-08-29 2022-11-01 南京理工大学 Space manipulator shaft hole assembly variable impedance control method based on reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张刚;布挺;焦文潭;王波;: "柔性机器人动力学跟踪变阻抗控制" *
李超: "基于强化学习的学习变阻抗控制" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116643501A (en) * 2023-07-18 2023-08-25 湖南大学 Variable impedance control method and system for aerial working robot under stability constraint
CN116643501B (en) * 2023-07-18 2023-10-24 湖南大学 Variable impedance control method and system for aerial working robot under stability constraint

Also Published As

Publication number Publication date
CN115421387B (en) 2023-04-14

Similar Documents

Publication Publication Date Title
CN114502335B (en) Method and system for trajectory optimization for non-linear robotic systems with geometric constraints
Peters et al. Reinforcement learning by reward-weighted regression for operational space control
EP3788549B1 (en) Stacked convolutional long short-term memory for model-free reinforcement learning
Argall et al. Learning robot motion control with demonstration and advice-operators
JP7301034B2 (en) System and Method for Policy Optimization Using Quasi-Newton Trust Region Method
Qi et al. Stable indirect adaptive control based on discrete-time T–S fuzzy model
CN114761966A (en) System and method for robust optimization for trajectory-centric model-based reinforcement learning
Nguyen et al. Adaptive chattering free neural network based sliding mode control for trajectory tracking of redundant parallel manipulators
Dong et al. Learning and recognition of hybrid manipulation motions in variable environments using probabilistic flow tubes
Li et al. Kinematic control of redundant robot arms using neural networks
Khansari-Zadeh et al. Learning to play minigolf: A dynamical system-based approach
CN115351780A (en) Method for controlling a robotic device
CN115421387B (en) Variable impedance control system and control method based on inverse reinforcement learning
Zhang et al. Model‐Free Attitude Control of Spacecraft Based on PID‐Guide TD3 Algorithm
Vinogradska et al. Numerical quadrature for probabilistic policy search
Jiang et al. Bioinspired control design using cerebellar model articulation controller network for omnidirectional mobile robots
Veselic et al. Human-robot interaction with robust prediction of movement intention surpasses manual control
Lin et al. Objective learning from human demonstrations
Nohooji et al. Actor–critic learning based PID control for robotic manipulators
Feng et al. Adaptive neural network tracking control of an omnidirectional mobile robot
Langsfeld Learning task models for robotic manipulation of nonrigid objects
US20220410380A1 (en) Learning robotic skills with imitation and reinforcement at scale
Yin et al. Learning cost function and trajectory for robotic writing motion
Gams et al. Manipulation learning on humanoid robots
Afzali et al. A Modified Convergence DDPG Algorithm for Robotic Manipulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant