CN115421387A

CN115421387A - Variable impedance control system and control method based on inverse reinforcement learning

Info

Publication number: CN115421387A
Application number: CN202211161566.3A
Authority: CN
Inventors: 边桂彬; 李桢; 钱琛
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-02
Anticipated expiration: 2042-09-22
Also published as: CN115421387B

Abstract

The present disclosure relates to a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of the reward function is improved in task setting, the generalized representation of the variable impedance technology is realized, the layered impedance control of the mechanical arm can be realized, the relatively complex physical interaction is completed, the motion precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.

Description

Variable impedance control system and control method based on inverse reinforcement learning

Technical Field

The present disclosure relates to the field of mechanical arms and automatic control, and in particular, to a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning.

Background

Robotic systems are increasingly used in various unstructured environments, such as hospitals, factories, houses, etc., where the robot needs to perform complex operational tasks, adjust the impedance according to different task phases and environmental constraints, while interacting with an unknown environment in a safe and stable manner. Impedance control, which establishes mass-spring-damped contact dynamics, has been widely used in these robotic systems to ensure safe physical interaction. In addition, many complex operating tasks require the robot to change impedance according to the task phase, and flexibility and robustness have become one of the important indicators for developing surgical robot controllers for physical interaction. However, conventional impedance control schemes do not understand the actual surgical scenario, including complex physical interactions on the robotic arm, resulting in a loss of precision, and in practice, achieving such tasks requires variable impedance skills.

The existing learning-based method for obtaining variable impedance skills mainly includes the following categories:

the first type is a teaching learning-based approach. The human expert controls the robot through a haptic interface and a hand-held impedance control interface, which is based on a linear spring-reset potentiometer that maps button positions to robot arm stiffness. This arrangement allows a human expert to adjust the compliance of the robot according to given task requirements, encode the demonstrated motion and stiffness trajectories using dynamic motion primitives, and learn using local weight regression. If the illustrated trajectory has a high variance, the impedance should be low, and if the illustrated trajectory has a low variance, the impedance should be high. Such a strategy may provide a very good solution for many manipulation tasks. The advantage is that no separate demonstration of the impedance is required. However, in some interactive tasks, such as a sliding task in a groove, low trajectory variability does not necessarily correspond to high impedance.

The second type is based on a deep reinforcement learning method with a variable impedance motion space. When using reinforcement learning to control robot motion, an important challenge is the parameterized selection of strategies. Parameters with relevant nonlinear features are usually extracted from a set of motion demonstrations following the teaching learning paradigm using gaussian mixture regression, the final parameterization takes the form of a nonlinear time-invariant dynamic system, this time-invariant dynamic system is used as a parameterization strategy for a variant of the PI2 strategy search algorithm, and finally the time-invariant motion is represented by the PI2 strategy search algorithm. However, this approach has certain drawbacks, and in the first place it is more idealized, assuming that there is no noise in the system other than the detection noise, which means that disturbances encountered during sampling the trajectory have a negative impact on learning and cannot be considered to improve the strategy. Second, it is initially designed to learn a trajectory from a particular initial state, and using it to learn a trajectory from multiple initial states increases the number of deployments required. While many inverse reinforcement learning algorithms employ entropy regularization to prevent simple emulation of expert strategies, most previous efforts have not focused on the impact of action space selection on a priori knowledge.

While many methods based on deep reinforcement learning and teaching learning have been proposed to obtain variable impedance skills that are exposed to rich operating tasks, these skills are typically task-specific based and may be sensitive to changes in task settings, and the task-specific impedance skills obtained by teaching learning methods may fail when a task changes. Furthermore, designing suitable reward functions is challenging for reinforcement learning, and therefore their skill transferability is limited.

Accordingly, there is a need for one or more methods to address the above-mentioned problems.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of the present disclosure is to provide a variable impedance control system, a control method, an electronic device, and a storage medium based on inverse reinforcement learning, thereby overcoming, at least to some extent, one or more of the problems due to the limitations and disadvantages of the related art.

According to an aspect of the present disclosure, there is provided an inverse reinforcement learning-based variable impedance control system, the system including a variable impedance controller, an impedance gain controller, wherein:

the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting a track according to the first feedback force and an expected track based on the acquired target rigidity and damping coefficient;

the impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and the mechanical arm control is completed based on the second feedback force.

Preferably, the variable impedance control system further comprises an inverse reinforcement learning algorithm module and a variable impedance control strategy module, wherein:

the reverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating a variable impedance control strategy through a maximized reward function;

the variable impedance control strategy module is used for calculating target rigidity and damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the existing variable impedance control strategy, and sending the target rigidity and damping coefficient to the variable impedance controller.

Preferably, the variable impedance controller is based on a second order impedance model

The robot arm tip desired position increment for the revised trajectory is generated as:

wherein, M _d (t)、B _d (t)、K _d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,

x is the actual acceleration, velocity and position of the robot end in cartesian space respectively,

x _d respectively the expected acceleration and speed of the robot endAnd position, fd and F are respectively expected contact force and actual contact force between the robot end and the environment, E (n) is contact force error, T is control period, w ₁ ，w ₂ ，w ₃ Are all intermediate variables;

w ₁ ＝4M _d (t)+2B _d (t)T+K _d (t)T ²

w ₂ ＝-8M _d (t)+2K _d (t)T ²

w ₃ ＝4M _d (t)-2B _d (t)T+K _d (t)T ² 。

preferably, the impedance gain controller is based on a dynamical model of the robot in cartesian space:

and the kinetic equation:

the feed forward terms that generate the impedance control law are:

the second feedback force is:

wherein M (x) is a mass inertia matrix,

in the form of a matrix of the coriolis forces,

as a vector of the gravity force,

and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F _ext Respectively inputting torque and external force of a joint space motor; m _d 、B _d 、K _d A desired mass, damping and stiffness matrix; e and

to track position errors and track velocity errors.

Preferably, the variable impedance control strategy module tracks errors according to cartesian spatial locations:

the variable impedance control strategy generated according to the distance of the mechanical arm close to the target position is as follows:

wherein e is ₁ 、e ₂ Two gain change points of 0.4m and 0.2m, respectively.

Preferably, the inverse reinforcement learning algorithm module is used for basing on expert strategy and reward function

Wherein the content of the first and second substances,

d _i，t the distances between the ith mixed track point and the expected point at the t moment respectively,

d _i，t+1 at the t +1 th time,The distance between the ith mixed track point and the expected point, wherein gamma is a proportionality coefficient;

discriminating motion trajectories from expert trajectories using a discriminator and calculating a loss function

Wherein r is _θ (o, a) is a reward function to be learned, and pi (a | o) is the probability of taking action a when the observed value is o under the current strategy pi;

the discriminator is updated by minimizing a penalty function and the variable impedance control strategy is updated by maximizing a reward function.

Preferably, the scale factor value range in the inverse reinforcement learning algorithm module is 0-1.

In one aspect of the present disclosure, there is provided a variable impedance control method based on inverse reinforcement learning, the method including:

initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track by a variable impedance controller according to the tail end position, the first feedback force and an expected track of the mechanical arm on the basis of the target rigidity and the damping coefficient;

and the impedance gain controller generates a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the tail end of the mechanical arm, and completes mechanical arm control based on the second feedback force.

Preferably, the method further comprises:

the reverse reinforcement learning algorithm module is based on an expert strategy and an incentive function in a reverse reinforcement learning algorithm, a discriminator is used for distinguishing a motion track and the expert track and calculating a loss function, the discriminator is updated through a minimized loss function, and a variable impedance control strategy is updated through a maximized incentive function;

and the variable impedance control strategy module calculates a target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module, and sends the target rigidity and the damping coefficient to the variable impedance controller.

Preferably, the inverse reinforcement learning algorithm in the inverse reinforcement learning algorithm module comprises:

collecting the force and torque exerted by a specialist on the mechanical arm end effector in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);

initializing a first impedance gain strategy by using random weight;

collecting a first trace under the first impedance gain strategy;

exploring to obtain a second impedance gain strategy by using an inverse reinforcement learning algorithm based on the first track;

collecting a second trace according to the second impedance gain strategy;

and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator through the minimized loss function, repeating the inverse reinforcement learning algorithm, and judging and generating the optimal variable impedance control strategy based on the reward function.

In one aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method according to any of the above.

In an aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the method according to any one of the above.

An inverse reinforcement learning-based variable impedance control system, a control method, an electronic device, and a storage medium in exemplary embodiments of the present disclosure. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of a reward function is improved in task setting, generalized representation of variable impedance skills is achieved, layered impedance control of the mechanical arm can be achieved, complex physical interaction is completed, the movement precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 illustrates a system block diagram of an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a controller design schematic of an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flowchart of an inverse reinforcement learning algorithm for an inverse reinforcement learning based variable impedance control system according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a variable impedance control method based on inverse reinforcement learning according to an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure; and

fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

In the present exemplary embodiment, there is first provided an inverse reinforcement learning-based variable impedance control system; referring to fig. 1, the variable impedance control system based on inverse reinforcement learning includes a variable impedance controller, an impedance gain controller, wherein:

An inverse reinforcement learning-based variable impedance control system, a control method, an electronic device, and a storage medium in exemplary embodiments of the present disclosure. Wherein, this system includes: the device comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module. According to the method, the variable impedance gain action space is introduced, the transfer performance of the reward function is improved in task setting, the generalized representation of the variable impedance technology is realized, the layered impedance control of the mechanical arm can be realized, the relatively complex physical interaction is completed, the motion precision of the mechanical arm is guaranteed in a dynamic environment, and therefore the safety of mechanical arm control is improved.

Next, a variable impedance control system based on inverse reinforcement learning in the present exemplary embodiment will be further described.

In the embodiment of the example, the variable impedance strategy and the reward function are recovered from the teaching based on the method of inverse reinforcement learning, the reward function is maximized by generating a new variable impedance strategy for different task settings by using a reinforcement learning algorithm, and different action spaces of the reward function are explored to realize generalized representation of the variable impedance skill. The method mainly comprises the following three parts:

in the embodiment of the present example, the Cartesian space impedance control design section

Consider a kinetic model of a robot in cartesian space:

where M (x) is the mass inertia matrix,

a matrix of the coriolis force is represented,

is the vector of the force of gravity,

and x is the Cartesian acceleration, velocity and position, respectively, relative to the end effector, J is the Jacobian matrix, τ, F _ext Respectively representing the torque input and the external force of the joint space motor. Under the law of impedance control, the robot will behave as a mass-spring-damper system, which followsFollowing the kinetic equation:

wherein M is _d 、B _d 、K _d Is the required mass, damping and stiffness matrix. By solving (1), (2) and setting M _d = M (x), the impedance control law can be written as:

τ＝J ^T F

the impedance control law can be further divided into two parts: feed forward term F _ff To eliminate nonlinear robot dynamics and feedback terms F _fb Tracking a required track:

wherein e and

are the tracking error and the tracking speed. Rigidity matrix K _d And a damping matrix B _d Also called impedance gain matrices, because they map tracking errors and velocities to feedback forces F _fb 。

In the present exemplary embodiment, the controller design is depicted in FIG. 1 by the antagonistic inverse reinforcement learning variable impedance skills section. In the method, the observed values of the robot and the environment are a tracking error e and a tracking speed

The strategy adopted accepts observation and outputs impedance gainK. B or feedback force F _fb Depending on the motion space design. The impedance gain controller then calculates the control input and controls the robot using equation (3), learning expert strategies and reward functions using antagonistic inverse reinforcement learning, the training process being detailed in the algorithm.

In the present invention, an inverse reinforcement learning algorithm is used to learn expert strategies and reward functions, and in this antagonism training environment, the discriminator that separates the generator trajectory from the expert trajectory is defined as:

wherein r is _θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a when observed as o under the current strategy. Updating the discriminator to minimize the loss:

the generator is a variable impedance strategy. During training, the strategy is updated to maximize the trajectory reward, evaluated by a reward function, and strategy updating is performed by a strategy gradient-based reinforcement learning method confidence domain strategy optimization algorithm (TRPO). Because the environment dynamics are unknown, new strategies are re-optimized in different task settings by applying reinforcement learning to test the performance of the learning reward function. In the reinforcement learning training process, strategy updating is the same as an inverse reinforcement learning method, but a fixed learning reward function is provided. The detailed process of algorithm training is as follows:

in the embodiment of the present example, the method applies the part; when the method is put into use, expert data is collected by human experts of real robots, and then learned strategies are transferred to the real robots for performance evaluation.

1. And setting a task. The real world experiment device consists of a host computer, a target computer, an F/T sensor and a robot. A cartesian variable impedance control algorithm is written on the host PC that controls the Real robotic system connected to the target PC through Simulink Real-Time. Model parameters of real robots, such as mass inertia matrix M (x), coriolis force matrix

And the gravity vector G (x) is obtained by the euler-lagrange method.

2. Human expert data is collected. During data collection, a human expert applies forces and torques on the end effector to cause the end of the robotic arm to complete a desired trajectory. The 6-dimensional Cartesian space forces and torques are measured by the F/T sensors, and then the control inputs are calculated using equation (3), which will track the state

And the force adopted by the human expert is recorded as human expert data, and the income of the human expert is estimated in data processing.

3. The gain estimation is performed using a sliding window method. To recover the expert gain strategy, a short sliding window is used to estimate the stiffness and damping of the force. Each time window contains ten state-force pairs and the expert gain estimates the test by solving equation (5) with least squares. Strategies and reward functions are learned in a simulated environment using antagonistic reverse reinforcement learning with real-world human expert data.

The variable impedance control system based on the inverse reinforcement learning comprises a variable impedance controller, an impedance gain controller, a variable impedance control strategy module and an inverse reinforcement learning algorithm module, wherein:

the variable impedance controller is used for generating a mechanical arm tail end expected position increment for correcting the track according to the first feedback force and the expected track based on the target rigidity and the damping coefficient generated and sent by the variable impedance control strategy module.

In the exemplary embodiment, the variable impedance controller in the system is based on a second order impedance model

The mechanical arm tip desired position increment for correcting the trajectory is generated as:

wherein M is _d (t)、B _d (t)、K _d (t) respectively representing a time-varying target inertia matrix, a target damping matrix and a target stiffness matrix in the impedance model,

x _d respectively the desired acceleration, velocity and position of the robot tip, F _d And F the expected and actual contact force between the robot tip and the environment, respectively, and E (n) the contact force error.

In the present exemplary embodiment, to achieve the desired dynamic behavior of the tip, a second order impedance model is used:

x _d respectively the desired acceleration, velocity and position of the robot tip, F _d And F is the expected and actual contact forces between the robot tip and the environment, respectively.

To obtain the corrected desired position increment, the second order impedance model is lagrangian transformed and a bilinear transformation s =2T is used ^-1 (z-1)(z+1) ^-1 Discretizing to obtain:

w ₁ ＝4M _d (t)+2B _d (t)T+K _d (t)T ² (3)

w ₂ ＝-8M _d (t)+2K _d (t)T ² (4)

w ₃ ＝4M _d (t)-2B _d (t)T+K _d (t)T ² (5)

where T is the control period, the difference equation of the impedance controller, i.e., the expected position increment of the terminal, is:

to simplify the calculation, the target inertia matrix is set to a constant M _d (t) = I, so the variable impedance controller requires a time-varying target stiffness K _d (t) damping coefficient B _d (t) adjusting the desired position with the contact force error E (n).

The impedance gain controller is used for generating a second feedback force for controlling the movement of the mechanical arm according to the expected position increment of the mechanical arm tail end.

In an embodiment of the present example, the impedance gain controller in the system is based on a dynamical model of the robot in cartesian space:

and the kinetic equation:

the feed forward term to generate the impedance control law is:

the second feedback force is:

wherein M (x) is a mass inertia matrix,

in the form of a matrix of the coriolis forces,

as a vector of the gravity force,

and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F _ext Respectively inputting torque and external force of a joint space motor; m is a group of _d 、B _d 、K _d A desired mass, damping and stiffness matrix; e and

to track position errors and track velocity errors.

In the exemplary embodiment, the mass inertia matrix M (x), the Coriolis matrix

And the gravity vector

The model parameters are automatically calculated using a Mujoco simulation model.

Constructing a dynamic model of the robot in a Cartesian space:

wherein M (x) is a mass inertia matrix,

is a Coriolis force matrix, G (x) is a gravity vector,

and x is the Cartesian acceleration, velocity and position relative to the end effector, J is the Jacobian matrix, τ, F _ext Respectively a joint space motor torque input and an external force.

Under the law of impedance control, the robot will behave as a mass-spring-damper system, which follows the kinetic equation:

wherein M is _d 、B _d 、K _d Is the desired mass, damping and stiffness matrix. By solving (1), (2) and setting M _d = M (x), the impedance control law can be written as:

τ＝J ^T F

the impedance control law can be further divided into two parts: feedforward term F _ff To eliminate nonlinear robot dynamics and feedback term F _fb Tracking a required track:

wherein e and

are a tracking position error and a tracking velocity error. Rigidity matrix K _d And damping matrix B _d Also referred to as impedance gain matrices, because they map tracking position errors and tracking velocity errors to feedback forces F _fb . To simplify the notation, we use K (stiffness) and B (damping) to denote K in the rest of the text _d And B _d . Fig. 2 depicts a controller design.

The variable impedance control strategy module is used for calculating target rigidity and a damping coefficient according to the tail end position of the mechanical arm and the second feedback force based on a preset variable impedance control strategy, and sending the target rigidity and the damping coefficient to the variable impedance controller.

In this exemplary embodiment, the variable impedance control strategy module in the system tracks errors based on cartesian spatial position:

wherein e is ₁ 、e ₂ Two gain change points of 0.4m and 0.2 m.

In the embodiment of the present example, (1) observation space: will track error e and tracking speed

Together serve as a viewing space for the task for which the end effector is located on the cup. In addition, due to a pair of tracking error e and tracking speed

Without providing acceleration information, it is not possible to fully represent the system dynamics, for which a preamble observation is used, comprising e and e from the first five time steps

Is evaluated.

(2) An action space: for the impedance gain action space, the strategy outputs the impedance gain, and the control input is obtained by equation (11).

To reduce the dimension of the gain action space, it is assumed that the stiffness matrix K and the damping matrix B are diagonal. Furthermore, by forcing the diagonal elements to be positive, it is ensured that the stiffness matrix K and the damping matrix B are positive-definite. To extend the method to the full matrix case, a Cholesky decomposition can be utilized to ensure K, B > 0. For cup set tasks, tracking speed is large and damping terms can affect performance. Thus, the output of the policy is now [ K ] ₁ ，K ₂ ，K ₃ ，K ₄ ，K ₅ ，K ₆ ，d]Containing an additional damping factor d, the stiffness and damping matrix can then be obtained by:

a 1-dimensional damping factor is used instead of another 6-dimensional damping to reduce the dimensions.

(3) Variable impedance control strategy:

cartesian spatial position tracking error:

the variable impedance control strategy comprises three phases:

e ₁ 、e ₂ two gain change points of 0.4m and 0.2 m. The expert control law selects the maximum gain for acceleration during the acceleration phase, and generally switches to a smaller gain during the switching phase. At the arrival stage, the robotic arm approaches the board at a minimum speed to ensure safety.

The inverse reinforcement learning algorithm module is used for distinguishing a motion track and an expert track by using a discriminator and calculating a loss function based on an expert strategy and a reward function, updating the discriminator through a minimized loss function and updating the variable impedance control strategy through a maximized reward function.

In the exemplary embodiment, the inverse reinforcement learning algorithm module of the system is configured to base expert strategies and reward functions

Wherein the content of the first and second substances,

d _i，t the distances between the ith mixing track point and the expected point at the t-th moment respectively,

d _i，t+1 the distances between the ith mixed track point and the desired point at the t +1 th time point, and gamma is a ratioExample coefficients;

Wherein r is _θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a with an observation of o under the current strategy π;

the discriminator is updated by minimizing a loss function and the variable impedance control strategy is updated by maximizing a reward function.

In the embodiment of the example, the scale factor in the inverse reinforcement learning algorithm module of the system has a value range of 0-1;

further, a scaling factor value in the inverse reinforcement learning algorithm module of the system is 0.95.

In the present exemplary embodiment, an inverse reinforcement learning algorithm is employed to learn the expert strategy and reward function. The input of the method is a mixed track of an expert track and a robot generation track, and the output is a target rigidity K of an impedance controller _d (t) and damping coefficient B _d (t)。

Firstly, a reward function is designed according to the states of the observation space and the action space, because the expert track and the robot generation track are not separated when the reward function is designed, the reward function is designed as follows:

wherein the content of the first and second substances,

d _i，t+1 the distances between the ith mixing track point and the expected point at the t +1 th moment are respectively, gamma is a proportionality coefficient, the magnitude of gamma is between 0 and 1, and the magnitude of gamma is generally 0.95.

Then, distinguishing the track generated by the robot from the expert track by using an identifier, wherein the distinguishing process comprises the following steps: the integral track is divided into 50 track points, and the identifier takes reward values and state action transition probabilities obtained by calculating the track points through a reward function as input to obtain the following formula:

wherein r is _θ (o, a) is the reward function that needs to be learned, and π (a | o) is the probability of taking action a with an observation of o under the current strategy π, which discriminator then uses many-to-one LSTM model with elements with time steps as input, and a scalar as output, the formula is as follows:

where F represents the fused features of all trace points in the trace (i.e., F = [ F ]) ₀ ，f ₁ ，...，f ₄₉ ]) Wherein f is _i Is the fusion feature vector of the ith trace point,

is the weight matrix of the LSTM model, and h is the output scalar of the LSTM model.

Binary classification (expert or generation) of scalar outputs using a unit dense layer with sigmoid activation function:

O _d ＝D ^be (h；W _bc ) (20)

wherein D ^bc Is a unit dense layer with sigmoid function for binary classification, W _bc Is its weight matrix, O _d Is the final output of the arbiter, which generates a trajectory for the expert trajectory and the robot.

Updating the discriminator by minimizing the loss:

during training, the strategy is updated to maximize the track reward, evaluated by a reward function, the strategy update is the same as the inverse reinforcement learning method but with a fixed learning reward function, the strategy update is carried out by adopting a reinforcement learning method confidence domain strategy optimization algorithm (TRPO) based on strategy gradient to obtain the target stiffness K of the impedance controller which maximizes r (o, a) _d (t) and damping coefficient B _d (t)。

The TRPO algorithm is used as follows: first, several functions are defined, each of which is an action value function Q _π (a _t ，s _t ) Function of state value V _π (s _t ) And a merit function A _π (s，a)：

A _π (s，a)＝Q _π (s，a)-V _π (s) (24)

Wherein s is _t ，a _t ，s _t+1 The state (position and velocity) and the motion of the robot at time t, and the state at time t +1 are shown. The action value function evaluates the quality of a state action pair, the state value function evaluates the quality of a state, and the dominance function evaluates a relative concept, namely the quality of the action relative to other actions in the same state. The strategy to learn pi often represents a neural network, with inputs being states and outputs being actions. Let us assume that the parameter of the neural network is θ, then

Now the goal is translated into finding a theta, which corresponds to the strategy pi _θ Corresponding eta (pi) _θ ) Maximum expectation of right in (25) formula

Is according to pi _θ (a _t |s _t ) Sampling is a process occurring in the real world and cannot be easily calculated, and a substitute function is needed

To rewrite the formula, first, define

p _π (s)＝P(s ₀ ＝s)+γP(s ₁ ＝s)+γ ² p(s ₂ ＝s)+… (27)

ρ _π (s) is related to pi, which represents the frequency that each state may be visited, with a gamma discount. Note that the following equation holds true:

A _π (s _t ，a _t ) Only with a single s, a, but is expected

Separately count each s _t ，a _t The probability of occurrence is obtained

Then is provided with

Regarding pi of the above formula as an old strategy, let

As new policies are considered, only one new policy needs to be found so that

Then this new strategy will certainly allow the accumulated reward η to be boosted. When all of

This condition is not satisfied, which means that the original strategy is optimal. Thus, by the above-described procedure, the impedance controller target rigidity K is obtained such that the reward obtains the maximum value _d (t) and damping coefficient B _d And (t), thereby realizing the aim of adjusting the track of the robot in real time.

It should be noted that although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order or that all of the depicted steps must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, in the present exemplary embodiment, there is also provided a variable impedance control method based on inverse reinforcement learning. Referring to fig. 4, the variable impedance control method based on inverse reinforcement learning includes:

s110, initializing target rigidity and a damping coefficient as mechanical arm control parameters, acquiring the tail end position and a first feedback force of the mechanical arm, and generating a mechanical arm tail end expected position increment for correcting a track according to the tail end position, the first feedback force and an expected track of the mechanical arm by a variable impedance controller based on the target rigidity and the damping coefficient;

and S120, generating a second feedback force for controlling the motion of the mechanical arm by the impedance gain controller according to the expected position increment of the tail end of the mechanical arm, and finishing mechanical arm control based on the second feedback force.

S130, the reverse reinforcement learning algorithm module distinguishes the motion track and the expert track and calculates the loss function by using a discriminator based on the expert strategy and the reward function in the reverse reinforcement learning algorithm, updates the discriminator by the minimized loss function and updates the variable impedance control strategy by the maximized reward function;

and S140, calculating a target rigidity and a damping coefficient by the variable impedance control strategy module based on the variable impedance control strategy sent by the inverse reinforcement learning algorithm module according to the tail end position of the mechanical arm and the second feedback force, and sending the target rigidity and the damping coefficient to the variable impedance controller.

In the embodiment of the present example, the inverse reinforcement learning algorithm in the control method further includes:

collecting the force and torque applied to the mechanical arm end effector by a specialist in the specialist track to enable the mechanical arm end to complete the expected track, and designing a reward function r (o, a);

initializing a first impedance gain strategy by using random weight;

collecting a first track under the first impedance gain strategy;

a second impedance gain strategy is obtained by using an inverse reinforcement learning algorithm based on the first track;

collecting a second trace according to the second impedance gain strategy;

and distinguishing the second track and the expert track based on the discriminator, calculating a loss function, updating the discriminator by minimizing the loss function, repeating the inverse reinforcement learning algorithm, and judging and generating an optimal variable impedance control strategy based on a reward function.

In the present exemplary embodiment, as shown in fig. 3, the inverse reinforcement learning algorithm of the present invention mainly includes five steps:

1) Gathering forces and torques applied by a human expert on an end effector to cause the end of a robotic arm to complete a desired trajectory

Or traces collected by a designed variable impedance controller performing the task

Designing a reward function r (o, a);

2) Initializing an impedance gain strategy pi by using a random weight;

3) Trace tau under collection strategy pi _i ；

4) An optimal impedance gain strategy pi (theta) is obtained by using an inverse reinforcement learning algorithm;

5) Setting the policy pi x ← pi (theta), and applying the policy to the system to collect a new track;

6) And (5) repeating the steps (3-5) until a satisfactory control strategy is obtained through learning.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 500 according to such an embodiment of the invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, a bus 530 connecting various system components (including the memory unit 520 and the processing unit 510), and a display unit 540.

Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 510 may perform steps S110 to S140 as shown in fig. 1.

The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) 5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.

Storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 530 may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 570 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. As shown, the network adapter 560 communicates with the other modules of the electronic device 500 over a bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.

Referring to fig. 6, a program product 600 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A variable impedance control system based on inverse reinforcement learning, the system comprising a variable impedance controller, an impedance gain controller, wherein:

2. The system of claim 1, wherein the variable impedance control system further comprises an inverse reinforcement learning algorithm module and a variable impedance control strategy module, wherein:

3. The system of claim 2, wherein the variable impedance controller is based on a second order impedance model

x _d respectively the desired acceleration, velocity and position of the robot tip, F _d And F is the expected contact force and the actual contact force between the robot end and the environment, respectively, E (n) is the contact force error, T is the control period, w ₁ ，w ₂ ，w ₃ Are all intermediate variables;

w ₁ ＝4M _d (t)+2B _d (t)T+K _d (t)T ²

w ₂ ＝-8M _d (t)+2K _d (t)T ²

w ₃ ＝4M _d (t)-2B _d (t)T+K _d (t)T ² 。

4. the system of claim 2, wherein the impedance gain controller is based on a model of the dynamics of the robot in cartesian space:

and a kinetic equation:

the feed forward term to generate the impedance control law is:

the second feedback force is:

wherein M (x) is a mass inertia matrix,

is a Coriolis force matrix, G (x) is a gravity vector,

and x is the Cartesian acceleration, velocity and position, respectively, relative to the end effector, J is the Jacobian matrix, τ, F _ext Respectively inputting torque and external force of a motor in a joint space; m is a group of _d 、B _d 、K _d A desired mass, damping and stiffness matrix; e and

to track position errors and track velocity errors.

5. The system of claim 2, wherein the variable impedance control strategy module tracks errors based on cartesian spatial locations:

6. The system of claim 2, wherein the inverse reinforcement learning algorithm module is to base expert strategies and reward functions on

Wherein the content of the first and second substances,

d _i，t+1 respectively at the t +1 th moment, the ith mixed track point and the expected pointGamma is a proportionality coefficient;

Wherein r is _θ (o, a) is a reward function needing to be learned, and pi (a | o) is the probability of taking action a when the observed value is o under the current strategy pi;

7. The system of claim 6, wherein the scale factor in the inverse reinforcement learning algorithm module ranges from 0 to 1.

8. A variable impedance control method based on inverse reinforcement learning, the method comprising:

9. The control method of claim 8, wherein the method further comprises:

10. The control method of claim 9, wherein the inverse reinforcement learning algorithm in the inverse reinforcement learning algorithm module comprises:

initializing a first impedance gain strategy with random weights;

collecting a first trace under the first impedance gain strategy;

collecting a second trace according to the second impedance gain strategy;

11. An electronic device, comprising

A processor; and

a memory having computer-readable instructions stored thereon that, when executed by the processor, implement the method of any of claims 8-10.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 8-10.