CN112894809B

CN112894809B - Impedance controller design method and system based on reinforcement learning

Info

Publication number: CN112894809B
Application number: CN202110061914.9A
Authority: CN
Inventors: 赵兴炜; 陶波; 韩世博; 丁汉
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2022-08-02
Anticipated expiration: 2041-01-18
Also published as: CN112894809A

Abstract

The invention discloses an impedance controller design method and system based on reinforcement learning, and belongs to the field of robot control. The method comprehensively considers the control input, the position and the speed of the controlled system and the influence of the external force, designs an effective reward function and a cost function by utilizing the direct proportional relation between the external force and the position of the controlled system, can design an optimal impedance controller by reinforcement learning under the condition that a system model and an environment model are unknown, and can modify the response characteristic of the system by adjusting parameters to generate an ideal robot impedance controller. The method of the invention defines the form of the cost function, greatly reduces the number of undetermined coefficients, does not need a complex deep network to fit the cost function, and greatly accelerates the learning process.

Description

Impedance controller design method and system based on reinforcement learning

Technical Field

The invention belongs to the field of robot control, and particularly relates to an impedance controller design method and system based on reinforcement learning.

Background

With the advent of compliant operation and human-computer interaction scenarios, the goal of robot control is no longer to reduce positional error singly, and its compliance is receiving more and more attention. Impedance control is a very effective robot compliance control method, when external force exists, balance is automatically kept between the external force and a target position, rigid collision and overlarge contact force are avoided, and a robot, a workpiece and a user are protected; when no external force exists, higher position precision can be realized, and the requirements of various kinds of work are met.

Patent CN202010771033.1 proposes a method for controlling adaptive impedance of a mechanical arm based on RBF neural network, but it needs to obtain a nominal dynamical model to design an impedance controller and an error compensation controller, and the structure is complicated. The CN201910352004.9 patent needs to identify the kinetic parameters of the robot through preprocessing. The method for optimizing the multi-axis hole assembly control of the non-model robot by using environment prediction, which is proposed by patent CN201910287227.1, uses reinforcement learning for robot assembly, but the adopted deep reinforcement learning method has low convergence rate, long training time and certain limitation in use.

Disclosure of Invention

In view of the above drawbacks and needs of the prior art, the present invention provides a method and system for designing an impedance controller based on reinforcement learning, which aims to obtain an optimal impedance controller quickly without knowing a system dynamics model.

To achieve the above object, according to one aspect of the present invention, there is provided an impedance controller design method based on reinforcement learning, including:

s1, designing a reward function and a cost function; set up as a reward function

Set as a cost function

Wherein the content of the first and second substances,

respectively representing the current position, speed and external force of the controlled system; q _f ，Q _x ，Q _v Weights for external force, position, velocity in the impedance controller design objective, respectively; and u is KX is the control input of the system; k is the impedance control parameter to be designed and optimized,

representing the kronecker product of the matrix; theta is a cost function parameter;

and S2, estimating theta by adopting a reinforcement learning method based on the reward function and the cost function to obtain an optimal impedance control parameter K, and finishing the design of the impedance controller.

Further, step S1 specifically includes:

s101, regarding an external force f borne by a controlled system as a part of a system stateObtaining an augmented state vector of

Respectively representing the current position, speed and external force of the controlled system; the form of the impedance controller is set as that u is KX, u is the control input of the system, and K is the impedance control parameter to be designed and optimized;

s102, inputting control u, the current position q and the speed of a controlled system

And the received external force f is regarded as the cost of the control system, and the cost function is set as:

Q ₁ ，Q ₂ ，Q ₃ all are positive and real numbers for controlling the weight;

s103, changing F to F _e Substituting q into the cost function to obtain:

in the form of a matrix

S104, designing the reward function as the inverse number of the cost function:

the cost function being designed as an accumulation of reward functions

Further, the order of arrangement of the elements in the augmented state vector X is arbitrary, K, c _k 、r、Q _v (X,u)、

The specific form of θ varies depending on the order of arrangement of the elements in X.

Further, step S2 is specifically:

s201, learning process initialization: setting K as zero vector and theta as zero vector, and setting updating period i _update ，i _update Is a positive integer;

s202, learning period initialization: setting a controlled system to an initial state X _ΔT Setting a learning parameter P ═ δ H, wherein δ is a positive integer, and H is a unit matrix of n × n;

s203, calculating control input u ═ KX _iΔT + σ Rand, Rand being a random number, σ being a weighting factor; x _iΔT The system state of the current control period is 1, 2, 3 …, and Δ T is the control period of the controlled system;

s204, calculating a reward function

S205, obtaining the system state X of the next control cycle _(i+1)ΔT Updating a system value function parameter theta and a learning parameter P:

θ＝θ+gradient

gradient is intermediate quantity, gamma is a prediction factor, and gamma is more than 0 and less than 1;

s206, updating an impedance control parameter K: when i is i _update When the number of the H is multiple, the elements of theta are sequentially arranged into a matrix of n x n, and H is partitioned to obtain the H

Wherein H ₂₁ Is a matrix with the same dimension as K; let K _updated ＝K-l*(H ₂₁ +KH ₂₂ )，K＝K _updated (ii) a l is an update weight;

s207, judging termination of the learning period: if i Δ T is not less than T _max If not, making i equal to i +1, returning to S203; t is _max Is a maximum learning cycle length;

s208, learning termination judgment: the control laws before and after the K-th learning period are respectively K _k X and u ═ K _k-1 X, if max (abs (K) _k -K _k-1 ) Is less than or equal to epsilon, the learning process is terminated and the obtained impedance controller is K _k X, otherwise, returning to S202; ε is the termination determination threshold.

Corresponding to the implementation process of the method, the invention also provides an impedance controller design system based on reinforcement learning, which comprises the following steps:

the control target design module is used for designing a reward function and a value function; set up as a reward function

A cost function set to

Wherein the content of the first and second substances,

respectively representing the current position, speed and external force of the controlled system; q _f ，Q _x ，Q _v Are respectively outsideThe weight of force, position, velocity in the impedance controller design objective; and u is KX is the control input of the system; k is the impedance control parameter to be designed and optimized,

and the impedance control parameter optimization module is used for estimating theta by adopting a reinforcement learning method based on the reward function and the value function to obtain an optimal impedance control parameter K and complete the design of the impedance controller.

In general, the above technical solutions contemplated by the present invention can achieve the following advantageous effects compared to the prior art.

(1) The method comprehensively considers the control input, the position and the speed of the controlled system and the influence of the external force, designs an effective reward function and a cost function by utilizing the direct proportional relation between the external force and the position of the controlled system, can design an optimal impedance controller by reinforcement learning under the condition that a system model and an environment model are unknown, and can modify the response characteristic of the system by adjusting parameters to generate an ideal robot impedance controller.

(2) The method of the invention defines the form of the cost function, greatly reduces the number of undetermined coefficients, does not need a complex deep network to fit the cost function, and greatly accelerates the learning process.

Drawings

FIG. 1 is a flow chart of a method for designing an impedance controller based on reinforcement learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Referring to fig. 1, the method for designing an impedance controller based on reinforcement learning according to the present invention includes:

s1, designing a reward function and a value function;

s101, regarding an external force f borne by a controlled system as a part of a system state, and obtaining an augmented state vector

Respectively representing the current position, speed and external force of the controlled system; the impedance controller is configured in the form of u-KX, where u is the control input to the system and K is the impedance control parameter to be designed and optimized.

S102, inputting control u, the position q and the speed of a controlled system

Q ₁ ，Q ₂ ，Q ₃ for controlling the weights, all are positive real numbers.

S103, the external force F borne by the controlled system is a direct proportional function of the position of the controlled system, namely F is F _e And q, substituting the cost function to obtain:

it is written in matrix form:

s104, designing the reward function as the inverse number of the control function:

the merit function being designed as an accumulation of reward functions, i.e.

Set the cost function to

Representing the kronecker product of the matrix. Theta is the control weight to be estimated as Q ₁ ，Q ₂ ，Q ₃ A time-value function parameter.

The reward function represents the design target of the impedance controller, the reward function provided by the invention comprehensively considers the control input, the position and the speed of the controlled system and the influence of the external force, and the form of the reward function has sufficient physical meaning by utilizing the direct proportional relation between the external force and the position of the controlled system, so that the stability and the convergence of the learning process can be ensured in the actual use. At the same time, Q _f ，Q _x ，Q _v And the weights of the external force, the position and the speed in the design target of the impedance controller are respectively, so that a user can flexibly adjust the design target according to the application scene requirement to obtain the optimal impedance controller under the scene.

In the reinforcement learning process, a real value function is finally obtained through continuous learning iteration. A wrong cost function form will lead to learning failure, and a complex cost function form (such as a deep neural network) will make the learning process tedious. The value function form provided by the invention greatly reduces the number of undetermined coefficients, does not need a complex deep network to fit the value function, and greatly accelerates the learning process.

And S2, estimating theta by adopting a reinforcement learning method based on the reward function and the value function to obtain an optimal impedance control parameter K, and finishing the design of the impedance controller.

The method for finishing the design of the controller by adopting the reinforcement learning method comprises the following steps:

s201, initializing a learning process, setting K as a zero vector, setting theta as a zero vector, and setting an updating period i _update ，i _update Is a positive integer.

S202, initializing a learning period, and setting a controlled system to be in an initial state X _ΔT And setting a learning parameter P as δ H, wherein δ is a positive integer, and H is an identity matrix of n by n. Recording system stateIs X _iΔT Where i is 1, 2, 3 … and Δ T is the control period of the controlled system. When i is 1.

S203, calculating control input u ═ KX _iΔT + σ Rand; rand is a random number and σ is a weighting factor.

S204, calculating a reward function

S205, obtaining a system state X _(i+1)ΔT Updating a system value function parameter theta and a learning parameter P:

θ＝θ+gradient

the method for updating the parameters of the value function has small calculation amount, can update the value function in real time in the learning process, and improves the convergence speed of the value function.

S206, updating an impedance control parameter K: when i is i _update When the multiple of the number of the elements is multiple, the elements of theta are sequentially arranged into a matrix of n x n, and H is partitioned to obtain the multiple of the number of the elements of theta

Wherein H ₂₁ Is a matrix with the same dimension as K; let K _updated ＝K-l*(H ₂₁ +KH ₂₂ )，K＝K _updated (ii) a l is the update weight.

The method provides an analytic control parameter updating strategy, and is low in calculation complexity and high in convergence speed. At the same time, i can be adjusted _update Adjusting control parameter update frequencyAnd adjusting the updating speed of the control parameters by adjusting the I, realizing effective control on the learning process, and effectively avoiding the situations of redundant learning process caused by too slow learning and incapability of convergence of the control parameters caused by too fast learning.

S207, judging termination of the learning period: if i Δ T is not less than T _max If not, i is made to be i +1, and the process returns to S203. T is _max Is the maximum length of a learning cycle.

S208, learning termination judgment: the control laws recorded before and after the K-th learning period are respectively u-K _k X and u ═ K _k-1 X, if max (abs (K) _k -K _k-1 ) Is less than or equal to epsilon, the learning process is terminated and the obtained impedance controller is K _k And X, otherwise, returning to S202.ε is the termination decision threshold.

In order to verify the effectiveness of the method, simulation and experimental verification are carried out according to the method, and the result shows that by adopting the method, the kinetic parameters of a controlled system do not need to be obtained, and the learning length is T after 10 learning lengths _max After a learning period of 250 Δ T, an optimal impedance controller can be generated, which has obvious advantages compared with deep reinforcement learning which often requires thousands of training.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An impedance controller design method based on reinforcement learning is characterized by comprising the following steps:

s1, designing a reward function and a value function; set up as a reward function

Set as a cost function

Wherein the content of the first and second substances,

q,

f respectively represents the current position, speed and external force of the controlled system; q _f ，Q _x ，Q _v Weights for external force, position, velocity in the impedance controller design objective, respectively; and u is KX is the control input of the system; k is the impedance control parameter to be designed and optimized,

representing a kronecker product of the matrix; theta is a cost function parameter;

s2, estimating theta by adopting a reinforcement learning method based on the reward function and the value function to obtain an optimal impedance control parameter K, and finishing the design of the impedance controller; step S2 specifically includes:

s203, calculating control input u ═ KX _iΔT + σ Rand, Rand being a random number, σ being a weight factor; x _iΔT The system state of the current control period is 1, 2, 3 …, and Δ T is the control period of the controlled system;

s204, calculating a reward function

gradient is the median, gamma is the predictor, 0<γ<1；

s207, learning period termination judgment: if i Δ T is not less than T _max If not, making i equal to i +1, returning to S203; t is _max Is a maximum learning cycle length;

s208, learning termination judgment: the control laws before and after the K-th learning period are respectively K _k X and u ═ K _k-1 X, if max (abs (K) _k -K _k-1 ) Is less than or equal to epsilon, the learning process is terminated and the obtained impedance controller is K _k X, otherwise, returning to S202; ε is the termination decision threshold.

2. The method as claimed in claim 1, wherein the step S1 specifically includes:

s101, regarding an external force f borne by a controlled system as one of system statesIn part, obtaining an augmented state vector of

q,

f respectively represents the current position, speed and external force of the controlled system; the form of the impedance controller is set as that u is KX, u is the control input of the system, and K is the impedance control parameter to be designed and optimized;

Q ₁ ，Q ₂ ，Q ₃ for controlling the weight, the values are positive and real;

s103, changing F to F _e Substituting q into the cost function to obtain:

in the form of a matrix of

S104, designing the reward function as the inverse number of the cost function:

the cost function being designed as an accumulation of reward functions

3. The method of claim 2, wherein the element in the augmented state vector X is arranged in an arbitrary order, K, c _k 、r、Q _v (X,u)、

4. An impedance controller design system based on reinforcement learning, comprising:

Set as a cost function

Wherein the content of the first and second substances,

q,

the impedance control parameter optimization module is used for estimating theta by adopting a reinforcement learning method based on the reward function and the value function to obtain an optimal impedance control parameter K and complete the design of the impedance controller; the implementation process of the impedance control parameter optimization module specifically comprises the following steps:

s204, calculating a reward function

θ＝θ+gradient

gradient is the median, gamma is the predictor, 0<γ<1；

5. The reinforcement learning-based impedance controller design system according to claim 4, wherein the control target design module is implemented by:

the external force f borne by the controlled system is regarded as a part of the system state, and an augmented state vector is obtained

q,

inputting control u, current position q and speed of the controlled system

changing F to F _e Substituting q into the cost function to obtain:

in the form of a matrix

The reward function is designed as the inverse of the cost function:

the cost function being designed as an accumulation of reward functions

6. The reinforcement learning-based impedance controller design system of claim 5, wherein the arrangement order of the elements in the augmented state vector X is arbitrary, K, c _k 、r、Q _v (X,u)、