CN114378820B

CN114378820B - Robot impedance learning method based on safety reinforcement learning

Info

Publication number: CN114378820B
Application number: CN202210055753.7A
Authority: CN
Inventors: 潘永平; 冯晓欣
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2023-06-06
Anticipated expiration: 2042-01-18
Also published as: CN114378820A

Abstract

The invention discloses a robot impedance learning method based on safety reinforcement learning, which comprises the following steps: the controller outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation; constructing an input item according to the position information, the speed information and the return information of a learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed according to the admittance model as target input of a controller. The invention has high stability, improves the feasibility of admittance control, and can be widely applied to the technical field of artificial intelligence.

Description

Robot impedance learning method based on safety reinforcement learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a robot impedance learning method based on safety reinforcement learning.

Background

Impedance control is an effective method for controlling interaction force of a robot, and expected impedance parameters are often required to be given, so that an impedance controller is designed to control the robot to achieve expected interaction force. However, due to the unknowns and uncertainties of the external environment, it is often necessary to introduce virtual compliance into the system to ensure the security of the interaction process.

The existing robot impedance learning method has the following defects:

1. traditional optimization methods (e.g., gradient descent optimization) require learning impedance parameters with the environmental model and the robot model known.

2. Existing reinforcement learning algorithms applied to impedance learning, such as depth deterministic strategy gradient (Deep Deterministic Policy Gradient, DDPG) algorithms, have certain drawbacks: the Actor network used in the DDPG algorithm is easy to cause algorithm instability due to non-delay updating and no noise added to the output action, and is not the most suitable algorithm for realizing impedance learning in the robot shaft hole assembly task.

3. The existing probabilistic reasoning (PILCO) algorithm in learning COntrol for impedance learning is a model-based reinforcement learning algorithm, which needs to model and predict a future state by using a Gaussian Process (GP) first, and modeling has a certain model error.

4. General reinforcement learning adds exploratory learning with random properties to the solution process, while exploratory learning that is not security limited is likely to present significant risks. If reinforcement learning is directly applied to real-world tasks, the intelligent agent is allowed to perform trial and error exploration learning, and the decision made may cause the system to fall into a dangerous state.

Disclosure of Invention

In view of the above, the embodiment of the invention provides a robot impedance learning method with high stability based on safety reinforcement learning, so as to realize the feasibility of variable admittance control.

An aspect of the present invention provides a robot impedance learning method based on safety reinforcement learning, including:

the controller outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation;

constructing an input item according to the position information, the speed information and the return information of a learning algorithm;

determining a decision action according to the input item, and further determining an impedance parameter;

taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;

taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed as target input of a controller according to the admittance model;

wherein the target input is for causing the controller to control the robot motion.

Optionally, the calculating the position information and the speed information of the cartesian space of the robot according to the robot dynamics equation specifically includes:

calculating the position information and the speed information through a robot dynamics equation;

the expression of the robot dynamics equation is as follows:

wherein ,

is joint moment; />

Is a joint space inertia matrix; />

Is a coriolis force and centripetal force coupling matrix; />

Is a gravity term; />

Non-rigid body power items such as friction;

acting a moment for the environment in the joint space; />

The space position and speed information of the robot joint are obtained;

the position information is Cartesian space position information, and the calculation formula of the Cartesian space position information is as follows:

x＝ψ(q)；

the calculation formula of the speed information is as follows:

wherein ,

representing position information; />

Representing speed information; psi (·) is the robot kinematic solution; j (q) is a jacobian matrix.

Optionally, the determining a decision action according to the input item, and further determining an impedance parameter, includes:

processing the state information of the input item;

evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain decision actions;

and taking the decision action as an impedance parameter.

Optionally, the calculation formula of the environment interaction force is:

wherein ,

representing environmental interaction forces; />

Damping, rigidity and inertia diagonal matrix of the admittance model of the robot respectively; />

Is the desired interaction force; />

Respectively the Cartesian space position, the speed and the acceleration of the robot; />

The desired position, the desired velocity and the desired acceleration of the robot, respectively, in cartesian space.

Optionally, the admittance model has the expression:

wherein ,

the reference position, the reference speed and the reference acceleration are respectively a Cartesian space reference position, a reference speed and a reference acceleration of the robot constrained after the robot is subjected to external force.

Optionally, the evaluating and optimizing the processed input item through the Critic network and the Actor network to obtain a decision action includes:

respectively inputting state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises the position information and the speed information of the end effector of the current mechanical arm;

inputting the first processing result into a first group of Critic networks to obtain a third processing result;

inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;

and adjusting the third processing result according to the fourth processing result to obtain a final decision action.

Optionally, the method further comprises the steps of: safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the method specifically comprises the following steps:

introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;

defining a set of feasible solutions according to the constraint threshold;

searching an optimal strategy according to the set of feasible solutions;

according to the actual task, optimizing and adjusting the loss function;

and performing safety reinforcement learning according to the adjusted loss function.

Another aspect of the embodiments of the present invention further provides a robot impedance learning device based on safety reinforcement learning, including:

the first module is used for outputting control moment by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to the dynamics equation of the robot;

the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;

the third module is used for determining a decision action according to the input item so as to determine an impedance parameter;

the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module, and calculating to obtain the environment interaction force;

a fifth module, configured to take the impedance parameter and the environmental interaction force as input of an admittance model, and determine a reference position and a reference speed according to the admittance model as target input of a controller;

Another aspect of the embodiment of the invention also provides an electronic device, which includes a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

Another aspect of the embodiments of the present invention also provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.

The controller of the embodiment of the invention outputs control moment, and calculates the position information and the speed information of the Cartesian space of the robot according to a robot dynamics equation; constructing an input item according to the position information, the speed information and the return information of a learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environment interaction force as input of an admittance model, and determining a reference position and a reference speed according to the admittance model as target input of a controller. The invention has high stability and improves the feasibility of admittance control.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a learning control framework provided by an embodiment of the present invention;

fig. 2 is an assembly schematic diagram of a shaft hole of a robot according to an embodiment of the present invention;

FIG. 3 is a network architecture diagram of a dual delay depth deterministic strategy gradient (Twin Delayed Deep Deterministic Policy Gradient, TD 3) algorithm provided by an embodiment of the invention;

FIG. 4 is a schematic diagram of a constrained Markov decision process framework provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

First, description will be made of related technical terms related to the present invention:

admittance control: the compliance control is a robot interaction control method, and the compliance control is not intended to independently control the position and interaction force of the robot, but rather is intended to realize a specified dynamic relationship between the interaction force and the position error, namely, to carry out impedance or admittance shaping on the robot, and virtual compliance is introduced into the system in a control mode so as to ensure the safety of the interaction process. The compliant control is divided into impedance control based on force control and impedance control based on position control, the latter is abbreviated as admittance control, and the idea is to modulate the robot in the interaction process into a second-order admittance model by a control mode, wherein the model comprises three impedance parameters of inertia, damping and rigidity.

Admittance variation control: the fixed admittance model directs that the inertial, damping and stiffness parameters in the admittance model are fixed. Under many conditions, the interaction control of the fixed admittance model cannot achieve the expected effect of adapting to the environment and tasks, so that the variable admittance control concept is introduced, and according to the specific environment and tasks, the impedance parameters are adjusted, so that the robot is more compliant with the environmental force, and the compliant operation under the unknown dynamic environment is realized.

The impedance learning method comprises the following steps: the process of adjusting the impedance parameter is commonly referred to as impedance learning. The common impedance adjustment methods include simulation learning, iterative learning, gradient descent optimization, neural network, reinforcement learning and the like.

Reinforcement learning: the method can overcome the limitation caused by the fact that the traditional optimal control algorithm cannot completely model the environment, and the optimal solution is found through interaction with the environment. In robotic applications, one of the main objectives of reinforcement learning is to have the robot interact with the environment entirely autonomously, an important feature of which is to learn the best behavior without human involvement, without having to know models of the robot and the environmental system. In the task of robot variable impedance learning, the main purpose of reinforcement learning is to autonomously learn and adjust the impedance parameters of the robot so as to show more proper flexibility.

Constrained markov decision process (Constrained Markov Decision Process, CMDP): the markov decision process (Markov Decision Process, MDP) is a mathematical model of sequential decisions for simulating the randomness strategies and rewards achievable by agents in environments where the system states have markov properties, almost all reinforcement learning problems can be translated into MDP, which is used for modeling reinforcement learning problems. By using dynamic programming, random sampling, etc., the MDP can solve agent policies that maximize returns. The Constraint Markov Decision Process (CMDP) additionally introduces a loss function and constraints, and the objective function of the CMDP problem is to meet the maximum long-term benefit under all constraint conditions.

Aiming at the problems existing in the prior art, the embodiment of the invention provides a robot impedance learning method based on safety reinforcement learning, which comprises the following steps:

the expression of the robot dynamics equation is as follows:

wherein ,

is joint moment; />

Is a joint space inertia matrix; />

Is a coriolis force and centripetal force coupling matrix; />

Is a gravity term; />

Non-rigid body power items such as friction;

acting a moment for the environment in the joint space; />

The space position and speed information of the robot joint are obtained;

x＝ψ(q)；

the calculation formula of the speed information is as follows:

wherein ,

representing position information; />

processing the state information of the input item;

and taking the decision action as an impedance parameter.

Optionally, the calculation formula of the environment interaction force is:

wherein ,

representing environmental interaction forces; />

Is the desired interaction force; />

Optionally, the admittance model has the expression:

wherein ,

respectively, the Cartesian space reference position, the reference speed and the reference speed of the robot constrained by external forceReference acceleration. />

defining a set of feasible solutions according to the constraint threshold;

searching an optimal strategy according to the set of feasible solutions;

according to the actual task, optimizing and adjusting the loss function;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

The following describes the specific implementation of the present invention in detail with reference to the drawings of the specification:

as shown in fig. 1, fig. 1 is a learning control frame diagram according to an embodiment of the present invention.

Specifically, the overall learning control flow is as follows:

1) The inner ring is a control ring, so that the robot system with unknown dynamics characteristics shows the behavior of a specified admittance model, namely the controller outputs a control moment tauCalculating the Cartesian space actual position x and the speed of the robot according to the robot dynamics equation

The robot dynamics equation is

wherein ,

for joint moment->

Is a joint space inertia matrix>

For the coriolis force and centripetal force coupling matrix, < >>

Is gravity item->

Is a non-rigid body power item such as friction,

acting a moment for the environment in the joint space, +.>

For the space position and speed information of the robot joint, the corresponding Cartesian space position and speed of the robot are obtained through a conversion model, and the specific formula is as follows:

x＝ψ(q),

wherein, ψ (·) is the robot kinematic solution, and J (q) is the jacobian matrix.

2) Actual position x and velocity

As an input item of the reinforcement learning algorithm, the return r is also used as an input item of the reinforcement learning algorithm, and after the input state information is processed, decision actions, namely the impedance parameter K, are output through a Critic network evaluation and an Actor network, and the specific algorithm implementation process is shown in fig. 3.

Fig. 3 is a network architecture in a dual delay depth deterministic strategy gradient TD3 algorithm.

Wherein the present embodiment learns impedance parameters using a dual delay depth deterministic strategy gradient (Twin Delayed Deep Deterministic Policy Gradient, TD 3) algorithm, wherein state S comprises the current position x and velocity of the robotic end effector

And an action K outputted at the previous moment; a represents a strategy output by an Actor network; a' represents a strategy output by an Actor target network; q1 represents a value function calculated by Critic network 1 evaluation; q2 represents a value function calculated by Critic network 2 evaluation; q' represents a target value function; r represents an instant prize; td_error1 represents the error obtained by subtracting Q1 from the weighted sum of R and Q'; td_error2 represents the error obtained by subtracting Q2 from the weighted sum of R and Q'; target represents a weighted sum of R and Q'; the difference between the Actor network and the Actor target network is that the Actor network is updated in an experience pool in each step, and the Actor target network copies the network parameters of the Actor network into the Actor target network at intervals to realize the update of the Actor target network; the Critic network 1 and the Critic network 2 respectively and independently update network parameters by using the same target value function; the difference between the Critic network 1 and the Critic target network 1 is that the Critic network 1 is updated in an experience pool in each step, and the Critic target network 1 copies network parameters of the Critic network 1 into the Critic target network 1 at intervals, so that the updating of the Critic target network 1 is realized. The return function defined in this embodiment is:

r＝-a*||(F _e -F _d ) ² ||-b*||(x-x _d ) ² ||-c*||(x-x _obj ) ² ||+r _final

wherein ,

represents the target position, r _final Is a positive integer.

The return function comprises four items in total, wherein the first three items represent the instant return of each step and are respectively used for punishing the actions of generating great interaction force, deviating from an expected track and keeping away from a target position, and the last item represents that tasks are completed within a specified time, namely rewards are given when the target position is reached. The purpose of the reward function is thus to encourage movement to small holes while suppressing behavior that would create large interaction forces.

3) In addition, the actual position x and the speed of the mechanical arm end effector

As input of the environment module, the environment interaction force F is calculated _e The method comprises the steps of carrying out a first treatment on the surface of the The environmental interaction force is designed as follows:

wherein ,

damping, stiffness and inertia diagonal matrix of the robot admittance model, respectively +.>

To expect interactive force, ++>

Cartesian spatial position, speed and acceleration of the robot, respectively->

4) Impedance parameter K and environmental interaction force F _e As input to the admittance model, a reference position x is calculated from the admittance model _r And a reference speed

And serves as an input to the controller. The admittance model is as follows:

wherein ,

the reference position, the reference speed and the reference acceleration are respectively a Cartesian space reference position, a reference speed and a reference acceleration of the robot constrained after the robot is subjected to external force. Specifically calculating a reference position x according to an admittance model _r And reference speed->

The specific steps are that the solution is carried out by an integration method, and the last time t-tau is used _s Reference speed of +.>

And reference position x _r (t-τ _s ) Calculating the reference acceleration of the current moment t>

Then the reference speed of the current moment is obtained through the integration operation>

And reference position x _r (t) the formula is as follows:

the final purpose of this scheme is through the impedance parameter of reinforcement learning mode study robot execution task in-process, strengthens the compliance effect of robot, ensures the safety of robot and environment. The learning curve converges after training hundreds of rounds, and the optimal motion track and environment interaction force change curve of the robot can be obtained through simulation, so that the whole task is completed.

Fig. 2 is a schematic diagram of a robot shaft hole assembly, with the task of robot shaft hole assembly as an environmental context. The initial position of the tail end of the mechanical arm is x ₀ The desired trajectory is first from x ₀ To x ₁ Then from x ₁ To x ₂ When deviating from the expected track, the mechanical arm receives the environment interaction force generated by collision with the wall, and generates a compliant effect by adjusting the impedance parameter, so that the environment interaction force is reduced.

FIG. 4 is a schematic diagram of a constrained Markov decision process framework in combination with which the present invention implements secure reinforcement learning.

Compared with a general Markov decision process, the method additionally introduces a loss function c in the constraint Markov decision process and sets a constraint threshold d. Order the

For long-term loss of policy pi within the constraint, a set of feasible solutions is defined as pi c= { pi e pi J _C(π) D is less than or equal to d). Then, under the condition of meeting the constraint condition, searching a strategy pi for maximizing long-term benefit ^* ＝argmax _π∈∏c J (pi). The loss function is designed to be c=w|| (F _e -F _d ) ² I, wherein w is a parameter for adjusting weightThe purpose of such design of the loss function is to constrain the environmental interaction forces to a safe range, and then to achieve learning of the impedance parameters within that safe range.

In summary, the TD3 algorithm is applied to the impedance learning of the Panda robot for the first time, and the feasibility of performing the impedance learning by the Panda robot based on the reinforcement learning mode and further realizing the admittance variation control is verified; the invention also applies the safety reinforcement learning idea to impedance learning, combines the CMDP with the TD3 algorithm, and applies the safety reinforcement learning idea to the impedance learning task in the robot shaft hole assembly process, thereby guaranteeing the safety of the shaft hole assembly task.

Compared with the prior art, the invention has the following advantages:

firstly, applying the impedance learning based on the TD3 algorithm to a Panda robot simulation platform for the first time, and verifying the feasibility of performing the impedance learning by the Panda robot based on the reinforcement learning mode so as to realize the variable admittance control; secondly, the deep reinforcement learning (namely TD 3) algorithm has higher performance, higher stability and faster learning of the optimal impedance parameter, and is more suitable for being applied to the impedance learning task of the robot shaft hole assembly process; the safety reinforcement learning thought is introduced, and the CMDP is combined, so that the safety guarantee function is realized, and the robot shaft hole equipment process is safer.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims

1. The robot impedance learning method based on the safety reinforcement learning is characterized by comprising the following steps of:

2. The robot impedance learning method based on the safety reinforcement learning according to claim 1, wherein the calculation of the position information and the speed information of the cartesian space of the robot according to the robot dynamics equation is specifically as follows:

the expression of the robot dynamics equation is as follows:

wherein ,

is joint moment; />

Is a joint space inertia matrix; />

Is a coriolis force and centripetal force coupling matrix; />

Is a gravity term; />

Is a non-rigid body power item; />

Acting a moment for the environment in the joint space; />

The space position and speed information of the robot joint are obtained;

x＝ψ(q)；

the calculation formula of the speed information is as follows:

wherein ,

representing position information; />

3. The method for learning impedance of a robot based on safety reinforcement learning of claim 1, wherein determining a decision action based on the input term, and thus determining an impedance parameter, comprises:

processing the state information of the input item;

and taking the decision action as an impedance parameter.

4. The robot impedance learning method based on safety reinforcement learning according to claim 1, wherein the calculation formula of the environment interaction force is:

wherein ,

representing environmental interaction forces；/>

Is the desired interaction force;

5. The robot impedance learning method based on safety reinforcement learning of claim 4, wherein the admittance model has the expression:

wherein ,

6. The method for learning impedance of a robot based on safety reinforcement learning according to claim 3, wherein the evaluating and optimizing the processed input item through a Critic network and an Actor network to obtain a decision action comprises:

7. The robot impedance learning method based on safety reinforcement learning of claim 6, further comprising the steps of: safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the method specifically comprises the following steps:

defining a set of feasible solutions according to the constraint threshold;

searching an optimal strategy according to the set of feasible solutions;

according to the actual task, optimizing and adjusting the loss function;

8. The utility model provides a robot impedance learning device based on safe reinforcement study which characterized in that includes:

9. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program implements the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the storage medium stores a program that is executed by a processor to implement the method of any one of claims 1 to 7.