CN116460860B

CN116460860B - Model-based robot offline reinforcement learning control method

Info

Publication number: CN116460860B
Application number: CN202310725865.3A
Authority: CN
Inventors: 尚伟伟; 李想; 丛爽
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-20
Anticipated expiration: 2043-06-19
Also published as: CN116460860A

Abstract

The invention discloses a model-based robot offline reinforcement learning control method, and belongs to the field of robot control. The method comprises the following steps: deep learning respectively establishes a depth kinematic model based on jacobian and a depth dynamics model based on Lagrange, which are corresponding to the mechanical arm of the robot, a depth transfer model is established by the depth kinematic model based on jacobian and the depth dynamics model based on Lagrange, the depth transfer model is used for establishing a Markov decision process model of a mechanical arm track tracking task, offline reinforcement learning based on the model is carried out through a Soft Actor-Critic reinforcement learning algorithm to obtain a control strategy, and the mechanical arm is controlled by combining a traditional calculation moment controller. The method greatly reduces the sample complexity of the reinforcement learning control of the robot, improves the precision of the track tracking task, and has stronger generalization and robustness.

Description

Model-based robot offline reinforcement learning control method

Technical Field

The invention relates to the field of high-precision track tracking tasks of robots, in particular to a model-based robot offline reinforcement learning control method.

Background

Reinforcement learning algorithms provide a powerful framework to solve sequential decision problems, and recent deep learning techniques have accelerated the development of model-free reinforcement learning algorithms. However, these algorithms are rarely applied directly to real world physical systems, especially robotic systems, because they have high sample complexity and intermediate strategies in the training process may be detrimental to robotic systems and the environment.

In contrast, the model-based reinforcement learning algorithm performs trajectory simulation and planning through a learning system and an environmental state transition model, so that the sample complexity of the algorithm is reduced. For the track tracking control task of the robot, the traditional dynamic control method based on the model, such as the augmented PD and the calculated moment control, has the advantages of higher tracking precision, lower control energy consumption and the like. The existing robot control method based on the model needs to accurately obtain the kinematic model and the dynamic model of the robot first, and parameters of the controller need to be manually adjusted according to experience, but as the robot becomes more and more complex, how to obtain the accurate kinematic model and the dynamic model corresponding to the robot and automatically obtain the parameters of the controller, so that the problem to be solved is to realize high-precision robot control.

In view of this, the present invention has been made.

Disclosure of Invention

The invention aims to provide a model-based robot offline reinforcement learning control method, which can obtain a kinematic model and a dynamic model corresponding to a robot through deep learning and automatically obtain parameters of a controller, and is combined with a traditional calculation moment controller to realize a high-precision track tracking task of the robot in a joint space and an operation space, so that the technical problems in the prior art are well solved.

The invention aims at realizing the following technical scheme:

the model-based robot offline reinforcement learning control method is characterized by comprising the following steps of:

step S1, respectively establishing a depth kinematic model based on jacobian and a depth kinematic model based on Lagrange, which correspond to the mechanical arm, through deep learning; the depth kinematic model is used for predicting the pose of the end effector of the mechanical arm and calculating a jacobian matrix corresponding to the mechanical arm; the depth dynamics model is used for predicting joint angles, angular velocities and angular accelerations of the mechanical arm to obtain state changes of joint spaces of the mechanical arm, and obtaining control moment of the calculation moment controller when the mechanical arm performs track tracking tasks;

Step S2, a random excitation track model described by finite Fourier series is established, a random excitation track given by the random excitation track model is used as an expected motion track to control the mechanical arm, the actual motion track of the mechanical arm is measured and collected to be used as a training data set, and the depth kinematic model based on the jacobian and the depth dynamics model based on the Lagrange established in the step S1 are trained respectively;

step S3, a Markov decision process model of a mechanical arm track tracking task is established, and a state transfer model in the Markov decision process model is a depth transfer model which is built by combining a trained depth kinematic model based on jacobian and a depth kinematic model based on Lagrange and simulates the mechanical arm motion track;

step S4, according to the Markov decision process model, offline learning is carried out through a Soft Actor-Critic reinforcement learning algorithm to calculate control parameters of a moment controller as a control strategy, mechanical arm simulation motion track data of offline interaction of the depth transfer model and the control strategy is collected, and an Actor network and a criticizer network of the Soft Actor-Critic reinforcement learning algorithm are updated until an optimal control strategy is obtained;

And S5, calculating the specific control moment of the mechanical arm by the calculated moment controller according to the optimal control strategy obtained in the step S4 to control the mechanical arm.

Compared with the prior art, the model-based robot offline reinforcement learning control method provided by the invention has the beneficial effects that:

the deep kinematic model and the deep dynamic model corresponding to the mechanical arm serving as the robot are established through deep learning, and offline learning of a software-Critic reinforcement learning algorithm is matched, so that the combination of the model and a traditional calculation moment controller is realized, and the mechanical arm is controlled to carry out high-precision track tracking tasks in joint space and operation space. The method greatly reduces the sample complexity of the reinforcement learning control of the robot, improves the precision of the track tracking task, and has stronger generalization and robustness.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a model-based offline reinforcement learning control method for a robot according to an embodiment of the present invention.

Fig. 2 is a specific flowchart of a model-based offline reinforcement learning control method for a robot according to an embodiment of the present invention.

Detailed Description

The technical scheme in the embodiment of the invention is clearly and completely described below in combination with the specific content of the invention; it will be apparent that the described embodiments are only some embodiments of the invention, but not all embodiments, which do not constitute limitations of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Terms that may be used herein will be described first as follows:

the term "and/or" is intended to mean that either or both may be implemented, e.g., X and/or Y are intended to include both the cases of "X" or "Y" and the cases of "X and Y".

The terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The term "consisting of … …" is meant to exclude any technical feature element not explicitly listed. If such term is used in a claim, the term will cause the claim to be closed, such that it does not include technical features other than those specifically listed, except for conventional impurities associated therewith. If the term is intended to appear in only a clause of a claim, it is intended to limit only the elements explicitly recited in that clause, and the elements recited in other clauses are not excluded from the overall claim.

The terms "mounted," "connected," "secured," and the like are to be construed broadly as including, for example: the connecting device can be fixedly connected, detachably connected or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms herein above will be understood by those of ordinary skill in the art as the case may be.

The terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," etc. refer to an orientation or positional relationship based on that shown in the drawings, merely for ease of description and to simplify the description, and do not explicitly or implicitly indicate that the apparatus or element in question must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the present disclosure.

The model-based robot offline reinforcement learning control method provided by the invention is described in detail. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The reagents or apparatus used in the examples of the present invention were conventional products commercially available without the manufacturer's knowledge.

As shown in fig. 1 and 2, an embodiment of the present invention provides a model-based robot offline reinforcement learning control method for controlling a mechanical arm as a robot, including the following steps:

Preferably, in step S1 of the above method, the depth kinematic model based on jacobian corresponding to the mechanical arm is established through deep learning in the following manner, including:

the forward kinematics model of the mechanical arm is determined as follows:； wherein ,/>Is the pose of the end effector of the mechanical arm in the operating space, < >>Is the degree of freedom of the operating space of the mechanical arm; />Is the joint angle of the mechanical arm, < >>Is the degree of freedom of the joint space of the mechanical arm;

the velocity relationship between the joints of the robot and the end effector of the robot is expressed as the differential of the kinematic equation f with respect to time； wherein ,/>Is the derivative of the position of the mechanical arm end effector in the operation space relative to time;is the jacobian matrix actually corresponding to the mechanical arm,>is the degree of freedom of the operating space of the mechanical arm, +.>Is the degree of freedom of the joint space of the mechanical arm; />Is the angular velocity of the joint of the mechanical arm;

determining the pose of the mechanical arm end effector in an operation space asWherein p is the position of the robotic end effector; / >Is a quaternion representing the pose of the robotic arm end effector,, wherein />A rotation axis representing the quaternion,>、/> and />Three components of the rotation axis, respectively +.>Indicating the angle of rotation about the axis of rotation +.>Satisfy constraint->The method comprises the steps of carrying out a first treatment on the surface of the Quaternion->The relation between the derivative with respect to time and the angular velocity ω of the manipulator end-effector in the operating space is +.>Namely, the relationship is:

；

in the above-mentioned relational expression, the term "about",、/> and />The components of the angular velocity of the mechanical arm end effector in the x axis, the y axis and the z axis are respectively;

the fully-connected deep neural network is established to learn a forward kinematics model of the mechanical arm, and the deep kinematics model corresponding to the mechanical arm is obtained by the following steps:, wherein ,/>Network parameters of the depth kinematic model; />Is a kinematic equation obtained by deep learning; q is the joint angle of the mechanical arm;

applying a chain law to the derivatives of each layer in the depth kinematic model, iteratively calculating the derivatives of the output of the depth neural network relative to the input to recover and learn a jacobian matrix, the jacobian matrixExpressed as calculated by:

；

wherein ,network parameters of the depth kinematic model; />The deviation of the kinematic equation obtained for the deep learning on the joint angle; / >An i-th layer network layer which is a deep kinematic model,>，/>；，/>，/>the weight input by a previous network layer of the deep neural network is applied to the deep kinematic model; />Is the deviation; /> and />The nonlinear activation function and its derivative of the deep neural network, respectively.

Preferably, in the above method, the training data set of the depth kinematic modelWherein q is the joint angle of the mechanical arm; />The joint angular velocity of the mechanical arm and x are the pose of the mechanical arm end effector in an operation space; />A derivative of the position of the mechanical arm end effector in the operation space relative to time;

the loss function during the training of the depth kinematic model is as follows:

；

wherein ,network parameters of the depth kinematic model; />A training data set which is a depth kinematic model;is the size of the training dataset; />Is the generalized inverse of the jacobian matrix; />Is the size of the training dataset; first loss term->Is the mean square error of the actual pose and the predicted pose of the manipulator operation space; second loss term->Is the mean square error of the actual speed and the predicted speed of the operation space of the mechanical arm; last loss item->The depth neural network is enabled to be more accurate by fitting the angular velocity of the joint of the mechanical arm.

Preferably, in step S1 of the above method, the deep learning is performed to build a lagrangian-based deep dynamics model corresponding to the manipulator in the following manner, including:

the forward dynamics model and the reverse dynamics model of the mechanical arm are determined as follows:

；

wherein ,the joint angle, the joint angular velocity and the joint angular acceleration of the mechanical arm are respectively; />Is the force and moment acting on the joints of the mechanical arm;

selecting generalized coordinates of the mechanical arm as joint angles of the mechanical arm, and defining a Lagrangian function as, wherein ,/>Is the kinetic energy of the mechanical arm; />Is the potential energy of the mechanical arm; />Is the mass matrix of the mechanical arm; the superscript T denotes the transposed matrix;

to ensure the quality matrix of the mechanical armIs positive symmetry of +.>Performing Chelesky decomposition to obtain +.>, wherein ,/>For the lower triangular matrix of the non-negative diagonal, the superscript T represents the transposed matrix, and the lower triangular matrix of the non-negative diagonal is learned by using the deep neural network +.>And potential energy of mechanical arm->Fitting Lagrangian function +.>In combination with Euler-Lagrangian equation->Wherein F is generalized force and moment, and a depth forward dynamics model and a depth reverse dynamics model which form a depth dynamics model corresponding to the mechanical arm are respectively:

；

wherein ,、/> and />Respectively the joint angle, the joint angular velocity and the joint angular acceleration of the mechanical arm; />Is a mass matrix of the mechanical arm; />The derivative of the mass matrix of the mechanical arm with respect to time;coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The torque is output for the joints of the mechanical arm; />The joint friction force of the mechanical arm is obtained through an induced joint priori friction force model of the mechanical arm consisting of coulomb friction, viscous friction and Stribeck friction, and the joint priori friction force model of the mechanical arm is as follows:

；

wherein ,is coulomb friction; />Is a viscous friction coefficient; />Is the maximum static friction force; the v and the delta are the correlation coefficients of the Stribeck friction model respectively; />Is the angular velocity of the ith joint of the mechanical arm.

Preferably, in the above method, the training data set of the depth dynamics modelWherein q is the joint angle of the mechanical arm; />Is the joint angular velocity of the mechanical arm; />The joint angular acceleration of the mechanical arm; />The joint moment of the mechanical arm;

the loss function of the depth dynamics model is as follows:

；

wherein ,a training data set for a depth dynamics model; / >The size of the training data set for the depth dynamics model; />Is the angular acceleration of the joint of the mechanical arm>Regression loss (I)>、/>Respectively a predicted value and an actual value of the joint angular acceleration of the mechanical arm at the time t; />For the +.>Regression loss (I)>、/>Respectively a predicted value and an actual value of the joint moment of the mechanical arm at the moment t; />Is a multi-step predictive loss by numerical integration, < >>For the predicted total number of steps +.>、/>The actual values at time t+1 of the joint angle and the joint angular velocity of the mechanical arm, respectively,/-> and />Respectively the predicted values of the joint angle and the joint angular velocity of the mechanical arm at the time t+1.

Preferably, in step S2 of the above method, the random excitation trajectory model described by the finite fourier series is built in the following manner, including:

for joint i of mechanical arm at t moment, randomly exciting track modelThe definition is as follows:

；

wherein , and />The amplitude of cosine and the amplitude of sine respectively; />The frequency of cosine and the frequency of sine;the offset of the joint angle of the mechanical arm, and the offset of the amplitude, the cosine amplitude and the frequency of the sine and the joint angle are randomly selected within a safe range; / >For the number of Fourier series, randomly select +.1 to 3>T is the time value at time t, which is the variable when accumulated.

Preferably, in step S3 of the above method, a markov decision process model of a robot arm track tracking task is established in the following manner, and a state transition model of the markov decision process model is a depth transition model which is constructed by combining a trained depth kinematic model based on jacobian and a depth kinematic model based on lagrangian and simulates the robot arm motion track, and includes:

modeling a trajectory tracking task of a robotic arm as a finite time discounted discrete Markov decision process； wherein ,/>For the state space +.>A state at time t; />Is an action space; />The action is at time t; />Is a deep transfer model as a state transfer model; />Is a reward function; />Is a discount factor;

the method for constructing the depth transfer model simulating the motion trail of the mechanical arm by combining the trained depth kinematic model based on jacobian and the depth kinematic model based on Lagrange comprises the following steps:

according to the joint angle and the joint angular velocity of the mechanical arm at the time tJoint moment with mechanical arm- >The fourth-order Longgan-Kutta numerical integration method is utilized to combine with a Lagrange-based depth dynamics model to obtain a predicted value +.10 of the joint angle and the joint angular velocity of the mechanical arm at the time t+1>；

Predicted values of joint angle and joint angular speed of mechanical arm at time t+1 by using depth kinematic model based on jacobianObtaining predicted values of pose and speed of end effector of mechanical arm at time t+1 for input；

According to the obtained predicted values of the joint angle and joint angular velocity of the mechanical arm at the time t+1 and the pose and the velocity of the end effectorCalculating an error value required for completing a track tracking task by combining an expected track of the mechanical arm at the time t+1 to form a state of the track tracking task of the mechanical arm at the time t+1>；

Tracking the state of a task by utilizing the locus of the mechanical arm at the time t+1Construction of depth transfer model->In the depth transfer model +.> and />The state at the moment t and the state at the moment t+1 of the track tracking task of the mechanical arm are respectively +.>And controlling the action of strategy output for the time t.

Preferably, in step S4 of the above method, according to the markov decision process model, control parameters of a moment controller are calculated as a control strategy by offline learning through a Soft Actor-Critic reinforcement learning algorithm, mechanical arm simulation motion track data of offline interaction of the depth transfer model with the control strategy is collected, and an Actor network and a criticizer network of the Soft Actor-Critic reinforcement learning algorithm are updated until an optimal control strategy is obtained, including:

Respectively setting a state space and an action space of the Markov decision process model according to a track tracking task performed by the mechanical arm in a joint space and a track tracking task performed by an operation space, performing offline reinforcement learning based on the model through a Soft Actor-Critic reinforcement learning algorithm, and outputting control parameters of a calculation torque controller as a control strategy;

collecting mechanical arm simulation motion trail data of offline and control strategy interaction of the depth transfer modelAnd updating the Actor network and the Critic network of the Soft Actor-Critic reinforcement learning algorithm until an optimal control strategy is obtained.

Preferably, in the above method, the setting of the state space and the action space of the markov decision process model according to the trajectory tracking task performed by the mechanical arm in the joint space and the trajectory tracking task performed by the operation space includes:

setting the state of a mechanical arm joint space for track tracking task at t timeThe method comprises the following steps:

；

wherein ,、/>respectively a joint angle error and a joint angular velocity error of the mechanical arm; />Is the accumulated value of the joint angle error of the mechanical arm; />、/> and />Respectively the joint angle, the joint angular velocity and the joint angular acceleration of the expected track of the mechanical arm;

Actions used for track following task in mechanical arm joint spaceThe torque controller is designed as follows:

；

wherein , and />The joint angle and the joint angular velocity of the mechanical arm are respectively; />Is a mass matrix of the mechanical arm; />Coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The joint friction force of the mechanical arm; />、/> and />Respectively accumulating values of joint angle error, joint angular speed error and joint angle error of the mechanical arm; />、/>Respectively calculating control parameters of the moment controller;

setting the state of a track tracking task of an operation space of a mechanical arm at the time tThe method comprises the following steps:

；

wherein ,、/>the position and orientation error and the speed error vector of the mechanical arm end effector are respectively; />、/>The position error and the linear speed error of the mechanical arm end effector are respectively; />、/>Respectively an attitude error and an angular velocity error of the mechanical arm end effector; />The accumulated value of the pose errors of the mechanical arm end effector is obtained;、/> and />Respectively mechanical armsThe pose, velocity, and acceleration of the desired trajectory of the end effector; /> and />The position and the linear speed of the mechanical arm end effector are respectively; />、/> and />The position, the linear speed and the linear acceleration of the expected track of the mechanical arm end effector are respectively; / >Quaternion for representing the attitude of the end effector of a robotic arm> and />The rotation axis of the quaternion and the rotation angle around the rotation axis are respectively; />Quaternion for the pose of the desired trajectory of the manipulator end effector, +.> and />The quaternion rotation axis and the rotation angle around the rotation axis are respectively; />Andrespectively the machinesThe first derivative and the second derivative of the quaternion of the gesture of the desired trajectory of the manipulator end effector with respect to time; and />The angular speed of the mechanical arm end effector and the angular speed of the expected track are respectively carried out; the subscript t of each parameter indicates that the parameter is a parameter at time t;

the mechanical arm operation space performs the action used for track following taskAn acceleration-based operating space calculation torque controller designed as follows:

；

wherein , and />The joint angle and the joint angular velocity of the mechanical arm are respectively; />Is a mass matrix of the mechanical arm;coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The joint friction force of the mechanical arm; />Mechanical arm reference for operating spaceThe acceleration rate of the vehicle is calculated,、/>the position and orientation error and the speed error vector of the mechanical arm end effector are respectively; />Network parameters of the depth kinematic model; / >The generalized inverse matrix is a mechanical arm jacobian matrix; />、/>Selecting control moment of zero space for calculating control parameters of moment controller>The form of (2) is:

；

wherein , and />The joint angle and the joint angular velocity of the mechanical arm are respectively; />The initial value of the joint angle of the mechanical arm; />Is a unit matrix; />Network parameters of the depth kinematic model; />A generalized inverse matrix of a jacobian matrix of the mechanical arm; />Is a jacobian matrix of the mechanical arm; />、/>Respectively calculating control parameters of the moment controller;

the set reward function is a segment reward function, and the segment reward function is:

；

wherein , and />Weights with different control precision respectively, including +.>The term is used for the action of making strategy exploration error quickly reduce, contain +>The term is used for enabling the strategy learning to improve the precision; beta is a value for adjusting the proportion of different weights, and the value of beta is 0.75; />、/> and />The track tracking task of the joint space of the mechanical arm is respectively the accumulated values of joint angle errors, joint angular velocity errors and joint angle errors of the mechanical arm, and the track tracking task of the operation space of the mechanical arm is respectively the accumulated values of pose errors, velocity errors and pose errors of the end effector of the mechanical arm;

Updating the Actor network and the critics network of the Soft Actor-Critic reinforcement learning algorithm in the following manner comprises:

the Soft Actor-Critic reinforcement learning algorithm fits a state-action value function through a Critic networkTo evaluate the strategy, the error of the minimized bellman equation is updated by: />Wherein->For the state of the arm track tracking task at time t+1, < > the following point is given>Controlling the action of strategy output for the time t+1; />A state-action value function at time t; />To control policy->Is the expected value under the condition; t represents the time t; />For the discount factor, superscript t represents the t power of the discount factor; />Is a reward function; />The expected value under the condition of the state and the action at the time t+1; />A state-action value function at time t+1;

updating the Actor network of the Soft Actor-Critic reinforcement learning algorithm by the KL divergence of the following minimization strategy:

；

wherein ,is a sample in the control strategy distribution; pi is the distribution of control strategies represented by the actor's network;KL divergence to minimize policy; />Tracking the state of the task for the track of the mechanical arm at the moment t; />Is a quaternion representing the pose of the robotic arm end effector; / >Is a control strategy before updating; />A state-action value function before update; />Is a distribution function for normalizing the distribution;

the optimal control strategy obtained by the Soft Actor-Critic reinforcement learning algorithm is to introduce the maximum entropy target while maximizing the expected rewards, and the optimal control strategyThe method comprises the following steps:

；

wherein T is the length of the motion trail of the mechanical arm;is the expected value under the condition of the state and the action at the moment t; />Tracking the state of the task for the track of the mechanical arm at the moment t; />Controlling the action of strategy output for the time t; />Is a reward function; />Is a discount factor; />Entropy of control strategy; />The regularization coefficient is used for adjusting the proportion of entropy in the objective function in an adaptive updating mode to control the randomness of the strategy.

Preferably, in step S5 of the above method, the control parameters of the calculated torque controller are output by the calculated torque controller according to the optimal control strategy obtained in step S4And calculating by combining with a calculation torque controller to obtain the specific control torque of the mechanical arm.

In the control method of the embodiment of the invention, the depth kinematic model and the depth dynamics model obtained by the deep learning are both gray box models by introducing priori knowledge, so that the generalization of the models is improved, and the interpretable depth kinematic model and the interpretable depth dynamics model are obtained and are used for simulating the motion trail of the robot and optimizing the control moment of the robot. The invention greatly reduces the sample complexity of the robot reinforcement learning control, improves the track tracking precision, and has stronger generalization and robustness.

In order to clearly demonstrate the technical scheme and the technical effects provided by the invention, the model-based offline reinforcement learning control method of the robot provided by the embodiment of the invention is described in detail below by using specific embodiments.

Example 1

As shown in fig. 1 and fig. 2, the embodiment of the invention provides a model-based robot offline reinforcement learning control method, which respectively establishes a deep kinematic model and a deep dynamics model corresponding to a mechanical arm serving as a robot through deep learning, and realizes the combination with a traditional calculation moment controller, thereby completing the high-precision track tracking task of the mechanical arm in joint space and operation space. The method comprises the following steps:

first, a jacobian-based depth kinematic model corresponding to a robotic arm is established by deep learning, comprising:

the forward kinematics model of the mechanical arm is as follows:

；

wherein ,is the position and the gesture, namely the pose,is the joint angle of the mechanical arm, < >>Is a mechanical arm operation spaceFreedom of (A)>Is the degree of freedom of the joint space of the mechanical arm;

the differential of the kinematic equation f over time describes the velocity relationship between the joints of the robotic arm and the end effector of the robotic arm as:

；

wherein ,is the derivative of the position of the mechanical arm end effector in the operation space relative to time;is the jacobian matrix actually corresponding to the mechanical arm,>is the degree of freedom of the operating space of the mechanical arm, +.>Is the degree of freedom of the joint space of the mechanical arm; />Is the angular velocity of the joint of the mechanical arm. Quaternion using numerical robustness>To represent the pose of the end effector; quaternion is denoted +.>And meet the constraint, wherein />A rotation axis representing the quaternion,>、/> and />Three components of the rotation axis, respectively +.>Indicating the angle of rotation about the axis of rotation. Thus the pose of the end effector of the manipulator in the operating space can be expressed as +.>Where p is the position of the robotic end effector. Quaternion->The relationship between the derivative with respect to time and the operating spatial angular velocity ω is +.>The method comprises the following steps: />

；

the fully-connected deep neural network is established to learn a forward kinematics model of the mechanical arm, and the deep kinematics model corresponding to the mechanical arm is obtained as follows:, wherein ,/>Network parameters of the depth kinematic model; />Is a kinematic equation obtained by deep learning; q is the joint angle of the mechanical arm;

And applying a chain rule to the derivative of each layer of the obtained depth kinematic model, and iteratively calculating the derivative of the output of the depth neural network relative to the input to recover and learn the jacobian matrix. Learned jacobian matrixExpressed as calculated as follows:

；

wherein ,network parameters of the depth kinematic model; />The deviation of the kinematic equation obtained for the deep learning on the joint angle; />An i-th layer network layer which is a deep kinematic model,>，；/>，/>，/>the weight input by a previous network layer of the deep neural network is applied to the deep kinematic model; />Is the deviation; /> and />The nonlinear activation function and its derivative of the deep neural network, respectively.

Training data set of the depth kinematic modelThe training data set consists of the joint angle q and the joint angular velocity of the mechanical arm>Pose x of the end effector in the operating space and its derivative with respect to time>Composition is prepared.

Specifically, the training data set of the depth kinematic model is a training data set obtained by firstly establishing a random excitation track model described by a finite Fourier series, controlling the mechanical arm by taking a random excitation track given by the random excitation track model as an expected motion track, and measuring and collecting an actual motion track of the mechanical arm.

Preferably, the random excitation trajectory model described by the finite fourier series is built in the following manner, including:

；/>

wherein , and />The amplitude of cosine and the amplitude of sine respectively; />The frequency of cosine and the frequency of sine;the offset of the joint angle of the mechanical arm, and the offset of the amplitude, the cosine amplitude and the frequency of the sine and the joint angle are randomly selected within a safe range; />For the number of Fourier series, randomly select +.1 to 3>T is the time value at time t, which is the variable when accumulated.

The loss function of the deep kinematic model training is set as follows:

；

wherein ,network parameters of the depth kinematic model; />A training data set which is a depth kinematic model;is the size of the training dataset; />Is the generalized inverse of the jacobian matrix; />Is the size of the training dataset; first loss term->Is the mean square error of the actual pose and the predicted pose of the manipulator operation space; second loss term->Is the mean square error of the actual speed and the predicted speed of the operation space of the mechanical arm; last loss item- >The depth neural network is enabled to be more accurate by fitting the angular velocity of the joint of the mechanical arm.

Secondly, establishing a depth dynamics model corresponding to the mechanical arm through deep learning, wherein the method comprises the following steps:

the forward dynamics model and the reverse dynamics model of the mechanical arm are respectively expressed as:

；

wherein ,joint angle, angular velocity and angular acceleration of the robotic arm, respectively +.>Are forces and moments acting on the joints of the robot arm.

And deducing a dynamics model of the mechanical arm based on Lagrangian mechanics so as to establish a depth dynamics model.

The Lagrangian function is defined as, wherein />Is kinetic energy of mechanical arm->Is the potential energy of the mechanical arm->Is the mass matrix of the mechanical arm; the superscript T denotes the transposed matrix;

to ensure the quality matrix of the mechanical armIs positive symmetry of +.>Performing Chelesky decomposition to obtain +.>, wherein ,/>A lower triangular matrix which is a non-negative diagonal, and the superscript T represents a transposed matrix;

lower triangular matrix for learning non-negative diagonals by using deep neural networkAnd potential energy of mechanical arm->Fitting Lagrangian function +.>In combination with Euler-Lagrangian equation->Wherein F is generalized force and moment to obtain the component machineThe depth forward dynamics model and the depth reverse dynamics model of the depth dynamics model corresponding to the arm are respectively as follows: / >

；

wherein ,、/> and />Respectively the joint angle, the joint angular velocity and the joint angular acceleration of the mechanical arm; />Is a mass matrix of the mechanical arm; />The derivative of the mass matrix of the mechanical arm with respect to time;coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The torque is output for the joints of the mechanical arm; />The joint friction force of the mechanical arm is obtained by an induced joint priori friction force model of the mechanical arm consisting of coulomb friction, viscous friction and Stribeck frictionThe joint priori friction model is:

；

Training dataset of the depth dynamics modelThe training data set consists of the joint angle q and the joint angular velocity of the mechanical arm>Angular acceleration of joint->And joint moment->Composition is prepared.

Specifically, the training data set of the depth dynamics model is a training data set obtained by firstly establishing a random excitation track model described by a finite Fourier series, controlling the mechanical arm by taking a random excitation track given by the random excitation track model as an expected motion track, and measuring and collecting an actual motion track of the mechanical arm. The way of constructing the random excitation trajectory model described in the finite fourier series is the same as mentioned in the previous deep kinematic model and is not repeated here.

The loss function of the depth dynamics model is set as follows:

；

wherein ,a training data set for a depth dynamics model; />The size of the training data set for the depth dynamics model; />Is the angular acceleration of the joint of the mechanical arm>Regression loss (I)>、/>Respectively a predicted value and an actual value of the joint angular acceleration of the mechanical arm at the time t; />For the +.>Regression loss (I)>、/>Respectively a predicted value and an actual value of the joint moment of the mechanical arm at the moment t; />Is a multi-step predictive loss obtained by numerical integration, the loss term can beImproving depth dynamics model multi-step prediction accuracy for strategy optimization process in subsequent reinforcement learning>For the predicted total number of steps +.>、/>The actual values at time t+1 of the joint angle and the joint angular velocity of the mechanical arm, respectively,/-> and />Respectively the predicted values of the joint angle and the joint angular velocity of the mechanical arm at the time t+1.

Then, a Markov decision process model of a mechanical arm track tracking task is established, and a state transfer model in the Markov decision process model is a depth transfer model which is built by combining a trained depth kinematic model based on jacobian and a depth kinematic model based on Lagrange and simulates the mechanical arm motion track, and the method comprises the following steps:

Modeling a trajectory tracking task of a robotic arm as a finite time discounted discrete Markov decision process； wherein ,/>For the state space +.>Is a movement space->For the state at time t>At time tAction of (a)> Is a deep transition model as a state transition model, < +.>For rewarding function->Is a discount factor;

according to the joint angle and the joint angular velocity of the mechanical arm at the time tJoint moment with mechanical arm->The fourth-order Longgan-Kutta numerical integration method is utilized to combine with a Lagrange-based depth dynamics model to obtain a predicted value +.10 of the joint angle and the joint angular velocity of the mechanical arm at the time t+1>；

According to the obtained predicted values of the joint angle and joint angular velocity of the mechanical arm at the time t+1 and the pose and the velocity of the end effector Calculating an error value required for completing a track tracking task by combining an expected track of the mechanical arm at the time t+1 to form a state of the track tracking task of the mechanical arm at the time t+1>；

Tracking the state of a task by utilizing the locus of the mechanical arm at the time t+1Construction of depth transfer model->In the depth transfer model +.> and />The state of the mechanical arm track tracking task at the time t and the state at the time t+1 are respectively,and controlling the action of strategy output for the time t.

The depth transfer model is used for simulating the motion trail of the mechanical arm and providing offline interaction data for a model-based reinforcement learning method to optimize a control strategy.

Finally, according to a Markov decision process model of the mechanical arm track tracking task, an off-line reinforcement learning method based on the model comprises the following steps:

the state space, the action space and the rewarding function of the Markov decision process model are respectively set, and the control parameters of the calculation torque controller are obtained as an optimization control strategy by combining a Soft Actor-Critic (SAC) reinforcement learning method to realize an offline reinforcement learning method based on the model.

The state of the track tracking task of the joint space of the mechanical arm at the time t is as follows:

；/>

wherein ,、/>respectively a joint angle error and a joint angular velocity error of the mechanical arm; / >Is the accumulated value of the joint angle error of the mechanical arm; />、/> and />The joint angle, the joint angular velocity and the joint angular acceleration of the expected track of the mechanical arm are respectively.

The motion of the track tracking task of the joint space of the mechanical arm is solved by a calculation torque controller, and the calculation torque controller is in the form of:

；

wherein , and />The joint angle and the joint angular velocity of the mechanical arm are respectively; />Is a machineA mass matrix of the arm; />Coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The joint friction force of the mechanical arm; />、/> and />Respectively accumulating values of joint angle error, joint angular speed error and joint angle error of the mechanical arm; />、/>Respectively calculating control parameters of the moment controller;

the state of the track tracking task of the operation space of the mechanical arm at the time t is as follows:

；

wherein ,、/>the position and orientation error and the speed error vector of the mechanical arm end effector are respectively; />、/>The position error and the linear speed error of the mechanical arm end effector are respectively; />、/>Respectively an attitude error and an angular velocity error of the mechanical arm end effector; />The accumulated value of the pose errors of the mechanical arm end effector is obtained;、/> and />The pose, the speed and the acceleration of the expected track of the mechanical arm end effector are respectively; / > and />The position and the linear speed of the mechanical arm end effector are respectively; />、/> and />The position, the linear speed and the linear acceleration of the expected track of the mechanical arm end effector are respectively; />Quaternion for representing the attitude of the end effector of a robotic arm> and />The rotation axis of the quaternion and the rotation angle around the rotation axis are respectively; />Quaternion for the pose of the desired trajectory of the manipulator end effector, +.> and />The quaternion rotation axis and the rotation angle around the rotation axis are respectively; /> and />The first derivative and the second derivative of the quaternion of the gesture of the expected track of the mechanical arm end effector relative to time are respectively; /> and />The angular speed of the mechanical arm end effector and the angular speed of the expected track are respectively carried out; the subscript t of each parameter indicates that the parameter is a parameter at time t.

The motion used by the mechanical arm operation space for track tracking task is designed as an acceleration-based operation space calculation moment controller in the following form:

；

wherein , and />The joint angle and the joint angular velocity of the mechanical arm are respectively; />Is a mass matrix of the mechanical arm; />Coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The joint friction force of the mechanical arm; / >For the manipulator reference acceleration of the operating space,、/>the position and orientation error and the speed error vector of the mechanical arm end effector are respectively; />Network parameters of the depth kinematic model; />The generalized inverse matrix is a mechanical arm jacobian matrix; />、/>Selecting control moment of zero space for calculating control parameters of moment controller>The form of (2) is:

；

the strategy learning output calculates control parameters of the torque controller instead of the specific joint control torque, as in the trajectory tracking task setup of the joint space.

；

wherein , and />Weights with different control precision respectively, including +.>The term is used for the action of quickly reducing strategy exploration errors, and when the errors are larger, the term mainly comprises +.>The term providing a reward value, an action that causes the strategy exploration error to decrease rapidly; contain- >The term is used for action of improving policy learning accuracy, and when error is small, the term is mainly composed of +.>The term provides a reward value, such that the policy learns actions that further improve accuracy; beta is a value for adjusting the proportion of different weights, the value of beta is 0.75, and the high-precision tracking task of the expected track is finally completed by adjusting the proportion of different weights through beta; />、/>Andthe track tracking task of the joint space of the mechanical arm is respectively the accumulated values of joint angle errors, joint angular velocity errors and joint angle errors of the mechanical arm, and the track tracking task of the operation space of the mechanical arm is respectively the accumulated values of pose errors, velocity errors and pose errors of the end effector of the mechanical arm.

Using Soft Actor-Critic reinforcement learning algorithm, i.e. SAC algorithmAs a strategy optimization algorithm of the invention, the strategy output calculates the control parameters of the moment controllerAnd calculating by combining with a calculation torque controller to obtain a final control torque, and controlling the mechanical arm according to the final control torque.

The SAC algorithm introduces a maximum entropy target while maximizing expected rewards, is used for balanced exploration and optimization, improves the performance and robustness of strategies, and has the optimal strategy that:

；

wherein T is the length of the motion trail of the mechanical arm; Is the expected value under the condition of the state and the action at the moment t; />Tracking the state of the task for the track of the mechanical arm at the moment t; />Controlling the action of strategy output for the time t; />Is a reward function; />Is a discount factor; />Entropy of control strategy; />The regularization coefficient is used for adjusting the proportion of entropy in the objective function in an adaptive updating mode to control the randomness of the strategy.

SAC algorithm uses a commentator network to fit state-actionsValue functionFor evaluating the strategy and updating by minimizing the error of the bellman equation by: />, wherein ,/>For the state of the arm track tracking task at time t+1, < > the following point is given>Controlling the action of strategy output for the time t+1; />A state-action value function at time t; />To control policy->Is the expected value under the condition; t represents the time t; />For the discount factor, superscript t represents the t power of the discount factor; />Is a reward function; />The expected value under the condition of the state and the action at the time t+1; />Is a state-action value function at time t+1.

The actor network of the SAC algorithm is updated by the KL divergence of the following minimization strategy:

；

wherein ,is a sample in the control strategy distribution; pi is the distribution of control strategies represented by the actor's network; KL divergence to minimize policy; />Tracking the state of the task for the track of the mechanical arm at the moment t; />Is a quaternion representing the pose of the robotic arm end effector; />Is a control strategy before updating; />A state-action value function before update; />Is a distribution function for normalizing the distribution.

The specific implementation process of the method comprises the following steps:

firstly, collecting actual motion trail data of a mechanical arm for deep learning, and respectively establishing a deep kinematic model and a deep dynamic model learning corresponding to the mechanical arm; the depth kinematic model is used for predicting the pose of the end effector of the mechanical arm and calculating a jacobian matrix corresponding to the mechanical arm; the depth dynamics model is used for predicting joint angles, angular velocities and angular accelerations of the mechanical arm to obtain state changes of joint spaces of the mechanical arm, and obtaining control moment of the calculation moment controller when the mechanical arm performs track tracking tasks;

then, a random excitation track model described by a finite Fourier series is established, a random excitation track given by the random excitation track model is used as an expected motion track to control the mechanical arm, an actual motion track of the mechanical arm is measured and collected to be used as a training data set, and the depth kinematic model based on the jacobian and the depth dynamics model based on the Lagrange established in the step S1 are trained respectively;

Secondly, establishing a Markov decision process model of a mechanical arm track tracking task, wherein a state transition model in the Markov decision process model is a depth transition model which is built by combining a trained depth kinematic model based on jacobian and a depth kinetic model based on Lagrange and simulates the mechanical arm motion track;

then, according to the Markov decision process model, offline learning is carried out through a Soft Actor-Critic reinforcement learning algorithm to calculate control parameters of a moment controller as a control strategy, mechanical arm simulation motion track data of offline interaction of the depth transfer model and the control strategy is collected, and an Actor network and a criticizer network of the Soft Actor-Critic reinforcement learning algorithm are updated until an optimal control strategy is obtained;

and finally, calculating the specific control moment of the mechanical arm by the calculated moment controller according to the optimal control strategy obtained in the step S4 to control the mechanical arm.

Compared with the prior art, the invention has the beneficial effects that:

(1) The control method of the invention is a learning method of a depth kinematic model of a robot based on jacobian and a depth kinematic model based on Lagrange. Unlike conventional methods, this method introduces a priori knowledge, such as: the depth kinematic model introduces priori knowledge of the Jacobian model on the velocity relation of the mechanical arm; the depth dynamics model is a priori knowledge of physical constraints such as energy conservation and motion equation constraint of a mechanical arm system and the like, which are introduced with a dynamics model structure (comprising a priori friction model), but not information of a specific robot, and the depth dynamics model corresponding to the mechanical arm are learned through a data-driven network model, so that a complex modeling process is not required, and the model precision and generalization are high. The depth kinematic model is used for predicting the pose of the end effector of the mechanical arm and calculating a jacobian matrix in real time. The deep dynamics model learns the components of the model in an unsupervised manner for predicting state changes in the robot joint space. The depth transfer model formed by combining the depth kinematic model and the depth dynamic model can be used for simulating the motion trail of the robot, is subsequently used for optimizing a control strategy, and is combined with the traditional calculation moment control law so as to achieve higher control accuracy.

(2) The control method of the invention is a model-based offline reinforcement learning method suitable for high-precision track tracking of robots. The method can complete high-precision track tracking tasks in joint space and operation space. Through the design of the state space, the action space and the rewarding function and the combination of the traditional calculation moment controller, the rapid convergence and high-precision performance of the off-line track tracking strategy are realized, and meanwhile, the stability and the safety of the control strategy are ensured.

Those of ordinary skill in the art will appreciate that: all or part of the flow of the method implementing the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program may include the flow of the embodiment of each method as described above when executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims. The information disclosed in the background section herein is only for enhancement of understanding of the general background of the invention and is not to be taken as an admission or any form of suggestion that this information forms the prior art already known to those of ordinary skill in the art.

Claims

1. The model-based robot offline reinforcement learning control method is characterized by comprising the following steps of:

2. The model-based robot offline reinforcement learning control method according to claim 1, wherein in the step S1, the jacobian-based deep kinematics model corresponding to the robot arm is established through deep learning in the following manner, comprising:

The forward kinematics model of the mechanical arm is determined as follows:； wherein ,/>Is the pose of the end effector of the mechanical arm in the operating space, < >>Is the degree of freedom of the operating space of the mechanical arm; />Is the joint angle of the mechanical arm,is the degree of freedom of the joint space of the mechanical arm;

determining the pose of the mechanical arm end effector in an operation space asWherein p is the position of the robotic end effector; />Is a quaternion representing the pose of the robotic arm end effector,, wherein />A rotation axis representing the quaternion,>、/> and />Three components of the rotation axis, respectively +.>Indicating the angle of rotation about the axis of rotation +.>Satisfy constraint->The method comprises the steps of carrying out a first treatment on the surface of the Quaternion->Relation between derivative for time and angular velocity ω of the manipulator end-effector in the operating space +. >Is thatI.e. the relationship->The method comprises the following steps:

；

wherein , 、/> and />The components of the angular velocity of the mechanical arm end effector in the x axis, the y axis and the z axis are respectively;

applying a chain rule to the derivative of each layer in the depth kinematic model, iteratively calculating the derivative of the output of the depth neural network relative to the input to recover and learn the jacobian matrix, and learning the obtained jacobian matrixExpressed as calculated by:

；

wherein ,network parameters of the depth kinematic model; />The deviation of the kinematic equation obtained for the deep learning on the joint angle; />An i-th layer network layer which is a deep kinematic model,>，/>；，/>，/>application to depth for depth kinematic modelsThe weight input by the network layer before the neural network; />Is the deviation; /> and />The nonlinear activation function and its derivative of the deep neural network, respectively.

3. The model-based robot offline reinforcement learning control method of claim 2, wherein the training dataset of the deep kinematic model Wherein q is the joint angle of the mechanical arm; />The joint angular velocity of the mechanical arm and x are the pose of the mechanical arm end effector in an operation space; />A derivative of the position of the mechanical arm end effector in the operation space relative to time;

；

wherein ,network parameters of the depth kinematic model; />Training data for depth kinematic modelsA collection; />Is the size of the training dataset; />Is the generalized inverse of the jacobian matrix; first loss term->Is the mean square error of the actual pose and the predicted pose of the manipulator operation space; second loss term->Is the mean square error of the actual speed and the predicted speed of the operation space of the mechanical arm; last loss item->The depth neural network is enabled to be more accurate by fitting the angular velocity of the joint of the mechanical arm.

4. A model-based robot offline reinforcement learning control method according to any one of claims 1-3, characterized in that in step S1, a lagrangian-based depth dynamics model corresponding to the robot arm is built by deep learning in the following manner, comprising:

；

wherein ,respectively machinesJoint angle, joint angular velocity, and joint angular acceleration of the arm; />Is a moment acting on the joints of the mechanical arm;

to ensure the quality matrix of the mechanical armIs positive symmetry of +.>Performing Chelesky decomposition to obtain +.>, wherein ,/>For the lower triangular matrix of the non-negative diagonal, the superscript T represents the transposed matrix, and the lower triangular matrix of the non-negative diagonal is learned by using the deep neural network +.>And potential energy of mechanical arm->Fitting Lagrangian functionIn combination with Euler-Lagrangian equation->Wherein F is generalized force and moment, and a depth forward dynamics model and a depth reverse dynamics model which form a depth dynamics model corresponding to the mechanical arm are respectively:

；

wherein ,、/> and />Respectively the joint angle, the joint angular velocity and the joint angular acceleration of the mechanical arm; />Is a mass matrix of the mechanical arm; />The derivative of the mass matrix of the mechanical arm with respect to time;coriolis force and centripetal force; / >Is a conservative force comprising gravity and spring force; />The torque is output for the joints of the mechanical arm; />The joint friction force of the mechanical arm is obtained through an induced joint priori friction force model of the mechanical arm consisting of coulomb friction, viscous friction and Stribeck friction, and the joint priori friction force model of the mechanical arm is as follows:

；

5. The model-based robot offline reinforcement learning control method of claim 4, wherein the training dataset of the deep dynamics modelWherein q is the joint angle of the mechanical arm; />Is the joint angular velocity of the mechanical arm; />The joint angular acceleration of the mechanical arm; />The joint moment of the mechanical arm;

the loss function of the depth dynamics model is as follows:

；

wherein ,a training data set for a depth dynamics model;the size of the training data set for the depth dynamics model;for angular acceleration of joints of mechanical armsThe regression loss is given by the fact that, 、Respectively a predicted value and an actual value of the joint angular acceleration of the mechanical arm at the time t;for joint momentThe regression loss is given by the fact that,、respectively a predicted value and an actual value of the joint moment of the mechanical arm at the moment t;is a multi-step predictive loss obtained by numerical integration,in order to predict the total number of steps,、the actual values of the joint angle and the joint angular velocity of the mechanical arm at the time t+1 are respectively,andrespectively the predicted values of the joint angle and the joint angular velocity of the mechanical arm at the time t+1.

6. A method according to any one of claims 1-3, wherein in step S2, a random excitation trajectory model described by a finite fourier series is built up in the following manner, comprising:

；

wherein , and />The amplitude of cosine and the amplitude of sine respectively; />The frequency of cosine and the frequency of sine; />The offset of the joint angle of the mechanical arm, and the offset of the amplitude, the cosine amplitude and the frequency of the sine and the joint angle are randomly selected within a safe range; / >For the number of Fourier series, randomly select +.1 to 3>T is the time value at time t, which is the variable when accumulated.

7. The model-based robot offline reinforcement learning control method according to any one of claims 1 to 3, wherein in the step S3, a markov decision process model of a robot arm track tracking task is established in the following manner, and a state transition model of the markov decision process model is a depth transition model which is constructed by combining a trained depth kinematic model based on jacobian and a depth kinematic model based on lagrangian and simulates the robot arm motion track, and the method comprises the following steps:

modeling a trajectory tracking task of a robotic arm as a finite time discounted discrete Markov decision process； wherein ,/>For the state space +.>A state at time t; />Is an action space;the action is at time t; />Is a deep transfer model as a state transfer model; />Is a reward function;is a discount factor;

Predicted values of joint angle and joint angular speed of mechanical arm at time t+1 by using depth kinematic model based on jacobianPredicted values of the pose and the speed of the end effector of the mechanical arm at time t+1 are obtained for input +.>；

Tracking the state of a task by utilizing the locus of the mechanical arm at the time t+1Construction of depth transfer model->In the depth transfer model +.> and />The state at the moment t and the state at the moment t+1 of the track tracking task of the mechanical arm are respectively +.>Control strategy for time tSlightly outputting the motion.

8. The method according to any one of claims 1 to 3, wherein in the step S4, according to the markov decision process model, control parameters of a moment controller are calculated as a control strategy by offline learning through a Soft Actor-Critic reinforcement learning algorithm, mechanical arm simulation motion track data of offline interactions of the deep transfer model with the control strategy are collected, and Actor networks and critics networks of the Soft Actor-Critic reinforcement learning algorithm are updated until an optimal control strategy is obtained, which comprises:

9. The model-based robot offline reinforcement learning control method according to claim 8, wherein the method sets the state space and the action space of the markov decision process model according to the trajectory tracking task performed in the joint space and the trajectory tracking task performed in the operation space by the robot arm, respectively, in the following manner, comprising:

；

wherein ,、/>respectively a joint angle error and a joint angular velocity error of the mechanical arm; Is the accumulated value of the joint angle error of the mechanical arm; />、/> and />Respectively the joint angle, the joint angular velocity and the joint angular acceleration of the expected track of the mechanical arm;

；

wherein , and />Joint angles of the mechanical arms respectivelyAnd joint angular velocity; />Is a mass matrix of the mechanical arm;coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The joint friction force of the mechanical arm; />、/> and />Respectively accumulating values of joint angle error, joint angular speed error and joint angle error of the mechanical arm; />、/>Respectively calculating control parameters of the moment controller;

；

wherein ,、/>the position and orientation error and the speed error vector of the mechanical arm end effector are respectively; />、/>The position error and the linear speed error of the mechanical arm end effector are respectively; />、/>Respectively an attitude error and an angular velocity error of the mechanical arm end effector; />The accumulated value of the pose errors of the mechanical arm end effector is obtained;、/> and />The pose, the speed and the acceleration of the expected track of the mechanical arm end effector are respectively; / > and />The position and the linear speed of the mechanical arm end effector are respectively; />、/> and />The position, the linear speed and the linear acceleration of the expected track of the mechanical arm end effector are respectively; />Quaternion for representing the attitude of the end effector of a robotic arm> and />The rotation axis of the quaternion and the rotation angle around the rotation axis are respectively; />Quaternion for the pose of the desired trajectory of the manipulator end effector, +.> and />The quaternion rotation axis and the rotation angle around the rotation axis are respectively; /> and />The first derivative and the second derivative of the quaternion of the gesture of the expected track of the mechanical arm end effector relative to time are respectively; /> and />The angular speed of the mechanical arm end effector and the angular speed of the expected track are respectively carried out; the subscript t of each parameter indicates that the parameter is a parameter at time t;

；

wherein , and />The joint angle and the joint angular velocity of the mechanical arm are respectively; />Is a mass matrix of the mechanical arm;coriolis force and centripetal force; />Is a conservative force comprising gravity and spring force; />The joint friction force of the mechanical arm; / >For the manipulator reference acceleration of the operating space,、/>the position and orientation error and the speed error vector of the mechanical arm end effector are respectively; />Network parameters of the depth kinematic model; />The generalized inverse matrix is a mechanical arm jacobian matrix; />、Selecting control moment of zero space for calculating control parameters of moment controller>The form of (2) is:

；

the Soft Actor-Critic reinforcement learning algorithm fits a state-action value function through a Critic networkTo evaluate the strategy, the error of the minimized bellman equation is updated by: />, wherein ,/>For the state of the arm track tracking task at time t+1, < > the following point is given>Controlling the action of strategy output for the time t+1; />A state-action value function at time t; />To control policy->Is the expected value under the condition; t represents the time t; />For the discount factor, the superscript t represents the power t of the discount factor, and t is the time value at the moment t; />Is a reward function; />The expected value under the condition of the state and the action at the time t+1; />A state-action value function at time t+1;

；

wherein ,is a sample in the control strategy distribution; pi is the distribution of control strategies represented by the actor's network; />KL divergence to minimize policy; />Tracking the state of the task for the track of the mechanical arm at the moment t; />Is a quaternion representing the pose of the robotic arm end effector; / >Is a control strategy before updating; />A state-action value function before update;is a distribution function for normalizing the distribution;

；

wherein T is the length of the motion trail of the mechanical arm;is the expected value under the condition of the state and the action at the moment t; />Tracking the state of the task for the track of the mechanical arm at the moment t; />Controlling the action of strategy output for the time t; />Is a reward function; />A discount factor at time t; />Entropy of control strategy; />The regularization coefficient is used for adjusting the proportion of entropy in the objective function in an adaptive updating mode to control the randomness of the strategy.

10. The model-based robot offline reinforcement learning control method according to claim 8, wherein in step S5, the control parameters of the calculated torque controller outputted by the calculated torque controller according to the optimal control strategy obtained in step S4And calculating by combining with a calculation torque controller to obtain the specific control torque of the mechanical arm.