CN111783250A - Flexible robot end arrival control method, electronic device, and storage medium - Google Patents

Flexible robot end arrival control method, electronic device, and storage medium Download PDF

Info

Publication number
CN111783250A
CN111783250A CN202010635603.4A CN202010635603A CN111783250A CN 111783250 A CN111783250 A CN 111783250A CN 202010635603 A CN202010635603 A CN 202010635603A CN 111783250 A CN111783250 A CN 111783250A
Authority
CN
China
Prior art keywords
training
state
flexible robot
neural network
deep neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010635603.4A
Other languages
Chinese (zh)
Other versions
CN111783250B (en
Inventor
孙俊
武海雷
韩飞
孙玥
刘超镇
阳光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Aerospace Control Technology Institute
Original Assignee
Shanghai Aerospace Control Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Aerospace Control Technology Institute filed Critical Shanghai Aerospace Control Technology Institute
Priority to CN202010635603.4A priority Critical patent/CN111783250B/en
Publication of CN111783250A publication Critical patent/CN111783250A/en
Application granted granted Critical
Publication of CN111783250B publication Critical patent/CN111783250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/17Mechanical parametric or variational design
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/1633Programme controls characterised by the control loop compliant, force, torque control, e.g. combined with position control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/1635Programme controls characterised by the control loop flexible-arm control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1674Programme controls characterised by safety, monitoring, diagnostic
    • B25J9/1676Avoiding collision or forbidden zones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Geometry (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mechanical Engineering (AREA)
  • Robotics (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computational Mathematics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a flexible robot tail end arrival control method, electronic equipment and a storage medium, wherein the method comprises the following steps: establishing a dynamic model of the flexible robot; establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model; performing primary training of a first flexible robot tail end arrival process on the deep neural network to obtain initial parameters of the deep neural network; and carrying out primary training on the deep neural network in the process of the flexible robot reaching for the second time to obtain the final parameters of the deep neural network. The invention reduces the influence of uncertainty or external disturbance of a dynamic model of the flexible robot on a control system and improves the tail end control precision of the flexible robot.

Description

Flexible robot end arrival control method, electronic device, and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a flexible robot tail end arrival control method based on deep reinforcement learning, electronic equipment and a storage medium.
Background
The control aim of the flexible robot based on cable variable-length driving is to realize accurate arrival of the tail end of the flexible robot under the condition of avoiding collision with the surrounding environment. The flexible robot end safety arrival control process has the following problems: firstly, due to the characteristics of complexity, strong nonlinearity, time variation, uncertainty and the like of a control system of the flexible robot, accurate dynamic modeling is difficult to establish; and secondly, in the moving process of the flexible robot, external disturbances such as friction and the like exist among the cable, the reel and the disc. Therefore, the problem that the flexible robot track tracking control is inaccurate due to uncertainty of a control system model and external interference is difficult to effectively inhibit by a completely known classical feedback control method based on the model, and even the control system is unstable.
Disclosure of Invention
The invention aims to provide a flexible robot tail end arrival control method based on deep reinforcement learning, electronic equipment and a storage medium, and aims to reduce the uncertainty of a dynamic model of a flexible robot or the influence of external disturbance on a control system and improve the tail end control precision of the flexible robot.
In order to achieve the above purpose, the invention is realized by the following technical scheme:
a flexible robot end arrival control method, comprising:
step S1, establishing a dynamic model of the flexible robot;
step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model;
step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network;
and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network.
Preferably, the step S1 includes:
regarding the flexible robot as a continuous model with arc coordinates as independent variables, and regarding the spatial pose of the flexible robot as the rotation or movement of a cross section around a central line;
establishing a dynamic model of the flexible robot based on a Cosserat rod model;
the kinetic model is represented by the following formula:
Figure BDA0002568789240000021
wherein F is the internal force on the section; m is the principal moment on the cross section; f is the uniform distribution of force on the single-section rod of the flexible robot; m is uniform moment on a single-section rod of the flexible robot; j (s, t) is the inertia tensor of the rod per unit length; rho is the density of the flexible robot rod in unit length; s is the section area of a rod of a flexible robot in unit length;
Figure BDA0002568789240000022
and omega is the angular speed of the point P in the section principal axis coordinate system P-xyz in the inertial coordinate system relative to the time variable t.
Preferably, the step S2 includes: acquiring random track data of the flexible robot in the terminal arrival process in real time by using an external calibration and measurement camera of the laboratory;
converting the random trajectory data into training data according to the dynamic model to obtain a random trajectory data set Drand=(s1,a1,r1,s2,a2,r2,...sT,aT,rT) (ii) a Wherein s istRepresenting the state of the flexible robot when the current moment is t; a istRepresenting the action of the flexible robot when the current moment is t; r istA reward representing a predicted environment at the current time t; t1, 2,. said, T;
the state s when the current time is t is taken astAnd the action atAs an input, the state s at the next time can be predictedt+1State transition prediction model P(s)t+1|st,at) Is represented as follows:
st+1~P(st+1|st,at)
the state s when the current time is t is taken astAnd the action atAs an input, the reward r of the predicted environment at the next time can be obtainedt+1Reward prediction model R (R)t+1|st,at) Is represented as follows:
rt+1~R(rt+1|st,at)
predicting model P(s) according to the state transitiont+1|st,at) And a reward prediction model R (R)t+1|st,at) The random trajectory data set Drand=(s1,a1,r1,s2,a2,r2,...sT,aT,rT) Respectively converting the training samples into a density estimation model training set and a regression model training set containing T-1 groups of training samples;
the training set of density estimation models is represented as follows:
(s1,a1)→s2,(s2,a2)→s3,...(sT-1,aT-1)→sT
the regression model training set is represented as follows:
(s1,a1)→r2,(s2,a2)→r3,...(sT-1,aT-1)→rT
preferably, step S3.1, a data set D of optimal control trajectories is presetRLWhen it is in the empty set state; and randomly initializing a first action value function based on the deep neural network with model reinforcement learning training
Figure BDA0002568789240000031
Figure BDA0002568789240000032
The parameters of the deep neural network corresponding to the first action value function are represented, and the step S3.2 is carried out;
step S3.2, presetting a first training round number M1Recording the current first training round number m1Judging the current first training round number m1Whether or not it is less than the preset first training round number M1If yes, go to step S3.3; if not, entering step S3.6;
step S3.3, judging the optimal control track data set DRLIf it is an empty set, go to step S3.3.1; if not, go to step S3.3.2;
step S3.3.1, based on the random trajectory data set DrandApplying a random gradient descent method to make the loss function
Figure BDA0002568789240000033
The minimum value is reached, and the minimum value,
Figure BDA0002568789240000034
in the formula, D1Represents DrandMiddle(s)t,at,st+1) A set of constructs; stAnd atRespectively representing the state and the action when the current time is t; st+1Represents a state in which the subsequent time is t + 1;
then using the loss function
Figure BDA0002568789240000041
The random trajectory data set D when the minimum is reachedrandThe set of data in (a) determines the parameters of the deep neural network
Figure BDA0002568789240000042
A first action value function of the deep neural network at this time
Figure BDA0002568789240000043
If the state is known, go to step S3.4;
step S3.3.2, based on the optimal control trajectory data set DRLApplying a random gradient descent method to make the loss function
Figure BDA0002568789240000044
The minimum value is reached, and the minimum value,
Figure BDA0002568789240000045
in the formula stAnd atRespectively representing the state and the action of the current moment; st+1Indicating the state at the subsequent time;
then using the loss function
Figure BDA0002568789240000046
The optimal control trajectory data set D when the minimum is reachedRLIs used for solving the parameters of the deep neural network
Figure BDA0002568789240000047
A first action value function of the deep neural network at this time
Figure BDA0002568789240000048
If the state is known, go to step S3.4;
s3.4, judging that the training sample group number T is equal to the random track data set D when the training times is less than the training sample group number TrandThe total number T of training data contained in (a); step S3.5 is executed;
step S3.5, obtaining the corresponding number of times of current trainingThe state s of the flexible robot at the current time ttGo to step S3.5.1;
step S3.5.1, using a first action value function of the deep neural network
Figure BDA0002568789240000049
An optimal action sequence containing T actions is estimated
Figure BDA00025687892400000411
Figure BDA00025687892400000410
In the formula
Figure BDA00025687892400000412
T is an integer;
proceed to step S3.5.2; step S3.5.2, executing the optimal sequence of actions
Figure BDA00025687892400000510
First action a in (1)tThe first action atThe state s of the flexible robot at the current moment t corresponding to the current training timestCombining to obtain the optimal control track(s)t,at) Go to step S3.5.3;
step S3.5.3, obtaining the optimal control track(s)t,at) Adding to the optimal control trajectory data set DRLPerforming the following steps; proceed to step S3.5.4;
step S3.5.4, judging whether the training times is equal to the training sample group number T; if not, returning to the step S3.5; if yes, returning to the step S3.2;
s3.6, finishing training; obtaining a final first action value function of the deep neural network
Figure BDA0002568789240000051
And the initial parameters corresponding thereto
Figure BDA0002568789240000052
Preferably, the step S4 includes:
step S4.1, initializing a state transition prediction model P (S)t+1|st,at) Reward prediction model R (R)t+1|st,at) And a second action value function of the deep neural network based on model-free reinforcement learning training
Figure BDA0002568789240000053
A corresponding parameter θ, and at this time, the parameter θ is made to be 0; entering step S4.2;
step S4.2, starting from the initial state S0Initiating a trial to initialize a first action value function of the deep neural network
Figure BDA0002568789240000054
Corresponding parameter
Figure BDA0002568789240000055
Order to
Figure BDA0002568789240000056
Entering step S4.3;
step S4.3, a qualification trace z is preset, and z is made equal to 0; entering step S4.4;
step S4.4, initial State S for each training round0Executing a model-based reinforcement learning training simulation once, and updating the first action value function, wherein the updated first action value function is
Figure BDA0002568789240000057
The obtained initial parameters
Figure BDA0002568789240000058
Entering step S4.5;
step S4.5, based on the state S when the current time is ttAnd combining the first action value function and the second action valueFunction, get the joint action value function
Figure BDA0002568789240000059
Selecting an action a using the-greedy methodt(ii) a Entering step S4.6;
step S4.6, if the current state StWith known desired terminal state sqError s oferr=||st-sqIf | | is greater than the constant value Δ, the step S4.6.1 is entered, otherwise, the step S4.2 is returned;
step S4.6.1, executing action a selected from step S4.5tPredicting model P(s) based on state transitionst+1|st,at) Obtaining a subsequent state st+1Based on a reward prediction model R (R)t+1|st,at) Receive a reward r, using successor state st+1And action atAnd a reward r for updating the state transition prediction model P(s)t+1|st,at) And a reward prediction model R (R)t+1|st,at) (ii) a Proceed to step S4.6.2;
step S4.6.2, using the state transition prediction model P(s) obtained in the step S4.6.1t+1|st,at) And a reward prediction model P(s)t+1|st,at) From the subsequent state st+1Starting to perform a model-based reinforcement learning training simulation, and updating the first action value function
Figure BDA0002568789240000061
Obtaining the parameters corresponding to the parameters
Figure BDA0002568789240000062
Proceed to step S4.6.3;
step S4.6.3, based on the successor state st+1And the joint action value function
Figure BDA0002568789240000063
Selecting the action a to be executed actually next by utilizing a greedy methodt+1(ii) a Proceed to step S4.6.4;
s4.6.4, obtaining the deviation of the corresponding second action value function based on the model-free reinforcement learning training simulation, wherein
Figure BDA0002568789240000064
And updating the second action value function corresponding to the model-free reinforcement learning training by using the deviation of the second action value function
Figure BDA0002568789240000065
θ ← θ + α z, where α denotes the learning rate, constant between 0 and 1;
step S4.6.5, the qualification trace z is updated,
Figure BDA0002568789240000066
wherein λ represents a discount factor, which is a constant value between 0 and 1; proceed to step S4.6.6;
step S4.6.6, transferring the acquisition state of the flexible robot to a subsequent state, i.e. st=st+1,at=at+1(ii) a Proceed to step S4.6.7;
step S4.6.7, presetting a second training round number M2Recording the current second number m of training rounds2Judging the current second training round number m2Whether or not it is less than a preset second training round number M2If yes, returning to the step S4.2; if not, the step S4.7 is carried out;
s4.7, finishing training; obtaining a final second action value function of the deep neural network
Figure BDA0002568789240000071
And the final parameters corresponding thereto
Figure BDA0002568789240000072
In another aspect, the present invention also provides an electronic device comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the method as described above.
In other aspects, the invention also provides a readable storage medium, in which a computer program is stored, which, when executed by a processor, implements the method as described above
The method of (1).
Compared with the prior art, the invention has the following advantages:
the invention discloses a flexible robot tail end arrival control method, which comprises the following steps: step S1, establishing a dynamic model of the flexible robot; step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model; step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network; and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network. Therefore, the dynamic model of the flexible robot is firstly established, the dynamic model is not very accurate, and then a deep neural network is established by combining the dynamic model, the deep neural network has the effect equivalent to the dynamic model, and the principle is similar to that the dynamic model is replaced by the deep neural network; the initial parameters of the deep neural network can be obtained through the first training, so that the deep neural network becomes known, but the precision of the deep neural network is still insufficient at the moment, and therefore the known deep neural network is trained for the second time and used for improving the precision of the deep neural network. According to the method, an accurate mathematical model (a dynamic model) of the flexible robot is not required to be established, adaptive control is realized by performing reward and punishment feedback on the operation process, inherent control errors caused by unknown or inaccurate dynamic model and control errors caused by dimensionality reduction and simplification of the dynamic model can be eliminated or weakened, the tail end control precision of the flexible robot is improved, and technical support is provided for operation tasks such as on-orbit module replacement, sailboard auxiliary expansion and the like of a failure target.
Drawings
Fig. 1 is a flowchart of a flexible robot end arrival control method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of various coordinate systems according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a simple structure of an electronic apparatus according to an embodiment of the invention.
Detailed Description
The main flexible robot end arrival control method, the electronic device and the storage medium according to the present invention will be described in detail with reference to fig. 1 to 3 and the following detailed description. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise scale for the purpose of facilitating and distinctly aiding in the description of the embodiments of the present invention. To make the objects, features and advantages of the present invention comprehensible, reference is made to the accompanying drawings. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the implementation conditions of the present invention, so that the present invention has no technical significance, and any structural modification, ratio relationship change or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As shown in fig. 1, the present embodiment provides a method for controlling terminal arrival of a flexible robot, including:
step S1, establishing a dynamic model of the flexible robot;
step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model;
step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network;
and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network.
Further, as shown in fig. 2, the step S1 includes: regarding the flexible robot as a continuous model with arc coordinates as independent variables, and regarding the spatial pose of the flexible robot as the rotation or movement of a cross section around a central line;
establishing a dynamic model of the flexible robot based on a Cosserat rod model;
the kinetic model is represented by the following formula:
Figure BDA0002568789240000091
wherein F is the internal force on the section; m is the principal moment on the cross section; f is the uniform distribution of force on the single-section rod of the flexible robot; m is uniform moment on a single-section rod of the flexible robot; j (s, t) is the inertia tensor of the rod per unit length; rho is the density of the flexible robot rod in unit length; s is the section area of a rod of a flexible robot in unit length;
Figure BDA0002568789240000092
and omega is the angular speed of the point P in the section principal axis coordinate system P-xyz in the inertial coordinate system relative to the time variable t.
The specific establishment process is as follows: aiming at the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like of the flexible robot, a dynamic equation based on a Cosserat rod model is established, the basic idea is that the flexible robot is regarded as a continuous model taking arc coordinates as independent variables, and the space pose of the robot can be regarded as the rotation or the movement of a cross section around a central line.
Continuing with fig. 2, the spatial geometry description method using the Frenet coordinate system as the target is shown, where the T-axis of the Frenet coordinate system (P-NBT) is the tangent vector of the cable centerline at the P-point, the N-axis is the normal vector of the cable centerline at the P-point, the B-axis is the secondary normal vector of the cable centerline at the P-point (B-T × N), the three coordinate axes N, B, T are orthogonal in pairs, and the introduction vector ω is orthogonal to the three coordinate axes N, B, TFIs defined as
ωF(s)=κ(s)B+τ(s)T (2)
In the formula of omegaF(s) the Darboux vector, called the curve, whose physical meaning is understood to mean that the Frenet coordinate system is relative to the inertial coordinate system when the point P moves along the curve C in the forward direction with a unit speed towards the arc coordinate s
Figure BDA0002568789240000101
The angular velocity of rotation of (a).
The change rule of the vectors N, B and T along with the arc coordinate s is determined by the following differential equation:
Figure BDA0002568789240000102
due to the fact that
Figure BDA0002568789240000103
The flexible robot curve r(s) is
Figure BDA0002568789240000104
On the basis of considering the deformation of the center line and considering the size of the cross section, writing the torsional deformation x around the Z axis at the axis of the flexible robot, and establishing a cross section principal axis coordinate system (P-xyz). According to the geometric form, the kinematic equation is
Figure BDA0002568789240000111
Wherein the wave line represents the local derivative to the section principal axis coordinate system (P-xyz); omega is the angular velocity of the point P in the inertial coordinate system with respect to the time variable t;
Figure BDA0002568789240000112
the discrete element concept is adopted to disperse continuous mechanical arms into infinitesimal sections, and the cable dynamic equation which can be realized according to the Newton's law and the centroid moment conservation theorem is as follows:
Figure BDA0002568789240000113
rho is the density of the flexible robot rod in unit length; s is the section area of a rod of a flexible robot in unit length; j (s, t) is the inertia tensor of the rod per unit length; f is the internal force on the section; m is the principal moment on the cross section (internal moment); f is the evenly distributed force on the rod; m is uniform moment on a single-section rod of the flexible robot; Δ s represents the arc length in infinitesimal; Δ F and Δ M represent the amount of change in the internal force and the main moment, respectively, corresponding to a change in Δ s in the cross section.
Dividing both sides of the formula (6) by Deltas to obtain the formula (1),
Figure BDA0002568789240000114
thereby, a dynamic model of a single-joint rod of the flexible mechanical arm (dynamic model of the flexible robot) is established.
Further, the step S2 includes: acquiring random track data of the flexible robot in the terminal arrival process in real time by using an external calibration and measurement camera of the laboratory;
converting the random trajectory data into training data according to the dynamic model to obtain random trajectory dataMachine trajectory data set Drand=(s1,a1,r1,s2,a2,r2,...sT,aT,rT) (ii) a Wherein s istRepresenting the state of the flexible robot when the current moment is t; a istRepresenting the action of the flexible robot when the current moment is t; r istA reward representing a predicted environment at the current time t; t1, 2,. said, T;
the state s when the current time is t is taken astAnd the action atAs an input, the state s at the next time can be predictedt+1State transition prediction model P(s)t+1|st,at) Is represented as follows:
st+1~P(st+1|st,at) (7)
the state s when the current time is t is taken astAnd the action atAs an input, the reward r of the predicted environment at the next time can be obtainedt+1Reward prediction model R (R)t+1|st,at) Is represented as follows:
rt+1~R(rt+1|st,at) (8)
predicting model P(s) according to the state transitiont+1|st,at) And a reward prediction model R (R)t+1|st,at) The random trajectory data set Drand=(s1,a1,r1,s2,a2,r2,...sT,aT,rT) Respectively converting the training samples into a density estimation model training set and a regression model training set containing T-1 groups of training samples;
the training set of density estimation models (equivalent to the state transition prediction model P (s))t+1|st,at) Is expressed as follows:
(s1,a1)→s2,(s2,a2)→s3,...(sT-1,aT-1)→sT(9)
the regression model training set (equivalent to reward)Prediction model R (R)t+1|st,at) Is expressed as follows:
(s1,a1)→r2,(s2,a2)→r3,...(sT-1,aT-1)→rT(10)。
the basic idea of step S2 is to fit a dynamic model of the flexible robot by using a deep neural network, apply a model-based reinforcement learning method (model reinforcement learning training), use a learned neural network model in a model predictive control framework, before selecting an actual execution action, the flexible robot first performs a simulation based on the dynamic model from a current state, the simulation simulates a completed trajectory, so as to evaluate a current action value function, and implement an initial training of the flexible robot in the arrival process.
Further, the step S3 includes: step S3.1, presetting an optimal control track data set DRLWhen it is in the empty set state; and randomly initializing a first action value function based on the deep neural network with model reinforcement learning training
Figure BDA0002568789240000131
Figure BDA0002568789240000132
The parameter representing the deep neural network corresponding to the first action value function proceeds to step S3.2.
Step S3.2, presetting a first training round number M1Recording the current first training round number m1Judging the current first training round number m1Whether or not it is less than the preset first training round number M1If yes, go to step S3.3; if not, go to step S3.6.
Step S3.3, judging the optimal control track data set DRLIf it is an empty set, go to step S3.3.1; if not, go to step S3.3.2.
Step S3.3.1, based on the random trajectory data set DrandLoss by using a random gradient descent methodFunction(s)
Figure BDA0002568789240000133
The minimum value is reached, and the minimum value,
Figure BDA0002568789240000134
in the formula, D1Represents DrandMiddle(s)t,at,st+1) A set of constructs; stAnd atRespectively representing the state and the action when the current time is t; st+1Indicating a state at a subsequent time instant t + 1.
Then using the loss function
Figure BDA0002568789240000135
The random trajectory data set D when the minimum is reachedrandThe set of data in (a) determines the parameters of the deep neural network
Figure BDA0002568789240000136
A first action value function of the deep neural network at this time
Figure BDA0002568789240000137
Known state, step S3.4 is entered.
Step S3.3.2, based on the optimal control trajectory data set DRLApplying a random gradient descent method to make the loss function
Figure BDA0002568789240000141
The minimum value is reached, and the minimum value,
Figure BDA0002568789240000142
in the formula stAnd atRespectively representing the state and the action of the current moment; st+1Indicating the state at the subsequent time;
then using the loss function
Figure BDA0002568789240000143
The optimal control trajectory data set D when the minimum is reachedRLIs used for solving the parameters of the deep neural network
Figure BDA0002568789240000144
A first action value function of the deep neural network at this time
Figure BDA0002568789240000145
Known state, step S3.4 is entered.
S3.4, judging that the training sample group number T is equal to the random track data set D when the training times is less than the training sample group number TrandThe total number T of training data contained in (a); step S3.5 is performed.
S3.5, obtaining the state S of the flexible robot at the current moment t corresponding to the current training timestProceed to step S3.5.1.
Step S3.5.1, using a first action value function of the deep neural network
Figure BDA0002568789240000146
An optimal action sequence containing T actions is estimated
Figure BDA0002568789240000148
Figure BDA0002568789240000147
In the formula
Figure BDA0002568789240000149
T is an integer.
Proceed to step S3.5.2; step S3.5.2, executing the optimal sequence of actions
Figure BDA00025687892400001410
First action a in (1)tThe first action atCorresponding to the current training timesThe state s of the flexible robot at the current time ttCombining to obtain the optimal control track(s)t,at) Step S3.5.3 is performed.
Step S3.5.3, obtaining the optimal control track(s)t,at) Adding to the optimal control trajectory data set DRLPerforming the following steps; step S3.5.4 is entered.
Step S3.5.4, judging whether the training times is equal to the training sample group number T; if not, returning to the step S3.5; if so, return to step S3.2.
S3.6, finishing training; obtaining a final first action value function of the deep neural network
Figure BDA0002568789240000151
And the initial parameters corresponding thereto
Figure BDA0002568789240000152
It can be seen that the basic idea of step S3 is to define a reward function encoding a task when the model-based flexible arm end reaches the reinforcement learning task, and to give a reward when the track is reached near the desired end and to follow the desired track. Implementing the optimal sequence of actions obtained from step S3.5.1 using a Model Predictive Controller (MPC)
Figure BDA00025687892400001510
In a first action a of selecting rank firsttAnd adds the corresponding first action to the data set of the optimal control trajectory, thereby increasing the robustness of the embodiment.
Further, the step S4 includes: step S4.1, initializing a state transition prediction model P (S)t+1|st,at) Reward prediction model R (R)t+1|st,at) And a second action value function of the deep neural network based on model-free reinforcement learning training
Figure BDA0002568789240000153
A corresponding parameter θ, and at this time, the parameter θ is made to be 0; proceed to step S4.2.
Step S4.2, starting from the initial state S0Initiating a trial to initialize a first action value function of the deep neural network
Figure BDA0002568789240000154
Corresponding parameter
Figure BDA0002568789240000155
Order to
Figure BDA0002568789240000156
Proceed to step S4.3.
Step S4.3, a qualification trace z is preset, and z is made equal to 0; step S4.4 is entered.
Step S4.4, initial State S for each training round0Executing a model-based reinforcement learning training simulation once, and updating the first action value function, wherein the updated first action value function is
Figure BDA0002568789240000157
The obtained initial parameters
Figure BDA0002568789240000158
Entering step S4.5; in this embodiment, said step S4.4 can be understood as being based on the initial state S0The parameters are calculated by the method of step S3 described above
Figure BDA0002568789240000159
Step S4.5, based on the state S when the current time is ttAnd combining the first action value function and the second action value function to obtain a combined action value function
Figure BDA0002568789240000161
Selecting an action a using the-greedy methodt(ii) a Proceed to step S4.6.
Step S4.6, if the current stateState stWith known desired terminal state sqError s oferr=||st-sqIf | | is greater than the constant value Δ, go to step S4.6.1, otherwise, return to step S4.2.
Step S4.6.1, executing action a selected from step S4.5tPredicting model P(s) based on state transitionst+1|st,at) Obtaining a subsequent state st+1Based on a reward prediction model R (R)t+1|st,at) Receive a reward r, using successor state st+1And action atAnd a reward r for updating the state transition prediction model P(s)t+1|st,at) And a reward prediction model R (R)t+1|st,at) (ii) a Step S4.6.2 is entered.
Step S4.6.2, using the state transition prediction model P(s) obtained in the step S4.6.1t+1|st,at) And a reward prediction model P(s)t+1|st,at) From the subsequent state st+1Starting to perform a model-based reinforcement learning training simulation, and updating the first action value function
Figure BDA0002568789240000162
Obtaining the parameters corresponding to the parameters
Figure BDA0002568789240000163
Proceed to step S4.6.3; the step S4.6.2 may be understood as calculating the parameters by the method of the step S3
Figure BDA0002568789240000164
Step S4.6.3, based on the successor state st+1And the joint action value function
Figure BDA0002568789240000165
Selecting the action a to be executed actually next by utilizing a greedy methodt+1(ii) a Step S4.6.4 is entered.
S4.6.4, training and simulating based on model-free reinforcement learningDeviation to the corresponding second action value function, wherein
Figure BDA0002568789240000166
And updating the second action value function corresponding to the model-free reinforcement learning training by using the deviation of the second action value function
Figure BDA0002568789240000167
θ ← θ + α z, where α denotes the learning rate, being a constant value between 0 and 1, proceeds to step S4.6.5.
Step S4.6.5, the qualification trace z is updated,
Figure BDA0002568789240000168
wherein λ represents a discount factor, which is a constant value between 0 and 1; step S4.6.6 is entered.
Step S4.6.6, transferring the acquisition state of the flexible robot to a subsequent state, i.e. st=st+1,at=at+1(ii) a Step S4.6.7 is entered.
Step S4.6.7, presetting a second training round number M2Recording the current second number m of training rounds2Judging the current second training round number m2Whether or not it is less than a preset second training round number M2If yes, returning to the step S4.2; if not, the process proceeds to step S4.7.
S4.7, finishing training; obtaining a final second action value function of the deep neural network
Figure BDA0002568789240000171
And the final parameters corresponding thereto
Figure BDA0002568789240000172
It can be seen that the basic idea of step S4 is that the flexible robot performs a model-based simulation from the current state to evaluate the current action value function before selecting the actual execution action, and then adds the action value function obtained by the simulation to the actual execution actionThe empirically derived action value function jointly selects the action a actually to be performedt
Therefore, in the embodiment, a flexible robot dynamic model based on Cosserat is firstly established; then fitting a dynamic model of the flexible robot by using a deep neural network, and realizing the initial training of the flexible robot in the arrival process by using a model-based reinforcement learning method; by adopting a method combining a model reinforcement learning method and a model-free reinforcement learning method, the terminal arrival process can be trained, the optimization of the flexible robot arrival action sequence is completed, the terminal of the flexible robot can arrive safely, and technical support is provided for tasks such as on-orbit module replacement and sailboard auxiliary unfolding. Therefore, the problem that model uncertainty and external disturbance exist in dynamic modeling due to the fact that the flexible robot and a traditional articulated robot are different in structure and have the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like is solved.
In still another aspect, based on the same inventive concept, the present invention further provides an electronic device, as shown in fig. 3, the electronic device includes a processor 301 and a memory 303, the memory 303 stores a computer program thereon, and the computer program, when executed by the processor 301, implements the flexible robot end arrival control method as described above.
The electronic equipment provided by the embodiment can solve the problems of model uncertainty and external disturbance in dynamic modeling caused by the fact that the flexible robot is different from a traditional articulated robot in structure consisting of rigid joints and connecting rods and has the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like.
With continued reference to fig. 3, the electronic device further comprises a communication interface 302 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 are communicated with each other through the communication bus 304. The communication bus 304 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 304 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface 302 is used for communication between the electronic device and other devices.
The Processor 301 in this embodiment may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and so on. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 301 is the control center of the electronic device and connects the various parts of the whole electronic device by various interfaces and lines.
The memory 303 may be used for storing the computer program, and the processor 301 implements various functions of the electronic device by running or executing the computer program stored in the memory 303 and calling data stored in the memory 303.
The memory 303 may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
In other aspects, based on the same inventive concept, the present invention also provides a readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, can implement the flexible robot end arrival control method as described above.
The readable storage medium provided by the embodiment can solve the problems of model uncertainty and external disturbance in dynamic modeling caused by the fact that the flexible robot has the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like and the structure is different from that of a traditional articulated robot which is composed of rigid joints and connecting rods.
The readable storage medium provided by this embodiment may take any combination of one or more computer-readable media. The readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this context, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
In this embodiment, computer program code for carrying out operations for embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It should be noted that the apparatuses and methods disclosed in the embodiments herein can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments herein. In this regard, each block in the flowchart or block diagrams may represent a module, a program, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments herein may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In summary, the present invention provides a method for controlling flexible robot terminal arrival, including: step S1, establishing a dynamic model of the flexible robot; step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model; step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network; and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network. Therefore, the dynamic model of the flexible robot is firstly established, the dynamic model is not very accurate, and then a deep neural network is established by combining the dynamic model, the deep neural network has the effect equivalent to the dynamic model, and the principle is similar to that the dynamic model is replaced by the deep neural network; the initial parameters of the deep neural network can be obtained through the first training, so that the deep neural network becomes known, but the precision of the deep neural network is still insufficient at the moment, and therefore the known deep neural network is trained for the second time and used for improving the precision of the deep neural network. According to the method, an accurate mathematical model (a dynamic model) of the flexible robot is not required to be established, adaptive control is realized by performing reward and punishment feedback on the operation process, inherent control errors caused by unknown or inaccurate dynamic model and control errors caused by dimensionality reduction and simplification of the dynamic model can be eliminated or weakened, the tail end control precision of the flexible robot is improved, and technical support is provided for operation tasks such as on-orbit module replacement, sailboard auxiliary expansion and the like of a failure target.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (7)

1. A flexible robot end arrival control method is characterized by comprising the following steps:
step S1, establishing a dynamic model of the flexible robot;
step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model;
step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network;
and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network.
2. The flexible robot terminal arrival control method according to claim 1, wherein the step S1 includes:
regarding the flexible robot as a continuous model with arc coordinates as independent variables, and regarding the spatial pose of the flexible robot as the rotation or movement of a cross section around a central line;
establishing a dynamic model of the flexible robot based on a Cosserat rod model;
the kinetic model is represented by the following formula:
Figure FDA0002568789230000011
wherein F is the internal force on the section; m is the principal moment on the cross section; f is the uniform distribution of force on the single-section rod of the flexible robot; m is uniform moment on single-section rod of flexible robot(ii) a J (s, t) is the inertia tensor of the rod per unit length; rho is the density of the flexible robot rod in unit length; s is the section area of a rod of a flexible robot in unit length;
Figure FDA0002568789230000021
and omega is the angular speed of the point P in the section principal axis coordinate system P-xyz in the inertial coordinate system relative to the time variable t.
3. The flexible robot terminal arrival control method according to claim 2, wherein the step S2 includes: acquiring random track data of the flexible robot in the terminal arrival process in real time by using an external calibration and measurement camera of the laboratory;
converting the random trajectory data into training data according to the dynamic model to obtain a random trajectory data set Drand=(s1,a1,r1,s2,a2,r2,...sT,aT,rT) (ii) a Wherein s istRepresenting the state of the flexible robot when the current moment is t; a istRepresenting the action of the flexible robot when the current moment is t; r istA reward representing a predicted environment at the current time t; t1, 2,. said, T;
the state s when the current time is t is taken astAnd the action atAs an input, the state s at the next time can be predictedt+1State transition prediction model P(s)t+1|st,at) Is represented as follows:
st+1~P(st+1|st,at)
the state s when the current time is t is taken astAnd the action atAs an input, the reward r of the predicted environment at the next time can be obtainedt+1Reward prediction model R (R)t+1|st,at) Is represented as follows:
rt+1~R(rt+1|st,at)
predicting model P(s) according to the state transitiont+1|st,at) And a reward prediction model R (R)t+1|st,at) The random trajectory data set Drand=(s1,a1,r1,s2,a2,r2,...sT,aT,rT) Respectively converting the training samples into a density estimation model training set and a regression model training set containing T-1 groups of training samples;
the training set of density estimation models is represented as follows:
(s1,a1)→s2,(s2,a2)→s3,...(sT-1,aT-1)→sT
the regression model training set is represented as follows:
(s1,a1)→r2,(s2,a2)→r3,...(sT-1,aT-1)→rT
4. the flexible robot terminal arrival control method according to claim 3, wherein the step S3 includes:
step S3.1, presetting an optimal control track data set DRLWhen it is in the empty set state; and randomly initializing a first action value function based on the deep neural network with model reinforcement learning training
Figure FDA0002568789230000031
Figure FDA0002568789230000032
The parameters of the deep neural network corresponding to the first action value function are represented, and the step S3.2 is carried out;
step S3.2, presetting a first training round number M1Recording the current first training round number m1Judging the current first training round number m1Whether or not it is less than the preset first training round number M1If yes, go to step S3.3; if not, entering step S3.6;
step S3.3,Judging the optimal control track data set DRLIf it is an empty set, go to step S3.3.1; if not, go to step S3.3.2;
step S3.3.1, based on the random trajectory data set DrandApplying a random gradient descent method to make the loss function
Figure FDA0002568789230000033
The minimum value is reached, and the minimum value,
Figure FDA0002568789230000034
in the formula, D1Represents DrandMiddle(s)t,at,st+1) A set of constructs; stAnd atRespectively representing the state and the action when the current time is t; st+1Represents a state in which the subsequent time is t + 1;
then using the loss function
Figure FDA0002568789230000035
The random trajectory data set D when the minimum is reachedrandThe set of data in (a) determines the parameters of the deep neural network
Figure FDA0002568789230000036
A first action value function of the deep neural network at this time
Figure FDA0002568789230000037
If the state is known, go to step S3.4;
step S3.3.2, based on the optimal control trajectory data set DRLApplying a random gradient descent method to make the loss function
Figure FDA0002568789230000038
The minimum value is reached, and the minimum value,
Figure FDA0002568789230000041
in the formula stAnd atRespectively representing the state and the action of the current moment; st+1Indicating the state at the subsequent time;
then using the loss function
Figure FDA0002568789230000042
The optimal control trajectory data set D when the minimum is reachedRLIs used for solving the parameters of the deep neural network
Figure FDA0002568789230000043
A first action value function of the deep neural network at this time
Figure FDA0002568789230000044
If the state is known, go to step S3.4;
s3.4, judging that the training sample group number T is equal to the random track data set D when the training times is less than the training sample group number TrandThe total number T of training data contained in (a); step S3.5 is executed;
s3.5, obtaining the state S of the flexible robot at the current moment t corresponding to the current training timestGo to step S3.5.1;
step S3.5.1, using a first action value function of the deep neural network
Figure FDA0002568789230000045
An optimal action sequence containing T actions is estimated
Figure FDA0002568789230000046
Figure FDA0002568789230000047
In the formula
Figure FDA0002568789230000048
T is an integer;
proceed to step S3.5.2; step S3.5.2, executing the optimal sequence of actions
Figure FDA0002568789230000049
First action a in (1)tThe first action atThe state s of the flexible robot at the current moment t corresponding to the current training timestCombining to obtain the optimal control track(s)t,at) Go to step S3.5.3;
step S3.5.3, obtaining the optimal control track(s)t,at) Adding to the optimal control trajectory data set DRLPerforming the following steps; proceed to step S3.5.4;
step S3.5.4, judging whether the training times is equal to the training sample group number T; if not, returning to the step S3.5; if yes, returning to the step S3.2;
s3.6, finishing training; obtaining a final first action value function of the deep neural network
Figure FDA0002568789230000051
And the initial parameters corresponding thereto
Figure FDA0002568789230000052
5. The flexible robot end arrival control method according to claim 4, wherein the step S4 includes:
step S4.1, initializing a state transition prediction model P (S)t+1|st,at) Reward prediction model R (R)t+1|st,at) And a second action value function of the deep neural network based on model-free reinforcement learning training
Figure FDA0002568789230000053
Corresponding parametersθ, and then let parameter θ be 0; entering step S4.2;
step S4.2, starting from the initial state S0Initiating a trial to initialize a first action value function of the deep neural network
Figure FDA0002568789230000054
Corresponding parameter
Figure FDA0002568789230000055
Order to
Figure FDA0002568789230000056
Entering step S4.3;
step S4.3, a qualification trace z is preset, and z is made equal to 0; entering step S4.4;
step S4.4, initial State S for each training round0Executing a model-based reinforcement learning training simulation once, and updating the first action value function, wherein the updated first action value function is
Figure FDA0002568789230000057
The obtained initial parameters
Figure FDA0002568789230000058
Entering step S4.5;
step S4.5, based on the state S when the current time is ttAnd combining the first action value function and the second action value function to obtain a combined action value function
Figure FDA0002568789230000059
Selecting an action a using the-greedy methodt(ii) a Entering step S4.6;
step S4.6, if the current state StWith known desired terminal state sqError s oferr=||st-sqIf | | is greater than the constant value Δ, the step S4.6.1 is entered, otherwise, the step S4.2 is returned;
step S4.6.1, performing the action a selected from said step S4.5tPredicting model P(s) based on state transitionst+1|st,at) Obtaining a subsequent state st+1Based on a reward prediction model R (R)t+1|st,at) Receive a reward r, using successor state st+1And action atAnd a reward r for updating the state transition prediction model P(s)t+1|st,at) And a reward prediction model R (R)t+1|st,at) (ii) a Proceed to step S4.6.2;
step S4.6.2, using the state transition prediction model P(s) obtained in the step S4.6.1t+1|st,at) And a reward prediction model P(s)t+1|st,at) From the subsequent state st+1Starting to perform a model-based reinforcement learning training simulation, and updating the first action value function
Figure FDA0002568789230000061
Obtaining the parameters corresponding to the parameters
Figure FDA0002568789230000062
Proceed to step S4.6.3;
step S4.6.3, based on the successor state st+1And the joint action value function
Figure FDA0002568789230000063
Selecting the action a to be executed actually next by utilizing a greedy methodt+1(ii) a Proceed to step S4.6.4;
s4.6.4, obtaining the deviation of the corresponding second action value function based on the model-free reinforcement learning training simulation, wherein
Figure FDA0002568789230000064
And updating the second action value function corresponding to the model-free reinforcement learning training by using the deviation of the second action value function
Figure FDA0002568789230000065
θ ← θ + α z, where α denotes the learning rate, constant between 0 and 1;
step S4.6.5, the qualification trace z is updated,
Figure FDA0002568789230000066
wherein λ represents a discount factor, which is a constant value between 0 and 1; proceed to step S4.6.6;
step S4.6.6, transferring the acquisition state of the flexible robot to a subsequent state, i.e. st=st+1,at=at+1(ii) a Proceed to step S4.6.7;
step S4.6.7, presetting a second training round number M2Recording the current second number m of training rounds2Judging the current second training round number m2Whether or not it is less than a preset second training round number M2If yes, returning to the step S4.2; if not, the step S4.7 is carried out;
s4.7, finishing training; obtaining a final second action value function of the deep neural network
Figure FDA0002568789230000067
And the final parameters corresponding thereto
Figure FDA0002568789230000068
6. An electronic device comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the method of any of claims 1 to 5.
7. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 5.
CN202010635603.4A 2020-07-03 2020-07-03 Flexible robot end arrival control method, electronic device and storage medium Active CN111783250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010635603.4A CN111783250B (en) 2020-07-03 2020-07-03 Flexible robot end arrival control method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010635603.4A CN111783250B (en) 2020-07-03 2020-07-03 Flexible robot end arrival control method, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN111783250A true CN111783250A (en) 2020-10-16
CN111783250B CN111783250B (en) 2024-09-10

Family

ID=72758726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010635603.4A Active CN111783250B (en) 2020-07-03 2020-07-03 Flexible robot end arrival control method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN111783250B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112372637A (en) * 2020-10-27 2021-02-19 东方红卫星移动通信有限公司 Adaptive impedance compliance control method, module and system for low-orbit satellite space manipulator
CN112540620A (en) * 2020-12-03 2021-03-23 西湖大学 Reinforced learning method and device for foot type robot and electronic equipment
CN113267993A (en) * 2021-04-22 2021-08-17 上海大学 Network training method and device based on collaborative learning
CN113848711A (en) * 2021-09-18 2021-12-28 内蒙古工业大学 Data center refrigeration control algorithm based on safety model reinforcement learning
CN115935553A (en) * 2022-12-29 2023-04-07 深圳技术大学 Linear flexible body deformation state analysis method and related device
CN115946131A (en) * 2023-03-14 2023-04-11 之江实验室 Flexible joint mechanical arm motion control simulation calculation method and device
CN116038773A (en) * 2023-03-29 2023-05-02 之江实验室 Vibration characteristic analysis method and device for flexible joint mechanical arm

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130131868A1 (en) * 2010-07-08 2013-05-23 Vanderbilt University Continuum robots and control thereof
US20150100530A1 (en) * 2013-10-08 2015-04-09 Google Inc. Methods and apparatus for reinforcement learning
US20160016319A1 (en) * 2010-07-08 2016-01-21 Vanderbilt University Continuum devices and control methods thereof
CN109726866A (en) * 2018-12-27 2019-05-07 浙江农林大学 Unmanned boat paths planning method based on Q learning neural network
CN110321666A (en) * 2019-08-09 2019-10-11 重庆理工大学 Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm
WO2019241680A1 (en) * 2018-06-15 2019-12-19 Google Llc Deep reinforcement learning for robotic manipulation
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN111190429A (en) * 2020-01-13 2020-05-22 南京航空航天大学 Unmanned aerial vehicle active fault-tolerant control method based on reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130131868A1 (en) * 2010-07-08 2013-05-23 Vanderbilt University Continuum robots and control thereof
US20160016319A1 (en) * 2010-07-08 2016-01-21 Vanderbilt University Continuum devices and control methods thereof
US20150100530A1 (en) * 2013-10-08 2015-04-09 Google Inc. Methods and apparatus for reinforcement learning
WO2019241680A1 (en) * 2018-06-15 2019-12-19 Google Llc Deep reinforcement learning for robotic manipulation
CN109726866A (en) * 2018-12-27 2019-05-07 浙江农林大学 Unmanned boat paths planning method based on Q learning neural network
CN110321666A (en) * 2019-08-09 2019-10-11 重庆理工大学 Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN111190429A (en) * 2020-01-13 2020-05-22 南京航空航天大学 Unmanned aerial vehicle active fault-tolerant control method based on reinforcement learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
李辉 等: "一种复杂环境下基于深度强化学习的机器人路径规划方法", 《计算机应用研究》, vol. 37, no. 1, 30 June 2020 (2020-06-30), pages 129 - 131 *
王发麟 等: "基于精确Cosserat模型的柔性线缆物理特性建模与变形仿真技术", 《计算机辅助设计与图形学学报》, vol. 29, no. 07, 15 July 2017 (2017-07-15), pages 1343 - 1355 *
王桂鸿: "合作型多智能体中的深度强化学习研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, 15 January 2020 (2020-01-15), pages 140 - 323 *
袁文婷: "基于章鱼仿生的柔性臂建模与控制", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 04, 15 April 2017 (2017-04-15), pages 140 - 149 *
赵辉: "基于Q学习算法的机械臂轨迹规划研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 12, 15 December 2013 (2013-12-15), pages 140 - 43 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112372637A (en) * 2020-10-27 2021-02-19 东方红卫星移动通信有限公司 Adaptive impedance compliance control method, module and system for low-orbit satellite space manipulator
CN112372637B (en) * 2020-10-27 2022-05-06 东方红卫星移动通信有限公司 Adaptive impedance compliance control method, module and system for low-orbit satellite space manipulator
CN112540620A (en) * 2020-12-03 2021-03-23 西湖大学 Reinforced learning method and device for foot type robot and electronic equipment
CN112540620B (en) * 2020-12-03 2022-10-14 西湖大学 Reinforced learning method and device for foot type robot and electronic equipment
CN113267993A (en) * 2021-04-22 2021-08-17 上海大学 Network training method and device based on collaborative learning
CN113848711A (en) * 2021-09-18 2021-12-28 内蒙古工业大学 Data center refrigeration control algorithm based on safety model reinforcement learning
CN113848711B (en) * 2021-09-18 2023-07-14 内蒙古工业大学 Data center refrigeration control algorithm based on safety model reinforcement learning
CN115935553A (en) * 2022-12-29 2023-04-07 深圳技术大学 Linear flexible body deformation state analysis method and related device
CN115935553B (en) * 2022-12-29 2024-02-09 深圳技术大学 Linear flexible body deformation state analysis method and related device
CN115946131A (en) * 2023-03-14 2023-04-11 之江实验室 Flexible joint mechanical arm motion control simulation calculation method and device
CN116038773A (en) * 2023-03-29 2023-05-02 之江实验室 Vibration characteristic analysis method and device for flexible joint mechanical arm

Also Published As

Publication number Publication date
CN111783250B (en) 2024-09-10

Similar Documents

Publication Publication Date Title
CN111783250A (en) Flexible robot end arrival control method, electronic device, and storage medium
Song et al. Indirect neuroadaptive control of unknown MIMO systems tracking uncertain target under sensor failures
CN104950678A (en) Neural network inversion control method for flexible manipulator system
JPH10133703A (en) Adaptive robust controller
US11759947B2 (en) Method for controlling a robot device and robot device controller
CN112077839B (en) Motion control method and device for mechanical arm
US20210107144A1 (en) Learning method, learning apparatus, and learning system
CN112571420B (en) Dual-function model prediction control method under unknown parameters
CN112959326B (en) Method and device for solving positive kinematics of robot, readable storage medium and robot
De Stefano et al. Reproducing physical dynamics with hardware-in-the-loop simulators: A passive and explicit discrete integrator
Shen et al. Cascade predictor for a class of mechanical systems under large uncertain measurement delays
Meyes et al. Continuous motion planning for industrial robots based on direct sensory input
Zhang et al. Time delay compensation of a robotic arm based on multiple sensors for indirect teaching
Yang et al. Model-free control of underwater vehicle-manipulator system interacting with unknown environments
CN116150934A (en) Ship maneuvering Gaussian process regression online non-parameter identification modeling method
WO2021186500A1 (en) Learning device, learning method, and recording medium
CN113219842B (en) Mechanical arm optimal tracking control method, system, processing equipment and storage medium based on self-adaptive dynamic programming
CN110703595B (en) Master satellite attitude forecasting method and system of satellite-arm coupling system
CN115170666A (en) Robot navigation method and system based on external memory
Guo et al. Robot path planning via deep reinforcement learning with improved reward function
Yovchev et al. Iterative learning control of hard constrained robotic manipulators
CN110515299B (en) Master satellite attitude decoupling forecasting method and system of satellite-arm coupling system
Dai et al. A robust optimal control by grey wolf optimizer for underwater vehicle-manipulator system
Poddighe Comparing FABRIK and neural networks to traditional methods in solving Inverse Kinematics
Yovchev et al. Genetic Algorithm with Iterative Learning Control for Estimation of the Parameters of Robot Dynamics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant