CN111783250A

CN111783250A - Flexible robot end arrival control method, electronic device, and storage medium

Info

Publication number: CN111783250A
Application number: CN202010635603.4A
Authority: CN
Inventors: 孙俊; 武海雷; 韩飞; 孙玥; 刘超镇; 阳光
Original assignee: Shanghai Aerospace Control Technology Institute
Current assignee: Shanghai Aerospace Control Technology Institute
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-10-16
Anticipated expiration: 2040-07-03
Also published as: CN111783250B

Abstract

The invention discloses a flexible robot tail end arrival control method, electronic equipment and a storage medium, wherein the method comprises the following steps: establishing a dynamic model of the flexible robot; establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model; performing primary training of a first flexible robot tail end arrival process on the deep neural network to obtain initial parameters of the deep neural network; and carrying out primary training on the deep neural network in the process of the flexible robot reaching for the second time to obtain the final parameters of the deep neural network. The invention reduces the influence of uncertainty or external disturbance of a dynamic model of the flexible robot on a control system and improves the tail end control precision of the flexible robot.

Description

Flexible robot end arrival control method, electronic device, and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a flexible robot tail end arrival control method based on deep reinforcement learning, electronic equipment and a storage medium.

Background

The control aim of the flexible robot based on cable variable-length driving is to realize accurate arrival of the tail end of the flexible robot under the condition of avoiding collision with the surrounding environment. The flexible robot end safety arrival control process has the following problems: firstly, due to the characteristics of complexity, strong nonlinearity, time variation, uncertainty and the like of a control system of the flexible robot, accurate dynamic modeling is difficult to establish; and secondly, in the moving process of the flexible robot, external disturbances such as friction and the like exist among the cable, the reel and the disc. Therefore, the problem that the flexible robot track tracking control is inaccurate due to uncertainty of a control system model and external interference is difficult to effectively inhibit by a completely known classical feedback control method based on the model, and even the control system is unstable.

Disclosure of Invention

The invention aims to provide a flexible robot tail end arrival control method based on deep reinforcement learning, electronic equipment and a storage medium, and aims to reduce the uncertainty of a dynamic model of a flexible robot or the influence of external disturbance on a control system and improve the tail end control precision of the flexible robot.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

a flexible robot end arrival control method, comprising:

step S1, establishing a dynamic model of the flexible robot;

step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model;

step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network;

and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network.

Preferably, the step S1 includes:

regarding the flexible robot as a continuous model with arc coordinates as independent variables, and regarding the spatial pose of the flexible robot as the rotation or movement of a cross section around a central line;

establishing a dynamic model of the flexible robot based on a Cosserat rod model;

the kinetic model is represented by the following formula:

wherein F is the internal force on the section; m is the principal moment on the cross section; f is the uniform distribution of force on the single-section rod of the flexible robot; m is uniform moment on a single-section rod of the flexible robot; j (s, t) is the inertia tensor of the rod per unit length; rho is the density of the flexible robot rod in unit length; s is the section area of a rod of a flexible robot in unit length;

and omega is the angular speed of the point P in the section principal axis coordinate system P-xyz in the inertial coordinate system relative to the time variable t.

Preferably, the step S2 includes: acquiring random track data of the flexible robot in the terminal arrival process in real time by using an external calibration and measurement camera of the laboratory;

converting the random trajectory data into training data according to the dynamic model to obtain a random trajectory data set D_rand＝(s₁,a₁,r₁,s₂,a₂,r₂,...s_T,a_T,r_T) (ii) a Wherein s is_tRepresenting the state of the flexible robot when the current moment is t; a is_tRepresenting the action of the flexible robot when the current moment is t; r is_tA reward representing a predicted environment at the current time t; t1, 2,. said, T;

the state s when the current time is t is taken as_tAnd the action a_tAs an input, the state s at the next time can be predicted_t+1State transition prediction model P(s)_t+1|s_t,a_t) Is represented as follows:

s_t+1～P(s_t+1|s_t,a_t)

the state s when the current time is t is taken as_tAnd the action a_tAs an input, the reward r of the predicted environment at the next time can be obtained_t+1Reward prediction model R (R)_t+1|s_t,a_t) Is represented as follows:

r_t+1～R(r_t+1|s_t,a_t)

predicting model P(s) according to the state transition_t+1|s_t,a_t) And a reward prediction model R (R)_t+1|s_t,a_t) The random trajectory data set D_rand＝(s₁,a₁,r₁,s₂,a₂,r₂,...s_T,a_T,r_T) Respectively converting the training samples into a density estimation model training set and a regression model training set containing T-1 groups of training samples;

the training set of density estimation models is represented as follows:

(s₁,a₁)→s₂,(s₂,a₂)→s₃,...(s_T-1,a_T-1)→s_T

the regression model training set is represented as follows:

(s₁,a₁)→r₂,(s₂,a₂)→r₃,...(s_T-1,a_T-1)→r_T。

preferably, step S3.1, a data set D of optimal control trajectories is preset_RLWhen it is in the empty set state; and randomly initializing a first action value function based on the deep neural network with model reinforcement learning training

The parameters of the deep neural network corresponding to the first action value function are represented, and the step S3.2 is carried out;

step S3.2, presetting a first training round number M₁Recording the current first training round number m₁Judging the current first training round number m₁Whether or not it is less than the preset first training round number M₁If yes, go to step S3.3; if not, entering step S3.6;

step S3.3, judging the optimal control track data set D_RLIf it is an empty set, go to step S3.3.1; if not, go to step S3.3.2;

step S3.3.1, based on the random trajectory data set D_randApplying a random gradient descent method to make the loss function

The minimum value is reached, and the minimum value,

in the formula, D₁Represents D_randMiddle(s)_t,a_t,s_t+1) A set of constructs; s_tAnd a_tRespectively representing the state and the action when the current time is t; s_t+1Represents a state in which the subsequent time is t + 1;

then using the loss function

The random trajectory data set D when the minimum is reached_randThe set of data in (a) determines the parameters of the deep neural network

A first action value function of the deep neural network at this time

If the state is known, go to step S3.4;

step S3.3.2, based on the optimal control trajectory data set D_RLApplying a random gradient descent method to make the loss function

The minimum value is reached, and the minimum value,

in the formula s_tAnd a_tRespectively representing the state and the action of the current moment; s_t+1Indicating the state at the subsequent time;

then using the loss function

The optimal control trajectory data set D when the minimum is reached_RLIs used for solving the parameters of the deep neural network

A first action value function of the deep neural network at this time

If the state is known, go to step S3.4;

s3.4, judging that the training sample group number T is equal to the random track data set D when the training times is less than the training sample group number T_randThe total number T of training data contained in (a); step S3.5 is executed;

step S3.5, obtaining the corresponding number of times of current trainingThe state s of the flexible robot at the current time t_tGo to step S3.5.1;

step S3.5.1, using a first action value function of the deep neural network

An optimal action sequence containing T actions is estimated

In the formula

T is an integer;

proceed to step S3.5.2; step S3.5.2, executing the optimal sequence of actions

First action a in (1)_tThe first action a_tThe state s of the flexible robot at the current moment t corresponding to the current training times_tCombining to obtain the optimal control track(s)_t，a_t) Go to step S3.5.3;

step S3.5.3, obtaining the optimal control track(s)_t，a_t) Adding to the optimal control trajectory data set D_RLPerforming the following steps; proceed to step S3.5.4;

step S3.5.4, judging whether the training times is equal to the training sample group number T; if not, returning to the step S3.5; if yes, returning to the step S3.2;

s3.6, finishing training; obtaining a final first action value function of the deep neural network

And the initial parameters corresponding thereto

Preferably, the step S4 includes:

step S4.1, initializing a state transition prediction model P (S)_t+1|s_t,a_t) Reward prediction model R (R)_t+1|s_t,a_t) And a second action value function of the deep neural network based on model-free reinforcement learning training

A corresponding parameter θ, and at this time, the parameter θ is made to be 0; entering step S4.2;

step S4.2, starting from the initial state S₀Initiating a trial to initialize a first action value function of the deep neural network

Corresponding parameter

Order to

Entering step S4.3;

step S4.3, a qualification trace z is preset, and z is made equal to 0; entering step S4.4;

step S4.4, initial State S for each training round₀Executing a model-based reinforcement learning training simulation once, and updating the first action value function, wherein the updated first action value function is

The obtained initial parameters

Entering step S4.5;

step S4.5, based on the state S when the current time is t_tAnd combining the first action value function and the second action valueFunction, get the joint action value function

Selecting an action a using the-greedy method_t(ii) a Entering step S4.6;

step S4.6, if the current state S_tWith known desired terminal state s_qError s of_err＝||s_t-s_qIf | | is greater than the constant value Δ, the step S4.6.1 is entered, otherwise, the step S4.2 is returned;

step S4.6.1, executing action a selected from step S4.5_tPredicting model P(s) based on state transitions_t+1|s_t,a_t) Obtaining a subsequent state s_t+1Based on a reward prediction model R (R)_t+1|s_t,a_t) Receive a reward r, using successor state s_t+1And action a_tAnd a reward r for updating the state transition prediction model P(s)_t+1|s_t,a_t) And a reward prediction model R (R)_t+1|s_t,a_t) (ii) a Proceed to step S4.6.2;

step S4.6.2, using the state transition prediction model P(s) obtained in the step S4.6.1_t+1|s_t,a_t) And a reward prediction model P(s)_t+1|s_t,a_t) From the subsequent state s_t+1Starting to perform a model-based reinforcement learning training simulation, and updating the first action value function

Obtaining the parameters corresponding to the parameters

Proceed to step S4.6.3;

step S4.6.3, based on the successor state s_t+1And the joint action value function

Selecting the action a to be executed actually next by utilizing a greedy method_t+1(ii) a Proceed to step S4.6.4;

s4.6.4, obtaining the deviation of the corresponding second action value function based on the model-free reinforcement learning training simulation, wherein

And updating the second action value function corresponding to the model-free reinforcement learning training by using the deviation of the second action value function

θ ← θ + α z, where α denotes the learning rate, constant between 0 and 1;

step S4.6.5, the qualification trace z is updated,

wherein λ represents a discount factor, which is a constant value between 0 and 1; proceed to step S4.6.6;

step S4.6.6, transferring the acquisition state of the flexible robot to a subsequent state, i.e. s_t＝s_t+1,a_t＝a_t+1(ii) a Proceed to step S4.6.7;

step S4.6.7, presetting a second training round number M₂Recording the current second number m of training rounds₂Judging the current second training round number m₂Whether or not it is less than a preset second training round number M₂If yes, returning to the step S4.2; if not, the step S4.7 is carried out;

s4.7, finishing training; obtaining a final second action value function of the deep neural network

And the final parameters corresponding thereto

In another aspect, the present invention also provides an electronic device comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the method as described above.

In other aspects, the invention also provides a readable storage medium, in which a computer program is stored, which, when executed by a processor, implements the method as described above

The method of (1).

Compared with the prior art, the invention has the following advantages:

the invention discloses a flexible robot tail end arrival control method, which comprises the following steps: step S1, establishing a dynamic model of the flexible robot; step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model; step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network; and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network. Therefore, the dynamic model of the flexible robot is firstly established, the dynamic model is not very accurate, and then a deep neural network is established by combining the dynamic model, the deep neural network has the effect equivalent to the dynamic model, and the principle is similar to that the dynamic model is replaced by the deep neural network; the initial parameters of the deep neural network can be obtained through the first training, so that the deep neural network becomes known, but the precision of the deep neural network is still insufficient at the moment, and therefore the known deep neural network is trained for the second time and used for improving the precision of the deep neural network. According to the method, an accurate mathematical model (a dynamic model) of the flexible robot is not required to be established, adaptive control is realized by performing reward and punishment feedback on the operation process, inherent control errors caused by unknown or inaccurate dynamic model and control errors caused by dimensionality reduction and simplification of the dynamic model can be eliminated or weakened, the tail end control precision of the flexible robot is improved, and technical support is provided for operation tasks such as on-orbit module replacement, sailboard auxiliary expansion and the like of a failure target.

Drawings

Fig. 1 is a flowchart of a flexible robot end arrival control method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of various coordinate systems according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a simple structure of an electronic apparatus according to an embodiment of the invention.

Detailed Description

The main flexible robot end arrival control method, the electronic device and the storage medium according to the present invention will be described in detail with reference to fig. 1 to 3 and the following detailed description. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise scale for the purpose of facilitating and distinctly aiding in the description of the embodiments of the present invention. To make the objects, features and advantages of the present invention comprehensible, reference is made to the accompanying drawings. It should be understood that the structures, ratios, sizes, and the like shown in the drawings and described in the specification are only used for matching with the disclosure of the specification, so as to be understood and read by those skilled in the art, and are not used to limit the implementation conditions of the present invention, so that the present invention has no technical significance, and any structural modification, ratio relationship change or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As shown in fig. 1, the present embodiment provides a method for controlling terminal arrival of a flexible robot, including:

step S1, establishing a dynamic model of the flexible robot;

Further, as shown in fig. 2, the step S1 includes: regarding the flexible robot as a continuous model with arc coordinates as independent variables, and regarding the spatial pose of the flexible robot as the rotation or movement of a cross section around a central line;

the kinetic model is represented by the following formula:

The specific establishment process is as follows: aiming at the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like of the flexible robot, a dynamic equation based on a Cosserat rod model is established, the basic idea is that the flexible robot is regarded as a continuous model taking arc coordinates as independent variables, and the space pose of the robot can be regarded as the rotation or the movement of a cross section around a central line.

Continuing with fig. 2, the spatial geometry description method using the Frenet coordinate system as the target is shown, where the T-axis of the Frenet coordinate system (P-NBT) is the tangent vector of the cable centerline at the P-point, the N-axis is the normal vector of the cable centerline at the P-point, the B-axis is the secondary normal vector of the cable centerline at the P-point (B-T × N), the three coordinate axes N, B, T are orthogonal in pairs, and the introduction vector ω is orthogonal to the three coordinate axes N, B, T_FIs defined as

ω_F(s)＝κ(s)B+τ(s)T (2)

In the formula of omega_F(s) the Darboux vector, called the curve, whose physical meaning is understood to mean that the Frenet coordinate system is relative to the inertial coordinate system when the point P moves along the curve C in the forward direction with a unit speed towards the arc coordinate s

The angular velocity of rotation of (a).

The change rule of the vectors N, B and T along with the arc coordinate s is determined by the following differential equation:

due to the fact that

The flexible robot curve r(s) is

On the basis of considering the deformation of the center line and considering the size of the cross section, writing the torsional deformation x around the Z axis at the axis of the flexible robot, and establishing a cross section principal axis coordinate system (P-xyz). According to the geometric form, the kinematic equation is

Wherein the wave line represents the local derivative to the section principal axis coordinate system (P-xyz); omega is the angular velocity of the point P in the inertial coordinate system with respect to the time variable t;

the discrete element concept is adopted to disperse continuous mechanical arms into infinitesimal sections, and the cable dynamic equation which can be realized according to the Newton's law and the centroid moment conservation theorem is as follows:

rho is the density of the flexible robot rod in unit length; s is the section area of a rod of a flexible robot in unit length; j (s, t) is the inertia tensor of the rod per unit length; f is the internal force on the section; m is the principal moment on the cross section (internal moment); f is the evenly distributed force on the rod; m is uniform moment on a single-section rod of the flexible robot; Δ s represents the arc length in infinitesimal; Δ F and Δ M represent the amount of change in the internal force and the main moment, respectively, corresponding to a change in Δ s in the cross section.

Dividing both sides of the formula (6) by Deltas to obtain the formula (1),

thereby, a dynamic model of a single-joint rod of the flexible mechanical arm (dynamic model of the flexible robot) is established.

Further, the step S2 includes: acquiring random track data of the flexible robot in the terminal arrival process in real time by using an external calibration and measurement camera of the laboratory;

converting the random trajectory data into training data according to the dynamic model to obtain random trajectory dataMachine trajectory data set D_rand＝(s₁,a₁,r₁,s₂,a₂,r₂,...s_T,a_T,r_T) (ii) a Wherein s is_tRepresenting the state of the flexible robot when the current moment is t; a is_tRepresenting the action of the flexible robot when the current moment is t; r is_tA reward representing a predicted environment at the current time t; t1, 2,. said, T;

s_t+1～P(s_t+1|s_t,a_t) (7)

r_t+1～R(r_t+1|s_t,a_t) (8)

the training set of density estimation models (equivalent to the state transition prediction model P (s))_t+1|s_t,a_t) Is expressed as follows:

(s₁,a₁)→s₂,(s₂,a₂)→s₃,...(s_T-1,a_T-1)→s_T(9)

the regression model training set (equivalent to reward)Prediction model R (R)_t+1|s_t,a_t) Is expressed as follows:

(s₁,a₁)→r₂,(s₂,a₂)→r₃,...(s_T-1,a_T-1)→r_T(10)。

the basic idea of step S2 is to fit a dynamic model of the flexible robot by using a deep neural network, apply a model-based reinforcement learning method (model reinforcement learning training), use a learned neural network model in a model predictive control framework, before selecting an actual execution action, the flexible robot first performs a simulation based on the dynamic model from a current state, the simulation simulates a completed trajectory, so as to evaluate a current action value function, and implement an initial training of the flexible robot in the arrival process.

Further, the step S3 includes: step S3.1, presetting an optimal control track data set D_RLWhen it is in the empty set state; and randomly initializing a first action value function based on the deep neural network with model reinforcement learning training

The parameter representing the deep neural network corresponding to the first action value function proceeds to step S3.2.

Step S3.2, presetting a first training round number M₁Recording the current first training round number m₁Judging the current first training round number m₁Whether or not it is less than the preset first training round number M₁If yes, go to step S3.3; if not, go to step S3.6.

Step S3.3, judging the optimal control track data set D_RLIf it is an empty set, go to step S3.3.1; if not, go to step S3.3.2.

Step S3.3.1, based on the random trajectory data set D_randLoss by using a random gradient descent methodFunction(s)

The minimum value is reached, and the minimum value,

in the formula, D₁Represents D_randMiddle(s)_t,a_t,s_t+1) A set of constructs; s_tAnd a_tRespectively representing the state and the action when the current time is t; s_t+1Indicating a state at a subsequent time instant t + 1.

Then using the loss function

A first action value function of the deep neural network at this time

Known state, step S3.4 is entered.

The minimum value is reached, and the minimum value,

then using the loss function

A first action value function of the deep neural network at this time

Known state, step S3.4 is entered.

S3.4, judging that the training sample group number T is equal to the random track data set D when the training times is less than the training sample group number T_randThe total number T of training data contained in (a); step S3.5 is performed.

S3.5, obtaining the state S of the flexible robot at the current moment t corresponding to the current training times_tProceed to step S3.5.1.

Step S3.5.1, using a first action value function of the deep neural network

An optimal action sequence containing T actions is estimated

In the formula

T is an integer.

Proceed to step S3.5.2; step S3.5.2, executing the optimal sequence of actions

First action a in (1)_tThe first action a_tCorresponding to the current training timesThe state s of the flexible robot at the current time t_tCombining to obtain the optimal control track(s)_t，a_t) Step S3.5.3 is performed.

Step S3.5.3, obtaining the optimal control track(s)_t，a_t) Adding to the optimal control trajectory data set D_RLPerforming the following steps; step S3.5.4 is entered.

Step S3.5.4, judging whether the training times is equal to the training sample group number T; if not, returning to the step S3.5; if so, return to step S3.2.

And the initial parameters corresponding thereto

It can be seen that the basic idea of step S3 is to define a reward function encoding a task when the model-based flexible arm end reaches the reinforcement learning task, and to give a reward when the track is reached near the desired end and to follow the desired track. Implementing the optimal sequence of actions obtained from step S3.5.1 using a Model Predictive Controller (MPC)

In a first action a of selecting rank first_tAnd adds the corresponding first action to the data set of the optimal control trajectory, thereby increasing the robustness of the embodiment.

Further, the step S4 includes: step S4.1, initializing a state transition prediction model P (S)_t+1|s_t,a_t) Reward prediction model R (R)_t+1|s_t,a_t) And a second action value function of the deep neural network based on model-free reinforcement learning training

A corresponding parameter θ, and at this time, the parameter θ is made to be 0; proceed to step S4.2.

Corresponding parameter

Order to

Proceed to step S4.3.

Step S4.3, a qualification trace z is preset, and z is made equal to 0; step S4.4 is entered.

The obtained initial parameters

Entering step S4.5; in this embodiment, said step S4.4 can be understood as being based on the initial state S₀The parameters are calculated by the method of step S3 described above

Step S4.5, based on the state S when the current time is t_tAnd combining the first action value function and the second action value function to obtain a combined action value function

Selecting an action a using the-greedy method_t(ii) a Proceed to step S4.6.

Step S4.6, if the current stateState s_tWith known desired terminal state s_qError s of_err＝||s_t-s_qIf | | is greater than the constant value Δ, go to step S4.6.1, otherwise, return to step S4.2.

Step S4.6.1, executing action a selected from step S4.5_tPredicting model P(s) based on state transitions_t+1|s_t,a_t) Obtaining a subsequent state s_t+1Based on a reward prediction model R (R)_t+1|s_t,a_t) Receive a reward r, using successor state s_t+1And action a_tAnd a reward r for updating the state transition prediction model P(s)_t+1|s_t,a_t) And a reward prediction model R (R)_t+1|s_t,a_t) (ii) a Step S4.6.2 is entered.

Obtaining the parameters corresponding to the parameters

Proceed to step S4.6.3; the step S4.6.2 may be understood as calculating the parameters by the method of the step S3

Selecting the action a to be executed actually next by utilizing a greedy method_t+1(ii) a Step S4.6.4 is entered.

S4.6.4, training and simulating based on model-free reinforcement learningDeviation to the corresponding second action value function, wherein

θ ← θ + α z, where α denotes the learning rate, being a constant value between 0 and 1, proceeds to step S4.6.5.

Step S4.6.5, the qualification trace z is updated,

wherein λ represents a discount factor, which is a constant value between 0 and 1; step S4.6.6 is entered.

Step S4.6.6, transferring the acquisition state of the flexible robot to a subsequent state, i.e. s_t＝s_t+1,a_t＝a_t+1(ii) a Step S4.6.7 is entered.

Step S4.6.7, presetting a second training round number M₂Recording the current second number m of training rounds₂Judging the current second training round number m₂Whether or not it is less than a preset second training round number M₂If yes, returning to the step S4.2; if not, the process proceeds to step S4.7.

And the final parameters corresponding thereto

It can be seen that the basic idea of step S4 is that the flexible robot performs a model-based simulation from the current state to evaluate the current action value function before selecting the actual execution action, and then adds the action value function obtained by the simulation to the actual execution actionThe empirically derived action value function jointly selects the action a actually to be performed_t。

Therefore, in the embodiment, a flexible robot dynamic model based on Cosserat is firstly established; then fitting a dynamic model of the flexible robot by using a deep neural network, and realizing the initial training of the flexible robot in the arrival process by using a model-based reinforcement learning method; by adopting a method combining a model reinforcement learning method and a model-free reinforcement learning method, the terminal arrival process can be trained, the optimization of the flexible robot arrival action sequence is completed, the terminal of the flexible robot can arrive safely, and technical support is provided for tasks such as on-orbit module replacement and sailboard auxiliary unfolding. Therefore, the problem that model uncertainty and external disturbance exist in dynamic modeling due to the fact that the flexible robot and a traditional articulated robot are different in structure and have the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like is solved.

In still another aspect, based on the same inventive concept, the present invention further provides an electronic device, as shown in fig. 3, the electronic device includes a processor 301 and a memory 303, the memory 303 stores a computer program thereon, and the computer program, when executed by the processor 301, implements the flexible robot end arrival control method as described above.

The electronic equipment provided by the embodiment can solve the problems of model uncertainty and external disturbance in dynamic modeling caused by the fact that the flexible robot is different from a traditional articulated robot in structure consisting of rigid joints and connecting rods and has the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like.

With continued reference to fig. 3, the electronic device further comprises a communication interface 302 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 are communicated with each other through the communication bus 304. The communication bus 304 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 304 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface 302 is used for communication between the electronic device and other devices.

The Processor 301 in this embodiment may be a Central Processing Unit (CPU), other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and so on. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 301 is the control center of the electronic device and connects the various parts of the whole electronic device by various interfaces and lines.

The memory 303 may be used for storing the computer program, and the processor 301 implements various functions of the electronic device by running or executing the computer program stored in the memory 303 and calling data stored in the memory 303.

The memory 303 may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

In other aspects, based on the same inventive concept, the present invention also provides a readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, can implement the flexible robot end arrival control method as described above.

The readable storage medium provided by the embodiment can solve the problems of model uncertainty and external disturbance in dynamic modeling caused by the fact that the flexible robot has the characteristics of high flexibility, high degree of freedom, strong nonlinearity and the like and the structure is different from that of a traditional articulated robot which is composed of rigid joints and connecting rods.

The readable storage medium provided by this embodiment may take any combination of one or more computer-readable media. The readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this context, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

In this embodiment, computer program code for carrying out operations for embodiments may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It should be noted that the apparatuses and methods disclosed in the embodiments herein can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments herein. In this regard, each block in the flowchart or block diagrams may represent a module, a program, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments herein may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In summary, the present invention provides a method for controlling flexible robot terminal arrival, including: step S1, establishing a dynamic model of the flexible robot; step S2, establishing a deep neural network according to the dynamic model, wherein the deep neural network is used for fitting the dynamic model; step S3, carrying out primary training of the terminal arrival process of the flexible robot for the first time on the deep neural network to obtain initial parameters of the deep neural network; and step S4, performing primary training of the flexible robot arrival process for the second time on the deep neural network to obtain final parameters of the deep neural network. Therefore, the dynamic model of the flexible robot is firstly established, the dynamic model is not very accurate, and then a deep neural network is established by combining the dynamic model, the deep neural network has the effect equivalent to the dynamic model, and the principle is similar to that the dynamic model is replaced by the deep neural network; the initial parameters of the deep neural network can be obtained through the first training, so that the deep neural network becomes known, but the precision of the deep neural network is still insufficient at the moment, and therefore the known deep neural network is trained for the second time and used for improving the precision of the deep neural network. According to the method, an accurate mathematical model (a dynamic model) of the flexible robot is not required to be established, adaptive control is realized by performing reward and punishment feedback on the operation process, inherent control errors caused by unknown or inaccurate dynamic model and control errors caused by dimensionality reduction and simplification of the dynamic model can be eliminated or weakened, the tail end control precision of the flexible robot is improved, and technical support is provided for operation tasks such as on-orbit module replacement, sailboard auxiliary expansion and the like of a failure target.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A flexible robot end arrival control method is characterized by comprising the following steps:

step S1, establishing a dynamic model of the flexible robot;

2. The flexible robot terminal arrival control method according to claim 1, wherein the step S1 includes:

the kinetic model is represented by the following formula:

wherein F is the internal force on the section; m is the principal moment on the cross section; f is the uniform distribution of force on the single-section rod of the flexible robot; m is uniform moment on single-section rod of flexible robot(ii) a J (s, t) is the inertia tensor of the rod per unit length; rho is the density of the flexible robot rod in unit length; s is the section area of a rod of a flexible robot in unit length;

3. The flexible robot terminal arrival control method according to claim 2, wherein the step S2 includes: acquiring random track data of the flexible robot in the terminal arrival process in real time by using an external calibration and measurement camera of the laboratory;

s_t+1～P(s_t+1|s_t,a_t)

r_t+1～R(r_t+1|s_t,a_t)

the training set of density estimation models is represented as follows:

(s₁,a₁)→s₂,(s₂,a₂)→s₃,...(s_T-1,a_T-1)→s_T

the regression model training set is represented as follows:

(s₁,a₁)→r₂,(s₂,a₂)→r₃,...(s_T-1,a_T-1)→r_T。

4. the flexible robot terminal arrival control method according to claim 3, wherein the step S3 includes:

step S3.1, presetting an optimal control track data set D_RLWhen it is in the empty set state; and randomly initializing a first action value function based on the deep neural network with model reinforcement learning training

step S3.3,Judging the optimal control track data set D_RLIf it is an empty set, go to step S3.3.1; if not, go to step S3.3.2;

The minimum value is reached, and the minimum value,

then using the loss function

A first action value function of the deep neural network at this time

If the state is known, go to step S3.4;

The minimum value is reached, and the minimum value,

then using the loss function

A first action value function of the deep neural network at this time

If the state is known, go to step S3.4;

s3.5, obtaining the state S of the flexible robot at the current moment t corresponding to the current training times_tGo to step S3.5.1;

step S3.5.1, using a first action value function of the deep neural network

An optimal action sequence containing T actions is estimated

In the formula

T is an integer;

proceed to step S3.5.2; step S3.5.2, executing the optimal sequence of actions

And the initial parameters corresponding thereto

5. The flexible robot end arrival control method according to claim 4, wherein the step S4 includes:

Corresponding parametersθ, and then let parameter θ be 0; entering step S4.2;

Corresponding parameter

Order to

Entering step S4.3;

The obtained initial parameters

Entering step S4.5;

Selecting an action a using the-greedy method_t(ii) a Entering step S4.6;

step S4.6.1, performing the action a selected from said step S4.5_tPredicting model P(s) based on state transitions_t+1|s_t,a_t) Obtaining a subsequent state s_t+1Based on a reward prediction model R (R)_t+1|s_t,a_t) Receive a reward r, using successor state s_t+1And action a_tAnd a reward r for updating the state transition prediction model P(s)_t+1|s_t,a_t) And a reward prediction model R (R)_t+1|s_t,a_t) (ii) a Proceed to step S4.6.2;

Obtaining the parameters corresponding to the parameters

Proceed to step S4.6.3;

θ ← θ + α z, where α denotes the learning rate, constant between 0 and 1;

step S4.6.5, the qualification trace z is updated,

And the final parameters corresponding thereto

6. An electronic device comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the method of any of claims 1 to 5.

7. A readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 5.