CN115256401A

CN115256401A - Space manipulator shaft hole assembly variable impedance control method based on reinforcement learning

Info

Publication number: CN115256401A
Application number: CN202211038250.5A
Authority: CN
Inventors: 詹腾达; 高鼎峰; 余朝宝; 周宇航; 许铭轩; 郭毓
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-01

Abstract

The invention discloses a space manipulator shaft hole assembly variable impedance control method for reinforcement learning. The scheme of the invention is based on reinforcement learning to carry out variable impedance control on the shaft hole assembly of the space manipulator, the control can track dynamic force, the dynamic error is smaller than that of the traditional fixed impedance control, the response speed is higher, the influence of uncertain factors in the environment can be effectively weakened, and the tracking precision is better than that of the traditional fixed impedance control.

Description

Space manipulator shaft hole assembly variable impedance control method based on reinforcement learning

Technical Field

The invention belongs to the field of space manipulator control, and particularly relates to a space manipulator shaft hole assembly variable impedance control method based on reinforcement learning.

Background

With the progress and development of space technology, the application of spacecrafts and space stations greatly affects human production and life. Due to the influence of factors such as vacuum and weightlessness in a space environment, a large amount of space debris and garbage float in the space surrounding the earth, the safety of an on-orbit spacecraft and a space station is seriously threatened, and meanwhile, various space facilities inevitably face the problems of equipment aging, faults and the like along with the increase of service time, so that the maintenance work of the space facilities is very necessary.

In the process of completing service tasks such as on-track assembly and the like, the space manipulator inevitably contacts with the external environment to generate force, which puts high requirements on the contact force control of the space manipulator, and meanwhile, in the space environment, various external disturbances such as gravity gradient torque, friction force and the like exist, so that the influence of external interference needs to be overcome. The flexible control can adjust the action of the mechanical arm according to the change of the external environment, and the control precision and the stability of the assembly operation can be effectively improved.

In order to coordinate the contact force between the mechanical arm and the environment, hogen N firstly proposes impedance control, and realizes compliant contact between the robot and the environment by establishing an ideal dynamic relationship between the contact force at the end of the mechanical arm and the deviation between the expected track and the actual track, but constant impedance control is difficult to maintain stable contact force under the condition that the geometric and rigidity parameters of the environment are uncertain. The environment of the space manipulator is complex and changeable when the space manipulator executes tasks, the environment information is difficult to accurately identify, and meanwhile, due to the fact that nonlinear time-varying factors exist in the target environment, the target tasks are difficult to achieve through the impedance control method with fixed parameters. If the impedance control parameters can be dynamically adjusted in real time according to the change of tasks and environments, the control performance is better.

Disclosure of Invention

Based on the above problems, the present invention aims to provide a space manipulator shaft hole assembly variable impedance control method based on reinforcement learning, which can update the parameters of an impedance controller in the interaction with a complex environment, ensure the rapidity of static force response and the accuracy of dynamic force tracking, and realize the flexible control of the space manipulator assembly operation.

The technical scheme for realizing the purpose of the invention is as follows:

a space manipulator shaft hole assembly variable impedance control method based on reinforcement learning comprises the following steps:

step 1, constructing a space manipulator model based on a DH parameter method;

step 2, constructing a conversion model of the joint angle state and the terminal pose of the space manipulator based on a forward and inverse kinematics algorithm;

step 3, initializing internal and external parameters of a binocular camera, acquiring images by using the binocular camera, and acquiring position information of the assembly holes;

step 4, constructing an impedance controller based on reinforcement learning, and setting an impedance parameter action table, a reward function and a suspension condition in the training process according to an expected target;

step 5, training an impedance controller based on a neural network;

and 6, inputting real-time information of the tail end of the mechanical arm, updating the impedance parameters of the impedance controller, outputting the position correction quantity of the tail end of the mechanical arm, and finishing the variable impedance control of the shaft hole assembly of the space mechanical arm.

Compared with the prior art, the invention has the remarkable advantages that:

(1) According to the technical scheme, variable impedance control is performed on the shaft hole assembly of the space manipulator based on reinforcement learning, dynamic force can be tracked through the control, dynamic errors are smaller than those of traditional fixed impedance control, the response speed is higher, and better tracking accuracy is achieved compared with the traditional fixed impedance control;

(2) The technical scheme of the invention realizes variable impedance control in shaft hole assembly of the mechanical arm based on reinforcement learning, can effectively weaken influence of uncertain factors in the environment, and improves accuracy and rapidity of the space mechanical arm on terminal force control.

The present invention is described in further detail below with reference to the attached drawing figures.

Drawings

FIG. 1 is a flowchart of steps of a space manipulator shaft hole assembly variable impedance control method based on reinforcement learning according to the present invention.

Fig. 2 is a schematic structural diagram of an impedance controller based on reinforcement learning according to the present invention.

FIG. 3 is a schematic flow chart of the neural network-based training impedance controller according to the present invention.

FIG. 4 is a schematic diagram of a fully-connected neural network structure according to the present invention.

Figure 5 is a schematic view of a space manipulator assembly in an embodiment of the present invention.

Figure 6 is a schematic representation of a space manipulator simulation in an embodiment of the present invention.

Fig. 7 is a trace diagram of a simulated impedance parameter in an embodiment of the invention.

Figure 8 is a schematic diagram illustrating a simulation of a terminal position trajectory of a spatial robot arm in an embodiment of the present invention.

Figure 9 is a schematic representation of a simulation of the end of space manipulator velocity trajectory in an embodiment of the present invention.

Fig. 10 is a simulation diagram of a static force tracking trajectory of a space manipulator in an embodiment of the present invention.

Fig. 11 is a schematic diagram illustrating a simulation of a dynamic force tracking trajectory of a space manipulator in an embodiment of the present invention.

Detailed Description

step 1, constructing a space manipulator model based on a DH parameter method;

step 2, constructing a conversion model of the joint angle state and the tail end pose of the space manipulator based on a forward and inverse kinematics algorithm;

step 4, constructing an impedance controller based on reinforcement learning, and setting an impedance parameter action table, a reward function and a suspension condition in the training process according to an expected target, wherein the impedance parameter action table, the reward function and the suspension condition are specifically as follows:

step 4-1, constructing an impedance controller:

the impedance control strategy aims at realizing an ideal dynamic relation between the tail end position of the space robot and the tail end contact force, the relation between a mechanical arm tail end tooling device and an assembly plane is simplified into a spring-mass block-damping model, and the mathematical model is as follows:

wherein, x _d Respectively representing the actual motion trail and the expected motion trail of the tail end of the space manipulator, F _e Representing the force of the end of the arm against the external environment, M _d ,K _d ,C _d Respectively corresponding to an expected inertia matrix, an expected rigidity matrix and an expected damping matrix of the impedance controller;

respectively representing the actual acceleration, the expected acceleration, the actual speed and the expected speed of the tail end of the space manipulator, and selecting K from the impedance controller _d ,C _d As a control quantity, M _d Set to a constant value of 1;

step 4-2, the control target of the impedance controller in the application is to track the expected force quickly, so that the speed of the tail end of the mechanical arm quickly approaches to 0, and simultaneously, the overshoot amount in the tracking process of the static force is optimized (the overshoot amount refers to the deviation between the maximum actual force and the expected force of the system, namely the deviation between the sharp top and the expected force);

for this purpose, corresponding reward and punishment are given to the state of the tail end of the mechanical arm in the training process, corresponding positive reward is given when the state of the tail end of the mechanical arm reaches a desired target, an optimal control parameter is found, and a reward function is set:

wherein T represents the duration of single training, and v represents the tail end speed of the space manipulator;

in the above formula E _f Setting the reward function as above for the error value of the expected force and the current time force, T is the current training simulation duration, and the speed value is expected to be capable of rapidly approaching to the range of 0-0.2 in speed.

This function can give greater rewards for smaller force steady state errors and greater penalties as the velocity deviates more from 0.

Step 4-3, considering that if the single change degree of the impedance parameter is too small, the impedance control of the tail end position of the mechanical arm is difficult to achieve a remarkable effect, and if the change range of the impedance parameter is too large, the stability of the impedance control of the tail end of the mechanical arm can be reduced, so that an impedance parameter action table for reinforcement learning is set:

δC _d ∈[±2,±1,0],δK _d ∈[±5,±4,±3,±2,±1,0]

wherein δ is the delta correction amount of the setting, δ C _d As a transformation of the stiffness coefficient, δ K _d And selecting corresponding actions in each sampling period for damping transformation quantity, and obtaining an optimal action strategy after multiple times of training.

In addition, the training suspension condition is set as: the training times reach the set threshold value.

Or, when the error value between the expected force and the current time force during training is greater than the set threshold value or the error between the maximum current time force of the system during training and the expected force exceeds the set threshold value, it indicates that the strategy of the training is going to be developed in a divergent direction, and the training is repeated by returning to the central set parameter.

Step 5, training the impedance controller based on the neural network, specifically:

the Q-learning (Q-learning) algorithm is essentially a markov decision process, which performs actions in the current state to find the reward value of the next state, and continuously updates the Q table, and the specific formula is as follows:

Q(s _t ,a _t )←Q(s _t ,a _t )+α[r _t +γmaxQ(s _t+1 ,a)-(s _t ,a _t )]

the traditional Q learning method updates the current state according to the Q value of the latter state, the method depends on a Q value table, the excessive system state wastes a larger memory space, DDQN adopts a fully-connected neural network 'strategy network' to realize the prediction of the Q value of the current state, the tail end state information of a mechanical arm is input into the 'strategy network' to obtain the Q value of the moment, a 'target network' is introduced to predict the state of the next moment, the mean square error of the difference value of the prediction results of the two neural networks is used as the loss function of the model, and the 'strategy network' is finally updated by back-propagating network parameters according to the following formula.

Specifically, the method comprises the following steps: the method comprises the steps of firstly setting the total times of training, collecting an experience table of the space manipulator in single training, placing the experience table into an experience pool ((namely a queue has the maximum storage length in the queue, and pops up experience with poor performance once the maximum length is exceeded), inputting higher-rewarded experience in the experience pool into a strategy network at intervals together with experience randomly extracted from the experience pool, updating the strategy network through residual errors between predicted values in the strategy network and a target network, setting updating time, replacing the target network with the strategy network once the time is exceeded, realizing target network updating, outputting the action with the highest score through feedback in the environment of the target network, sequentially circulating until the finally set total times of training is greater than a set value, and finishing training.

Further, the strategy network predicts the Q value of the current time in the impedance controller based on reinforcement learning, predicts the Q value of the next time in the impedance controller based on reinforcement learning based on the target network, and takes the mean square error of the difference between the two times as a loss function:

L＝Mse(Q(s _t ,a)-r-γQ(s _t+1 ,a)

where Mse represents the mean square error, Q(s) _t And a) represents the Q value at time t, and γ ∈ (0, 1)Represents the decay rate in the learning process, and α ∈ (0, 1) represents the learning rate of the model.

Further, the strategy network adopts a full neural network structure, the position, the speed, the acceleration and the force error information of the tail end of the mechanical arm are used as network input, the number of neurons in a hidden layer is set to be 400, the ReLU function is selected by an activation function, and the Q value of each action at the current moment is output.

The target network adopts a full neural network structure, the position, the speed, the acceleration and the force error information of the tail end of the mechanical arm are used as network input, the number of neurons in a hidden layer is set to be 400, the ReLU function is selected by the activation function, and the Q value of each action at the next moment is output.

The invention improves the updating process of the experience pool, marks and stores the optimal state information generated in the training process, and inputs the optimal state information to the experience pool at intervals of a training period, and the high-profit state information can improve the rapidity of DDQN model convergence.

A space manipulator shaft hole assembly variable impedance control system based on reinforcement learning comprises the following modules:

the space manipulator model construction module comprises: the method is used for constructing a space manipulator model based on a DH parameter method;

the terminal pose transformation model building module comprises: the method comprises the steps of constructing a transformation model of the joint angle state and the terminal pose of the space manipulator based on a forward-inverse kinematics algorithm;

position information acquisition module of pilot hole: the device comprises a binocular camera, a positioning device and a control device, wherein the binocular camera is used for initializing internal and external parameters of the binocular camera, acquiring images by using the binocular camera and acquiring position information of an assembly hole;

the impedance controller constructing module: the impedance controller is used for constructing an impedance controller based on reinforcement learning, and setting an impedance parameter action table, a reward function and a suspension condition in the training process according to an expected target;

a training module: training an impedance controller based on a neural network;

the variable impedance control module is assembled in the shaft hole of the space manipulator: the impedance controller is used for inputting real-time information of the tail end of the mechanical arm, updating impedance parameters of the impedance controller, outputting position correction of the tail end of the mechanical arm and finishing variable impedance control of shaft hole assembly of the space mechanical arm.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

step 1, constructing a space manipulator model based on a DH parameter method;

step 5, training an impedance controller based on a neural network;

A computer-storable medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

step 1, constructing a space manipulator model based on a DH parameter method;

step 5, training an impedance controller based on a neural network;

The present invention will be further described with reference to the following examples.

Examples

With reference to fig. 1, a space manipulator shaft hole assembly variable impedance control method based on reinforcement learning includes the following steps:

step 1, constructing a space manipulator model based on a DH parameter method;

step 4-1, constructing an impedance controller:

with reference to fig. 2 and 3, the objective of the impedance control strategy is to achieve an ideal dynamic relationship between the terminal position and the terminal contact force of the space robot, and the patent simplifies the relationship between the terminal tooling device of the mechanical arm and the assembly plane into a spring-mass-damping model, and the mathematical model is as follows:

wherein, x _d Respectively representing the actual motion trail and the expected motion trail of the tail end of the space manipulator F _e Representing the end of a robot arm and the external environmentActing force of (M) _d ,K _d ,C _d Respectively corresponding to an expected inertia matrix, an expected rigidity matrix and an expected damping matrix of the impedance controller;

wherein T represents the duration of a single training, E _f And setting the reward function as above for the error value of the expected force and the current moment force, wherein T is the current simulation duration, and the speed value is expected to be capable of rapidly approaching to the range of 0-0.2 in speed.

Step 4-3, considering that if the single change degree of the impedance parameter is too small, the impedance control of the tail end position of the mechanical arm is difficult to achieve a remarkable effect, and if the change range of the impedance parameter is too large, the stability of the impedance control of the tail end of the mechanical arm is reduced, so that an impedance parameter action table for reinforcement learning is set:

δC _d ∈[±2,±1,0],δK _d ∈[±5,±4,±3,±2,±1,0]

and selecting corresponding action in each sampling period, and obtaining an optimal action strategy after multiple times of training.

Or, when the error value between the expected force and the current time force in the training is larger than the set threshold or the error between the maximum current time force and the expected force of the system in the training exceeds the set threshold, the strategy of the training is shown to be developed in the direction of dispersion, and the training is repeated by returning to the central set parameter.

Q(s _t ,a _t )←Q(s _t ,a _t )+α[r _t +γmaxQ(s _t+1 ,a)-(s _t ,a _t )]

the traditional Q learning method updates the current state according to the Q value of the next state, the method depends on a Q value table, but more memory space can be wasted by excessive system states, DDQN adopts a fully-connected neural network strategy network to predict the Q value of the current state, the tail end state information of the mechanical arm is input into the strategy network to obtain the Q value of the moment, a target network is introduced to predict the state of the next moment, the mean square error of the difference of the prediction results of the two neural networks is used as the loss function of the model, and the strategy network is finally updated by back propagation network parameters according to the following formula.

L＝Mse(Q(s _t ,a)-r-γQ(s _t+1 ,a)

where Mse represents the mean square error, Q(s) _t A) represents a Q value at time t, γ ∈ (0, 1) represents a decay rate during learning, and α ∈ (0, 1) represents a learning rate of the model.

Further, with reference to fig. 4, the policy network adopts an all-neural network structure, the position, the speed, the acceleration and the force error information of the end of the mechanical arm are used as network inputs, the number of neurons in the hidden layer is set to 400, the ReLU function is selected as the activation function, and the Q value of each action at the current moment is output.

The invention improves the updating process of the experience pool, marks and stores the optimal state information generated in the training process, and inputs the optimal state information into the experience pool at intervals of a training period, and the high-income state information can improve the rapidity of the DDQN model convergence.

A schematic diagram of a conventional space manipulator assembly is shown in fig. 5, in this embodiment, a simulation of the impedance control of the end of the manipulator is implemented by combining Robotic Toolbox and python tensorflow2.0 in MATLAB,

the UR5 robot simulation environment of figure 6 was created using a Robotic Toolbox.

When the mechanical arm is hindered by the environment to move on the expected track, because the environment generally has rigid property, the mechanical arm can generate an interaction force F with the external environment at the moment _e The force/position relationship between the robot and the environment can be considered as a spring model, as follows.

F _e ＝K _e (x-x _e ) (6)

Wherein K is _e Representing the ambient stiffness, x _e Indicating an ambient position offset. Let K _e Is 500N/m.

In the simulation, the mechanical arm is set to move downwards along the Z axis, and the expected force is set to be F _x ,F _y ,F _z ]＝[0,0,15N]That is, only the stress information in the Z-axis direction is considered, and the initial state of the end of the mechanical arm is set to [ x, v, a ]]＝[0,-0.5m/s,0]。

Let simulation time T epsilon (0, 2) s, simulation period T =0.005s, and expected position X _d 0.2m, the final simulation result is shown in fig. 7, fig. 7 is an impedance control optimum parameter table selected after reinforcement learning, the simulation is divided into three stages, as shown in fig. 7,

1) In the first stage, a large error exists between the tail end of the mechanical arm and the expected position, so that a strategy with high rigidity and high damping is selected, and the impedance controller is enabled to respond quickly.

2) In the second stage, after the mechanical arm reaches the target plane, a strategy of reducing rigidity is adopted, and the method reduces the force error of the system gradually.

3) In the third stage, when the overshoot of the tail end of the mechanical arm is 0, it can be seen that at this time, because the position and the speed of the tail end of the mechanical arm are low, a low-rigidity and low-damping strategy is adopted, so that the static force error of the system approaches to 0.

Fig. 8 and 10 are control effect diagrams of the end position of the mechanical arm and the static force, and a comparison of the two shows that the variable impedance method provided by the invention can make a faster response to a force error at an initial moment, and can realize the tracking of a target force on the premise of meeting a smaller overshoot, and the static error is smaller than that of the traditional constant impedance control.

Fig. 9 is a tracking simulation diagram of the velocity of the end of the mechanical arm, and the velocity of the end of the mechanical arm in the simulation result can be better stabilized within a set threshold value of | v | <0.2, and meanwhile, the velocity of the method of the present invention can reach 0 more quickly than that of the conventional method.

Fig. 11 is a graph of a tracking curve of a mechanical arm tail end for a dynamic force, and the dynamic error of the variable impedance control provided by the invention for the tracking of the dynamic force is smaller than that of the traditional impedance control, the response speed is faster, and the tracking accuracy is better than that of the traditional constant impedance control.

The foregoing embodiments illustrate and describe the general principles and principal features of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims

1. A space manipulator shaft hole assembly variable impedance control method based on reinforcement learning is characterized by comprising the following steps:

step 1, constructing a space manipulator model based on a DH parameter method;

step 5, training an impedance controller based on a neural network;

2. The reinforcement learning-based variable impedance control method for the shaft hole assembly of the space manipulator of claim 1, wherein the step 4 of constructing the reinforcement learning-based impedance controller comprises the following steps:

step 4-1, constructing an impedance controller:

wherein, x _d Respectively representing the actual motion trail and the expected motion trail of the tail end of the space manipulator F _e Representing the force of the end of the arm against the external environment, M _d ,K _d ,C _d An expected inertia matrix, an expected stiffness matrix and an expected damping matrix respectively corresponding to the impedance controller;

respectively representing the actual acceleration, the expected acceleration, the actual speed and the expected speed of the tail end of the space manipulator, and selecting K from the impedance controller _d ,C _d As a control amount;

step 4-2, setting a reward function:

where T represents the duration of a single training, E _f An error value between the expected force and the current moment force;

step 4-3, setting an impedance parameter action table for reinforcement learning:

δC _d ∈[±2,±1,0],δK _d ∈[±5,±4,±3,±2,±1,0]

delta is the delta correction amount set.

3. The reinforcement learning-based space manipulator shaft hole assembly variable impedance control method according to claim 2, wherein the training suspension condition is set as:

the training times reach the set threshold value.

4. The space manipulator shaft hole assembly variable impedance control method based on reinforcement learning of claim 2, wherein the impedance controller trained based on the neural network in the step 5 is specifically:

the method comprises the steps of firstly setting the total times of training, collecting an experience table of the space manipulator in single training, placing the experience table in an experience pool, inputting higher-rewarded experiences in the experience pool into a strategy network at intervals together with experiences randomly extracted from the experience pool, updating the strategy network through a residual error between a predicted value in the strategy network and a target network, setting updating time, replacing the target network with the strategy network once the time is exceeded, updating the target network, finally outputting an action with the highest score through the feedback of the target network in an environment, circulating in sequence until the finally set total times of training is larger than a set value, and finishing training.

5. The reinforcement learning-based space manipulator shaft hole assembly variable impedance control method according to claim 4, wherein the strategy network predicts the Q value of the current time in the reinforcement learning-based impedance controller, predicts the Q value of the next time in the reinforcement learning-based impedance controller based on the target network, and takes the mean square error of the difference between the two times as a loss function:

L＝Mse(Q(s _t ,a)-r-γQ(s _t+1 ,a)

6. The reinforcement learning-based variable impedance control method for spatial manipulator shaft hole assembly according to claim 5, wherein the strategy network adopts a whole neural network structure, the position, speed, acceleration and force error information of the manipulator tip are used as network inputs, the number of hidden layer neurons is set to 400, the ReLU function is selected as the activation function, and the Q value of each action at the current moment is output.

7. The reinforcement learning-based space manipulator shaft hole assembly variable impedance control method of claim 5, wherein the target network adopts a whole neural network structure, the position, the speed, the acceleration and the force error information of the manipulator tail end are used as network input, the number of hidden layer neurons is set to 400, the ReLU function is selected by the activation function, and the Q value of each action at the next moment is output.

8. The utility model provides a space manipulator shaft hole assembly variable impedance control system based on reinforcement study which characterized in that includes following module:

the terminal pose transformation model building module comprises: the method comprises the steps of constructing a transformation model of the joint angle state and the tail end pose of the space manipulator based on a forward-inverse kinematics algorithm;

position information acquisition module of pilot hole: the binocular camera is used for initializing internal and external parameters of the binocular camera, acquiring images by using the binocular camera and acquiring position information of the assembly holes;

a training module: training an impedance controller based on a neural network;

assembling a variable impedance control module in a shaft hole of the space manipulator: the impedance controller is used for inputting real-time information of the tail end of the mechanical arm, updating impedance parameters of the impedance controller, outputting position correction of the tail end of the mechanical arm and finishing variable impedance control of shaft hole assembly of the space mechanical arm.

9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-7 are implemented by the processor when executing the computer program.

10. A computer-storable medium having stored thereon a computer program, characterised in that the computer program, when being executed by a processor, carries out the steps of the method as set forth in claims 1-7.