CN117226613A

CN117226613A - Robot constant force control polishing method based on PPO reinforcement learning

Info

Publication number: CN117226613A
Application number: CN202311444136.7A
Authority: CN
Inventors: 彭芳瑜; 郑周义; 王宇; 陈晨; 闫蓉; 唐小卫
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2023-12-15

Abstract

The invention belongs to the technical field of robot polishing and polishing, and discloses a robot constant force control polishing method based on PPO reinforcement learning. The method comprises the following steps: s1, acquiring a grinding and polishing original track for processing a workpiece to be processed for a three-dimensional model or a point cloud model of the workpiece to be processed; s2, selecting an impedance control mode to perform constant force polishing control, and constructing an impedance controller containing unknown parameters and corresponding constraint conditions according to the constant force polishing control mode; s3, calculating the environmental rigidity and the position of the tail end of the robot in real time, calculating a normal control instruction of the robot by using the calculated environmental rigidity and position, and adjusting the normal displacement of the original track in real time according to the normal control instruction so that the actual polishing force is equal to the preset expected polishing force; s4, solving unknown parameters in the impedance controller to further determine the impedance controller, and polishing the robot with constant force according to the impedance controller. The invention solves the problem of how to realize the constant force control of the polishing force in the polishing process.

Description

Robot constant force control polishing method based on PPO reinforcement learning

Technical Field

The invention belongs to the technical field of robot polishing, and particularly relates to a robot constant force control polishing method based on PPO reinforcement learning.

Background

In order to achieve stable interaction of the robot and the environment, stable force control of the robot tip is increasingly required. A position-based impedance controller is used to receive the contact force signal to track a constant desired force. Most of the kinetic parameters of the robots are often difficult to identify, and the robots consider the influence of safety, are not high in openness, generally do not open a control interface of a bottom layer, only provide a position control mode, and cannot directly access joint current. In performing robot tip force control on such a robot, it is necessary to control mechanical impedance characteristics of the robot, i.e., position-based impedance control, by generating a reference trajectory of an existing position controller.

The traditional impedance control is simple in structure, is widely applied in the field of robot power control, and is commonly used for realizing the flexible control of the robot. This can result in poor contact force control when in contact with an unknown environment due to the lack of accurate environmental parameter information. In order to reduce steady state errors in contact force, the surface position and stiffness of the environment need to be known in advance. Thus, there is a need for a method to achieve steady state control of contact force during polishing.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a robot constant force control polishing method based on PPO reinforcement learning, which solves the problem of how to realize constant force control of polishing force in the polishing process.

To achieve the above object, according to one aspect of the present invention, there is provided a robot constant force control polishing method based on PPO reinforcement learning, the method comprising the steps of:

s1, acquiring a grinding and polishing original track for processing a workpiece to be processed for a three-dimensional model or a point cloud model of the workpiece to be processed;

s2, selecting an impedance control mode to perform constant force polishing control, and constructing an impedance controller containing unknown parameters and corresponding constraint conditions according to the constant force polishing control mode;

s3, calculating the environmental rigidity and the position of the tail end of the robot in real time, calculating a robot normal control instruction by using the calculated environmental rigidity and position, and adjusting the normal displacement of the original track in real time according to the normal control instruction so that the actual polishing force is equal to the preset expected polishing force;

s4, solving unknown parameters in the impedance controller to further determine the impedance controller, and polishing the robot with constant force according to the impedance controller.

Further preferably, in step S2, the impedance controller performs the following equation:

wherein m is an inertia coefficient, b is a damping coefficient, k is a stiffness coefficient, and Deltax is the actual position x and the expected position x of the end of the robot _d Is used for the error of (a),and->The first and second derivatives of Δx, respectively, Δf being the desired contact force f _d With actual contact force f _e Error, K of _p Is a proportionality coefficient, K _i Is the integral coefficient.

Further preferably, in step S2, the constraint is obtained according to the following steps:

s21, calculating initial rigidity and damping of an environment in polishing, wherein the environment is formed by integrating a polishing tool and a workpiece;

s22 builds constraints using the initial stiffness and damping obtained.

Further preferably, in step S21, the initial stiffness and damping is performed according to the following formula:

wherein omega ₁ ，ω ₂ Is a parameter obtained by adopting recursive augmented least square iteration,respectively identified initial damping and initial stiffness, T being the time period.

Further preferably, in step S22, the constraint is performed as follows:

where m, b, k are coefficients in the impedance equation, κ, ζ _t Is an intermediate variable which is used to control the flow of water,is the environmental stiffness.

Further preferably, in step S3, the normal control command of the robot is performed according to the following formula:

wherein x is _d Is the desired position, f _d It is the force that is desired to be applied,is the ambient location,/->Is the environmental stiffness.

Further preferably, the environmental stiffness and position is according to the formula:

wherein k is _e (0) Is the initial environmental stiffness x _e (0) Is the initial ambient location of the device,f _e is the measured contact force, < >>Is the estimated environmental stiffness,/>Is the estimated ambient position at time t, +.>Is the estimated environmental position at time t-1, x is the end position of the robot at time t-1, t is the motion time of the robot, alpha, beta and gamma are constants, and the conditions of alpha beta < gamma are satisfied ² ，/>Is the derivative of the original trajectory generated by the robot.

Further preferably, in step S4, the solution of the unknown parameters employs a reinforcement learning method.

Further preferably, the reinforcement learning method is performed according to the following steps:

s41, constructing a reinforcement learning reward function, and setting an action space and a state space;

s42, taking the parameters of the state space as input, taking the parameters of the action space as output, constructing a reinforcement learning strategy neural network, giving an initial value to the unknown parameters, stopping training when the training reward value is stably converged, and obtaining the current corresponding unknown parameter value as the required unknown parameter value.

Further preferably, in step S41, the reward function is performed according to the following formula:

the workspace proceeds according to the following formula:

a＝[K _p ,K _i ]

the state space proceeds according to the following formula:

where Δf is the deviation of the actual force from the desired force,is the robot tip speed, alpha ₁ ,α ₂ ,α ₃ ,α ₄ Is a positive number.

In general, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

1. the impedance controller constructed by the invention comprises K _p Δf+K _i Compared with the traditional impedance controller, the delta f is added with feedback of a force closed loop in impedance control, and higher contact force control precision is ensured by adopting PI control;

2. in the invention, in the step S3, the normal displacement in the original grinding and polishing track is compensated through the estimation of the environment position and the rigidity, so that the actual grinding and polishing force is equal to the preset expected grinding and polishing force, the inherent defect of the impedance controller is overcome, the compensation of the actual grinding and polishing force is realized, and the grinding and polishing precision is improved;

3. k obtained by solving unknown parameters in step S4 of the invention _p And K _i Realize the closed-loop control of the polishing force by monitoring in real timeThe measured contact force is fed back to the control system, and the control system adjusts in real time according to the monitoring result, so that fluctuation of the polishing force in the polishing process is reduced;

4. the invention is applied to the control of robot constant force polishing by using PPO reinforcement learning. And (3) performing on-line estimation on the environmental position and rigidity by using a Li Yanuo f stability judging method, adjusting a robot reference track, and reducing steady-state errors of force tracking. The reinforcement learning method is adopted, force closed-loop control is added, a priori model of control parameters and polishing force errors is not required to be established, and the robustness of constant force tracking is improved;

5. in order to improve the force tracking performance, the invention adopts a reinforcement learning mode to adjust, does not need expert knowledge or priori understanding of the complex world at the bottom layer, and can autonomously find the optimal behavior in the process of constantly and repeatedly interacting with the environment, and the proposed method aims at combining force control with RL so as to learn to contact the constant force polishing task when using the position control robot.

Drawings

FIG. 1 is a flow chart of a robotic constant force control sanding method based on PPO reinforcement learning constructed in accordance with a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a workpiece to be processed according to a preferred embodiment of the invention, wherein (a) is a bevel workpiece and (b) is a curved workpiece;

FIG. 3 is an initial stiffness and damping of an environment based on a recursive least squares method of variable forgetting factors, where (a) is an estimated stiffness and (b) is an estimated damping, in accordance with a preferred embodiment of the present invention;

FIG. 4 is a diagram of a reinforcement learning cost function network structure and a strategy network structure according to a preferred embodiment of the present invention, wherein (a) is a schematic diagram of the reinforcement learning cost function network structure and (b) is a diagram of the strategy network structure;

FIG. 5 is a block diagram of a robot based reinforcement learning polishing constant force control constructed in accordance with a preferred embodiment of the present invention;

FIG. 6 is a reinforcement learning training 100-time bonus value image according to a preferred embodiment of the invention;

FIG. 7 is an actual grinding force trace of the preferred embodiment of the present invention, wherein (a) is a ramp grinding force trace after robot reinforcement learning training and (b) is a curved surface grinding force trace after robot reinforcement learning training.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

As shown in fig. 1, a robot constant force control polishing method based on PPO reinforcement learning specifically includes the following steps:

s1, acquiring the original grinding and polishing track data of the robot through a three-dimensional model or a point cloud model of the workpiece.

And generating a polishing initial track of the robot through a three-dimensional model or point cloud of the workpiece and a track generation method.

S2, estimating the initial rigidity and damping of the environment through a recursion augmentation least square method, designing a robot impedance controller, and selecting proper impedance parameters, wherein in the embodiment, the environment refers to the polishing tool and the workpiece as a whole.

S21, adopting a recursion augmentation least square method to perform initial estimation on the equivalent stiffness of the tail end of the robot and the workpiece, adopting a variable forgetting factor lambda,where β is the attenuation coefficient, ε is the difference between the current output and the last output, λ _min Is the minimum forgetting factor.

The dynamics of contact between the polishing tool and the workpiece is set asDiscretizing through z transformation to obtain Y(k)＝ω(k) ^T Phi (k), the meaning of each parameter is:

wherein f _e Indicating the contact force with the environment, x indicating the end position of the robot, x _e Representing the location of the environment. b _e ，k _e Respectively representing damping and stiffness of the environment, ζ represents sensor noise, ω, φ, Y represents intermediate variables, k represents a time series, δx represents a difference between a current moment and a previous moment of a robot position.

The parameters that need to be estimated are shown below using the RELS recurrence formula:

wherein the gain vector L (k+1) is calculated as follows:

finally, the initial rigidity and damping of the environment are obtained:

the final experimental results are shown in FIG. 3.

Wherein the method comprises the steps ofRepresenting estimated environmental damping->Representation ofEstimated environmental stiffness.

S22, adopting an impedance control mode to carry out constant force polishing control, wherein the control equation of the existing impedance control is as followsWherein Δf is the desired force f _d And the actual contact force f _e Difference f _e The method is characterized in that the actual contact force is measured in real time through a sensor, m is an inertia coefficient, b is a damping coefficient, k is a rigidity coefficient, x is the actual position of the tail end of the robot, and x is the actual position of the tail end of the robot _d Is the desired location. By estimating the initial parameters of the environment, the following constraints are constructed:

within this range, impedance parameter selection is performed, and the m, k parameters are selected so that b can passAnd (5) calculating to obtain the product.

S23 constructs an impedance controller,the feedback of a force closed loop is added in the impedance control, the PI control is adopted to ensure higher contact force control precision, and Deltax represents the actual position x and the expected position x of the tail end of the robot _d Error of->And->Respectively representing the first and second derivatives of deltax, deltaf representing the desired contact force f _d With actual contact force f _e Error, K of _p Representing the proportionality coefficient, K _i Representing the integral coefficient.

S3, estimating the environmental position and the rigidity parameter by adopting a Lyapunov stability method, adjusting the original grinding and polishing track of the robot, and reducing steady-state errors.

By means of the Lyapunov stability method, according to the estimated equation of the environmental rigidity and the position:

wherein k is _e (0) Is the initial value of the environmental rigidity, x _e (0) Is the initial value of the location of the environment,f _e indicating the measured contact force +.>Representing estimated environmental stiffness ∈>Representing the estimated ambient position at time t,represents the estimated environmental position at time t-1, x represents the end position of the robot at time t-1, t represents the movement time of the robot, alpha, beta, gamma are smaller constants, and alpha beta < gamma ² 。/>Differentiation of the original trajectory generated for the robot. And obtaining the estimated environmental rigidity and position of the robot in real time. The normal displacement of the original track of the robot is regulated, namely, the normal control instruction of the robot is set as +.>The polishing position of the polishing tool is adjusted according to the instruction, the delta F is reduced, and steady-state errors of robot force tracking can be reduced.

S4 calculating K _p And K _i 。

S41, analyzing influence factors of constant force control, constructing a reinforcement learning reward function, and setting an action space and a state space.

The goal of robot training is to make the actual contact force of the robot tip smaller than the desired contact force as much as possible, so that the speed normal to the robot tip is smaller.

The set reward function isΔf is the deviation of the actual force from the desired force, +.>For the robot tip speed, α ₁ ,α ₂ ,α ₃ ,α ₄ Is a positive number.

Setting the action space of reinforcement learning as a= [ K ] _p ,K _i ]，K _p For real-time variation of scaling factor, K _i The integral coefficient which changes in real time is the value output by the reinforcement learning model. The set state space isFor K varying in real time _p 、K _i It is associated with Δf->In relation, considering that the actual force measured by a force sensor is often quite noisy, it is not desirable to directly differentiate the force error, whereas the desired force f _d And the environmental parameters can be regarded as constant in a short time, there is +.>Thus setting the state space to Δf, +.>

S42, constructing a reinforcement learning strategy neural network, training based on a PPO reinforcement learning method, and performing robot constant force polishing control by using a trained model.

As shown in fig. 4, the design of the deep neural network in the reinforcement learning training process includes the design of the strategy network and the network design of the cost function. Because the training parameters are not complex, the training is performed by adopting a three-layer neural network structure, 128 nodes are arranged in each layer of the network, the activation function among all hidden layers is a Tanh activation function, and the output of the strategy network is sampling values distributed in a Gaussian mode.

The training data is obtained by setting an initial value K _p 、K _i Then, the robot circularly polishes the workpiece to be polished, simultaneously records the contact force and the tail end position of the robot in the polishing process, inputs the contact force and the tail end position of the robot into a model as training data, and then repeatedly updates K to be trained _p 、K _i Parameters.

Normalizing the training data, normalizing the input state quantity and the output state quantity of the neural network, dividing the input state quantity and the output state quantity by the corresponding upper limit values, and enabling the input and output value ranges of the neural network to be [ -1,1]. The control command obtained by the impedance equation is [ p ] _x ,p _y ,p _z ,o _x ,o _y ,o _z ,o _w ]And controlling the joint angle of the robot by inverse kinematics of the robot. In order to avoid the phenomenon of overlarge contact force during training on a real robot, the contact force f is monitored at any time _e And the joint angle delta theta of the inverse solution, when the contact force and the joint angle change excessively, the robot is directly moved to a safe position, and the training is terminated.

In this embodiment, gravity compensation is further performed on the measured six-dimensional force sensor data, the polishing tool is disposed at the end of the robot, the sensor is disposed between the end of the robot and the polishing tool, the result displayed by the sensor includes the gravity of the polishing tool and the contact force with the workpiece during polishing, and the gravity compensation is performed by subtracting the gravity of the polishing tool to obtain the contact force between the polishing tool and the workpiece. The method comprises the following steps: calculating the mass center [ x, y, z ] of the robot end tool by a least square method]Gravity [ g ] _x ,g _y ,g _z ]Sensor zero drift [ F _x0 ,F _y0 ,F _z0 ,M _x ,M _y ,M _z ]. By sittingAnd performing standard transformation, performing feedforward compensation on the measured force data, and eliminating the influence of the gravity of the tail end to obtain the actual contact force of the tail end of the robot. As shown in fig. 2, which is a workpiece used during the experiment.

The PPO reinforcement learning method is adopted to control the robot polishing constant force, the machining process totally comprises 750 control periods, the control frequency is 50Hz, the frequency of the force sensor is 125Hz, and the first 200 control periods are the approaching stage of the robot.

As shown in fig. 5, the overall robot control steps are as follows:

setting an initial track and expected force, and initializing a PPO reinforcement learning algorithm.

Estimating the rigidity and the position of the environment, adjusting the reference motion track of the robot, moving the robot along the reference track, and calculating the force error delta f=f _d -f _e ，f _d Indicating the desired force, f _e Indicating the measured ambient contact force.

Select action (K) _p 、K _i Adjustment amount of control parameter) a _t ～N(f _θ (s _t ),σ)，N(f _θ (s _t ) Sigma) represents a high sigma-s distribution with standard deviation of motion selection compliance, and the robot K is calculated by an impedance equation _p 、K _i The adjustment of the control parameter is then substituted into the equationCalculating a joint displacement command q, and enabling the robot to move according to the joint displacement command q;

calculating the current K _p 、K _i Corresponding acquisition of a reward value according to a reward function, acquisition of a state of the robot end, collection (s _t ,a _t ,r _t ,s _t+1 ) And updating an actor and critic neural network in the PPO reinforcement learning model every n steps, and finally evaluating the average rewards and performances of the training model. Wherein s is _t A represents a state space of the current moment of reinforcement learning, a _t Representing the action space of reinforcement learning at the current moment, r _t Prize value s representing current time of reinforcement learning _t+1 Representing the next moment in time of the reinforcement learning modelState space. The training end target needs to meet the requirement that the strategy network structure converges to a stable state. As shown in fig. 6, in the present embodiment, the reinforcement learning reward value is reinforcement learning trained 100 times. After the PPO reinforcement learning training model converges, the trained model can be directly used as a controller for polishing constant force control. In the actual experimental process, the robot adopts the following parameters, specifically, the rotation speed of the grinding disc is 2000rpm, the expected force is 20N, the feeding speed is 0.035m/s, and the maximum threshold value of the grinding and polishing force is 30N. As shown in fig. 7, CAC (Constant Admittance Control) represents constant impedance control and R-AC (Reinforcement Learning Applied in Admittance Control) represents the method described herein for a graph of the contact force variation between a bevel workpiece and a curved workpiece after training using reinforcement learning.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The robot constant force control polishing method based on PPO reinforcement learning is characterized by comprising the following steps of:

s2, performing constant force polishing control in an impedance control mode, and constructing an impedance controller containing unknown parameters and corresponding constraint conditions according to the constant force polishing control;

2. A robot constant force control polishing method based on PPO reinforcement learning as set forth in claim 1, wherein in step S2, the impedance controller performs the following:

3. A robot constant force control polishing method based on PPO reinforcement learning as set forth in claim 2, wherein in step S2, the constraint condition is obtained as follows:

s22 builds constraints using the initial stiffness and damping obtained.

4. A robot constant force control sanding method based on PPO reinforcement learning as recited in claim 3, wherein in step S21, the initial stiffness and damping is performed according to the following formula:

5. A robot constant force control polishing method based on PPO reinforcement learning as set forth in claim 3, wherein in step S22, the constraint conditions are as follows:

6. A method for robot constant force control polishing based on PPO reinforcement learning as set forth in claim 1, wherein in step S3, the normal control command of the robot is performed as follows:

wherein x is _d Is the desired position, f _d It is the force that is desired to be applied,is the estimated ambient position,/>Is an estimated environmental stiffness.

7. A robotic constant force control polishing method based on PPO reinforcement learning as recited in claim 1 or 6, wherein said environmental stiffness and position is as follows:

8. A robot constant force control polishing method based on PPO reinforcement learning as set forth in claim 1, wherein in step S4, the solution of the unknown parameters adopts reinforcement learning method.

9. A robot constant force control polishing method based on PPO reinforcement learning as recited in claim 8, wherein the reinforcement learning method is performed according to the following steps:

s42, taking parameters of a state space as input, taking parameters of an action space as output, constructing a reinforcement learning strategy neural network, giving an initial value to the unknown parameters, stopping training when the training rewarding value is stably converged, and obtaining the current corresponding unknown parameter value as the required unknown parameter value, wherein the action space is a= [ K ] _p ,K _i ]The state space isK _p Is a proportionality coefficient, K _i Is the integral coefficient.

10. A robot constant force control polishing method based on PPO reinforcement learning as set forth in claim 9, wherein in step S41, the bonus function is performed according to the following formula:

the workspace proceeds according to the following formula:

a＝[K _p ,K _i ]

the state space proceeds according to the following formula: