CN113510709B

CN113510709B - Industrial robot pose precision online compensation method based on deep reinforcement learning

Info

Publication number: CN113510709B
Application number: CN202110856844.6A
Authority: CN
Inventors: 肖文磊; 孙子惠; 姚开然; 吴少宇; 张鹏飞
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2022-08-19
Anticipated expiration: 2041-07-28
Also published as: CN113510709A

Abstract

The invention discloses an industrial robot pose accuracy online compensation method based on deep reinforcement learning, which comprises the following steps of: operating the robot in different running states, acquiring the actual pose of the robot, and performing error operation on the actual pose and the theoretical pose to serve as a training set; constructing a deep reinforcement learning network model, and determining an input and output layer of the learning network; completing the pre-training of the deep reinforcement learning network model, and training to obtain network model parameters; and predicting the pose deviation of the robot on line by using the trained deep reinforcement learning network model, realizing the return of the closed-loop real-time error compensation, and performing on-line compensation on the non-system error. The method realizes the interactive learning of the robot model and the current environment by using two networks with different functions together, dynamically adjusts the control parameters, and solves the problem of non-systematic error pose compensation of the industrial robot.

Description

Industrial robot pose precision online compensation method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of industrial robot pose precision online compensation, in particular to an industrial robot pose precision online compensation method based on deep reinforcement learning.

Background

Along with the development of the domestic high-precision manufacturing industry towards the direction of automation and intellectualization, the industrial robot has the characteristics of high efficiency, high quality, good environmental adaptability and the like, the industrial robot is more and more widely applied to automatic production such as spraying, welding, carrying, assembling and the like, the demand of the industrial robot is increased day by day, the technical innovation of the high-precision manufacturing industry is realized, the processing quality and the production efficiency are greatly improved, and the problem of breaking through the high-precision positioning of the industrial robot is a difficult problem which must be solved. The operation precision of the robot directly influences the operation effect of the robot, and especially when higher requirements are made on certain performance indexes in the operation process, higher requirements are also made for improving the operation precision of the robot.

The precision compensation method of the robot at present mainly has two types: error prediction compensation and error calibration compensation. The error prediction compensation method has high production cost, and the long-time movement of the robot can cause the abrasion of a mechanical structure, and the generated error can not be avoided, so the method is less applied in practice. The error calibration compensation mainly adopts the idea of error modeling of the system to obtain a mathematical model of non-system errors, thereby realizing dynamic error feedback. However, for a series structure of an industrial robot, the dynamic solution of the series structure is very complex, and meanwhile, the influence of load change under temperature and different postures is introduced, so that the error model is necessarily very large and complex, and the solution is very difficult. Meanwhile, because the influence of the non-system errors in a general industrial application environment is far smaller than the system errors, a relatively uniform model for online compensation of the non-system errors is not formed in the current industrial environment. In addition, the existing pose precision compensation method cannot realize the pose compensation of the on-line real-time robot target pose; while the off-line pose compensation can improve the absolute position accuracy and the attitude accuracy of the robot at the same time, the off-line pose compensation cannot be performed. For example, patent CN107351089A discloses an optimized selection method for robot kinematic parameter calibration pose, but the algorithm convergence time of the method is affected by the number of iterations, the number of parameters to be identified, and the number of pose points, and is not easy to converge. Patent CN108608425A discloses an offline programming method for milling of a six-axis industrial robot, which needs to construct a complex one-dimensional robot pose optimization model, is difficult to ensure the similarity between a mathematical model and the actual robot cutting process, and reduces the actual upper limit of compensation effect. Patent CN112450820A discloses a pose optimization method, a mobile robot and a storage medium, which cannot realize prediction and compensation of robot attitude errors. Patent CN112536797A discloses a method for comprehensively compensating position and attitude errors of an industrial robot, which does not need to establish a complex motion error model, and improves the absolute position accuracy and attitude accuracy of the industrial robot at the same time, but the interpretability of the error prediction process is weak, and in addition, online prediction and compensation of non-system errors under different working application environments cannot be realized.

Disclosure of Invention

In order to solve the problems, the invention provides an industrial robot pose accuracy online compensation method based on deep reinforcement learning, which does not depend on a mathematical model of an industrial robot, realizes interactive learning of a robot model and the current environment by using two networks with different functions together, dynamically adjusts control parameters and solves the problem of non-systematic error pose compensation of the industrial robot. The invention adopts the following technical scheme:

an industrial robot pose accuracy online compensation method based on deep reinforcement learning comprises the following steps:

step 1, operating the robot in different running states, acquiring an actual pose of the robot, and performing error operation on the actual pose and a theoretical pose to serve as a training set;

step 2, constructing a deep reinforcement learning network model, and determining an input and output layer of the deep reinforcement learning network;

step 3, completing the pre-training of the deep reinforcement learning network model to obtain network model parameters;

and 4, predicting the pose deviation of the robot on line by using the trained deep reinforcement learning network model, realizing the return of closed-loop real-time error compensation, and performing on-line compensation on non-system errors.

Further, in the step 1, the actual pose of the robot is measured by using a laser tracker, and a coordinate system conversion matrix is adopted for a measurement coordinate system of the laser tracker and a base coordinate system of the robot

And (3) conversion is carried out:

wherein R is a rotation matrix:

R＝(n _C3 ,n _C1 ×n _C3 ,n _C1 )

in the formula, n _C1 Is C ₁ Normal direction of the locus circle, n _C3 Is C ₃ The normal direction of the trajectory circle;

q is a displacement vector and is obtained by adopting the following method:

locus circle C ₁ And the locus circle C ₆ Intersect at P _T Points, i.e. target ball position at zero position of the robot, locus circle C ₁ Has a radius of R ₁ (ii) a According to the self reading of the robot, the coordinate P of the default tool center point under the robot base coordinate system can be obtained ₀ ＝[X ₀ ,Y ₀ ,Z ₀ ] ^T Definition of P _T Point relative to P ₀ When the dot offset vector Δ is (Δ X, Δ Y, Δ Z), the vector O is obtained ₆ O _B Under the base coordinate system, can be represented by the following formula:

wherein, Δ Y ₀ ＝O ₆ P ₀ ·n _C3 Orbit circle C ₆ Center of circle O ₆ The coordinate vector under the measuring coordinate system of the laser tracker is

Further, a displacement vector Q':

to ensure that the error of the displacement vector is as small as possible, ten points P are randomly sampled in the robot space _i ， ^B P _i Is a coordinate vector of the target ball under a base coordinate system, ^C P _i and calculating a displacement vector Q' for the coordinate vector of the target sphere in the robot default tool coordinate system based on a least square fitting method.

Respectively calculating the displacement vector errors delta E of the displacement vectors Q 'and Q' through a formula, and selecting the displacement vector with small error as a coordinate system conversion matrix

Displacement vector Q in (1):

Q＝min{ΔE(Q _i ),Q _i ∈{Q′,Q″}}

furthermore, the deep reinforcement learning network model is an Actor-Critic network model, an Actor neural network calculates and generates a strategy according to the current environment state S, generates specific joint motion actions as the input of robot motion, and interacts with the environment; the criticic neural network is used for evaluating the strategic joint action output generated by the Actor network in the state S, determining whether the situation is good or bad at the moment, measuring the situation by a value, returning the measured value to the Actor neural network for learning, and performing parameter optimization to make the cost function converge to the global optimum.

Further, an end execution position TCP pose, a rigidity k, a temperature change T, a load eta, a time signal T and time signal functions sin (T) and ln (T) of the robot are used as the input of the deep reinforcement learning network, wherein the end execution position TCP pose is composed of a coordinate position (x, y, z) and an Euler angle orientation (alpha, beta, gamma); and taking the angle value delta joint _ angle (a1, a2, a3, a4, a5 and a6) of each joint of the robot as the output of the deep reinforcement learning network.

Further, the step 3 specifically includes the following steps:

(1) taking the state characteristics of the industrial robot acquired in the step 1 and the corresponding pose error parameter mass data set as training samples, inputting the training samples into robot simulation interactive software, starting each training, taking the actual position as the actual pose of the robot sample data set, and taking the target position as the theoretical pose of the robot sample data set;

(2) the Actor-criticic network obtains a current TCP (transmission control protocol), rigidity k, temperature change T, a load eta state value, a time signal and a time function of the robot from a robot simulation interaction environment, calculates to obtain an angle correction value of each current joint and sends the angle correction value back to the robot simulation interaction software;

(3) after receiving the joint angle correction value, the robot simulation interactive software performs joint limit calculation on the robot, judges whether the robot is in limit, if so, executes joint motion correction, and if a certain robot joint is not in limit, ends the current alignment and transmits a message to an Actor-Critic network;

(4) obtaining the current robot pose and the target position, calculating a reward value to obtain a reward function R, and if the R value is too low, ending the current game; if the value of R is normal, continuing the current office, and returning R to the Actor-Critic network for continuous learning;

and repeating the steps, and training to obtain the structure parameters of the Actor-Critic network model.

Further, the reward function R is calculated by the theoretical pose and the actual pose of the robot:

R＝η*D _M (P,P ₀ )

wherein P is the current pose, P ₀ Is the pose of the target, sigma is P and P ₀ η < 0:

compared with the prior art, the invention has the following beneficial effects:

(1) the method does not depend on a mathematical model of the industrial robot, but utilizes a reinforcement learning algorithm to find an optimal control strategy through continuous exploration and trial-and-error learning, realizes online compensation of non-system errors such as temperature change and rigidity, and solves the problem of the non-system errors caused by factors such as temperature and dynamic load change in the motion of the mechanical arm.

(2) The invention uses two networks with different functions to jointly realize the interactive learning of the robot model and the current environment, namely an Actor neural network and a Critic neural network. The Actor neural network generates a robot motion strategy according to the current environment state S (comprising a TCP pose P, rigidity k, temperature change T and load eta) by calculation, generates specific joint motion actions as the output of the robot motion, and interacts with the environment. The criticic neural network is used for evaluating the strategic joint action output generated by the Actor network in the state S, determining whether the situation is good or bad at the moment, measuring the situation through a value, and returning the measured value to the Actor neural network for learning, so that parameter optimization is carried out, and the cost function is converged to the global optimum.

Drawings

FIG. 1 is a flow chart of an industrial robot pose accuracy online compensation method based on deep reinforcement learning;

FIG. 2 is a schematic diagram of an experimental platform for acquiring terminal pose position information and online pose accuracy compensation of an industrial robot;

FIG. 3 is a schematic diagram of a robot body and a coordinate system;

FIG. 4 is a diagram of the logic structure of an Actor-Critic network;

FIG. 5 is a flowchart of an algorithm for performing deep reinforcement learning network training in interaction with a robot in a robot simulation scenario.

Detailed Description

The present invention will be described in further detail below with reference to the drawings and examples, but the embodiments of the present invention are not limited thereto.

In a field working environment, robot positioning is affected by external factors such as complex and variable loads, dynamics and temperature changes, the action form of a system error changes, and a non-system error is introduced, so that the invention provides an industrial robot pose accuracy online compensation method based on deep reinforcement learning, as shown in fig. 1, the method comprises the following steps:

step 1: the robot is operated under different running states (load and temperature), the actual pose is measured and error operation is carried out on the actual pose and the theoretical pose, and all data are collected to be used as a training set. The method comprises the following specific steps:

the invention discloses an experimental platform for realizing acquisition of pose position information and precision compensation of the tail end of a mechanical arm, which comprises an industrial robot and a control cabinet thereof, a pose position measuring system device (a laser tracker and a pose measuring target) and a movable workstation, wherein the industrial robot is of a six-degree-of-freedom open-chain structure, the tail end of the robot is provided with a tail end executor, and the absolute positioning precision is 2-3mm, as shown in figure 2. The position of the robot is monitored in real time through a laser tracker, and the position is transmitted to a TwinCAT master station in real time based on an EtherCAT bus, so that a full closed loop is realized; the robot end effector six-degree-of-freedom pose information from the laser tracker and the motion control information from the industrial robot are acquired in real time, and the robot-laser tracker system state machine can be analyzed and controlled in real time.

For subsequent error calculation, a coordinate system needs to be unified, conversion between a laser tracker measurement coordinate system and an industrial robot base coordinate system is carried out, pose data of the industrial robot coordinate system is converted into the laser tracker coordinate system, and the coordinate origin of the base coordinate system is calculated by adopting a method of combining axis measurement and multipoint fitting, so that a conversion matrix is obtained. Calculating a displacement vector Q by using a multipoint fitting method, and ensuring the calculation precision of the displacement vector Q; the rotation matrix R is calculated using axis vector measurement,

a transformation matrix for the robot base coordinate system B to the laser tracker measurement coordinate system L:

specifically, as shown in fig. 3, the robot is moved to the HOME position, the target ball of the laser tracker is placed on the target holder of the end effector, and the a1, A3 and a6 axes of the robot are independently rotated to obtain a trajectory circle C ₁ 、C ₃ And C ₆ The center of the circle corresponds to O ₁ 、O ₃ And O ₆ And obtaining C ₁ And C ₃ Normal direction n of the locus circle _C1 And n _C3 The Z and Y directions of the base coordinate system are respectively calculated to obtain a rotation matrix R:

R＝(n _C3 ,n _C1 ×n _C3 ,n _C1 )

locus circle C ₁ And the locus circle C ₆ Intersect at P _T Points, i.e. target ball position at zero position of the robot, locus circle C ₁ Has a radius of R ₁ . According to the self reading of the robot, the coordinate P of the default tool center point (defined at the center of the sixth shaft flange plate of the robot) under the robot base coordinate system can be obtained ₀ ＝[X ₀ ,Y ₀ ,Z ₀ ] ^T Definition of P _T Point relative to P ₀ If the offset vector Δ of the dot is (Δ X, Δ Y, Δ Z), the vector O is obtained ₆ O _B Under the base coordinate system, can be represented by the following formula:

wherein, Delta Y ₀ ＝O ₆ P ₀ ·n _C3 Orbit circle C ₆ Center of circle O of ₆ The coordinate vector under the measuring coordinate system of the laser tracker is

Further, a displacement vector Q' can be obtained:

to ensure that the error of the displacement vector is as small as possible, ten points P are randomly sampled in the robot space _i ， ^B P _i Is a coordinate vector of the target ball under a base coordinate system, ^C P _i and calculating a displacement vector Q' for the coordinate vector of the target ball under the robot default tool coordinate system based on a least square fitting method.

Respectively calculating displacement vector errors delta E of the displacement vectors Q 'and Q' through a formula, and selecting the displacement vector with small error as a coordinate system conversion matrix

Displacement vector Q in (1):

Q＝min{ΔE(Q _i ),Q _i ∈{Q′,Q″}}

for non-systematic errors, they are generated during the robot use and will vary with factors such as working temperature, running time and motion attitude. The method comprises the steps of operating the industrial robot to move under different running states (rigidity, load and temperature), measuring the actual position of the industrial robot by using the laser tracker, further converting actual data measured by the laser tracker through coordinate system conversion matrix operation, converting the actual data to a robot coordinate system from the laser tracker coordinate system, and performing error operation on the actual pose and the theoretical pose of the robot to obtain a robot pose error.

And storing the data samples in a format of < pose error, robot running state (rigidity, load and temperature) > and constructing a robot motion error data set of a large sample through experimental acquisition.

Step 2: and constructing a deep reinforcement learning network model and determining the input and output layers of the learning network.

Fig. 4 is a logical structure diagram of an Actor-Critic network, which provides a deep reinforcement learning network design framework, and two networks with different functions are used to jointly implement interactive learning between a robot model and a current environment, namely an Actor neural network and a Critic neural network. The Actor neural network is essentially a DPG network, and generates specific joint motion actions as robot motion input according to a calculation generation strategy of a current environment state S (comprising a TCP pose P, rigidity k, temperature change T and load eta) so as to interact with the environment. The criticic neural network is used for evaluating the strategic joint action output generated by the Actor network in the state S, determining whether the situation is good or bad at the moment, measuring the situation through a value, and returning the measured value to the Actor neural network for learning, so that parameter optimization is carried out, and the cost function is converged to the global optimum.

The terminal execution position TCP pose of the robot is taken as an input layer of the network, the state values of the rigidity k, the temperature change T and the load eta of the robot are taken as input layers of the network, the terminal execution position TCP pose is composed of coordinate position (x, y, z) and Euler angle orientation (alpha, beta, gamma), however, the motion deviation of the robot is generally extremely small, if an error value corresponding to the theoretical position is taken as an output layer, the output and the input of the network are extremely similar, the learning difficulty is improved, and the learning result cannot be obtained correctly. Therefore, in order to make the input and output of the network as far as possible, the input and output are established into a nonlinear relationship, the values of the joint angles of the robot are used as the output of the network, the values are represented by delta joint _ angle (a1, a2, a3, a4, a5 and a6), and finally the TCP pose of the robot can also be obtained by performing positive kinematic calculation on the joint angles.

The non-systematic errors of the robot, such as rigidity, temperature change, load and the like, change slightly in a short period and are functions of time. If the influence factor data is directly used as network input, due to the fact that the change is always lacked, in the process of multiple updates of network parameters based on gradients, neuron parameters connected with the network parameters are considered to be low in learning value, numerical values are pressed to be small, and the neural parameters are quickly fixed, so that non-system error factors are ignored equivalently. Therefore, the time signal t and the time signal functions sin (t) and ln (t) are used as the input of the network, and because the influence factors of the time-varying signal have periodicity or a logarithmic relation, the number of neurons used by the reinforcement learning network is reduced, and the feature information can be learned more quickly. The final network inputs and outputs are shown in table 1.

Table 1 robot deep reinforcement learning network input/output table

And step 3: and completing the pre-training of the reinforcement learning network model, and training to obtain network model parameters.

A virtual training virtual scene of a reinforcement learning network model is built in the robot simulation interactive software, and the virtual training virtual scene is communicated with Python through a UDP protocol, so that a deep reinforcement learning training network and a robot simulation interactive scene are interactively trained, as shown in FIG. 5, the process of training the deep reinforcement learning network is as follows:

(1) and (2) inputting the state characteristic dimension S (comprising TCP pose P, rigidity k, temperature change T and load eta) of the industrial robot acquired in the step (1) and a pose error parameter massive data set corresponding to the state characteristic dimension S as training samples into the robot simulation interaction software, wherein each training is started, the actual position is the actual pose of the robot sample data set, and the target position is the theoretical pose of the robot sample data set.

(2) The Actor-Critic network obtains the current state values, time signals and time functions of the TCP, the rigidity k, the temperature change T and the load eta of the robot from the simulation interaction environment of the robot, and initializes the state of the system

Use in an Actor network

As an input, a calculation is performed to output a current motion angle correction value a of each joint { Δ joint _ angle (a1, a2, a3, a4, a5, a6) }, and this value is sent back to the robot simulation interaction software.

(3) And after receiving the joint angle correction value, the robot simulation interactive software performs limit calculation on each joint of the robot, judges whether the joint angle correction value is within the limit range, and if so, executes motion correction of each joint to obtain a new state S'. If a certain robot joint is not in the limit, ending the current game and transmitting the message to the reinforcement learning network.

(4) Use in Critic networks separately

The Q value output V (S), V (S') is obtained as input, TD error is calculated, the step size is alpha, the attenuation factor gamma, the search rate epsilon:

δ＝R+γV(S′)-V(S)

using a mean square error loss function Σ (R + γ V (S') -V (S, ω)) ² And performing gradient update on the Critic network parameter omega, and updating an Actor network strategy parameter theta as follows:

score function for Actor

Softmax or gaussian score function may be selected.

(4) Obtaining the pose and the target position of the current robot, calculating to obtain a reward function R, and calculating by using the Ma distance negative number:

R＝η*D _M (P,P ₀ )

where P is the current pose, P ₀ For the pose of the target, sigma is P and P ₀ η < 0:

if the R value is too low, the current game is also ended, because the low R value indicates that the network output correction value is abnormal and is useless, and the game is ended to prevent the network from memorizing wrong operation data and carrying out learning. If the value of R is normal, continuing to check the game at present and returning R to the reinforcement learning network for continuous learning.

And repeating the steps so as to train and obtain the structure parameters of the Actor-Critic network model.

And 4, step 4: and calculating the current pose deviation aiming at the current robot state on line through the trained Actor-Critic network model to obtain a real-time pose error compensation value, realizing the return of closed-loop real-time error compensation, and compensating non-system errors on line so as to realize the on-line compensation of the robot pose positioning precision.

The pose positioning accuracy online compensation scheme provided by the invention aims at the non-system errors in the online input track, realizes the online compensation of the non-system errors such as rigidity, temperature variation, load and the like by an online error reinforcement learning method, can improve the absolute positioning accuracy of the industrial robot, and realizes the real-time compensation and control of the motion pose of the robot. The compensation method has the advantages of no need of establishing a robot kinematic model, high calculation speed and universality, and provides guarantee for real-time online calibration of the subsequent robot and improvement of the accuracy and speed of online calibration.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. An industrial robot pose accuracy online compensation method based on deep reinforcement learning is characterized by comprising the following steps:

step 1, operating a robot in different running states, acquiring an actual pose of the robot, and performing error operation on the actual pose and a theoretical pose to serve as a training set;

step 4, predicting the pose deviation of the robot on line by using the trained deep reinforcement learning network model, realizing the real-time error compensation return of a closed loop, and performing on-line compensation on non-system errors;

the robot is a six-degree-of-freedom open chain structure, the actual pose of the robot is measured by using a laser tracker, and a coordinate system conversion matrix is adopted by a measurement coordinate system of the laser tracker and a base coordinate system of the robot

And (3) conversion is carried out:

wherein R is a rotation matrix:

R＝(n _C3 ,n _C1 ×n _C3 ,n _C1 )

in the formula, a locus circle C is obtained by fitting an A1 axis, an A3 axis and an A6 axis of the rotary robot respectively and independently ₁ 、C ₃ And C ₆ The center of the circle corresponds to O ₁ 、O ₃ And O ₆ ，n _C1 Is C ₁ Normal direction of the locus circle, n _C3 Is C ₃ The normal direction of the trajectory circle;

q is a displacement vector and is obtained by adopting the following method:

locus circle C ₁ And the locus circle C ₆ Intersect at P _T Points, i.e. target ball position at zero position of the robot, locus circle C ₁ Has a radius of R ₁ (ii) a According to the self-reading of the robot, the coordinate P of the default tool center point under the robot base coordinate system can be obtained ₀ ＝[X ₀ ,Y ₀ ,Z ₀ ] ^T Definition of P _T Point relative to P ₀ When the dot offset vector Δ is (Δ X, Δ Y, Δ Z), the vector O is obtained ₆ O _B Under the base coordinate system, can be represented by the following formula:

wherein, O _B Is the origin, Δ Y, of the robot base coordinate system ₀ ＝O ₆ P ₀ ·n _C3 Orbit circle C ₆ Center of circle O of ₆ The coordinate vector under the measuring coordinate system of the laser tracker is

Further, a displacement vector Q':

in order to ensure that the error of the displacement vector is as small as possible, ten points P are randomly sampled in the robot space _i ， ^B P _i Is a coordinate vector of the target ball under a base coordinate system, ^C P _i calculating a displacement vector Q' for a coordinate vector of the target sphere in a robot default tool coordinate system based on a least square fitting method:

Displacement vector Q in (2):

2. the method according to claim 1, wherein the deep reinforcement learning network model is an Actor-Critic network model, and an Actor neural network generates a strategy according to the current environment state S, generates a specific joint motion action as an input of robot motion, and interacts with the environment; the criticic neural network is used for evaluating the strategic joint action output generated by the Actor network in the state S, determining whether the situation is good or bad at the moment, measuring the situation by a value, returning the measured value to the Actor neural network for learning, and performing parameter optimization to make the cost function converge to the global optimum.

3. The method according to claim 2, characterized in that the end execution position TCP pose of the robot, consisting of coordinates position (x, y, z) and euler angle orientation (α, β, γ), stiffness k, temperature variation T, load η, time signal T and time signal functions sin (T) and ln (T) are used as inputs to the deep reinforcement learning network; and taking the angle value delta joint _ angle (a1, a2, a3, a4, a5 and a6) of each joint of the robot as the output of the deep reinforcement learning network.

4. The method according to claim 2 or 3, wherein the step 3 specifically comprises the steps of:

(2) the Actor-Critic network obtains current TCP, rigidity k, temperature change T, load eta state values, time signals and time functions of the robot from a robot simulation interaction environment, calculates to obtain current angle correction values of all joints and sends the angle correction values back to the robot simulation interaction software;

(4) acquiring the pose and the target position of the current robot, calculating a reward value to obtain a reward function R, and if the R value is too low, ending the current game; if the value of R is normal, continuing the current office, and returning R to the Actor-Critic network for continuous learning;

5. The method according to claim 4, characterized in that the reward function R is calculated from the theoretical pose and the actual pose of the robot:

R＝η*D _M (P,P ₀ )

wherein P is the current pose, P ₀ For the pose of the target, sigma is P and P ₀ η < 0: