CN116494247A

CN116494247A - Mechanical arm path planning method and system based on depth deterministic strategy gradient

Info

Publication number: CN116494247A
Application number: CN202310703629.1A
Authority: CN
Inventors: 安玲玲; 谢振; 万波; 张慧锋; 罗贤涛
Original assignee: Shenzhen Readline Biotechnology Co ltd; Xidian University; Guangzhou Institute of Technology of Xidian University
Current assignee: Shenzhen Readline Biotechnology Co ltd; Xidian University; Guangzhou Institute of Technology of Xidian University
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-07-28

Abstract

The invention discloses a mechanical arm path planning method and a system based on depth deterministic strategy gradient, wherein the method comprises the following steps: constructing a mechanical arm multitasking motion model, a depth deterministic strategy gradient algorithm network model, a layering rewarding function of tail end tracking of the mechanical arm and an experience sample pool; training the depth deterministic strategy gradient algorithm network model based on the layering reward function and the weight of the experience sample, and obtaining a trained depth deterministic strategy gradient algorithm network model; and deploying the trained depth deterministic strategy gradient algorithm network model to a mechanical arm multitasking motion model, and planning a path of the mechanical arm. By using the invention, the utilization rate and the training speed of the mechanical arm training sample can be improved. The method and the system for planning the path of the mechanical arm based on the depth deterministic strategy gradient can be widely applied to the technical field of path planning of the mechanical arm of the robot.

Description

Mechanical arm path planning method and system based on depth deterministic strategy gradient

Technical Field

The invention relates to the technical field of robot mechanical arm path planning, in particular to a mechanical arm path planning method and system based on depth deterministic strategy gradients.

Background

With the continuous development of artificial intelligence technology and engineering machinery demands, traditional artificial neural networks have been widely successful in various fields such as pattern recognition, automatic control, signal processing, auxiliary decision making and the like, and the development planning of new generation artificial intelligence further emphasizes with the advancement of industrialization progress and the arrival of informatization age, and the intelligent robot industry is an important mark for measuring the technological innovation and high-end manufacturing industry level of a country, and the development of the intelligent robot industry is receiving a great deal of attention from all countries of the world. Mechanical arms are the most common type of industrial robot and are found in many fields of logistics, medical treatment, construction, etc. The traditional mechanical arm operation often needs manual assistance, and depends on manually provided task instructions and operation modes, so that the mechanical arm has a plurality of limitations, such as incapability of automatically completing tasks and high technical requirements for operators. The development of deep learning and reinforcement learning provides a new idea for the intelligent mechanical arm, for example, some researchers can help the mechanical arm to realize complex actions by establishing an artificial neural network to enable the mechanical arm to learn the imitation characteristics. Some researchers apply reinforcement learning algorithm to the mechanical arm, train the mechanical arm as an intelligent body, and the mechanical arm after training can complete tasks such as path planning or object grabbing. The researches provide new directions for the intellectualization of the mechanical arm, and in recent years, the rising fields of intelligent robots, automatic driving, artificial intelligence and the like provide great convenience for the life of human beings, and in the researches in the fields, the path planning problem is always one of the hot contents of the researches, and the proper path planning algorithm can improve the efficiency of the movement and reduce the time complexity. For the robot field, a feasible, efficient and safe motion route is provided for the robot by a path planning algorithm, and the existing planning of a mechanical arm grabbing path by utilizing machine vision and the improvement of a traditional mechanical arm path planning scheme are not high in algorithm effectiveness, time-consuming and high in algorithm complexity although the production efficiency is improved to a great extent, so that how to utilize effective visual clues to realize autonomous obstacle avoidance and path planning of the mechanical arm is a problem to be solved urgently.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide the mechanical arm path planning method and the mechanical arm path planning system based on depth deterministic strategy gradient, which can enhance the pointing function of a reward function by utilizing a multi-layer reward mechanism in the mechanical arm training process, utilize time difference errors as experience sample weights and utilize the data structure characteristics of binary stacks to apply to an experience pool replacement method, so that the utilization rate and the training speed of mechanical arm training samples are further improved.

The first technical scheme adopted by the invention is as follows: a mechanical arm path planning method based on depth deterministic strategy gradient comprises the following steps:

taking the three-dimensional space motion characteristics of the mechanical arm into consideration, constructing a mechanical arm multitasking motion model, wherein the mechanical arm multitasking motion comprises tail end tracking of the mechanical arm, pushing of the mechanical arm and grabbing of the mechanical arm;

based on an Actor-Critic network structure, establishing a depth deterministic strategy gradient algorithm network model;

introducing a preset rewarding rule, and constructing a layered rewarding function for the tail end tracking of the mechanical arm;

leading in a priority experience playback mechanism, accumulating an experience sample pool, and acquiring the weight of an experience sample;

training the depth deterministic strategy gradient algorithm network model based on the layering reward function and the weight of the experience sample, and obtaining a trained depth deterministic strategy gradient algorithm network model;

And deploying the trained depth deterministic strategy gradient algorithm network model to a mechanical arm multitasking motion model, and planning a path of the mechanical arm.

Further, the step of constructing the multi-task motion model of the mechanical arm by considering the three-dimensional space motion characteristics of the mechanical arm specifically comprises the following steps:

drawing a three-dimensional motion model of the mechanical arm;

initializing a virtual environment, and setting a three-dimensional motion model of the mechanical arm based on a coordinate system O-XYZ in the virtual environment;

defining the motion space, observation state information and motion information of the mechanical arm, and setting the updating step of the mechanical arm.

Further, the step of establishing a depth deterministic strategy gradient algorithm network model based on the Actor-Critic network structure specifically comprises the following steps:

the depth deterministic strategy gradient algorithm network model updates deterministic strategy parameters through the maximum accumulated rewards value and outputs the probability of the action space of the mechanical arm, the depth deterministic strategy gradient algorithm network model comprises a main network and a Target network, the main network comprises an Actor network and a Critic network, and the Target network comprises a Target Actor network and a Target Critic network;

the Actor network adopts a deterministic strategy, gives an input state, integrates the input state, outputs a determined action information, and takes a deterministic strategy gradient function as a method for updating Actor network parameters;

The Critic network is an evaluation network and is used for calculating a Q value to evaluate the quality of a strategy adopted by the Actor network, and a state-action value function is used as a method for updating Critic network parameters;

the target network is used for calculating a target Q value and assisting in updating parameters of the main network.

Further, the expression of the depth deterministic strategy gradient algorithm network model is specifically as follows:

in the above, mu _ω Representing deterministic strategies, J (-) represents training objectives, training the network by maximizing J (-), r represents the network by enforcing the strategy μ _ω The expected value obtained later, s, represents the state of environmental feedback, μ _ω A distribution function representing a state is provided,representation pair->Integration is performed.

Further, the preset reward rule specifically includes:

pre-launching training tasks to the mechanical arm multi-task motion model;

considering the task completion condition of the mechanical arm, for the mechanical arm completing the task, rewarding the preset target set value, and for the mechanical arm not completing the task, not sending the rewarding value;

considering the stay step number of the mechanical arm for completing the task, for the mechanical arm which stays in the target area and exceeds the preset step number, transmitting one-time rewards minus the rewards obtained by subtracting the consumed step number multiplied by the preset coefficient after completing the target, and for the mechanical arm which stays in the target area and is smaller than the preset step number, not transmitting the rewards;

Considering the number of steps of the mechanical arm for completing the task, calculating the number of steps consumed by the mechanical arm for completing the pre-starting training task, and transmitting a preset proportional value multiplied by a negative rewarding value of the number of steps consumed;

considering the distance between the tail end of the mechanical arm and the target, and giving a preset rewarding value to the mechanical arm if the distance between the tail end of the mechanical arm and the target is smaller than a set value.

Further, the step of introducing a priority experience playback mechanism and accumulating an experience sample pool specifically comprises the following steps:

observing the state of the current mechanical arm, inputting the state into an Actor network to obtain a corresponding mechanical arm action output result, and storing the mechanical arm action output result into an experience sample pool;

the mechanical arm multitask motion model executes a mechanical arm action output result, updates the state of the mechanical arm to obtain the observation state of the mechanical arm at the next moment, and calculates to obtain a corresponding rewarding value;

inputting the current state of the mechanical arm and the output result of the mechanical arm action into a Critic network to obtain a Q estimated value;

inputting the observation state of the mechanical arm at the next moment into a Target Actor network, and obtaining the action output result of the mechanical arm at the next moment;

inputting the observation state of the mechanical arm at the next moment and the action output result of the mechanical arm at the next moment into a Target Critic network to acquire a Q Target value;

Performing difference calculation processing on the Q estimation value and the Q target value to obtain a TD-error value, wherein the larger the TD-error value is, the larger the potential of experience learning is represented, and the higher the priority is;

the execution steps of the main network and the target network are circulated, an experience sample pool is accumulated, the TD-error value is set as the weight of the corresponding experience sample, and the data stored in the experience sample pool is in the form of five-tuple (s _t ,a _t ,r _t ,s _t+1 Done), wherein s _t Representing the current observation state of the mechanical arm, a _t Indicating the current mechanical arm action output result, r _t Representing the current corresponding prize value, s _t+1 The observation state of the mechanical arm at the next moment is shown, and done shows the task completion condition.

Further, the calculation expression of the TD-error is as follows:

in the above, r _t+1 Indicating the jackpot value at time t +1,representing TD-target time sequence differential target, delta _t Representing the time difference error at time t, gamma represents a discount factor for balancing the current and futureImportance of rewards, Q (s _t ) A value function estimate representing the current state.

The method also comprises the steps of storing and processing the data in the experience sample pool based on the SumPree binary tree and combining the weights of the experience samples, and specifically comprises the following steps:

introducing a binary tree data structure of a minimum heap, wherein the binary tree data structure of the minimum heap comprises leaf nodes and root nodes, the leaf nodes are used for storing quintuple data of experience samples and corresponding TD-error values, and the root nodes reserve the sum of the TD-error values of the leaf nodes;

Uniformly sampling nodes in a binary tree data structure of a minimum heap, and presetting a sampling value;

comparing the sampling value with a left sub-node of a left root node in a binary tree data structure of the minimum heap, if the sampling value is larger than the left sub-node of the root node, performing difference calculation processing on the sampling value and the value of the left sub-node of the root node, and obtaining a difference value which is the sampling value at the next moment and entering a right root node;

and comparing the sampling value at the next moment with the left child node of the right root node until the comparison result is larger than the leaf node corresponding to the root node, and storing quintuple data of the experience sample into the leaf node corresponding to the root node.

The method also comprises the steps of introducing a binary tree data structure of a minimum heap, and carrying out substitution processing on data in an experience sample pool based on the weight of the experience sample, wherein the substitution processing is specifically as follows:

when the experience pool reaches the maximum sample number, a new experience sample is used for replacing the experience sample with the minimum weight;

deleting the root node through the minimum heap, covering the experience sample of the root node with the experience sample of the last node in the binary tree, and performing top-down recursion adjustment, wherein the adjustment target is to enable the whole binary tree to meet the data structure of the minimum heap again;

The new experience sample is inserted after the last node of the binary tree, and the experience sample with the smallest weight is moved to the root node by recursion upwards.

The second technical scheme adopted by the invention is as follows: a depth deterministic strategy gradient-based robotic arm path planning system, comprising:

the construction module is used for constructing a mechanical arm multitasking motion model, a depth deterministic strategy gradient algorithm network model, a layering rewarding function of tail end tracking of the mechanical arm and an experience sample pool;

the training module is used for training the depth deterministic strategy gradient algorithm network model based on the layering rewarding function and the weight of the experience sample, and obtaining a trained depth deterministic strategy gradient algorithm network model;

and the planning module is used for deploying the trained depth deterministic strategy gradient algorithm network model to the mechanical arm multitasking motion model and planning the path of the mechanical arm.

The method and the system have the beneficial effects that: according to the invention, in the action output training process of the mechanical arm, the value of the action is estimated by utilizing the feedback of the environmental reward value, the cost of the action is learned, the strategy network and the value function in the depth deterministic strategy gradient algorithm are optimized, the next motion mode is adjusted, the maximization of the reward value is realized, the reward sparseness problem exists in the complex motion mode of the mechanical arm and the depth deterministic strategy gradient algorithm, a multi-level reward function is designed, the giving of the reward value is carried out from multiple dimensions of the mechanical arm motion, the feedback mechanism of the environment is enhanced, the exploration capacity of the mechanical arm in the early stage is improved, the training speed of the mechanical arm is accelerated, the experience playback mechanism is further introduced, the data is repeatedly sampled through an experience pool, the weight of a sample is determined by adopting a proper measurement scale, the experience sample sampling probability of high weight is improved by considering the weight of each experience sample in the experience sample pool, finally, the experience sample with high weight is retained by utilizing the data structure characteristic of a pile to replace the pool, and the experience sample with low weight is removed.

Drawings

FIG. 1 is a flow chart of steps of a robotic arm path planning method based on depth deterministic strategy gradients of the present invention;

FIG. 2 is a block diagram of a robotic path planning system based on depth deterministic strategy gradients in accordance with the present invention;

FIG. 3 is a schematic flow chart of steps of a conventional intelligent robot path planning intelligent algorithm;

FIG. 4 is a schematic flow chart of an algorithm for planning a path of a mechanical arm based on a depth deterministic strategy gradient;

FIG. 5 is a schematic diagram of the structure of a SumPree binary tree of the present invention;

FIG. 6 is a schematic diagram of a binary tree data structure of a minimum heap of the present invention;

FIG. 7 is a schematic flow chart of a first round of empirical sample substitution in accordance with an embodiment of the present invention;

FIG. 8 is a schematic flow chart of a second round of empirical sample substitution in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of the completion rate index result of the binary heap based PER algorithm of the present invention compared to existing algorithms;

FIG. 10 is a graph of average return indicator results of a binary heap based PER algorithm of the present invention compared to existing algorithms;

FIG. 11 is a graph of the average step index result of the binary heap based PER algorithm of the present invention compared to existing algorithms.

Detailed Description

The invention will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In recent years, the rise in the fields of intelligent robots, autopilots, artificial intelligence and the like provides great convenience for the life of human beings. In the research of the fields, the path planning problem is always one of the hot contents of the research, and a proper path planning algorithm can improve the efficiency of movement and reduce the time complexity, so that the path planning algorithm provides a feasible, efficient and safe movement route for the robot in the field of robots. The algorithm of the path planning problem is mainly divided into two categories, namely a traditional path planning algorithm mainly based on the characteristics of a graph and a sampling-based method, and an intelligent algorithm, as shown in fig. 3, a mechanical arm can often face a complex multi-obstacle environment during actual working, so that the control of a general research mechanical arm is mainly researched from the two aspects of the acquisition of obstacle information under visual clues and obstacle avoidance path planning; the reliable visual clues can effectively guide the mechanical arm to move so as to enable the mechanical arm to perform efficient, safe and stable operation along an effective path; in 2016, chen discloses a scheme capable of reducing unnecessary sweeping motion of a flexible surgical manipulator, and the method is a new three-dimensional neuro-dynamics model, and can obtain a safety enhancement track of a manipulator working space in consideration of a minimum scanning area; in 2017, < robot arm motion planning using robust constraint control > Zanchettin discloses a new motion allocation and reactive execution algorithm, which combines the conventional trajectory generation technology and the optimal strategy into a unified synchronous motion planning and control framework, so that the problem of motion trajectory planning of the robot arm is effectively solved; in 2018, a Gilbert-Johnson-Keerthi algorithm is proposed by < dynamic obstacle avoidance by distance calculation and discrete detection > Han for a mechanical arm, and the algorithm is improved by calculating the nearest distance so as to solve the problem of optimal path planning of the mechanical arm, thereby avoiding dynamic obstacle when the mechanical arm executes a manufacturing task and realizing optimal path planning of the mechanical arm; in 2019, < research based on ROS tracking recognition and control method > Wang proposed ROS-based tracking recognition, and the robot arm with machine vision is enabled to recognize and grasp by utilizing the technology of realizing the robot autonomous grasping task function by vision, and classification of objects is completed; in 2020, liu discloses prediction of a plurality of feature points of an object based on a neural network to obtain isomorphic transformation relation between object coordinates and image coordinates, and thus the obtained three-dimensional pose can accurately describe the three-dimensional position and pose of the object; < Zhang discloses a method for optimizing Informand-RRT algorithm based on greedy algorithm and changing search object in order to solve the problems of poor destination, slow convergence speed and low path optimization efficiency of Informand-RRT algorithm in robot path planning; < a no-collision inverse kinematics solution complete method based on reachability database > Xu discloses that objects stored in different trays are effectively and robustly collected by one mobile mechanical arm, a no-collision inverse kinematics solution complete method based on reachability database is proposed to determine a complete set of solutions for feasible base bits, which approximates a set of typical IK solutions, which is particularly useful when handling IK and checking collisions separately; < study of Industrial robot positioning System based on machine Vision > Ma public study and construction of a set of Industrial robot lithium battery current carrying sheet positioning System based on machine Vision, segmentation of multi-angle depth images by convolutional neural network, matching of the segmented depth images with preset target object images and path planning with mechanical arms are performed; 2021 < indoor environment visual positioning using RGB-D images and improved local aggregation descriptor vectors > Zhang proposes an intelligent mobile device visual positioning method based on RGB-D images, which calculates the pose of each image in a training set by feature extraction and description, image registration and pose map optimization, and then clusters the training set and query set in an image retrieval stage to generate a local aggregation description vector; in order to avoid the problems that the kinematic inversion time is consumed and the calibration calculation complexity of a vision system is high in the automatic mechanical arm grabbing method, chen et al provide a mechanical arm grabbing method based on a Gaussian process regression and kernel ridge regression combination model; the planning of the mechanical arm grabbing path by utilizing the machine vision and the improvement of the traditional mechanical arm path planning scheme greatly improve the production efficiency, but the algorithm effectiveness is not high, the time is relatively consumed, and the algorithm complexity is relatively high;

In recent years, with the development of deep learning and reinforcement learning, more and more researchers have applied deep reinforcement learning to path planning. Deep reinforcement learning algorithms are mainly divided into two classes, one class is a value function-based algorithm including DQN (Deep Q-Network), dueling DQN (Dueling Deep Q-Network), double DQN (Double Deep Q-Network), and another class is a policy-based method including DPG (Deterministic Policy Gradient), DDPG (Deep Deterministic Policy Gradient), A3C (Asynchronous Advantage ActorCritic), PPO (Proximal Policy Optimization); the deep reinforcement learning can solve the problem of poor dynamic environment adaptation in a conventional path planning algorithm, the deep reinforcement learning algorithm combines the deep learning with the reinforcement learning, the deep learning provides perception capability, the reinforcement learning provides decision capability, so that an intelligent body can learn path selection autonomously through interaction with the environment and a corresponding feedback mechanism, and the effects of obstacle avoidance and path optimization are achieved; the robot exploration strategy based on Q-learning network > Tai discloses that the path planning capability of the robot based on the depth Q network algorithm in the labyrinth environment is verified through experiments, the motion direction of the robot is taken as output through the images as input through the experiments, and the robot can complete the functions of automatic obstacle avoidance, tracking and the like through finding a reward value function to the maximum extent; < reinforcement learning of unsupervised auxiliary task > Jaderberg discloses that an A3C algorithm is enhanced by adding auxiliary task rewards, then an agent based on the algorithm is tested to judge the performance of the algorithm through a track in a labyrinth environment, and experiments prove that the agent in the algorithm can well perform a path-finding operation in the labyrinth environment; < one-day academy autopilot > Kendall discloses that a DDPG algorithm is firstly utilized to autopilot, and is interacted with surrounding environment to automatically realize path planning, and the path planning algorithm utilizing deep reinforcement learning can improve the autonomous learning capacity of a mechanical arm, but the reinforcement learning algorithm generally faces the problem of sparse rewards, and aiming at the problem, the problem of high acquisition difficulty of rewards can be solved by designing reinforcement learning rewards function and experience playback strategy;

In summary, the method takes the intelligent mechanical arm algorithm in the intelligent biological colony capturing process as a research purpose, designs the mechanical arm path planning algorithm based on the depth deterministic strategy gradient algorithm, and expands the research from the rewarding function and the experience playback strategy of the depth deterministic strategy gradient algorithm by analyzing the motion mode of the mechanical arm and the basic principle of the depth deterministic strategy gradient algorithm, specifically:

(1) A hierarchical rewards function based on a depth deterministic strategy gradient algorithm is presented. In the training process, the mechanical arm utilizes feedback of environmental rewarding values to evaluate the value of actions, learns the cost generated by the actions, optimizes strategy networks and value functions in a depth deterministic strategy gradient algorithm, adjusts the next movement mode and realizes the maximization of rewarding values. Aiming at the problem of rewarding sparseness in complex movement modes and depth deterministic strategy gradient algorithms of the mechanical arm, the project builds a multi-layer task movement model of tail end tracking of the mechanical arm, pushing of the mechanical arm, grabbing of the mechanical arm and the like by analyzing joint angles and movement modes of movement of the mechanical arm and combining the characteristics of mechanical movement in a real three-dimensional space, designs a multi-layer rewarding function, gives rewarding values from multiple dimensions of movement of the mechanical arm, enhances a feedback mechanism of the environment, improves the early exploration capability of the mechanical arm and accelerates the training speed of the mechanical arm;

(2) An empirical playback strategy based on a depth deterministic strategy gradient algorithm is presented. Firstly, aiming at the problems of less opportunity of effective movement learning and low training efficiency of the mechanical arm in a simulation environment with sparse reward values and high exploration difficulty, an experience playback mechanism is introduced, an experience sample pool is opened up, training experience samples of single movement are stored, data are repeatedly sampled through the experience pool, and the experience sample utilization rate is improved. Secondly, aiming at the problem that the random sampling strategy in the experience playback mechanism has low utilization rate of high-weight experience samples, the project adopts proper measurement to determine the weight of the samples, considers the weight of each experience sample in an experience sample pool, designs a proper storage mode, and improves the sampling probability of the high-weight experience samples. Finally, aiming at the problems of the upper limit of capacity of the experience sample pool and low efficiency of the experience sample replacement algorithm, the project is applied to the experience pool replacement method by utilizing the data structure characteristic of the binary heap, the high-weight experience sample is reserved, the low-weight experience sample is removed, and the utilization rate of the high-weight experience sample is improved.

Referring to fig. 1 and 4, the present invention provides a method for planning a path of a manipulator based on a depth deterministic strategy gradient, the method comprising the steps of:

S1, constructing a depth deterministic strategy gradient algorithm;

specifically, in a conventional actor-critter model, the strategy approach employed is a random strategy. The random strategy outputs probability distribution of the motion space, namely probability of each motion, wherein the probability interval is from 0 to 1, so that the random strategy-based method can sample the whole motion space during each output, and for a high-dimensional space, the random strategy large batch sampling can influence the performance of an algorithm, and a deterministic strategy gradient algorithm is proposed for the problems. Unlike the stochastic strategy, the probability of the deterministic strategy algorithm outputting the action space is determined, so that the action space is not required to be integrated, the efficiency is improved, the deterministic strategy parameters are updated through the maximum cumulative prize value, and the mathematical expression of the objective function of the deterministic strategy is as follows:

in the above, mu _ω Representing deterministic strategies, J (-) represents training objectives, training the network by maximizing J (-), r represents the network by enforcing the strategy μ _ω The expected value obtained later, s, represents the state of environmental feedback, μ _ω A distribution function representing a state is provided,representation pair->Integrating;

further, updating the deterministic strategy parameters through the maximum cumulative prize value, and outputting the probability of the action space of the mechanical arm, wherein the expression is specifically as follows:

In the above-mentioned method, the step of,representing the probability function distribution of the strategy, gamma ^k Representing a discount factor for weighting the importance of current and future rewards, p representing the slave state s ₀ Probability of transition to state s over k time steps, k representing the number of time points;

the mathematical expression of the deterministic strategy gradient is as follows:

in the above-mentioned method, the step of,represents a gradient with respect to the policy parameter ω +.>Represents the gradient, μ, with respect to action a _ω (s) represents the action a, Q taken by policy ω in state s ^μ (s, a) represents a state-action value function, a represents an action;

in the Actor-Critic structure, a state-action value function is taken as a neural network parameter method for updating Critic, a deterministic strategy gradient function is taken as a neural network parameter method for updating Actor, the structure is similar to Q learning, the network parameters are updated by using TD-error of the value function, but the difference is that the output of the Q learning action adopts a greedy algorithm, the action is output by adopting a deterministic strategy gradient, and the mathematical expression of the TD-error of the value function is as follows:

in the above, delta _t Representing the time difference error at time t, r (s, a) represents the reward function for state s and action a, Q (s', μ) _ω (s')) represents a value function at time t+1, Q (s, μ) _ω (s)) represents a value function at time t;

the Actor network is designed based on deterministic policies so its action output is deterministic. Compared with the strategy gradient algorithm, the deterministic strategy gradient algorithm does not need to integrate the action, but only needs to integrate the state, so that the sampling of the action is reduced, and the efficiency is improved. The deterministic strategy gradient mathematical expression adopted by the Actor network is as follows:

in the above, J _β (μ _ω ) Representation policy mu _ω The performance index of (2) is used for measuring the quality of the strategy and adjusting the strategy parameters as an optimization target;

the solving formula ensures that the output of the action space of the intelligent agent does not need the update of the action-value function, and can ensure that the algorithm achieves local convergence.

S2, a layered rewarding function based on a depth deterministic strategy gradient algorithm;

specifically, in the reinforcement learning algorithm, the intelligent agent always utilizes the feedback of the environmental reward value to judge the superiority and inferiority of the action, so as to continuously optimize a strategy network or a value function in reinforcement learning, a proper excellent reward function can greatly improve the training speed of the intelligent agent and the effect degree of the reinforcement learning algorithm, at the beginning of training of the intelligent agent, the action output by the reinforcement learning algorithm can only be used for obtaining the reward value through continuous exploration, the action output by the reinforcement learning algorithm acts in the environment, the value evaluation of the action is fed back in the environment, and the intelligent agent can output the next action through learning the value of the action, so that the intelligent agent moves towards the direction of maximizing the reward value, therefore, the proper reward value function is particularly important in the reinforcement learning algorithm;

For training the motion control of the mechanical arm by using a deep reinforcement learning algorithm, firstly, setting a proper reward function, observing the environment by acquiring the reward value of the reward function in the training process so as to adjust the joint angle and mode of the next motion of the mechanical arm, and obtaining the maximum reward value until the aim of expected training is achieved, wherein the motion of the mechanical arm is more complex than that of a common intelligent body, firstly, the mechanical arm is a multi-joint controlled robot, the motion mode of the mechanical arm needs to be considered when the simulation environment of the mechanical arm training is built, secondly, the motion of the mechanical arm is a motion of a three-dimensional space in actual application, the characteristics of the three-dimensional motion of the mechanical arm need to be considered at the same time, and finally, the mechanical arm can be divided into multiple layers of tasks in actual application, including tail end tracking of the mechanical arm, pushing of the mechanical arm, grabbing of the mechanical arm and the like, and the reward function of the mechanical arm needs to be different for different tasks;

in the setting of a reward function of a reinforcement learning algorithm, a positive reward value is generally given according to whether an intelligent agent completes a task as a condition, for example, the intelligent agent completes a set of tasks, in order to improve the efficiency of the intelligent agent, a negative reward value is often provided for the number of movement steps of the intelligent agent, the form of the reward function is more common and effective in a simple simulation environment and a single task target, but for the reward function of the deep reinforcement learning algorithm for the mechanical arm, the form cannot achieve a good training effect, because the mechanical arm moves in a three-dimensional space, if only the reward is given to the completion state of the task, sparse feedback is a relatively troublesome problem of reinforcement learning at present, aiming at the problems, the invention fully considers the movement mode and the characteristics of the mechanical arm, and provides a layered reward function for the tracking movement of the tail end of the mechanical arm, the layered reward function mainly divides the reward function into a plurality of layers, and under the excitation of the multi-layer reward function, the tail end of the mechanical arm can complete the task under the minimum number of steps;

S21, rewarding rules;

specifically, a prize value needs to be given for the task completion, when the task is completed, a prize with a value of 1 is given, and a non-prize value is not completed, and the expression is specifically as follows:

in the above formula, target represents the completion of the task;

meanwhile, in order to ensure that the mechanical arm can realize the target, when the tail end of the mechanical arm stays in the target area for more than 50 steps, the algorithm gives a reward with a value of 200 at one time, and in order to improve the probability that the mechanical arm completes the target in the earlier training, the consumed steps are subtracted after completing the target, and meanwhile, the reward value is multiplied by a coefficient of 0.5, wherein the reward value is expressed as follows:

in the above, steps represents the number of steps required by the mechanical arm to complete the target, and gold represents the number of steps that the tail end of the mechanical arm stays in the target area to exceed;

in order to ensure that the mechanical arm can realize that the target can prevent the mechanical arm from possibly touching the target area by mistake in the training process to finish the task, but the mechanical arm cannot realize the content of the task in the moment, the invention sets a preset stop step number, and can calculate the mechanical arm to finish the task only when the stop step number exceeds the preset step number, so that the situation that the mechanical arm finishes the task due to the fact that the mechanical arm bumps by mistake can be effectively avoided;

Secondly, in order to excite the training efficiency of the mechanical arm, the movement cost of the mechanical arm is reflected, when the number of steps of the mechanical arm movement is increased by one step, a certain negative rewarding value is given, when the number of steps is closer to the maximum number of steps, the punishment value is larger, the mechanical arm is ensured to realize the target in the minimum number of steps, wherein the coefficient-0.01 is determined according to the maximum number of steps of each set of the mechanical arm in the simulation environment and the rewarding value after the task is completed, and the expression of the rewarding value is as follows:

r ₃ ＝-0.01×steps

finally, in order to increase the training speed of the mechanical arm, the reward function needs to set a positive reward value for the motion of the mechanical arm. In two dimensions, the coordinate information of the object is unchanged, its coordinates (X _B ,Y _B ) Only X required to reach the target _O Coordinates or Y _O The coordinates should give positive rewards to the robot arm, so that the robot arm learns the azimuth information of the target, and the mathematical expression is as follows:

in the above, r ₄ The distance between the tail end of the mechanical arm and the target and the X of the tail end of the mechanical arm reaching the target are fully considered in the reward function _O Y with axis direction and robot arm end reaching target _I Prize value in axial direction. Such a bonus function setting can be achieved by both ensuring that the robot arm tip is moving towards the target and by moving at the target X _O Axes and Y _O The shaft rewards to the robotic arm conduct a correct direction. The 0.5 is set according to the rewards given by 1 when the target is reached, so that the mechanical arm can obtain rewards in two directions after reaching the target area, and the total value is consistent with the target finishing rewards;

the reward function of the final deep reinforcement learning algorithm is obtained by compounding the reward functions of the layers, and the mathematical expression is as follows:

r＝r ₁ +r ₂ +r ₃ +r ₄

in the above formula, r represents the total bonus function, r ₁ 、r ₂ 、r ₃ And r ₄ Each representing a conditional rule of a different bonus function.

S3, experience playback strategy based on depth deterministic strategy gradient algorithm.

Specifically, in a simulation environment based on reinforcement learning, an intelligent body is mainly trained by acquiring a reward value strategy of each step of action, for a simulation environment with simple scene or reasonable reward value design, the intelligent body can quickly learn to a target and task, but for a simulation environment with sparse reward value and high exploration difficulty, the intelligent body has less opportunity to learn correct operation, the training efficiency is low, if in some extreme simulation environments, the intelligent body only has a correct operation in a final state to obtain rewards, the training is extremely unfavorable, for the problem, an off-policy-based reinforcement learning algorithm introduces an experience playback mechanism, the experience playback mechanism stores training experience of each time into an experience pool, and then repeatedly adopts data in the experience pool to train the experience pool, so that the sample utilization rate is improved;

Wherein the Off-policy-based reinforcement learning algorithm is a learning process, a certain strategy can be improved through learning instead of improving on the basis of the existing strategy, in the Off-policy algorithm, an agent learns how to evaluate a group of different strategies, and can automatically adjust the strategies along with the time, unlike the on-policy algorithm, the Off-policy algorithm does not need to keep a specific strategy applied to a problem in the learning process, and common Off-policy reinforcement learning algorithms include an Actor-Critic method in Q-learning and deep reinforcement learning;

the initial experience pool is consistent in stored data, has no weight and priority, a playback experience sample is obtained by adopting a random mode, the equal probability playback mode only carries out repeated training on experiences, the effect of the experience sample with large influence cannot be reflected, the network training efficiency is low, the training influence on a strategy network is weak, the weight of the experience sample is determined by adopting proper measurement and weighing for the problem, and the weight of each experience is considered in the experience pool, so that the probability of being adopted by the experience with larger weight is improved;

The most important core of the PER algorithm is to define the weight of each experience, which is considered that the magnitude of the experience weight should be determined by how much the agent can learn from this experience, which although not intuitively represented, is often used in reinforcement learning of the value function to update the value function with TD-error, which is used for prediction estimation of the current state of the next state, whose magnitude can represent the difference between the current value function and the target value function, for an experience, the larger TD-error represents the greater the potential of this experience learning, the higher its priority, and if the smaller TD-error represents the more common this experience, its priority should be reduced when storing into the experience pool, for the simulation environment of the invention, the experience sample is defined as five-tuple (s _t ,a _t ,r _t ,s _t+1 Done), wherein s _t Representing the current observation state of the mechanical arm, a _t Indicating the current mechanical arm action output result, r _t Representing the current corresponding prize value, s _t+1 The observation state of the mechanical arm at the next moment is represented, and done represents the task completion condition;

wherein, the TD-error refers to Temporal Difference error (time sequence difference error), which is an important measure in the reinforcement learning TD (Temporal Difference) algorithm, and represents the difference between the actual return and the predicted return in the current time step, and the TD algorithm gradually approximates the actual cost function by continuously updating the estimated value of the cost function. In each time step, the TD algorithm estimates the current return according to the current state and the value of the estimated value function, and then compares the current return with the actual return to calculate TD-error;

Further, the mathematical expression of TD-error is as follows:

in the above, r _t+1 Indicating the jackpot value at time t +1,representing TD-target time sequence differential target, delta _t Representing the time difference error at time t, gamma represents a discount factor for balancing the importance of current and future rewards, Q(s) _t ) A value function estimate representing the current state.

The TD-target refers to Temporal Difference target (time sequence differential target), which is also an important concept in the reinforcement learning TD algorithm, and the TD algorithm gradually approximates to the real cost function by continuously updating the estimated value of the cost function. At each time step, the TD algorithm estimates the current return according to the current state and the value of the estimated value function, then compares the current return with the actual return, calculates the TD-error, and then uses the TD-error to update the value of the estimated value function, where TD-target is the target value required by the TD algorithm when updating the estimated value function, and is generally defined as the return of the current time step plus the discount factor multiplied by the estimated value of the value function of the next state.

S31, storing and processing data in an experience sample pool based on a SumPree binary tree and combining weights of the experience samples;

Specifically, after the weight calculation method of the experience sample is determined, the PER algorithm stores the sample by using a form of a SumPree binary tree, wherein in the SumPree binary tree, leaf nodes store the experience sample and carry the value of the priority of the leaf nodes in the gray node in FIG. 5 besides five-tuple of the experience sample, and other nodes only retain the sum of the values of child nodes without storing the experience sample;

taking the process of the PER algorithm as an example in fig. 5, firstly, the PER will uniformly sample (0, 38), if the sampling value is 25, the left sub-node will be compared from the root node, if 25 is greater than 15, the right sub-node will be entered, then the skipped left sub-tree sum 15 is subtracted, at this time, the sampling value remains 10, the root node becomes 23, and 23 is taken as the root node, the size of the left sub-node and the sampling value will be continuously compared, if 10 is greater than 5, the right sub-node 18 is selected as the root node of the next round, the sampling value is 5, and finally, if 18 is the leaf node, the five-tuple of the experience sample is stored on the leaf node, the five-tuple will be taken as the last training sample, and this storage mode can ensure that the experience sample with greater weight has greater probability of being selected when sampling the experience pool, because the experience sample with greater weight is represented by the selection interval. The probability of being selected is the same for each value of the uniformly sampled empirical pool, the interval size represented by the empirical sample represents the range of value selections, and the larger the range of empirical samples, the greater the probability of being selected.

S32, introducing a binary tree data structure of a minimum heap, and performing replacement processing on data in an experience sample pool based on the weight of the experience sample;

specifically, the data structure of the minimum heap is in the form of a binary tree, and the sorting mode only needs to meet that the head node is the minimum value, so that the time complexity of the insertion and deletion operation is O (log n), which is better than that of the common sorting mode, and when the SumTie is used for replacing the experience samples, if the SumTie is used for replacing the experience samples in a first-in first-out mode, the experience samples with high weight can be replaced. The binary heap ensures that experience samples with smaller weight are replaced each time, which is more beneficial to training of the network;

the PER algorithm realizes the preferential playback of the experience sample with larger weight of the experience pool by using a TD-error weight calculation mode and a SumToee binary tree sample storage mode, but does not limit the size of the experience pool, adopts a fully reserved mode, ensures the diversity of experience, and can certainly have influence on memory and calculation efficiency for a large number of training samples. The traditional solution is to set an upper limit on an experience pool, the experience pool reaching the upper limit needs to be replaced with an experience sample, a new experience sample is replaced with an old experience sample, the replacement method generally adopts the idea of First-In-First-Out (FIFO) and replaces the new experience sample In sequence according to the time sequence of entering the experience pool, the latest experience sample can be stored In the experience pool, the same storage time of each experience sample is ensured, but the weight of each experience sample is different, the experience sample with high weight learning effect can be replaced by a First-In First-Out algorithm, the experience sample with high weight learning effect is unfavorable to training, the experience sample with high weight should be kept In the experience pool as much as possible, the experience sample with low weight should be replaced with the latest experience sample, so that the experience sample In the experience pool is more excellent, and for the problems, the invention introduces a data structure of a minimum stack, the minimum stack is a minimum stack with low weight, the minimum stack is a complete binary tree, the minimum tree is a root of the minimum tree is a full root of the minimum tree, the minimum tree is a node of the minimum node, and the minimum node is a node of the minimum node is shown In a node 6;

Referring to fig. 7 and 8, each node represents the weight of an experience sample, while the index of the experience sample is stored, the root node is the minimum value in the sample space, when the experience pool reaches the maximum sample number, a new experience sample replaces the experience sample with the minimum weight, the root node is deleted by the minimum heap first, the experience sample of the root node is covered by the experience sample of the last node in the binary tree, and then recursively adjusted from top to bottom, so that the whole binary tree meets the data structure of the minimum heap again. After adjustment, finally removing the experience sample with the lowest weight, inserting a new experience sample behind the last node of the binary tree, moving the experience sample with the lowest weight to the root node through upward recursion, wherein the gray node is the newly inserted experience sample, and combining the characteristics of the minimum heap, the replacement efficiency is higher than that of the common sorting method, meanwhile, the data of the experience sample with the high weight is reserved in the replacement process, and the sample with the low weight is replaced by the latest experience sample.

Further, the PER algorithm based on the binary heap carries out simulation experiment comparison analysis with the existing algorithm under three groups of different indexes, as shown in fig. 9, 10 and 11, it is obvious that the PER algorithm based on the binary heap is obviously superior to other two methods under the three groups of different indexes, the PER algorithm based on the binary heap can reserve the experience sample with larger weight when the experience sample is replaced under the condition of full capacity of the experience pool, so that the depth deterministic strategy algorithm can have larger probability to sample the experience sample with larger weight when the Actor network and Critic network parameters are updated by adopting the sample from the experience pool, and the experience sample with small weight for network training is replaced by a new experience sample from the experience pool by a replacement method, the mode can maximize the utilization ratio of the experience sample with high weight, the training effect of the PER algorithm based on the FIFO is obviously better than that of a random sampling algorithm, and the experience sample obtained by sampling by adopting an experience playback strategy with TD-error as a weight standard can be proved to be more beneficial to the training of a network, wherein for the PER algorithm based on the FIFO in the section of 2000-3000 sets, the average reward value fluctuation is larger and higher than that of the PER algorithm based on the binary heap, as the PER algorithm principle is introduced by the three pairs of steps, although the experience sample is classified according to the TD-error standard and stored by a Sumtreebinary tree, the probability of sampling to the experience sample with smaller weight still exists in the sampling process, a certain randomness is generated experimentally, which is the normal condition, as can be seen from experiments of a plurality of groups of fig. 9, 10 and 11, the priority experience playback strategy based on the binary heap provided by the invention has good effect on the depth deterministic strategy algorithm, the utilization rate of the mechanical arm training sample can be improved, and the training speed is improved.

Referring to fig. 2, a depth deterministic strategy gradient based robotic arm path planning system, comprising:

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

While the preferred embodiment of the present invention has been described in detail, the invention is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the invention, and these modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. The mechanical arm path planning method based on depth deterministic strategy gradient is characterized by comprising the following steps:

2. The method for planning a path of a manipulator based on a depth deterministic strategy gradient according to claim 1, wherein the step of constructing a multi-task motion model of the manipulator by considering three-dimensional spatial motion characteristics of the manipulator specifically comprises:

Drawing a three-dimensional motion model of the mechanical arm;

3. The method for planning a path of a manipulator based on a depth deterministic strategy gradient according to claim 2, wherein the step of establishing a depth deterministic strategy gradient algorithm network model based on an Actor-Critic network structure specifically comprises the steps of:

4. The mechanical arm path planning method based on depth deterministic strategy gradient according to claim 3, wherein the expression of the depth deterministic strategy gradient algorithm network model is specifically as follows:

5. The depth deterministic strategy gradient-based robotic arm path planning method according to claim 4, wherein the preset rewarding rule specifically comprises:

pre-launching training tasks to the mechanical arm multi-task motion model;

6. The method for planning a path of a manipulator based on a depth deterministic strategy gradient according to claim 5, wherein the step of introducing a preferential experience playback mechanism and accumulating an experience sample pool comprises the steps of:

7. The depth deterministic strategy gradient-based robotic arm path planning method according to claim 6, wherein the calculation expression of TD-error is:

8. The mechanical arm path planning method based on depth deterministic strategy gradient according to claim 7, further comprising storing and processing data in an experience sample pool based on a SumTree binary tree in combination with weights of experience samples, specifically:

9. The depth deterministic strategy gradient-based robotic arm path planning method of claim 8, further comprising introducing a binary tree data structure of a minimum heap, and performing permutation processing on data in the empirical sample pool based on weights of the empirical samples, specifically:

10. The mechanical arm path planning system based on the depth certainty strategy gradient is characterized by comprising the following modules: