CN115509233A

CN115509233A - Robot path planning method and system based on prior experience playback mechanism

Info

Publication number: CN115509233A
Application number: CN202211199553.5A
Authority: CN
Inventors: 王朋; 程诺; 倪翠
Original assignee: Shandong Jiaotong University
Current assignee: Shandong Jiaotong University
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-12-23
Anticipated expiration: 2042-09-29
Also published as: CN115509233B

Abstract

The invention discloses a robot path planning method and a system based on a priority experience playback mechanism; the method comprises the following steps: acquiring the current state and the target position of the path planning robot; inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action; in the training process of the trained depth deterministic strategy gradient network, experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage; the construction process of the experience sample priority sequence is as follows: calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; and determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, and constructing the priority sequence of the empirical sample.

Description

Robot path planning method and system based on priority experience playback mechanism

技术领域technical field

本发明涉及机器人路径规划技术领域，特别是涉及基于优先经验回放机制的机器人路径规划方法及系统。The invention relates to the technical field of robot path planning, in particular to a robot path planning method and system based on a priority experience playback mechanism.

背景技术Background technique

本部分的陈述仅仅是提到了与本发明相关的背景技术，并不必然构成现有技术。The statements in this section merely mention the background technology related to the present invention and do not necessarily constitute the prior art.

随着机器人和人工智能技术的深入研究，智能机器人种类日益丰富，在各个行业也发挥着越来越重要的作用。路径规划能够让智能机器人在指定的区域内找到一条从起点到终点的无碰撞安全路径，是智能机器人运动的基础，也是目前研究的热点。其过程为通过传感器感知智能机器人的周围环境信息，确定自身位姿，然后在环境中寻找一条从当前位置到指定位置的最优路径。With the in-depth research of robots and artificial intelligence technology, the types of intelligent robots are becoming more and more abundant, and they are also playing an increasingly important role in various industries. Path planning enables intelligent robots to find a collision-free safe path from the starting point to the end point in a designated area, which is the basis of intelligent robot movement and is also a hot spot of current research. The process is to perceive the surrounding environment information of the intelligent robot through the sensor, determine its own pose, and then find an optimal path from the current position to the specified position in the environment.

近些年来，深度强化学习(DRL)在诸多领域广泛应用，与深度强化学习结合的路径规划算法逐渐成为研究重点。深度强化学习不需要机器人事先了解环境，而是通过感知周围环境状态来预测下一步的动作，执行动作后获得环境反馈的奖励，使机器人从当前状态迁移到下一个状态。重复循环，直至机器人到达目标点或达到设定的最大步数。DeepMind提出DDPG(Deep Deterministic Policy Gradient)算法，采用基于确定性策略梯度算法，将Actor-critic框架与 DQN结合，使用卷积神经网络模拟策略函数和Q函数，使输出结果为确定动作值，解决了深度强化学习在高维度或连续动作任务上无法应用或表现极差的问题，是目前一种有效的路径规划算法。然而，由于对经验样本的利用率不足，DDPG算法对机器人路径规划的环境适应性较差，存在成功率低、收敛速度慢等问题。In recent years, deep reinforcement learning (DRL) has been widely used in many fields, and the path planning algorithm combined with deep reinforcement learning has gradually become the focus of research. Deep reinforcement learning does not require the robot to know the environment in advance, but predicts the next action by perceiving the state of the surrounding environment, and obtains the reward of environmental feedback after performing the action, so that the robot can migrate from the current state to the next state. Repeat the loop until the robot reaches the target point or reaches the set maximum number of steps. DeepMind proposed the DDPG (Deep Deterministic Policy Gradient) algorithm, which uses a deterministic policy gradient algorithm, combines the Actor-critic framework with DQN, uses the convolutional neural network to simulate the policy function and Q function, and makes the output result a definite action value, which solves the problem of Deep reinforcement learning is currently an effective path planning algorithm for the problem that it cannot be applied or performs extremely poorly on high-dimensional or continuous action tasks. However, due to the insufficient utilization of empirical samples, the DDPG algorithm has poor environmental adaptability to robot path planning, and has problems such as low success rate and slow convergence speed.

传统DDPG采用随机经验回放机制，机器人生成的经验[s_t,a_t,r_t,s_t+1]存储在经验池中，随机选取经验样本对神经网络进行训练。通过打破经验之间的时序相关性，解决经验无法重复利用的问题，加速机器人的学习过程。但ER使用统一的随机采样策略，未考虑不同经验对机器人学习的重要性不同，无法充分利用重要性高的经验，影响了神经网络的训练效率。The traditional DDPG adopts a random experience playback mechanism. The experience generated by the robot [s _t , a _t , r _t , s _t+1 ] is stored in the experience pool, and the experience samples are randomly selected to train the neural network. By breaking the temporal correlation between experiences, it solves the problem that experiences cannot be reused, and accelerates the learning process of robots. However, ER uses a unified random sampling strategy, which does not consider the importance of different experiences to robot learning, and cannot make full use of experiences with high importance, which affects the training efficiency of neural networks.

发明内容Contents of the invention

为了解决现有技术的不足，本发明提供了基于优先经验回放机制的机器人路径规划方法及系统；本发明提出一种动态样本优先级的优先经验回放机制，综合考虑TD-error、Actor网络的损失函数和经验的立即奖励，对三者加权求和来设置经验的优先级。在经验采样时，为奖励大于零的经验(积极经验)赋予更高的优先级，优先利用这些经验更新网络参数。积极的经验样本被选择训练后，在下一轮的训练过程中，将这些经验样本的优先级进行指数衰减，直至降到优先级序列的平均值。增加经验样本的多样性，提高经验样本的利用率，解决了DDPG算法路径规划成功率低、收敛速度慢的问题。In order to solve the deficiencies of the prior art, the present invention provides a robot path planning method and system based on a priority experience playback mechanism; the present invention proposes a priority experience playback mechanism with dynamic sample priority, taking TD-error and Actor network losses into consideration Immediate rewards for functions and experience, and the weighted sum of the three to set the priority of experience. When sampling experience, give higher priority to the experience (positive experience) whose reward is greater than zero, and use these experiences to update the network parameters first. After positive experience samples are selected for training, in the next round of training, the priority of these experience samples will be decayed exponentially until it drops to the average value of the priority sequence. Increase the diversity of experience samples, improve the utilization rate of experience samples, and solve the problems of low success rate and slow convergence speed of DDPG algorithm path planning.

第一方面，本发明提供了基于优先经验回放机制的机器人路径规划方法；In the first aspect, the present invention provides a robot path planning method based on a priority experience playback mechanism;

基于优先经验回放机制的机器人路径规划方法，包括：A robot path planning method based on the priority experience playback mechanism, including:

获取路径规划机器人当前状态和目标位置；将路径规划机器人当前状态和目标位置，输入到训练后的深度确定性策略梯度网络，得到机器人动作；路径规划机器人根据得到的机器人动作，完成机器人的路径规划；Obtain the current state and target position of the path planning robot; input the current state and target position of the path planning robot into the trained deep deterministic policy gradient network to obtain the robot action; the path planning robot completes the path planning of the robot based on the obtained robot action ;

其中，训练后的深度确定性策略梯度网络，在训练的过程中，机器人生成的经验存储在经验池中，存储方式采用经验样本优先级序列进行存储；Among them, the deep deterministic policy gradient network after training, during the training process, the experience generated by the robot is stored in the experience pool, and the storage method is stored in the priority sequence of experience samples;

其中，经验样本优先级序列，其构建过程为：Among them, the experience sample priority sequence, its construction process is:

计算时间差分误差的优先级、当前Actor网络损失函数的优先级和立即奖励的优先级；利用信息熵确定三者的权重，采用加权求和的方式计算经验样本优先级，构建经验样本优先级序列；Calculate the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; use the information entropy to determine the weight of the three, use the weighted sum to calculate the priority of the experience sample, and construct the priority sequence of the experience sample ;

在经验采样时，判断奖励是否大于零，如果是，就上调经验样本的优先级，如果否，就保持经验样本的优先级不变；按照优先级由高到低的顺序对经验进行采样，进而更新网络参数。When sampling experience, judge whether the reward is greater than zero, if so, increase the priority of the experience sample, if not, keep the priority of the experience sample unchanged; sample the experience in order of priority from high to low, and then Update network parameters.

进一步地，所述方法还包括：Further, the method also includes:

在经验被选择参与训练之后，在下一轮的训练过程中，将已经参与训练的经验的优先级进行衰减，判断衰减后的所有优先级的平均值是否小于设定阈值，如果是，就上调经验样本的优先级，如果否，就继续进行优先级衰减，直至降低到优先级序列的平均值。After the experience is selected to participate in the training, in the next round of training, the priority of the experience that has participated in the training is attenuated, and it is judged whether the average value of all the priorities after the attenuation is less than the set threshold, and if so, the experience is raised The priority of the sample, if not, continue to decay the priority until it is reduced to the average value of the priority sequence.

第二方面，本发明提供了基于优先经验回放机制的机器人路径规划系统；In the second aspect, the present invention provides a robot path planning system based on a priority experience playback mechanism;

基于优先经验回放机制的机器人路径规划系统，包括：Robot path planning system based on priority experience playback mechanism, including:

获取模块，其被配置为：获取路径规划机器人当前状态和目标位置；An acquisition module configured to: acquire the current state and target position of the path planning robot;

路径规划模块，其被配置为：将路径规划机器人当前状态和目标位置，输入到训练后的深度确定性策略梯度网络，得到机器人动作；路径规划机器人根据得到的机器人动作，完成机器人的路径规划；The path planning module is configured to: input the current state and target position of the path planning robot into the trained deep deterministic policy gradient network to obtain the robot action; the path planning robot completes the path planning of the robot according to the obtained robot action;

其中，经验样本优先级序列，其构建过程为：计算时间差分误差的优先级、当前Actor网络损失函数的优先级和立即奖励的优先级；利用信息熵确定三者的权重，采用加权求和的方式计算经验样本优先级，构建经验样本优先级序列；Among them, the experience sample priority sequence, its construction process is: calculate the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; use the information entropy to determine the weight of the three, and use the weighted summation method to calculate the priority of experience samples, and construct the priority sequence of experience samples;

第三方面，本发明还提供了一种电子设备，包括：In a third aspect, the present invention also provides an electronic device, comprising:

存储器，用于非暂时性存储计算机可读指令；以及memory for non-transitory storage of computer readable instructions; and

处理器，用于运行所述计算机可读指令，a processor for executing said computer readable instructions,

其中，所述计算机可读指令被所述处理器运行时，执行上述第一方面所述的方法。Wherein, when the computer-readable instructions are executed by the processor, the method described in the first aspect above is performed.

第四方面，本发明还提供了一种存储介质，非暂时性地存储计算机可读指令，其中，当所述非暂时性计算机可读指令由计算机执行时，执行第一方面所述方法的指令。In a fourth aspect, the present invention also provides a storage medium that non-transitorily stores computer-readable instructions, wherein, when the non-transitory computer-readable instructions are executed by a computer, the instructions for executing the method described in the first aspect .

第五方面，本发明还提供了一种计算机程序产品，包括计算机程序，所述计算机程序当在一个或多个处理器上运行的时候用于实现上述第一方面所述的方法。In a fifth aspect, the present invention also provides a computer program product, including a computer program, which is used to implement the method described in the first aspect when running on one or more processors.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

针对基于DDPG算法的经验回放机制存在的样本单一、样本利用率不高的问题，本发明提出一种动态样本优先级序列的优先经验回放机制，综合TD-error、 Actor网络损失函数和立即奖励，利用信息熵确定三者权重，通过加权求和计算经验样本优先级，构建经验样本优先级序列；赋予积极经验更高的优先级，使其优先采样，加速网络收敛；充分考虑经验样本的多样性，当积极经验被训练时，之后每一轮其优先级依次呈指数衰减，直至到达优先级序列平均值。Aiming at the problems of single sample and low sample utilization rate in the experience playback mechanism based on DDPG algorithm, the present invention proposes a priority experience playback mechanism of dynamic sample priority sequence, which integrates TD-error, Actor network loss function and immediate reward, Use information entropy to determine the weights of the three, calculate the priority of experience samples through weighted summation, and construct a priority sequence of experience samples; give positive experience a higher priority, make it preferentially sampled, and accelerate network convergence; fully consider the diversity of experience samples , when the positive experience is trained, its priority decays exponentially in each subsequent round until it reaches the average value of the priority sequence.

附图说明Description of drawings

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings constituting a part of the present invention are used to provide a further understanding of the present invention, and the schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention.

图1为本发明实施例一的调整经验样本优先级流程示意图；FIG. 1 is a schematic diagram of a flow chart for adjusting experience sample priorities in Embodiment 1 of the present invention;

图2为本发明实施例一的优先经验回放流程图。FIG. 2 is a flow chart of priority experience playback in Embodiment 1 of the present invention.

具体实施方式detailed description

应该指出，以下详细说明都是示例性的，旨在对本发明提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本发明的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terminology used here is only for describing specific embodiments, and is not intended to limit exemplary embodiments according to the present invention. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural, and it should also be understood that the terms "comprising" and "having" and any variations thereof are intended to cover a non-exclusive Comprising, for example, a process, method, system, product, or device comprising a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include steps or units not explicitly listed or for these processes, methods, Other steps or units inherent in a product or equipment.

在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。In the case of no conflict, the embodiments and the features in the embodiments of the present invention can be combined with each other.

本实施例所有数据的获取都在符合法律法规和用户同意的基础上，对数据的合法应用。The acquisition of all data in this embodiment is based on compliance with laws and regulations and user consent, and the legal application of data.

术语解释：Explanation of terms:

机器人路径规划：让智能机器人在指定的区域内找到一条从起点到终点的无碰撞安全路径。Robot path planning: let the intelligent robot find a collision-free safe path from the starting point to the ending point in the designated area.

深度强化学习(DRL)：将深度学习的感知能力和强化学习的决策能力相结合，是一种更接近人类思维方式的人工智能方法。Deep Reinforcement Learning (DRL): Combining the perception ability of deep learning and the decision-making ability of reinforcement learning, it is an artificial intelligence method that is closer to the way of human thinking.

DDPG：一种用于机器人路径规划的算法DDPG: an algorithm for robot path planning

经验回放(ER)：机器人路径规划训练过程的一部分，一种让经验的概率分布变得稳定的技术，它能提高训练的稳定性。Experience Replay (ER): Part of the robot path planning training process, a technique that stabilizes the probability distribution of experience, which can improve the stability of training.

优先经验回放(PER)：在经验回放的基础上进行改进，用非均匀抽样代替经验回放的均匀抽样，来抽取经验。Prioritized Experience Replay (PER): Improve on the basis of experience replay, and use non-uniform sampling instead of uniform sampling of experience replay to extract experience.

信息熵：描述信息各可能事件发生的不确定性。Information entropy: Describes the uncertainty of each possible event of information.

DeepMind提出优先经验回放(PER)的概念，通过时间差分误差(TD-error) 的绝对值来衡量经验的优先级。TD-error越大，该经验对机器人学习的重要性越高，反之，TD-error越小，则经验的重要性就越低，使机器人专注于重要性高的经验，进一步提高经验利用率，加快机器人的学习进程。但PER的学习过程往往会忽略立即回报和TD-error较小的经验的作用，导致样本单一的问题。DeepMind proposes the concept of Priority Experience Replay (PER), which measures the priority of experience by the absolute value of time difference error (TD-error). The larger the TD-error, the more important the experience is for the robot to learn. On the contrary, the smaller the TD-error, the lower the importance of the experience, so that the robot can focus on the experience with high importance and further improve the experience utilization rate. Accelerate the learning process of robots. However, the learning process of PER often ignores the effects of immediate reward and small TD-error experience, resulting in the problem of a single sample.

实施例一Embodiment one

本实施例提供了基于优先经验回放机制的机器人路径规划方法；This embodiment provides a robot path planning method based on a priority experience playback mechanism;

如图1所示，基于优先经验回放机制的机器人路径规划方法，包括：As shown in Figure 1, the robot path planning method based on the priority experience playback mechanism includes:

S101：获取路径规划机器人当前状态和目标位置；S101: Obtain the current state and target position of the path planning robot;

S102：将路径规划机器人当前状态和目标位置，输入到训练后的深度确定性策略梯度网络，得到机器人动作；路径规划机器人根据得到的机器人动作，完成机器人的路径规划；S102: Input the current state and target position of the path planning robot into the trained deep deterministic policy gradient network to obtain the robot action; the path planning robot completes the path planning of the robot according to the obtained robot action;

进一步地，所述方法还包括：Further, the method also includes:

进一步地，所述训练后的深度确定性策略梯度网络，其网络结构包括：Further, the network structure of the trained deep deterministic policy gradient network includes:

依次连接的Actor模块、经验池和Critic模块；Actor module, experience pool and Critic module connected in sequence;

其中，Actor模块，包括：依次连接的当前Actor网络和目标Actor网络；Among them, the Actor module includes: the current Actor network and the target Actor network connected in sequence;

其中，Critic模块，包括：依次连接的当前Critic网络和目标Critic网络；Among them, the Critic module includes: the current Critic network and the target Critic network connected in sequence;

当前Actor网络与当前Critic网络之间相互连接。The current Actor network is connected to the current Critic network.

进一步地，所述训练后的深度确定性策略梯度网络，在训练的过程中，机器人与环境的交互过程如下：Further, the trained deep deterministic policy gradient network, during the training process, the interaction process between the robot and the environment is as follows:

在每一时刻t，机器人的当前Actor网络根据环境状态s_t得到动作a_t，作用于环境获得立即奖励r_t和下一时刻环境状态s_t+1，当前Critic网络根据环境状态s_t和动作a_t得到Q值Q(s_t,a_t)，对动作a_t进行评价；At each moment t, the current Actor network of the robot gets the action a _t according to the environment state _st , acts on the environment to obtain the immediate reward r _t and the environment state _st+1 at the next moment, and the current Critic network according to the environment state _st and the action a _t gets the Q value Q(s _t , a _t ), and evaluates the action a _t ;

从经验池采样第i个经验[s_i,a_i,r_i,s_i+1]，当前Actor网络根据Q值Q(s_t,a_t)调整动作策略，当前Actor网络的损失函数为

Q表示当前Critic网络产生的Q值，s_i表示从经验池采样的状态，a_i表示从经验池中采样的动作，θ^Q表示当前Actor网络的参数，Q(s_t,a_t)表示状态s_t和动作a_t的价值。Sampling the i-th experience [s _i ,a _i ,ri ,s _i ₊₁ ] from the experience pool, the current Actor network adjusts the action strategy according to the Q value Q(s _t ,at ₎ , the loss function of the current Actor network is

Q represents the Q value generated by the current Critic network, s _i represents the state sampled from the experience pool, a _i represents the action sampled from the experience pool, θ ^Q represents the parameters of the current Actor network, and Q(s _t , a _t ) represents the state s _t and the value of action a _t .

目标Actor网络根据下一时刻环境状态s_t+1得到估计动作a'；The target Actor network obtains the estimated action a' according to the environment state s _t+1 at the next moment;

目标Critic网络根据下一时刻环境状态s_t+1和估计动作a'得到Q'值Q'(s_t+1,a')；Q'(s_t+1,a')表示状态s_t+1和动作a'的价值；The target critic network obtains the Q' value Q'(s _t+1 ,a') according to the next moment environment state s _t+1 and the estimated action a';Q'(s _t+1 ,a') represents the state s _{t+ 1} and the value of action a';

计算Q值与Q'值之间的差值，得到时间差分误差TD-error。Calculate the difference between the Q value and the Q' value to obtain the time differential error TD-error.

应理解地，信息熵表示离散随机事件出现的概率，为了消除事件的不确定性，从而对信息量进行度量。为消除事件的不确定性而引入的信息越多，信息熵就越高，反之亦然。It should be understood that information entropy represents the probability of occurrence of a discrete random event, and measures the amount of information in order to eliminate the uncertainty of the event. The more information introduced to eliminate the uncertainty of events, the higher the information entropy, and vice versa.

信息熵H(X)的计算如公式(1)所示：The calculation of information entropy H(X) is shown in formula (1):

其中，X表示未知事件，p_i表示未知事件出现的概率。Among them, X represents the unknown event, p _i represents the probability of the unknown event.

综合考虑立即奖励、TD-error和Actor网络损失函数对机器人训练过程的影响，将其引入经验样本的优先级序列的构造过程中。但是将三者简单的累积求和，无法确定它们对训练的影响程度，可能会造成某一因素占比过多，导致计算的经验样本优先级不准确。为了消除三种信息的不确定性，引入信息熵计算其权重因子。Considering the impact of immediate reward, TD-error and Actor network loss function on the robot training process, it is introduced into the construction process of the priority sequence of experience samples. However, the simple accumulation and summation of the three cannot determine the degree of their influence on the training, which may cause a certain factor to account for too much, resulting in inaccurate priority of the calculated experience samples. In order to eliminate the uncertainty of the three kinds of information, information entropy is introduced to calculate its weight factor.

进一步地，所述计算时间差分误差的优先级、当前Actor网络损失函数的优先级和立即奖励的优先级，具体包括：Further, the priority of the calculation time difference error, the priority of the current Actor network loss function and the priority of the immediate reward specifically include:

p_im-reward＝|r_i| (2)p _im-reward ＝|r _i | (2)

p_TD-error＝|δ_i| (3)p _TD-error = |δ _i | (3)

其中，ò表示极小常数，取值0.05，r_i表示立即奖励，δ_i表示TD-error，

表示Actor网络损失函数，Q表示当前Critic网络产生的Q值，s_i表示从经验池采样的状态，a_i表示从经验池中采样的动作，θ^Q表示当前Actor网络的参数，p_im-reward表示立即奖励的优先级，p_TD-error表示TD-error的优先级，p_Actor-loss表示当前Actor网络损失函数的优先级。Among them, ò represents a very small constant with a value of 0.05, r _i represents immediate reward, δ _i represents TD-error,

Represents the Actor network loss function, Q represents the Q value generated by the current Critic network, s _i represents the state sampled from the experience pool, a _i represents the action sampled from the experience pool, θ ^Q represents the parameters of the current Actor network, p _im-reward Indicates the priority of the immediate reward, p _TD-error indicates the priority of TD-error, and p _Actor-loss indicates the priority of the current Actor network loss function.

进一步地，所述利用信息熵确定三者的权重，采用加权求和的方式计算经验样本优先级，构建经验样本优先级序列，具体包括：Further, the use of information entropy to determine the weights of the three, using weighted summation to calculate the priority of experience samples, and constructing a priority sequence of experience samples, specifically includes:

H(r_i)＝-p_{reward＞Ravg}log₂(p_{reward＞Ravg})-p_{reward＜Ravg}log₂(p_{reward＜Ravg}) (7)H(r _i )＝-p _{reward＞Ravg} log ₂ (p _{reward＞Ravg} )-p _{reward＜Ravg} log ₂ (p _{reward＜Ravg} ) (7)

在网络模型的一次训练过程中，如果机器人获得的奖励大于0，就将本次训练称作积极训练，获得的经验样本称作积极经验样本，否则，就称作消极训练；式(5)-(7)中，H(TD)表示TD-error信息熵，H(TA)表示当前Actor网络损失函数的信息熵，H(r_i)表示立即奖励信息熵，

为积极经验训练中TD-error 的概率，

为消极经验训练中TD-error的概率；

为积极经验训练中Actor 网络损失函数的概率，

为消极训练中Actor网络损失函数的概率；p_{reward＞Ravg}为立即奖励大于奖励平均值的概率，p_{reward＜Ravg}为立即奖励小于奖励平均值的概率。During a training process of the network model, if the reward obtained by the robot is greater than 0, this training is called active training, and the obtained experience samples are called positive experience samples; otherwise, it is called negative training; formula (5)- In (7), H(TD) represents the TD-error information entropy, H(TA) represents the information entropy of the current Actor network loss function, H(r _i ) represents the immediate reward information entropy,

is the probability of TD-error in positive experience training,

is the probability of TD-error in negative experience training;

is the probability of the Actor network loss function in positive experience training,

is the probability of the Actor network loss function in negative training; p _reward>Ravg is the probability that the immediate reward is greater than the average reward, and p _reward<Ravg is the probability that the immediate reward is smaller than the average reward.

计算信息熵确定立即奖励优先级的权重a、时间差分误差优先级的权重β和Actor网络损失函数优先级的权重υ：Calculate the information entropy to determine the weight a of the immediate reward priority, the weight β of the time difference error priority and the weight υ of the Actor network loss function priority:

υ＝1-a-β (10)υ＝1-a-β (10)

式(8)-(10)中，H(TD)表示TD-error信息熵，H(TA)表示当前Actor网络损失函数的信息熵，H(r_i)表示立即奖励信息熵。In formulas (8)-(10), H(TD) represents the TD-error information entropy, H(TA) represents the information entropy of the current Actor network loss function, and H(r _i ) represents the immediate reward information entropy.

最后，根据计算出的权重系数以及p_im-reward、p_TD-error和p_Actor-loss，利用公式(11) 计算每个经验样本的优先级，其中ò表示极小常数。如公式(11)所示：Finally, according to the calculated weight coefficients and p _im-reward , p _TD-error and p _Actor-loss , use formula (11) to calculate the priority of each empirical sample, where ò represents a very small constant. As shown in formula (11):

p_i＝(a×p_im-reward+β×p_TD-error+υ×p_Actor-loss)+ò (11)p _i ＝(a×p _im-reward +β×p _TD-error +υ×p _Actor-loss )+ò (11)

进一步地，如图2所示，在经验采样时，判断奖励是否大于零，如果是，就上调经验样本的优先级，如果否，就保持经验样本的优先级不变；按照优先级由高到低的顺序对经验进行采样，进而更新网络参数；具体包括：Further, as shown in Figure 2, during experience sampling, it is judged whether the reward is greater than zero, if so, the priority of the experience sample is raised, and if not, the priority of the experience sample is kept unchanged; according to the priority from high to The low order samples the experience, and then updates the network parameters; specifically includes:

在经验采样时，判断奖励是否大于零，如果是，就上调经验样本的优先级，上调经验样本的优先级的过程中，将参数

作为积极经验(奖励大于零的经验) 每一轮训练开始时的优先级权重，根据经验样本优先级p_i，对经验样本的优先级进行调整，令积极经验样本(奖励大于零的经验样本)的优先级为：During experience sampling, it is judged whether the reward is greater than zero. If so, the priority of the experience sample is raised. During the process of raising the priority of the experience sample, the parameter

As the priority weight of positive experience (experience with reward greater than zero) at the beginning of each round of training, adjust the priority of experience samples according to the priority p _i of experience samples, so that positive experience samples (experience samples with reward greater than zero) The priority is:

其中p′_i＞p_i，

where p′ _i >p _i ,

对奖励小于等于零的经验样本的优先级不做调整，仍为p_i；The priority of experience samples whose reward is less than or equal to zero is not adjusted, and it is still p _i ;

根据经验样本的优先级，计算采样概率P_i，根据采样概率P_i从经验池采样经验样本进行训练；Calculate the sampling probability P _i according to the priority of the experience samples, and sample the experience samples from the experience pool according to the sampling probability P _i for training;

采样概率计算过程，如公式(12)所示：The sampling probability calculation process is shown in formula (12):

其中，α为常量，取值范围为[0,1]，当α＝0时，表示均匀采样，n为经验样本的总数量。Among them, α is a constant, and the value range is [0,1]. When α=0, it means uniform sampling, and n is the total number of empirical samples.

应理解地，在经验池中，TD-error绝对值高的经验样本通常意味着当前Critic 网络的Q值与目标Critic网络的Q值差距很大，具备高学习潜力。优先考虑将这类经验样本进行经验回放，可能会带来机器人学习能力的大幅度提升。It should be understood that in the experience pool, an experience sample with a high absolute value of TD-error usually means that the Q value of the current critic network is far from the Q value of the target critic network, and has high learning potential. Prioritizing the replay of such experience samples may lead to a substantial improvement in the learning ability of the robot.

但是，如果仅考虑TD-error这一因素，忽略奖励在训练过程中的重要性，容易过渡使用边缘经验，导致网络过拟合。However, if only the factor of TD-error is considered and the importance of rewards in the training process is ignored, it is easy to use marginal experience excessively, resulting in network overfitting.

而类似成功的经验或者高奖励的经验这类积极经验也是机器人需要学习的最重要经验，如果更多地采样积极经验不仅可以加速算法的收敛，也可以有效缓解过拟合。因此，将为积极的经验样本赋予更高的优先级用于经验回放。Positive experience such as successful experience or high reward experience is also the most important experience that robots need to learn. If more positive experience is sampled, it can not only accelerate the convergence of the algorithm, but also effectively alleviate overfitting. Therefore, positive experience samples will be given higher priority for experience replay.

进一步地，在经验被选择参与训练之后，在下一轮的训练过程中，将已经参与训练的经验的优先级进行衰减，判断衰减后的所有优先级的平均值是否小于设定阈值，如果是，就上调经验样本的优先级，如果否，就继续进行优先级衰减，直至降低到优先级序列的平均值，具体包括：Further, after the experience is selected to participate in the training, in the next round of training, the priority of the experience that has participated in the training is attenuated, and it is judged whether the average value of all the priorities after the attenuation is less than the set threshold, and if so, Just increase the priority of the experience sample, if not, continue to decay the priority until it is reduced to the average value of the priority sequence, including:

假设积极经验样本为j，优先级为p_j，经验样本批量采样的优先级序列为 p＝(p₁,p₂,···p_j···,p_n)。Suppose the positive experience sample is j, the priority is p _j , and the priority sequence of batch sampling of experience samples is p=(p ₁ ,p ₂ ,···p _j ···,p _n ).

那么，下一轮j的优先级p_j根据衰减因子σ进行指数衰减：Then, the priority p _j of the next round j undergoes exponential decay according to the decay factor σ:

当经验样本的优先级大于设定阈值Z时，允许优先级衰减，否则停止衰减；When the priority of the experience sample is greater than the set threshold Z, the priority decay is allowed, otherwise the decay is stopped;

阈值Z的计算公式：The calculation formula of the threshold Z:

为了在有限容量的经验池中快速学习机器人与环境交互的最新经验，需要及时更新经验样本的优先级。In order to quickly learn the latest experience of robot-environment interaction in a limited-capacity experience pool, the priority of experience samples needs to be updated in time.

本发明所提方法为基于优先序列采样的DDPG算法。充分利用DDPG算法的actor-critic框架，结合actor网络损失函数、TD-error及立即奖励，根据信息熵确定加权系数，构造经验样本优先级序列。合理调用机器人的立即奖励，将立即奖励大于零的经验样本视作积极经验，其余为消极经验。赋予积极经验更高的优先级，使其更频繁地被采样，加速DDPG算法的训练过程。同时，为了兼顾经验样本的多样性，令优先级低的经验样本也能够被采样，每当积极经验被采样后，在下一轮训练时其优先级将会呈指数衰减，直到小于等于设定的阈值为止。算法具体流程如图2所示。The method proposed in the present invention is a DDPG algorithm based on priority sequence sampling. Make full use of the actor-critic framework of the DDPG algorithm, combine the actor network loss function, TD-error and immediate reward, determine the weighting coefficient according to the information entropy, and construct the priority sequence of experience samples. The immediate reward of the robot is reasonably called, and the experience samples with immediate reward greater than zero are regarded as positive experience, and the rest are negative experience. Giving higher priority to positive experiences allows them to be sampled more frequently to speed up the training process of the DDPG algorithm. At the same time, in order to take into account the diversity of experience samples, experience samples with low priority can also be sampled. Whenever a positive experience is sampled, its priority will decay exponentially in the next round of training until it is less than or equal to the set up to the threshold. The specific flow of the algorithm is shown in Figure 2.

本发明进行评估基于优先序列采样的DDPG算法的实验。实验使用Gazebo 物理仿真软件进行仿真实验，选用python语言及pytorch框架，并利用Ros操作系统进行信息传递。评价算法好坏的重要指标是机器人根据算法在一轮训练中通过与环境交互的奖励大小，如果获得的奖励值高且相对稳定，我们认为此算法这一轮在该环境中性能优良。The present invention conducts experiments evaluating the DDPG algorithm based on priority sequence sampling. The experiment uses Gazebo physical simulation software for simulation experiments, chooses python language and pytorch framework, and uses Ros operating system for information transmission. An important indicator for evaluating the quality of an algorithm is the size of the robot’s rewards for interacting with the environment in a round of training according to the algorithm. If the rewards obtained are high and relatively stable, we believe that the algorithm performs well in this round of the environment.

机器人在四个不同环境中训练，其中第一个环境没有障碍物，其余环境障碍物的大小和排列方式不同。机器人训练结果表明，在无障碍物的环境中，PER 方法不够稳定，平均奖励值较低，主要浮动在0～0.5区间，本发明所提方法提高收敛速度，在50轮左右稳定下来，平均奖励值较高，主要浮动在0.75～1.75区间。The robot was trained in four different environments, the first of which had no obstacles, and the remaining environments had obstacles of different sizes and arrangements. The results of robot training show that in an environment without obstacles, the PER method is not stable enough, and the average reward value is low, mainly floating in the range of 0-0.5. The value is relatively high, mainly floating in the range of 0.75 to 1.75.

在有障碍物的环境中，PER方法训练曲线震荡幅度大，抗风险能力较差，平均奖励值主要浮动于-1.0～1.0区间；本发明所提方法平均奖励值平稳上升，主要浮动于0.5～1.0区间，取得良好效果。In an environment with obstacles, the training curve of the PER method fluctuates greatly, and the anti-risk ability is poor. 1.0 interval, and achieved good results.

为了验证所提方法在未知环境下机器人路径规划的成功率和有效性，当机器人训练完成后，使其在新的仿真环境中进行200次任务测试。测试结果表明在成功率方面，PER方法成功率为86％，而本发明所提算法成功率可以达到 90.5％，明显高于PER方法；在耗费时间方面，本发明算法比PER算法快13％，机器人更加快速到达目标点。本发明提出算法在路径规划时试错率较低，与障碍物碰撞次数较少。In order to verify the success rate and effectiveness of the proposed method in robot path planning in an unknown environment, after the robot training is completed, it is tested for 200 tasks in a new simulation environment. The test results show that in terms of success rate, the PER method has a success rate of 86%, while the success rate of the proposed algorithm of the present invention can reach 90.5%, which is obviously higher than the PER method; in terms of time-consuming, the algorithm of the present invention is 13% faster than the PER algorithm, The robot reaches the target point more quickly. The algorithm proposed by the invention has a low trial-and-error rate during path planning and fewer collisions with obstacles.

由以上实验结果表明，在机器人路径规划的任务中，PER方法平均奖励值波动较大，训练过程不稳定，因为评估经验样本优先级的方法只有一个TD-error 指标，忽略立即奖励在网络训练的作用，从而出现过度使用边缘经验样本，容易出现局部最优。本发明提出构建复合优先级综合考虑立即奖励、TD-error和 Actor网络的反馈，利用信息熵确定动态加权系数，赋予经验样本合理的优先级，训练过程逐步稳定，平均奖励值较PER有所提升，表明动态调整的优先级保证了机器人训练过程中对经验样本多样性的需求。在此基础上叠加优先序列采样机制，区分积极经验样本和消极经验样本，调整积极经验样本优先级，经过采样训练后，优先级进行衰减步骤，目的是在有限经验池中快速训练机器人，在不同训练轮数的对比实验中均获得较高的平均奖励值，证明了本发明改进算法的有效性。The above experimental results show that in the task of robot path planning, the average reward value of the PER method fluctuates greatly, and the training process is unstable, because the method for evaluating the priority of experience samples has only one TD-error index, ignoring the immediate reward in network training. role, resulting in excessive use of marginal experience samples, prone to local optima. The present invention proposes to construct a compound priority to comprehensively consider the immediate reward, TD-error and the feedback of the Actor network, use the information entropy to determine the dynamic weighting coefficient, give the experience sample a reasonable priority, the training process is gradually stable, and the average reward value is improved compared with PER , indicating that the dynamically adjusted priority guarantees the demand for diversity of experience samples during robot training. On this basis, the priority sequence sampling mechanism is superimposed to distinguish between positive experience samples and negative experience samples, and adjust the priority of positive experience samples. After sampling training, the priority is attenuated. In the comparative experiments of the number of training rounds, higher average reward values are obtained, which proves the effectiveness of the improved algorithm of the present invention.

实施例二Embodiment two

本实施例提供了基于优先经验回放机制的机器人路径规划系统；This embodiment provides a robot path planning system based on a priority experience playback mechanism;

此处需要说明的是，上述获取模块和路径规划模块对应于实施例一中的步骤S101至S102，上述模块与对应的步骤所实现的示例和应用场景相同，但不限于上述实施例一所公开的内容。需要说明的是，上述模块作为系统的一部分可以在诸如一组计算机可执行指令的计算机系统中执行。What needs to be explained here is that the above acquisition module and path planning module correspond to steps S101 to S102 in the first embodiment, and the examples and application scenarios implemented by the above modules and the corresponding steps are the same, but are not limited to those disclosed in the first embodiment above Content. It should be noted that, as a part of the system, the above-mentioned modules can be executed in a computer system such as a set of computer-executable instructions.

上述实施例中对各个实施例的描述各有侧重，某个实施例中没有详述的部分可以参见其他实施例的相关描述。The description of each embodiment in the foregoing embodiments has its own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

所提出的系统，可以通过其他的方式实现。例如以上所描述的系统实施例仅仅是示意性的，例如上述模块的划分，仅仅为一种逻辑功能划分，实际实现时，可以有另外的划分方式，例如多个模块可以结合或者可以集成到另外一个系统，或一些特征可以忽略，或不执行。The proposed system can be implemented in other ways. For example, the above-described system embodiments are only illustrative. For example, the division of the above modules is only a logical function division. In actual implementation, there may be other division methods, for example, multiple modules can be combined or integrated into another A system, or some feature, can be ignored, or not implemented.

实施例三Embodiment Three

本实施例还提供了一种电子设备，包括：一个或多个处理器、一个或多个存储器、以及一个或多个计算机程序；其中，处理器与存储器连接，上述一个或多个计算机程序被存储在存储器中，当电子设备运行时，该处理器执行该存储器存储的一个或多个计算机程序，以使电子设备执行上述实施例一所述的方法。This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the one or more computer programs are programmed Stored in the memory, when the electronic device is running, the processor executes one or more computer programs stored in the memory, so that the electronic device executes the method described in Embodiment 1 above.

应理解，本实施例中，处理器可以是中央处理单元CPU，处理器还可以是其他通用处理器、数字信号处理器DSP、专用集成电路ASIC，现成可编程门阵列FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in this embodiment, the processor can be a central processing unit CPU, and the processor can also be other general-purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

存储器可以包括只读存储器和随机存取存储器，并向处理器提供指令和数据、存储器的一部分还可以包括非易失性随机存储器。例如，存储器还可以存储设备类型的信息。The memory may include read-only memory and random access memory, and provide instructions and data to the processor, and a part of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

在实现过程中，上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.

实施例一中的方法可以直接体现为硬件处理器执行完成，或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法的步骤。为避免重复，这里不再详细描述。The method in Embodiment 1 can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware. To avoid repetition, no detailed description is given here.

本领域普通技术人员可以意识到，结合本实施例描述的各示例的单元及算法步骤，能够以电子硬件或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in this embodiment can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention.

实施例四Embodiment Four

本实施例还提供了一种计算机可读存储介质，用于存储计算机指令，所述计算机指令被处理器执行时，完成实施例一所述的方法。This embodiment also provides a computer-readable storage medium for storing computer instructions, and when the computer instructions are executed by a processor, the method described in the first embodiment is completed.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A robot path planning method based on a priority experience playback mechanism, characterized in that it includes:

Obtain the current state and target position of the path planning robot; input the current state and target position of the path planning robot into the trained deep deterministic policy gradient network to obtain the robot action; the path planning robot completes the path planning of the robot based on the obtained robot action ;

Among them, the deep deterministic policy gradient network after training, during the training process, the experience generated by the robot is stored in the experience pool, and the storage method is stored in the priority sequence of experience samples;

Among them, the experience sample priority sequence, its construction process is:

Calculate the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; use the information entropy to determine the weight of the three, use the weighted sum to calculate the priority of the experience sample, and construct the priority sequence of the experience sample ;

When sampling experience, judge whether the reward is greater than zero, if so, increase the priority of the experience sample, if not, keep the priority of the experience sample unchanged; sample the experience in order of priority from high to low, and then Update network parameters.

2. The robot path planning method based on the priority experience playback mechanism as claimed in claim 1, wherein said experience is sampled in order of priority from high to low, and then network parameters are updated, and then further comprising:

After the experience is selected to participate in the training, in the next round of training, the priority of the experience that has participated in the training is attenuated, and it is judged whether the average value of all the priorities after the attenuation is less than the set threshold, and if so, the experience is raised The priority of the sample, if not, continue to decay the priority until it is reduced to the average value of the priority sequence.

3. the robot path planning method based on priority experience replay mechanism as claimed in claim 1, it is characterized in that, the deep deterministic policy gradient network after the training, in the process of training, the interactive process of robot and environment is as follows:

At each moment t, the current Actor network of the robot gets the action a _t according to the environment state _st , acts on the environment to obtain the immediate reward r _t and the environment state _st+1 at the next moment, and the current Critic network according to the environment state _st and the action a _t gets the Q value Q(s _t , a _t ), and evaluates the action a _t ;

Sampling the i-th experience [s _i ,a _i ,ri ,s _i ₊₁ ] from the experience pool, the current Actor network adjusts the action strategy according to the Q value Q(s _t ,at ₎ , and the loss function of the current Actor network is ▽ _a Q(s _i , a _i |θ ^Q ); Q represents the Q value generated by the current Critic network, s _i represents the state sampled from the experience pool, a _i represents the action sampled from the experience pool, and θ ^Q represents the current Actor network The parameters of , Q(s _t , a _t ) represent the value of state s _t and action a _t ;

The target Actor network obtains the estimated action a' according to the environment state s _t+1 at the next moment;

The target critic network obtains the Q' value Q'(s _t+1 ,a') according to the next moment environment state s _t+1 and the estimated action a';Q'(s _t+1 ,a') represents the state s _{t+ 1} and the value of action a';

Calculate the difference between the Q value and the Q' value to obtain the time differential error TD-error.

4. The robot path planning method based on priority experience playback mechanism as claimed in claim 1, characterized in that, the priority of the calculation time difference error, the priority of the current Actor network loss function and the priority of immediate reward, specifically include:

p _im-reward ＝|r _i | (2)

p _TD-error = |δ _i | (3)

p _Actor-loss ＝|▽ _a Q(s _i ,a _i |θ ^Q )| (4)

Among them, r _i represents the immediate reward, δ _i represents TD-error, ▽ _a Q(s _i , a _i |θ ^Q ) represents the loss function of the Actor network, Q represents the Q value generated by the current critic network, and s _i represents the value from the experience pool The state of sampling, a _i represents the action sampled from the experience pool, θ ^Q represents the parameters of the current Actor network, p _im-reward represents the priority of immediate reward, p _TD-error represents the priority of TD-error, p _{Actor- loss} represents the priority of the current Actor network loss function.

5. The robot path planning method based on the priority experience replay mechanism as claimed in claim 4, characterized in that, the use of information entropy to determine the weight of the three, the weighted summation method is used to calculate the experience sample priority, and the experience sample is constructed Priority order, specifically including:

H(r _i )＝-p _{reward＞Ravg} log ₂ (p _{reward＞Ravg} )-p _{reward＜Ravg} log ₂ (p _{reward＜Ravg} ) (7)

During a training process of the network model, if the reward obtained by the robot is greater than 0, this training is called active training, and the obtained experience samples are called positive experience samples; otherwise, it is called negative training; formula (5)- In (7), H(TD) represents the TD-error information entropy, H(TA) represents the information entropy of the current Actor network loss function, H(r _i ) represents the immediate reward information entropy,

is the probability of TD-error in positive experience training,

is the probability of TD-error in negative experience training;

is the probability of the Actor network loss function in passive training; p _reward>Ravg is the probability that the immediate reward is greater than the average reward, and p _reward<Ravg is the probability that the immediate reward is smaller than the average reward;

Calculate the information entropy to determine the weight a of the immediate reward priority, the weight β of the time difference error priority and the weight υ of the Actor network loss function priority:

υ＝1-a-β (10)

In formulas (8)-(10), H(TD) represents TD-error information entropy, H(TA) represents the information entropy of the current Actor network loss function, and H(r _i ) represents immediate reward information entropy;

Finally, according to the calculated weight coefficient and p _im-reward , p _TD-error and p _Actor-loss , use formula (11) to calculate the priority of each experience sample, where ∈ represents a very small constant; such as formula (11) Shown:

p _i =(a×p _im-reward +β×p _TD-error +υ×p _Actor-loss )+∈ (11).

6. The robot path planning method based on priority experience replay mechanism as claimed in claim 1, characterized in that, during experience sampling, it is judged whether the reward is greater than zero, if yes, the priority of experience sample is raised, if not, then Keep the priority of the experience sample unchanged; sample the experience in order of priority from high to low, and then update the network parameters; specifically include:

During experience sampling, it is judged whether the reward is greater than zero. If so, the priority of the experience sample is raised. During the process of raising the priority of the experience sample, the parameter

As the priority weight of the experience with a reward greater than zero at the beginning of each round of training, the priority of the experience sample is adjusted according to the priority p _i of the experience sample, so that the priority of the experience sample with a reward greater than zero is:

Where p _i ' represents the adjusted priority;

The priority of experience samples whose reward is less than or equal to zero is not adjusted, and it is still p _i ;

Calculate the sampling probability P _i according to the priority of the experience samples, and sample the experience samples from the experience pool according to the sampling probability P _i for training;

The sampling probability calculation process is shown in formula (12):

Among them, α is a constant, and the value range is [0,1]. When α=0, it means uniform sampling, and n is the total number of empirical samples.

7. The robot path planning method based on the priority experience playback mechanism as claimed in claim 2, characterized in that, after the experience is selected to participate in training, in the next round of training process, the priority of the experience that has participated in the training is carried out. Attenuation, judging whether the average value of all priorities after the attenuation is less than the set threshold, if yes, increase the priority of the experience sample, if not, continue to attenuate the priority until it is reduced to the average value of the priority sequence, specifically include:

Suppose the positive experience sample is j, the priority is p _j , and the priority sequence of batch sampling of experience samples is p=(p ₁ ,p ₂ ,…p _j …,p _n );

Then, the priority p _j of the next round j undergoes exponential decay according to the decay factor σ:

When the priority of the experience sample is greater than the set threshold Z, the priority decay is allowed, otherwise the decay is stopped;

The calculation formula of the threshold Z:

8. Robot path planning system based on priority experience playback mechanism, including:

An acquisition module configured to: acquire the current state and target position of the path planning robot;

The path planning module is configured to: input the current state and target position of the path planning robot into the trained deep deterministic policy gradient network to obtain the robot action; the path planning robot completes the path planning of the robot according to the obtained robot action;

Among them, the experience sample priority sequence, its construction process is: calculate the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; use the information entropy to determine the weight of the three, and use the weighted summation method to calculate the priority of experience samples, and construct the priority sequence of experience samples;

9. An electronic device comprising:

memory for non-transitory storage of computer readable instructions; and

a processor for executing said computer readable instructions,

Wherein, when the computer-readable instructions are executed by the processor, the method described in any one of claims 1-7 is performed.

10. A storage medium storing computer-readable instructions in a non-transitory manner, wherein when the non-transitory computer-readable instructions are executed by a computer, the instructions of the method according to any one of claims 1-7 are executed.