CN115509233A - Robot path planning method and system based on prior experience playback mechanism - Google Patents
Robot path planning method and system based on prior experience playback mechanism Download PDFInfo
- Publication number
- CN115509233A CN115509233A CN202211199553.5A CN202211199553A CN115509233A CN 115509233 A CN115509233 A CN 115509233A CN 202211199553 A CN202211199553 A CN 202211199553A CN 115509233 A CN115509233 A CN 115509233A
- Authority
- CN
- China
- Prior art keywords
- priority
- experience
- robot
- sample
- reward
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 95
- 230000007246 mechanism Effects 0.000 title claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 67
- 230000008569 process Effects 0.000 claims abstract description 42
- 230000009471 action Effects 0.000 claims abstract description 33
- 238000010276 construction Methods 0.000 claims abstract description 8
- 238000005070 sampling Methods 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 38
- 230000015654 memory Effects 0.000 claims description 12
- 230000002238 attenuated effect Effects 0.000 claims description 11
- 230000007613 environmental effect Effects 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 7
- 230000003993 interaction Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims description 2
- 239000000523 sample Substances 0.000 description 59
- 238000004422 calculation algorithm Methods 0.000 description 23
- 230000002787 reinforcement Effects 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000087 stabilizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0214—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0219—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory ensuring the processing of the whole working surface
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a robot path planning method and a system based on a priority experience playback mechanism; the method comprises the following steps: acquiring the current state and the target position of the path planning robot; inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action; in the training process of the trained depth deterministic strategy gradient network, experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage; the construction process of the experience sample priority sequence is as follows: calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; and determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, and constructing the priority sequence of the empirical sample.
Description
Technical Field
The invention relates to the technical field of robot path planning, in particular to a robot path planning method and system based on a priority experience playback mechanism.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the intensive research on robots and artificial intelligence technologies, the types of intelligent robots are increasingly abundant, and the intelligent robots play more and more important roles in various industries. The path planning can enable the intelligent robot to find a collision-free safe path from a starting point to an end point in a specified area, and the path planning is the basis of the motion of the intelligent robot and is also a hot spot of current research. The process comprises the steps of sensing surrounding environment information of the intelligent robot through a sensor, determining the self pose, and searching an optimal path from the current position to the designated position in the environment.
In recent years, deep Reinforcement Learning (DRL) is widely used in many fields, and a path planning algorithm combined with the deep reinforcement learning is becoming a research focus. The deep reinforcement learning does not need the robot to know the environment in advance, but predicts the next action by sensing the surrounding environment state, and obtains the reward of environment feedback after executing the action, so that the robot is transferred from the current state to the next state. And repeating the circulation until the robot reaches the target point or reaches the set maximum step number. The Deep Mind provides a DDPG (Deep Deterministic Policy Gradient) algorithm, an Actor-critical framework and a DQN (differential decision tree) are combined by adopting a Deterministic-based strategy Gradient algorithm, a strategy function and a Q function are simulated by using a convolutional neural network, an output result is a determined action value, the problem that Deep reinforcement learning cannot be applied or is poor in performance on a high-dimensional or continuous action task is solved, and the method is an effective path planning algorithm at present. However, due to the insufficient utilization rate of the empirical samples, the environmental adaptability of the DDPG algorithm to robot path planning is poor, and the problems of low success rate, low convergence rate and the like exist.
Traditional DDPG adopts random experience playback mechanism and robot generated experience s t ,a t ,r t ,s t+1 ]And storing the neural network training data in an experience pool, and randomly selecting experience samples to train the neural network. By breaking the time sequence correlation among experiences, the problem that the experiences cannot be reused is solved, and the learning process of the robot is accelerated. However, the ER uses a uniform random sampling strategy, does not consider different experiences to have different importance on robot learning, cannot fully utilize the experience with high importance, and influences the training efficiency of the neural network.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a robot path planning method and a system based on a priority experience playback mechanism; the invention provides a priority experience playback mechanism of dynamic sample priority, which comprehensively considers loss functions of TD-error and Actor networks and immediate experience awards and sets experience priority by weighted summation of the loss functions and the experience awards. When experience is sampled, experiences with rewards greater than zero (positive experiences) are given higher priority, and the network parameters are updated by using the experiences preferentially. After the active experience samples are selected for training, the priorities of the experience samples are exponentially decayed in the next training round until the priorities are reduced to the average value of the priority sequence. The diversity of the experience samples is increased, the utilization rate of the experience samples is improved, and the problems of low success rate and low convergence rate of the DDPG algorithm path planning are solved.
In a first aspect, the invention provides a robot path planning method based on a priority experience playback mechanism;
the robot path planning method based on the prior experience playback mechanism comprises the following steps:
acquiring the current state and the target position of the path planning robot; inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action;
in the training process of the trained depth deterministic strategy gradient network, experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage;
the construction process of the experience sample priority sequence is as follows:
calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, constructing an experience sample priority sequence;
during experience sampling, judging whether the reward is larger than zero, if so, adjusting the priority of the experience sample, and if not, keeping the priority of the experience sample unchanged; and sampling the experience according to the sequence of the priority from high to low, and further updating the network parameters.
Further, the method further comprises:
after the experience is selected to participate in the training, in the next round of training process, the priority of the experience which participates in the training is attenuated, whether the average value of all the attenuated priorities is smaller than a set threshold value is judged, if yes, the priority of the test sample is increased, and if not, the priority attenuation is continued until the average value of the priority sequence is reduced.
In a second aspect, the invention provides a robot path planning system based on a priority experience playback mechanism;
the robot path planning system based on the prior experience playback mechanism comprises:
an acquisition module configured to: acquiring the current state and the target position of the path planning robot;
a path planning module configured to: inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action;
in the training process of the trained depth deterministic strategy gradient network, experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage;
the construction process of the experience sample priority sequence is as follows: calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, and constructing a priority sequence of the empirical sample;
during experience sampling, judging whether the reward is larger than zero, if so, adjusting the priority of the experience sample, and if not, keeping the priority of the experience sample unchanged; and sampling the experience according to the sequence of the priority from high to low, and further updating the network parameters.
In a third aspect, the present invention further provides an electronic device, including:
a memory for non-transitory storage of computer readable instructions; and
a processor for executing the computer readable instructions,
wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect.
In a fourth aspect, the present invention also provides a storage medium storing non-transitory computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, perform the instructions of the method of the first aspect.
In a fifth aspect, the invention also provides a computer program product comprising a computer program for implementing the method of the first aspect when run on one or more processors.
Compared with the prior art, the invention has the beneficial effects that:
aiming at the problems of single sample and low sample utilization rate of an experience playback mechanism based on a DDPG algorithm, the invention provides a priority experience playback mechanism of a dynamic sample priority sequence, which integrates a TD-error function, an Actor network loss function and an immediate reward, determines the weights of the TD-error function, the Actor network loss function and the Actor network loss function by using an information entropy, calculates the experience sample priority by weighted summation, and constructs the experience sample priority sequence; giving higher priority to positive experience, so that the positive experience can sample preferentially and network convergence is accelerated; considering the diversity of experience samples fully, when the positive experience is trained, the priority of each round is decayed exponentially in turn until the average value of the priority sequence is reached.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flowchart illustrating a process of adjusting the priority of empirical samples according to a first embodiment of the present invention;
fig. 2 is a flowchart of a preferred experience playback according to a first embodiment of the present invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.
Interpretation of terms:
planning a robot path: and the intelligent robot finds a collision-free safe path from the starting point to the end point in the designated area.
Deep Reinforcement Learning (DRL): the method combines the perception capability of deep learning and the decision-making capability of reinforcement learning, and is an artificial intelligence method closer to a human thinking mode.
DDPG: algorithm for robot path planning
Empirical playback (ER): a technique for stabilizing the probability distribution of experience, which is part of the robot path planning training process, improves the stability of the training.
Preferential empirical Playback (PER): the method is improved on the basis of empirical playback, and non-uniform samples are used for replacing uniform samples of the empirical playback to extract the experience.
Information entropy: describing the uncertainty of each possible occurrence of the information.
DeepMind proposes the concept of preferential empirical Playback (PER), where the priority of experience is measured by the absolute value of the time difference error (TD-error). The larger the TD-error is, the higher the importance of the experience to the robot learning is, and conversely, the smaller the TD-error is, the lower the importance of the experience is, so that the robot is enabled to concentrate on the experience with high importance, the experience utilization rate is further improved, and the learning process of the robot is accelerated. However, the learning process of the PER usually ignores the effect of the immediate return and the smaller experience of the TD-error, resulting in the problem of single sample.
Example one
The embodiment provides a robot path planning method based on a prior experience playback mechanism;
as shown in fig. 1, a robot path planning method based on a priority experience playback mechanism includes:
s101: acquiring the current state and the target position of the path planning robot;
s102: inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action;
in the training process of the trained depth deterministic strategy gradient network, experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage;
the construction process of the experience sample priority sequence is as follows:
calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, and constructing an empirical sample priority sequence;
during experience sampling, judging whether the reward is larger than zero, if so, adjusting the priority of the experience sample, and if not, keeping the priority of the experience sample unchanged; and sampling the experience according to the sequence of the priority from high to low, and further updating the network parameters.
Further, the method further comprises:
after the experience is selected to participate in the training, in the next round of training process, the priority of the experience which participates in the training is attenuated, whether the average value of all the attenuated priorities is smaller than a set threshold value is judged, if yes, the priority of the test sample is increased, and if not, the priority attenuation is continued until the average value of the priority sequence is reduced.
Further, the network structure of the trained deep deterministic strategy gradient network comprises:
the system comprises an Actor module, an experience pool and a criticic module which are connected in sequence;
wherein, the Actor module includes: a current Actor network and a target Actor network which are connected in sequence;
wherein, criticic module includes: the current criticic network and the target criticic network are connected in sequence;
and the current Actor network and the current Critic network are connected with each other.
Further, in the trained depth deterministic strategy gradient network, in the training process, the interaction process between the robot and the environment is as follows:
at each moment t, the current Actor network of the robot is according to the environment state s t Obtain an action a t Acting on the environment to obtain an immediate reward r t And the next time environmental state s t+1 The current Critic network depends on the environmental state s t And action a t The Q value Q(s) is obtained t ,a t ) To action a t Carrying out evaluation;
sampling the ith experience [ s ] from a pool of experiences i ,a i ,r i ,s i+1 ]The current Actor network relies on the Q value Q(s) t ,a t ) Adjusting the action strategy, wherein the loss function of the current Actor network isQ represents the Q value, s, generated by the current Critic network i Representing the state sampled from the experience pool, a i Representing the act of sampling from a pool of experiences, theta Q A parameter, Q(s), representing the current Actor network t ,a t ) Represents a state s t And action a t The value of (A) is obtained.
The target Actor network according to the environment state s of the next time t+1 Obtaining an estimated action a';
the target Critic network is based on the next time environment state s t+1 And estimating action a ' to obtain Q ' value Q '(s) t+1 ,a'); Q'(s t+1 A') represents the state s t+1 And the value of action a';
and calculating the difference value between the Q value and the Q' value to obtain a time difference error TD-error.
It should be understood that the entropy of information represents the probability of discrete random events occurring, and thus measures the amount of information in order to remove the uncertainty of the event. The more information is introduced to remove the uncertainty of an event, the higher the entropy of the information and vice versa.
The information entropy H (X) is calculated as shown in formula (1):
wherein X represents an unknown event, p i Representing the probability of an unknown event occurring.
The influence of the immediate reward, the TD-error and the Actor network loss function on the robot training process is comprehensively considered and introduced into the construction process of the priority sequence of the experience sample. However, the simple cumulative summation of the three methods cannot determine the influence degree of the three methods on training, which may cause that a certain factor accounts for too much, and the calculated experience sample priority is inaccurate. In order to eliminate the uncertainty of the three kinds of information, the information entropy is introduced to calculate the weight factor of the information entropy.
Further, the calculating of the priority of the time difference error, the priority of the current Actor network loss function, and the priority of the immediate reward specifically includes:
p im-reward =|r i | (2)
p TD-error =|δ i | (3)
wherein, oa represents the minimum constant, taking the value 0.05, r i Indicating an immediate reward, δ i Represents a time-series of TD-error,represents the Actor network loss function, Q represents the Q value produced by the current Critic network, s i Representing the state sampled from the experience pool, a i Representing the act of sampling from a pool of experiences, theta Q Parameter, p, representing the current Actor network im-reward Indicating priority of immediate reward, p TD-error Denotes the priority, p, of TD-error Actor-loss Indicating the priority of the current Actor network loss function.
Further, the determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by using a weighted summation mode, and constructing the priority sequence of the empirical sample specifically includes:
H(r i )=-p reward>Ravg log 2 (p reward>Ravg )-p reward<Ravg log 2 (p reward<Ravg ) (7)
in the one-time training process of the network model, if the reward obtained by the robot is greater than 0, the training is called as active training, the obtained experience sample is called as an active experience sample, and if not, the training is called as passive training; in the formulas (5) to (7), H (TD) represents TD-error information entropy, H (TA) represents information entropy of current Actor network loss function, and H (r) represents information entropy of current Actor network loss function i ) Indicating the entropy of the immediate reward information,to gain positive experience with the probability of TD-error in training,probability of TD-error in passive empirical training;to positively experience the probability of the Actor network loss function in training,probability of Actor network loss function in passive training; p is a radical of formula reward>Ravg For immediate awards with a probability greater than the average value of the awards, p reward<Ravg A probability of being less than the reward average is awarded for immediate.
Calculating information entropy to determine the weight a of the immediate reward priority, the weight beta of the time difference error priority and the weight upsilon of the Actor network loss function priority:
υ=1-a-β (10)
in the formulae (8) to (10), H (TD) represents TD-error signalEntropy, H (TA) denotes the entropy of the current Actor network loss function, H (r) i ) Indicating the immediate reward information entropy.
Finally, according to the calculated weight coefficient and p im-reward 、p TD-error And p Actor-loss The priority of each empirical sample is calculated using formula (11), where oa denotes a minimum constant. As shown in equation (11):
p i =(a×p im-reward +β×p TD-error +υ×p Actor-loss )+ò (11)
further, as shown in fig. 2, when the experience is sampled, it is determined whether the reward is greater than zero, if yes, the priority of the experience sample is increased, and if not, the priority of the experience sample is kept unchanged; sampling experiences according to the sequence of the priorities from high to low, and further updating network parameters; the method specifically comprises the following steps:
judging whether the reward is larger than zero or not during experience sampling, if so, giving priority to the experience sample, and in the process of giving priority to the experience sample, setting the parametersAs a priority weight at the beginning of each round of training for positive experience (experience with reward greater than zero), based on the experience sample priority p i The priority of the experience samples is adjusted, so that the priority of the positive experience samples (experience samples with reward larger than zero) is as follows:
The priority of the empirical sample with reward less than or equal to zero is not adjusted and is still p i ;
Calculating the sampling probability P according to the priority of the empirical samples i Root of Chinese characterAccording to the sampling probability P i Sampling experience samples from an experience pool for training;
and (3) a sampling probability calculation process, as shown in formula (12):
wherein, α is a constant and has a value range of [0,1], when α =0, it represents uniform sampling, and n is the total number of empirical samples.
It should be understood that in the experience pool, an experience sample with a high TD-error absolute value generally means that the Q value of the current Critic network is very different from the Q value of the target Critic network, and has a high learning potential. The prior consideration of experience playback of the experience samples may bring about great improvement of the learning ability of the robot.
However, if only TD-error is considered, neglecting the importance of rewards in the training process, it is easy to overuse the edge experience, resulting in network overfitting.
While positive experiences like successful experiences or experiences with high rewards are also the most important experiences for the robot to learn, if the positive experiences are sampled more, the convergence of the algorithm can be accelerated, and the overfitting can be effectively relieved. Therefore, positive experience samples will be given higher priority for experience playback.
Further, after the experience is selected to participate in the training, in the next round of training, the priority of the experience already participating in the training is attenuated, whether the average value of all the attenuated priorities is smaller than a set threshold is judged, if yes, the priority of the test sample is increased, if not, the priority attenuation is continued until the average value of the priority sequence is reduced, and the method specifically comprises the following steps:
assume a positive empirical sample of j and a priority of p j The priority sequence of the empirical sample batch sampling is p = (p) 1 ,p 2 ,···p j ···,p n )。
Then, the priority p of the next round j j According toThe decay factor σ decays exponentially:
when the priority of the empirical sample is greater than a set threshold value Z, allowing the priority to attenuate, otherwise stopping attenuating;
the formula for calculating the threshold value Z is:
in order to quickly learn the latest experience of robot interaction with the environment in a finite capacity experience pool, the priorities of experience samples need to be updated in time.
The method provided by the invention is a DDPG algorithm based on priority sequence sampling. An actor-critic framework of the DDPG algorithm is fully utilized, an actor network loss function, TD-error and immediate reward are combined, a weighting coefficient is determined according to information entropy, and an experience sample priority sequence is constructed. And reasonably calling the immediate reward of the robot, and regarding the experience sample with the immediate reward larger than zero as positive experience, and taking the negative experience as the rest. And giving higher priority to positive experience so that the positive experience is sampled more frequently, and accelerating the training process of the DDPG algorithm. Meanwhile, in order to take account of the diversity of experience samples, experience samples with low priority can also be sampled, and after positive experience is sampled, the priority of the experience samples is exponentially attenuated in the next training round until the priority is less than or equal to the set threshold. The specific flow of the algorithm is shown in fig. 2.
The present invention performs an experiment to evaluate the DDPG algorithm based on prior sequence sampling. Gazebo physical simulation software is used for carrying out simulation experiments, a python language and a pytorech frame are selected, and an Ros operating system is used for information transmission. An important indicator for evaluating the performance of the algorithm is the size of the reward for the robot to interact with the environment in one round of training according to the algorithm, and if the obtained reward value is high and relatively stable, the round of the algorithm is considered to have excellent performance in the environment.
The robot trains in four different environments, where the first environment has no obstacles and the remaining environmental obstacles are of different sizes and arrangements. The robot training result shows that the PER method is not stable enough and the average reward value is low in the environment without obstacles and mainly floats in the range of 0-0.5, the method improves the convergence rate, is stable in about 50 rounds, and is high in the average reward value and mainly floats in the range of 0.75-1.75.
In an environment with obstacles, the PER method training curve has large oscillation amplitude and poor risk resistance, and the average reward value mainly floats in a range of-1.0 to 1.0; the average reward value of the method provided by the invention stably rises and mainly floats in the range of 0.5-1.0, and a good effect is achieved.
In order to verify the success rate and the effectiveness of the robot path planning of the method under the unknown environment, after the robot training is completed, the robot is subjected to 200 task tests in a new simulation environment. The test result shows that in the aspect of success rate, the success rate of the PER method is 86 percent, while the success rate of the algorithm provided by the invention can reach 90.5 percent and is obviously higher than that of the PER method; in terms of time consumption, the algorithm of the invention is 13% faster than the PER algorithm, and the robot reaches the target point more quickly. The algorithm provided by the invention has the advantages of low trial and error rate and less collision times with the barrier during path planning.
The experimental results show that in the task of robot path planning, the average reward value of the PER method fluctuates greatly, the training process is unstable, and the method for evaluating the priority of the experience sample only has one TD-error index, and the effect of immediate reward on network training is ignored, so that the edge experience sample is overused, and local optimization is easy to occur. The invention provides a method for constructing a composite priority level and comprehensively considering the feedback of an immediate reward, a TD-error network and an Actor network, determines a dynamic weighting coefficient by using an information entropy, gives a reasonable priority level to an experience sample, ensures that a training process is stable step by step, improves an average reward value compared with a PER, and shows that the dynamically adjusted priority level ensures the requirement on the diversity of the experience sample in the robot training process. On the basis, a priority sequence sampling mechanism is superposed, positive experience samples and negative experience samples are distinguished, the priority of the positive experience samples is adjusted, and after sampling training, the priority is subjected to attenuation steps, so that the purpose is to quickly train the robot in a limited experience pool, and obtain higher average reward values in comparison experiments of different training rounds, thereby proving the effectiveness of the improved algorithm.
Example two
The embodiment provides a robot path planning system based on a priority experience playback mechanism;
the robot path planning system based on the prior experience playback mechanism comprises:
an acquisition module configured to: acquiring the current state and the target position of the path planning robot;
a path planning module configured to: inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action;
in the training process of the trained depth deterministic strategy gradient network, experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage;
the construction process of the experience sample priority sequence is as follows: calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, and constructing an empirical sample priority sequence;
during experience sampling, judging whether the reward is larger than zero, if so, adjusting the priority of the experience sample, and if not, keeping the priority of the experience sample unchanged; and sampling the experiences according to the sequence of the priority from high to low, and further updating the network parameters.
It should be noted here that the acquiring module and the path planning module correspond to steps S101 to S102 in the first embodiment, and the modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. The robot path planning method based on the prior experience playback mechanism is characterized by comprising the following steps:
acquiring the current state and the target position of the path planning robot; inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action;
in the trained depth certainty strategy gradient network, in the training process, the experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage;
the construction process of the experience sample priority sequence is as follows:
calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, and constructing an empirical sample priority sequence;
during experience sampling, judging whether the reward is larger than zero, if so, adjusting the priority of the experience sample, and if not, keeping the priority of the experience sample unchanged; and sampling the experience according to the sequence of the priority from high to low, and further updating the network parameters.
2. The method for planning robot path based on priority experience replay mechanism according to claim 1, wherein said experience is sampled from high to low according to priority to update network parameters, and then further comprising:
after the experience is selected to participate in the training, in the next round of training process, the priorities of the experience already participating in the training are attenuated, whether the average value of all attenuated priorities is smaller than a set threshold value or not is judged, if yes, the priority of the test sample is increased, and if not, the priority attenuation is continued until the average value of the priority sequence is reduced.
3. The method as claimed in claim 1, wherein the interaction process between the robot and the environment during the training process is as follows:
at each moment t, the current Actor network of the robot is according to the environmental state s t To obtainAction a t Acting on the environment to obtain an immediate reward r t And the next time environmental state s t+1 The current Critic network depends on the environmental state s t And action a t The Q value Q(s) is obtained t ,a t ) To action a t Carrying out evaluation;
sampling the ith experience [ s ] from a pool of experiences i ,a i ,r i ,s i+1 ]The current Actor network relies on the Q value Q(s) t ,a t ) Adjusting the action strategy, wherein the loss function of the current Actor network is + a Q(s i ,a i |θ Q ) (ii) a Q represents the Q value, s, generated by the current Critic network i Representing the state sampled from the experience pool, a i Representing the act of sampling from a pool of experiences, theta Q Parameter, Q(s), representing the current Actor network t ,a t ) Represents a state s t And action a t The value of (D);
the target Actor network according to the next time environment state s t+1 Obtaining an estimated action a';
the target Critic network is based on the next time environment state s t+1 And estimating action a ' to obtain Q ' value Q '(s) t+1 ,a');Q'(s t+1 A') represents the state s t+1 And the value of action a';
and calculating the difference value between the Q value and the Q' value to obtain a time difference error TD-error.
4. The method according to claim 1, wherein the calculating of the priority of the time difference error, the priority of the current Actor network loss function, and the priority of the immediate reward specifically comprises:
p im-reward =|r i | (2)
p TD-error =|δ i | (3)
p Actor-loss =|▽ a Q(s i ,a i |θ Q )| (4)
wherein r is i Indicating an immediate reward, δ i Represents TD-error,▽ a Q(s i ,a i |θ Q ) Represents the Actor network loss function, Q represents the Q value produced by the current Critic network, s i Representing the state sampled from the experience pool, a i Representing the act of sampling from the experience pool, θ Q Parameter, p, representing the current Actor network im-reward Indicating priority of immediate reward, p TD-error Denotes the priority, p, of TD-error Actor-loss Indicating the priority of the current Actor network loss function.
5. The method as claimed in claim 4, wherein the determining weights of the three by using information entropy, calculating the priority of the empirical samples by means of weighted summation, and constructing the sequence of the priority of the empirical samples specifically includes:
H(r i )=-p reward>Ravg log 2 (p reward>Ravg )-p reward<Ravg log 2 (p reward<Ravg ) (7)
in the process of one training of the network model, if the reward obtained by the robot is greater than 0, the training is called positive training, the obtained experience sample is called positive experience sample, and otherwise, the training is called negative training; in the formulas (5) to (7), H (TD) represents TD-error information entropy, H (TA) represents information entropy of current Actor network loss function, and H (r) represents information entropy of current Actor network loss function i ) Indicating the entropy of the immediate prize information,to gain positive experience with the probability of TD-error in training,probability of TD-error in passive empirical training;to positively experience the probability of the Actor network loss function in training,probability of Actor network loss function in passive training; p is a radical of reward>Ravg For immediate awards with a probability greater than the average value of the awards, p reward<Ravg A probability that the immediate award is less than the average award value;
calculating information entropy to determine the weight a of the immediate reward priority, the weight beta of the time difference error priority and the weight upsilon of the Actor network loss function priority:
υ=1-a-β (10)
in the formulas (8) to (10), H (TD) represents TD-error information entropy, H (TA) represents information entropy of current Actor network loss function, and H (r) represents information entropy of current Actor network loss function i ) Representing an immediate reward information entropy;
finally, according to the calculated weight coefficient and p im-reward 、p TD-error And p Actor-loss Calculating the priority of each empirical sample by using a formula (11), wherein epsilon represents a minimum constant; as shown in equation (11):
p i =(a×p im-reward +β×p TD-error +υ×p Actor-loss )+∈ (11)。
6. the method for planning a robot path based on a priority empirical playback mechanism of claim 1, wherein in the empirical sampling, it is determined whether the reward is greater than zero, if yes, the priority of the empirical sample is increased, and if no, the priority of the empirical sample is kept unchanged; sampling experiences according to the sequence of the priorities from high to low, and further updating network parameters; the method specifically comprises the following steps:
during experience sampling, judging whether the reward is larger than zero, if so, giving priority to the experience sample, and in the process of giving priority to the experience sample, adding the parametersAs a priority weight at the beginning of each round of training for experiences with rewards greater than zero, based on the experience sample priority p i And adjusting the priority of the experience samples, wherein the priority of the experience samples with the reward larger than zero is as follows:
wherein p is i ' denotes the adjusted priority;
the priority of the experience sample with the reward less than or equal to zero is not adjusted and is still p i ;
Calculating the sampling probability P according to the priority of the empirical samples i According to the sampling probability P i Sampling experience samples from an experience pool for training;
and (3) a sampling probability calculation process, as shown in formula (12):
wherein, α is a constant and has a value range of [0,1], when α =0, it represents uniform sampling, and n is the total number of empirical samples.
7. The method as claimed in claim 2, wherein after the experience is selected to participate in the training, the priorities of the experiences already participating in the training are attenuated in the next training round, the average of all attenuated priorities is determined whether to be less than a set threshold, if yes, the priority of the test sample is increased, if no, the attenuation of the priority is continued until the average of the priority sequence is reduced, and the method specifically comprises:
assume a positive empirical sample of j and a priority of p j The empirical sample batch sampling priority sequence is p = (p) 1 ,p 2 ,…p j …,p n );
Then, the priority p of the next round j j Performing exponential decay according to a decay factor sigma:
when the priority of the empirical sample is greater than a set threshold value Z, allowing the priority to attenuate, otherwise stopping attenuating;
formula for calculating the threshold value Z:
8. the robot path planning system based on the prior experience playback mechanism comprises:
an acquisition module configured to: acquiring the current state and the target position of the path planning robot;
a path planning module configured to: inputting the current state and the target position of the path planning robot into the trained depth certainty strategy gradient network to obtain the action of the robot; the path planning robot finishes the path planning of the robot according to the obtained robot action;
in the training process of the trained depth deterministic strategy gradient network, experience generated by the robot is stored in an experience pool, and the storage mode adopts an experience sample priority sequence for storage;
the construction process of the experience sample priority sequence is as follows: calculating the priority of the time difference error, the priority of the current Actor network loss function and the priority of the immediate reward; determining the weights of the three by using the information entropy, calculating the priority of the empirical sample by adopting a weighted summation mode, and constructing a priority sequence of the empirical sample;
during experience sampling, judging whether the reward is larger than zero, if so, adjusting the priority of the experience sample, and if not, keeping the priority of the experience sample unchanged; and sampling the experience according to the sequence of the priority from high to low, and further updating the network parameters.
9. An electronic device, comprising:
a memory for non-transitory storage of computer readable instructions; and
a processor for executing the computer readable instructions,
wherein the computer readable instructions, when executed by the processor, perform the method of any of claims 1-7.
10. A storage medium storing, non-transitory, computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform the instructions of the method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211199553.5A CN115509233B (en) | 2022-09-29 | 2022-09-29 | Robot path planning method and system based on priority experience playback mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211199553.5A CN115509233B (en) | 2022-09-29 | 2022-09-29 | Robot path planning method and system based on priority experience playback mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115509233A true CN115509233A (en) | 2022-12-23 |
CN115509233B CN115509233B (en) | 2024-09-06 |
Family
ID=84508874
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211199553.5A Active CN115509233B (en) | 2022-09-29 | 2022-09-29 | Robot path planning method and system based on priority experience playback mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115509233B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116523154A (en) * | 2023-03-22 | 2023-08-01 | 中国科学院西北生态环境资源研究院 | Model training method, route planning method and related devices |
CN118508817A (en) * | 2024-07-18 | 2024-08-16 | 闽西职业技术学院 | Motor self-adaptive control method and system based on deep reinforcement learning |
CN118670400A (en) * | 2024-08-22 | 2024-09-20 | 西南交通大学 | Multi-agent route planning method and device based on artificial potential field and PPO |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112734014A (en) * | 2021-01-12 | 2021-04-30 | 山东大学 | Experience playback sampling reinforcement learning method and system based on confidence upper bound thought |
CN113503885A (en) * | 2021-04-30 | 2021-10-15 | 山东师范大学 | Robot path navigation method and system based on sampling optimization DDPG algorithm |
WO2022083029A1 (en) * | 2020-10-19 | 2022-04-28 | 深圳大学 | Decision-making method based on deep reinforcement learning |
-
2022
- 2022-09-29 CN CN202211199553.5A patent/CN115509233B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022083029A1 (en) * | 2020-10-19 | 2022-04-28 | 深圳大学 | Decision-making method based on deep reinforcement learning |
CN112734014A (en) * | 2021-01-12 | 2021-04-30 | 山东大学 | Experience playback sampling reinforcement learning method and system based on confidence upper bound thought |
CN113503885A (en) * | 2021-04-30 | 2021-10-15 | 山东师范大学 | Robot path navigation method and system based on sampling optimization DDPG algorithm |
Non-Patent Citations (2)
Title |
---|
王朋等: "Prioritized experience reply in DDPG via multi-dimensional transition prioities calculation", 《RESEARCH SQUARE》, 14 November 2022 (2022-11-14) * |
许诺;杨振伟;: "稀疏奖励下基于MADDPG算法的多智能体协同", 现代计算机, no. 15, 25 May 2020 (2020-05-25) * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116523154A (en) * | 2023-03-22 | 2023-08-01 | 中国科学院西北生态环境资源研究院 | Model training method, route planning method and related devices |
CN116523154B (en) * | 2023-03-22 | 2024-03-29 | 中国科学院西北生态环境资源研究院 | Model training method, route planning method and related devices |
CN118508817A (en) * | 2024-07-18 | 2024-08-16 | 闽西职业技术学院 | Motor self-adaptive control method and system based on deep reinforcement learning |
CN118670400A (en) * | 2024-08-22 | 2024-09-20 | 西南交通大学 | Multi-agent route planning method and device based on artificial potential field and PPO |
Also Published As
Publication number | Publication date |
---|---|
CN115509233B (en) | 2024-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115509233A (en) | Robot path planning method and system based on prior experience playback mechanism | |
CN112052456B (en) | Multi-agent-based deep reinforcement learning strategy optimization defense method | |
CN111260027B (en) | Intelligent agent automatic decision-making method based on reinforcement learning | |
CN108319132B (en) | Decision-making system and method for unmanned aerial vehicle air countermeasure | |
CN112884131A (en) | Deep reinforcement learning strategy optimization defense method and device based on simulation learning | |
CN112884130A (en) | SeqGAN-based deep reinforcement learning data enhanced defense method and device | |
CN109242099B (en) | Training method and device of reinforcement learning network, training equipment and storage medium | |
CN113298252B (en) | Deep reinforcement learning-oriented strategy anomaly detection method and device | |
CN108830376B (en) | Multivalent value network deep reinforcement learning method for time-sensitive environment | |
CN110335466B (en) | Traffic flow prediction method and apparatus | |
CN111753300B (en) | Method and device for detecting and defending abnormal data for reinforcement learning | |
CN111416797A (en) | Intrusion detection method for optimizing regularization extreme learning machine by improving longicorn herd algorithm | |
CN109800517B (en) | Improved reverse modeling method for magnetorheological damper | |
CN105424043B (en) | It is a kind of based on judging motor-driven estimation method of motion state | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
CN117008620A (en) | Unmanned self-adaptive path planning method, system, equipment and medium | |
CN117787384A (en) | Reinforced learning model training method for unmanned aerial vehicle air combat decision | |
CN118193978A (en) | Automobile roadblock avoiding method based on DQN deep reinforcement learning algorithm | |
CN110302539B (en) | Game strategy calculation method, device and system and readable storage medium | |
CN115542901B (en) | Deformable robot obstacle avoidance method based on near-end strategy training | |
CN116524316A (en) | Scene graph skeleton construction method under reinforcement learning framework | |
CN111144243A (en) | Household pattern recognition method and device based on counterstudy | |
CN113313236B (en) | Deep reinforcement learning model poisoning detection method and device based on time sequence neural pathway | |
Wang et al. | Reinforcement Learning using Reward Expectations in Scenarios with Aleatoric Uncertainties | |
CN116755046B (en) | Multifunctional radar interference decision-making method based on imperfect expert strategy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |