CN116360454A - Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment - Google Patents
Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment Download PDFInfo
- Publication number
- CN116360454A CN116360454A CN202310437715.2A CN202310437715A CN116360454A CN 116360454 A CN116360454 A CN 116360454A CN 202310437715 A CN202310437715 A CN 202310437715A CN 116360454 A CN116360454 A CN 116360454A
- Authority
- CN
- China
- Prior art keywords
- robot
- pedestrian
- state
- action
- reinforcement learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000002787 reinforcement Effects 0.000 title claims abstract description 27
- 230000009471 action Effects 0.000 claims abstract description 31
- 238000013528 artificial neural network Methods 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000013135 deep learning Methods 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 11
- 241000283283 Orcinus orca Species 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000006399 behavior Effects 0.000 abstract description 7
- 238000004088 simulation Methods 0.000 description 6
- 239000003795 chemical substances by application Substances 0.000 description 5
- 238000005070 sampling Methods 0.000 description 4
- 230000008447 perception Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0231—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
- G05D1/0238—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors
- G05D1/024—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors in combination with a laser
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0214—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0276—Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Optics & Photonics (AREA)
- Electromagnetism (AREA)
- Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
- Manipulator (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment, and belongs to the technical field of deep learning and mobile robot navigation. The invention fuses various collision algorithms to generate a pedestrian track data set formed by pedestrian states, and the pedestrian track data set is used for encoding and decoding the pedestrian states by using a coder and decoder, and then is used as the input of a deep neural network together with the current state and observation of the robot obtained by a Markov tuple to generate an estimated value of the state-action value pair of the robot; and outputting the optimal action selected by the robot in the current state by using the action neural network, scoring the state-action of the robot by using the evaluation neural network, and performing iterative training based on reinforcement learning. In the execution stage, the invention adopts the laser radar to acquire surrounding environment data, realizes the safe navigation of the mobile robot in the pedestrian environment, effectively solves the problems of short vision and shielding of the robot navigation in the dynamic environment, and improves the safety and timeliness of the obstacle avoidance behavior of the robot.
Description
Technical Field
The invention belongs to the technical field of deep learning and mobile robot navigation, and particularly relates to a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.
Background
With the rapid development of the robot industry in recent years, intelligent mobile robots are increasingly applied to the fields of transportation, logistics, social services and emergency rescue. The intelligent mobile robot can sense and understand the external environment by means of the sensors carried by the intelligent mobile robot, makes a real-time decision according to task requirements, performs closed-loop control, performs operation in an autonomous or semi-autonomous mode, and has certain self-learning and adaptive capacity in known or unknown environments. The most fundamental technology of intelligent mobile robots is navigation and obstacle avoidance. Navigation is an important problem to be solved when an intelligent mobile robot realizes path planning, and refers to a process that the mobile robot senses the environment and the state of the mobile robot through a sensor and learning to realize autonomous movement to a target in an obstacle environment.
The robot path planning means that the robot senses the surrounding environment according to various sensors and autonomously searches a collision-free path from a starting point to an end point. The traditional navigation obstacle avoidance method mainly comprises a search-based method, a sampling-based method, an artificial potential field method and the like. The search-based method mainly comprises Dijkstra algorithm and A algorithm. The Dijkstra algorithm adopts a greedy mode, solves the problem of the shortest path from a single node to another node in the directed graph, and is mainly characterized in that the next node selected during each iteration is the nearest child node of the current node. In order to ensure that the final searched path is shortest, in each iteration process, the shortest path from the initial node to all traversed points is updated, and when the searched range is covered to the target point, the algorithm is ended. The A algorithm is a heuristic search algorithm, namely, a heuristic search rule is established in the search process, so that the distance relation between the real-time search position and the target position is measured, the search direction preferentially faces the direction of the position where the target point is located, and the effect of improving the search efficiency is finally achieved. The sampling-based method includes probability graph method, random tree method, etc. The probability map method refers to randomly sampling in a path searching space to form a path map and search paths. The random tree method is to guide the path tree to grow in the path searching space through sampling points to form paths. The basic idea of the artificial potential field method is derived from the concept of "fields" in physics. The algorithm simulates the process of robot motion in the environment as motion in an abstract artificial "field".
Reinforcement learning is a learning that maps from an environmental state to an action, with the goal of letting agents get the greatest cumulative rewards during interactions with the environment. While a markov decision process may be used to model RL problems. In the robot navigation technology, the main part that has obvious influence to navigation effect is divided into two parts: a perception part for environment and a decision control part. The deep learning and the reinforcement learning can respectively solve the problems of environment perception reasoning and decision control in the navigation problem, and the deep reinforcement learning has remarkable achievement in the navigation decision problem in recent years.
However, in reinforcement learning, the setting of the reward function is a key to driving the agent to learn the optimal strategy, and improper reward functions result in the agent learning an inefficient or even erroneous strategy. The existing algorithm uses a single-step reward function and the inherent trial-and-error of reinforcement learning, which easily results in excessively high Q value variance. The high variance of the Q value affects the rate of convergence of the training return function.
In the application scene of the intelligent mobile robot, the situation of surrounding pedestrians is not always needed to be considered. Occlusion is very common in the case of highly dynamic and partly measurable. Existing crowd navigation methods often fail to predict human motion trajectories and assume that complete environmental knowledge is provided. When deployed in the real world, these algorithms only consider avoiding detected or observed humans. Therefore, when the blocked human suddenly appears on the path of the robot, collision may occur.
The above varieties provide higher requirements for the navigation obstacle avoidance of the intelligent mobile robot. The required intelligent mobile robot needs to have certain adaptability to an unknown environment and can navigate in a pedestrian environment.
Disclosure of Invention
In order to solve the problems of slow convergence in training and insufficient recognition of dynamic pedestrians in the prior art, the invention provides a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.
The technical scheme adopted for solving the technical problems is as follows:
a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment comprises the following steps:
s1: defining a state space according to the states of the robot and the person at the current moment; modeling the mutual observation between the robot and the pedestrian as a partially observable Markov tuple according to the state space;
s2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;
s3: randomly generating a pedestrian track data set formed by pedestrian states based on a collision algorithm according to the action space of the pedestrians;
s4: inputting the pedestrian track data set D into a codec network, and extracting pedestrian motion characteristics;
s5: estimating the state-action value pair of the robot by using a deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by part of observable Markov tuples;
s6: outputting the optimal action selected by the robot in the current state by utilizing the action neural network according to the estimated value of the state-action value pair of the robot, scoring the state-action of the robot by utilizing the evaluation neural network, and updating parameters of the optimized codec network, the deep learning network, the action neural network and the evaluation neural network by utilizing reinforcement learning iterative training and combining a multi-step agent stage rewarding function;
s7: in the path collision avoidance planning process, firstly, coding the pedestrian state by utilizing a trained codec network to obtain pedestrian motion characteristics; estimating the state-action value pair of the robot by using the trained deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by the partially observable Markov tuple; and finally, outputting the optimal action in the current state through the trained action neural network, and realizing the safe navigation in the dynamic pedestrian environment.
Further, the state space is expressed as:
s=[d g ,v pref ,v x ,v y ,r]
h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r]
wherein s represents the robot state, d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a i Represents the i-th pedestrian state, p x ,p y Representing the position of a pedestrian, v x ,v y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r i Radius d representing the space occupied by the ith pedestrian i Representing the distance between the robot and the ith pedestrian, r i +r represents the minimum safe distance of the robot from the person.
Further, the markov tuple is represented as (S, a, P, R, Ω, O, γ), where S represents a state space, a represents an action space, P represents a transition state, R represents a reward, Ω represents an observation probability distribution, O represents a mapping of the state space to the observation space, and γ represents a discount rate.
Further, the multi-step proxy stage rewards function is as follows:
where N represents the total number of steps of the multi-step proxy stage rewarding function, r t Represents the rewards of the T-th step, T represents the total decision times in the navigation process, gamma represents the discount rate, and k represents the current timeNumber of steps.
Further, said r t The following are provided:
wherein t represents the current time,a represents the joint state at time t in the robot navigation t Represents the action at time t, d t The distance between the robot and the pedestrian at the time t. When the distance between the robot and the pedestrian is smaller than 0, collision occurs, and punishment is carried out on the robot; when the distance between the robot and the pedestrian is smaller than 0.15, punishing according to the approaching degree; when the robot reaches the target position within a prescribed time, the robot is rewarded, and the time when reaching the destination is inversely related to the rewarding.
Further, the pedestrian trajectory data set D in step S3 is generated by three collision avoidance algorithms RVO, ORCA, SFM.
Further, the codec network is composed of an encoder and a decoder, the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic f= { F t ,f t-1 ,f t-2 And (f), where f t The motion characteristics of the pedestrian at the time t are shown.
Further, the deep neural network is sequentially composed of a multi-layer perceptron network, a softmax activation function, a multi-layer perceptron network and a full connection layer; the input of the deep neural network is the current state s of the robot t And observe o t And the pedestrian motion characteristic F, and the output of the deep neural network is an estimated value of the state-action value pair of the robot.
The beneficial effects of the invention are mainly as follows:
(1) The invention combines three collision avoidance algorithms of RVO, ORCA and SFM to randomly generate pedestrian data, thereby enhancing the robustness of the data and solving the problem that a single model is difficult to popularize into a complex dynamic pedestrian environment.
(2) The invention uses the multi-step agent stage rewarding function, reduces the variance of the cost function and accelerates the training convergence speed.
(3) According to the invention, through the encoder network structure, the observed pedestrian behavior is used as additional sensor measurement to estimate the position of the blocked pedestrian, and the mechanism is integrated into deep reinforcement learning, so that the obstacle avoidance success rate and the navigation efficiency of the robot in the pedestrian environment are effectively improved, the problems of short vision and blocking of the robot navigation in the dynamic environment are solved, and the safety and timeliness of the obstacle avoidance behavior are improved.
Drawings
Fig. 1 is a flowchart of a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.
Fig. 2 is a diagram of the model training network of the present invention.
Fig. 3 is a simulation result at 2.5s after the robot starts.
Fig. 4 is a simulation result at 7.75s after the robot starts.
Detailed Description
The invention is further described below with reference to the accompanying drawings and some examples.
The invention discloses a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment, which is shown in a flow chart as shown in figure 1, and comprises the steps of defining a state space and part of observable Markov tuples, setting a multi-step reward function in consideration of the influence of future multi-step robot operation on the current operation, randomly generating a pedestrian track data set D, establishing a local perception map of the robot according to the data set, initializing the state space of the robot, extracting pedestrian motion characteristics by utilizing a multi-layer coding and decoding layer network, and predicting the pedestrian track in a simulation scene through reinforcement learning training action-evaluation network.
As shown in fig. 1-2, the specific implementation steps are as follows:
s=[d g ,v pref ,v x ,v y ,r] (1)
h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r] (2)
wherein d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y Wherein s represents the robot state, d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a i Represents the i-th pedestrian state, p x ,p y Representing the position of a pedestrian, v x ,v y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r i Radius d representing the space occupied by the ith pedestrian i Representing the distance between the robot and the ith pedestrian, r i +r represents the minimum safe distance of the robot from the person.
Modeling the mutual observation between the robot and the pedestrian as a partially observable markov tuple (S, a, P, R, Ω, O, γ), where S represents a state space, a represents an action space, P represents a transition state, R represents a reward, Ω represents an observation probability distribution, O represents a mapping of state space to observation space, and γ represents a discount rate. Within each time step, the robot selects action a e a given observation O e O. The next state S' is determined only by the current state S according to the markov assumption; the observation probability distribution Ω (o, s ', a) and the state transitions P (s, a, s') are determined by conditional probabilities.
Step 2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;
in this embodiment, the multi-step proxy stage rewards function is defined as follows:
where N represents the total number of steps of the multi-step proxy stage rewarding function, r i Indicating the rewards of the T-th step, wherein T indicates the total decision number in the navigation process, gamma indicates the discount rate, and k indicates the current step number. Take the multi-step value N from the experience stored in the temporary replay buffer. For the training segments of T steps, the buffer is from the initial state s 0 To the end state s T Is T. The influence of future multi-step robot action selection on the current rewarding value is considered, so that the action of the robot is converged in a safe action interval, the variance of a cost function is reduced, and the training convergence speed is accelerated.
Wherein r is t The expression is as follows:
wherein t represents the current time,a represents the joint state at time t in the robot navigation t Represents the action at time t, d t The distance between the robot and the pedestrian at the time t. The reward function comprises punishment of collision with the dynamic obstacle, constraint on the distance between the robot and the dynamic obstacle, punishment on the distance between the robot and the target point and time cost. When the distance between the robot and the pedestrian is smaller than 0, collision occurs, and punishment is carried out on the robot, namely punishment of collision of the dynamic obstacle; when the distance between the robot and the pedestrian is smaller than 0.15, punishment is performed according to the approaching degree, namely the distance between the robot and the dynamic obstacle is restrained. When the robot reaches the target position within a specified time, the robot is rewarded, and when reaching the destination, the robot is inversely related to the rewards, namely punishment of the distance from the target point and time cost.
Step 3: random generation of pedestrian track data based on collision algorithm according to pedestrian action spaceA set D; in one embodiment of the invention, three RVO, ORCA, SFM collision avoidance algorithms are adopted in the action space of the pedestrians, and a pedestrian motion trail data set D is randomly generated according to the proportion of 1:1:1, wherein the pedestrian motion trail data set consists of a plurality of pedestrian states h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r]The composition is formed.
Step 4: a surrounding grid map of the robot is generated for capturing the locations of surrounding dynamic pedestrians.
Step 5: the encoder and decoder are constructed in a three-layer structure, wherein each encoder layer comprises a feedforward network and a self-attention module, and each decoder layer comprises a self-attention module, a codec attention module and a feedforward network. The self-attention module is used for learning the relation between the current behavior of the pedestrian and the previous behavior, and the encoding and decoding attention module is used for learning the current behavior of the pedestrian and the motion characteristic F= { F of the pedestrian to be encoded t ,f t-1 ,f t-2 Relationships of }. As shown in fig. 2, the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic f= { F t ,f t-1 ,f t-2 And (f), where f t The motion characteristics of the pedestrian at the time t are shown.
Step 6: and estimating the state-action value pair of the robot by using the deep learning network.
The current state s of the robot t And observe o t And the pedestrian motion characteristic F is input into the deep neural network to estimate the state-action value pair of the robot. In this embodiment, the deep neural network structure is sequentially composed of 3 multi-layer perceptron networks, a softmax activation function, 2 multi-layer perceptron networks, and 1 fully connected layer.
Step 7: and outputting the optimal action selected by the robot in the current state by using an action neural network in the PPO algorithm, scoring the state-action of the robot by using an evaluation neural network in the PPO algorithm, and selecting the action with the largest rewarding value for updating. By using the course learning method, the number of pedestrians in the environment is gradually increased, the pedestrian speed is changed, and the like. And (5) carrying out iterative training, updating and optimizing network parameters, and selecting an optimal motion trail.
In one implementation of the invention, the simulation environment is set up as a GYM station, the scene is a square space with a side length of 12 meters, the initial positions of pedestrians are randomly distributed in the space, and the number of pedestrians is 8 dynamic pedestrians. Under the environment, the trained reinforcement learning network is used for selecting the optimal collision avoidance path, and the safe navigation of the robot is completed.
The simulation results of the robot path planning process are shown in fig. 3 and 4, wherein square units are robots, round units of 0-7 represent pedestrians, upper left corner legends respectively represent the instantaneous speed of pedestrians of 0-7, arrows represent the speed direction, and five-pointed stars are destinations. The black area is a blind area for laser radar scanning when the robot is moving.
As can be seen from fig. 3, when the robot starts for 2.5s, pedestrians No. 2 and No. 5 are completely covered by pedestrians No. 3 and No. 7, and are in the blind area of the robot. Meanwhile, the position of the pedestrian No. 6 causes local shielding for the pedestrian No. 0. In fig. 4, moving to 7.75s, there is still an occlusion relationship between the two. The invention can output the pedestrian characteristic vector F= { F through the coder-decoder i ,f i-1 ,f i-2 And the multi-step reward function is used for capturing the pedestrian dynamics rapidly and selecting the optimal behavior action of the current robot so as to avoid collision. Experimental simulation proves that the robot obstacle avoidance method can solve the problems of obstacle avoidance and path planning of the robot in an actual pedestrian environment. And the safety of navigation and the effectiveness of obstacle avoidance in a multi-line human environment are proved.
The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.
Claims (8)
1. The robot path collision avoidance planning method based on deep reinforcement learning in the pedestrian environment is characterized by comprising the following steps of:
s1: defining a state space according to the states of the robot and the person at the current moment; modeling the mutual observation between the robot and the pedestrian as a partially observable Markov tuple according to the state space;
s2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;
s3: randomly generating a pedestrian track data set formed by pedestrian states based on a collision algorithm according to the action space of the pedestrians;
s4: inputting the pedestrian track data set into a coder-decoder network, and extracting pedestrian motion characteristics;
s5: estimating the state-action value pair of the robot by using a deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by part of observable Markov tuples;
s6: outputting the optimal action selected by the robot in the current state by utilizing the action neural network according to the estimated value of the state-action value pair of the robot, scoring the state-action of the robot by utilizing the evaluation neural network, and updating parameters of the optimized codec network, the deep learning network, the action neural network and the evaluation neural network by utilizing reinforcement learning iterative training and combining a multi-step agent stage rewarding function;
s7: in the robot path collision avoidance planning process, acquiring surrounding environment data, and firstly, coding the pedestrian state by utilizing a trained codec network to obtain pedestrian motion characteristics; estimating the state-action value pair of the robot by using the trained deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by the partially observable Markov tuple; and finally, outputting the optimal action in the current state through the trained action neural network, and realizing the safe navigation in the dynamic pedestrian environment.
2. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the state space is expressed as:
s=[d g ,v pref ,v x ,v y ,r]
h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r]
wherein s represents the robot state, d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a i Represents the i-th pedestrian state, p x ,p y Representing the position of a pedestrian, v x ,v y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r i Radius d representing the space occupied by the ith pedestrian i Representing the distance between the robot and the ith pedestrian, r i +r represents the minimum safe distance of the robot from the person.
3. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the Markov tuple is expressed as (S, A, P, R, omega, O and gamma), wherein S represents a state space, A represents an action space, P represents a transition state, R represents a reward, omega represents an observation probability distribution, O represents a relation mapping from the state space to the observation space, and gamma represents a discount rate.
4. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the multi-step proxy stage rewards function is as follows:
wherein N represents a multi-step generationTotal number of steps of the stage prize function, r t Indicating the rewards of the T-th step, wherein T indicates the total decision number in the navigation process, gamma indicates the discount rate, and k indicates the current step number.
5. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: said r t The following are provided:
wherein t represents the current time,a represents the joint state at time t in the robot navigation t Represents the action at time t, d t When the distance between the robot and the pedestrian at the moment t is smaller than 0, collision occurs, and punishment is carried out on the robot and the pedestrian; when the distance between the robot and the pedestrian is smaller than 0.15, punishing according to the approaching degree; when the robot reaches the target position within a prescribed time, the robot is rewarded, and the time when reaching the destination is inversely related to the rewarding.
6. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the pedestrian trajectory data set D described in step S3 is generated by three collision avoidance algorithms RVO, ORCA, SFM.
7. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the coder-decoder network consists of an encoder and a decoder, wherein the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic F= { F t ,f t-1 ,f t-2 And (f), where f t The motion characteristics of the pedestrian at the time t are shown.
8. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the deep neural network is sequentially composed of a multi-layer perceptron network, a softmax activation function, the multi-layer perceptron network and a full-connection layer; the input of the deep neural network is the current state s of the robot t And observe o t And the pedestrian motion characteristic F, and the output of the deep neural network is an estimated value of the state-action value pair of the robot.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310437715.2A CN116360454A (en) | 2023-04-18 | 2023-04-18 | Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310437715.2A CN116360454A (en) | 2023-04-18 | 2023-04-18 | Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116360454A true CN116360454A (en) | 2023-06-30 |
Family
ID=86909203
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310437715.2A Pending CN116360454A (en) | 2023-04-18 | 2023-04-18 | Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116360454A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116562332A (en) * | 2023-07-10 | 2023-08-08 | 长春工业大学 | Robot social movement planning method in man-machine co-fusion environment |
-
2023
- 2023-04-18 CN CN202310437715.2A patent/CN116360454A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116562332A (en) * | 2023-07-10 | 2023-08-08 | 长春工业大学 | Robot social movement planning method in man-machine co-fusion environment |
CN116562332B (en) * | 2023-07-10 | 2023-09-12 | 长春工业大学 | Robot social movement planning method in man-machine co-fusion environment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chiang et al. | RL-RRT: Kinodynamic motion planning via learning reachability estimators from RL policies | |
CN113110592B (en) | Unmanned aerial vehicle obstacle avoidance and path planning method | |
Bai et al. | Intention-aware online POMDP planning for autonomous driving in a crowd | |
CN102819264B (en) | Path planning Q-learning initial method of mobile robot | |
CN102402712B (en) | Robot reinforced learning initialization method based on neural network | |
Cao et al. | Target search control of AUV in underwater environment with deep reinforcement learning | |
Wang et al. | A survey of learning‐based robot motion planning | |
Eiffert et al. | Path planning in dynamic environments using generative rnns and monte carlo tree search | |
CN112034887A (en) | Optimal path training method for unmanned aerial vehicle to avoid cylindrical barrier to reach target point | |
Kanezaki et al. | Goselo: Goal-directed obstacle and self-location map for robot navigation using reactive neural networks | |
CN116360454A (en) | Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment | |
CN113391633A (en) | Urban environment-oriented mobile robot fusion path planning method | |
Liu et al. | Reinforcement learning-based collision avoidance: Impact of reward function and knowledge transfer | |
US11911902B2 (en) | Method for obstacle avoidance in degraded environments of robots based on intrinsic plasticity of SNN | |
Martinez-Baselga et al. | Improving robot navigation in crowded environments using intrinsic rewards | |
Alcalde et al. | DA-SLAM: Deep Active SLAM based on Deep Reinforcement Learning | |
CN114396949B (en) | DDPG-based mobile robot apriori-free map navigation decision-making method | |
CN116202526A (en) | Crowd navigation method combining double convolution network and cyclic neural network under limited view field | |
CN116127853A (en) | Unmanned driving overtaking decision method based on DDPG (distributed data base) with time sequence information fused | |
CN115031753A (en) | Driving condition local path planning method based on safety potential field and DQN algorithm | |
Tang et al. | Reinforcement learning for robots path planning with rule-based shallow-trial | |
Martovytskyi et al. | Approach to building a global mobile agent way based on Q-learning | |
Xiaoxian et al. | Obstacle Avoidance Algorithm for Mobile Robot Based on Deep Reinforcement Learning in Dynamic Environments | |
CN113033893B (en) | Method for predicting running time of automatic guided vehicle of automatic container terminal | |
CN117606490B (en) | Collaborative search path planning method for autonomous underwater vehicle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |