CN116360454A - Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment - Google Patents

Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment Download PDF

Info

Publication number
CN116360454A
CN116360454A CN202310437715.2A CN202310437715A CN116360454A CN 116360454 A CN116360454 A CN 116360454A CN 202310437715 A CN202310437715 A CN 202310437715A CN 116360454 A CN116360454 A CN 116360454A
Authority
CN
China
Prior art keywords
robot
pedestrian
state
action
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310437715.2A
Other languages
Chinese (zh)
Inventor
张建明
曹晋瑜
朱骞
朱科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202310437715.2A priority Critical patent/CN116360454A/en
Publication of CN116360454A publication Critical patent/CN116360454A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0238Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors
    • G05D1/024Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors in combination with a laser
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Optics & Photonics (AREA)
  • Electromagnetism (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment, and belongs to the technical field of deep learning and mobile robot navigation. The invention fuses various collision algorithms to generate a pedestrian track data set formed by pedestrian states, and the pedestrian track data set is used for encoding and decoding the pedestrian states by using a coder and decoder, and then is used as the input of a deep neural network together with the current state and observation of the robot obtained by a Markov tuple to generate an estimated value of the state-action value pair of the robot; and outputting the optimal action selected by the robot in the current state by using the action neural network, scoring the state-action of the robot by using the evaluation neural network, and performing iterative training based on reinforcement learning. In the execution stage, the invention adopts the laser radar to acquire surrounding environment data, realizes the safe navigation of the mobile robot in the pedestrian environment, effectively solves the problems of short vision and shielding of the robot navigation in the dynamic environment, and improves the safety and timeliness of the obstacle avoidance behavior of the robot.

Description

Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment
Technical Field
The invention belongs to the technical field of deep learning and mobile robot navigation, and particularly relates to a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.
Background
With the rapid development of the robot industry in recent years, intelligent mobile robots are increasingly applied to the fields of transportation, logistics, social services and emergency rescue. The intelligent mobile robot can sense and understand the external environment by means of the sensors carried by the intelligent mobile robot, makes a real-time decision according to task requirements, performs closed-loop control, performs operation in an autonomous or semi-autonomous mode, and has certain self-learning and adaptive capacity in known or unknown environments. The most fundamental technology of intelligent mobile robots is navigation and obstacle avoidance. Navigation is an important problem to be solved when an intelligent mobile robot realizes path planning, and refers to a process that the mobile robot senses the environment and the state of the mobile robot through a sensor and learning to realize autonomous movement to a target in an obstacle environment.
The robot path planning means that the robot senses the surrounding environment according to various sensors and autonomously searches a collision-free path from a starting point to an end point. The traditional navigation obstacle avoidance method mainly comprises a search-based method, a sampling-based method, an artificial potential field method and the like. The search-based method mainly comprises Dijkstra algorithm and A algorithm. The Dijkstra algorithm adopts a greedy mode, solves the problem of the shortest path from a single node to another node in the directed graph, and is mainly characterized in that the next node selected during each iteration is the nearest child node of the current node. In order to ensure that the final searched path is shortest, in each iteration process, the shortest path from the initial node to all traversed points is updated, and when the searched range is covered to the target point, the algorithm is ended. The A algorithm is a heuristic search algorithm, namely, a heuristic search rule is established in the search process, so that the distance relation between the real-time search position and the target position is measured, the search direction preferentially faces the direction of the position where the target point is located, and the effect of improving the search efficiency is finally achieved. The sampling-based method includes probability graph method, random tree method, etc. The probability map method refers to randomly sampling in a path searching space to form a path map and search paths. The random tree method is to guide the path tree to grow in the path searching space through sampling points to form paths. The basic idea of the artificial potential field method is derived from the concept of "fields" in physics. The algorithm simulates the process of robot motion in the environment as motion in an abstract artificial "field".
Reinforcement learning is a learning that maps from an environmental state to an action, with the goal of letting agents get the greatest cumulative rewards during interactions with the environment. While a markov decision process may be used to model RL problems. In the robot navigation technology, the main part that has obvious influence to navigation effect is divided into two parts: a perception part for environment and a decision control part. The deep learning and the reinforcement learning can respectively solve the problems of environment perception reasoning and decision control in the navigation problem, and the deep reinforcement learning has remarkable achievement in the navigation decision problem in recent years.
However, in reinforcement learning, the setting of the reward function is a key to driving the agent to learn the optimal strategy, and improper reward functions result in the agent learning an inefficient or even erroneous strategy. The existing algorithm uses a single-step reward function and the inherent trial-and-error of reinforcement learning, which easily results in excessively high Q value variance. The high variance of the Q value affects the rate of convergence of the training return function.
In the application scene of the intelligent mobile robot, the situation of surrounding pedestrians is not always needed to be considered. Occlusion is very common in the case of highly dynamic and partly measurable. Existing crowd navigation methods often fail to predict human motion trajectories and assume that complete environmental knowledge is provided. When deployed in the real world, these algorithms only consider avoiding detected or observed humans. Therefore, when the blocked human suddenly appears on the path of the robot, collision may occur.
The above varieties provide higher requirements for the navigation obstacle avoidance of the intelligent mobile robot. The required intelligent mobile robot needs to have certain adaptability to an unknown environment and can navigate in a pedestrian environment.
Disclosure of Invention
In order to solve the problems of slow convergence in training and insufficient recognition of dynamic pedestrians in the prior art, the invention provides a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.
The technical scheme adopted for solving the technical problems is as follows:
a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment comprises the following steps:
s1: defining a state space according to the states of the robot and the person at the current moment; modeling the mutual observation between the robot and the pedestrian as a partially observable Markov tuple according to the state space;
s2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;
s3: randomly generating a pedestrian track data set formed by pedestrian states based on a collision algorithm according to the action space of the pedestrians;
s4: inputting the pedestrian track data set D into a codec network, and extracting pedestrian motion characteristics;
s5: estimating the state-action value pair of the robot by using a deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by part of observable Markov tuples;
s6: outputting the optimal action selected by the robot in the current state by utilizing the action neural network according to the estimated value of the state-action value pair of the robot, scoring the state-action of the robot by utilizing the evaluation neural network, and updating parameters of the optimized codec network, the deep learning network, the action neural network and the evaluation neural network by utilizing reinforcement learning iterative training and combining a multi-step agent stage rewarding function;
s7: in the path collision avoidance planning process, firstly, coding the pedestrian state by utilizing a trained codec network to obtain pedestrian motion characteristics; estimating the state-action value pair of the robot by using the trained deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by the partially observable Markov tuple; and finally, outputting the optimal action in the current state through the trained action neural network, and realizing the safe navigation in the dynamic pedestrian environment.
Further, the state space is expressed as:
s=[d g ,v pref ,v x ,v y ,r]
h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r]
wherein s represents the robot state, d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a i Represents the i-th pedestrian state, p x ,p y Representing the position of a pedestrian, v x ,v y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r i Radius d representing the space occupied by the ith pedestrian i Representing the distance between the robot and the ith pedestrian, r i +r represents the minimum safe distance of the robot from the person.
Further, the markov tuple is represented as (S, a, P, R, Ω, O, γ), where S represents a state space, a represents an action space, P represents a transition state, R represents a reward, Ω represents an observation probability distribution, O represents a mapping of the state space to the observation space, and γ represents a discount rate.
Further, the multi-step proxy stage rewards function is as follows:
Figure BDA0004192733470000031
where N represents the total number of steps of the multi-step proxy stage rewarding function, r t Represents the rewards of the T-th step, T represents the total decision times in the navigation process, gamma represents the discount rate, and k represents the current timeNumber of steps.
Further, said r t The following are provided:
Figure BDA0004192733470000032
wherein t represents the current time,
Figure BDA0004192733470000041
a represents the joint state at time t in the robot navigation t Represents the action at time t, d t The distance between the robot and the pedestrian at the time t. When the distance between the robot and the pedestrian is smaller than 0, collision occurs, and punishment is carried out on the robot; when the distance between the robot and the pedestrian is smaller than 0.15, punishing according to the approaching degree; when the robot reaches the target position within a prescribed time, the robot is rewarded, and the time when reaching the destination is inversely related to the rewarding.
Further, the pedestrian trajectory data set D in step S3 is generated by three collision avoidance algorithms RVO, ORCA, SFM.
Further, the codec network is composed of an encoder and a decoder, the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic f= { F t ,f t-1 ,f t-2 And (f), where f t The motion characteristics of the pedestrian at the time t are shown.
Further, the deep neural network is sequentially composed of a multi-layer perceptron network, a softmax activation function, a multi-layer perceptron network and a full connection layer; the input of the deep neural network is the current state s of the robot t And observe o t And the pedestrian motion characteristic F, and the output of the deep neural network is an estimated value of the state-action value pair of the robot.
The beneficial effects of the invention are mainly as follows:
(1) The invention combines three collision avoidance algorithms of RVO, ORCA and SFM to randomly generate pedestrian data, thereby enhancing the robustness of the data and solving the problem that a single model is difficult to popularize into a complex dynamic pedestrian environment.
(2) The invention uses the multi-step agent stage rewarding function, reduces the variance of the cost function and accelerates the training convergence speed.
(3) According to the invention, through the encoder network structure, the observed pedestrian behavior is used as additional sensor measurement to estimate the position of the blocked pedestrian, and the mechanism is integrated into deep reinforcement learning, so that the obstacle avoidance success rate and the navigation efficiency of the robot in the pedestrian environment are effectively improved, the problems of short vision and blocking of the robot navigation in the dynamic environment are solved, and the safety and timeliness of the obstacle avoidance behavior are improved.
Drawings
Fig. 1 is a flowchart of a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.
Fig. 2 is a diagram of the model training network of the present invention.
Fig. 3 is a simulation result at 2.5s after the robot starts.
Fig. 4 is a simulation result at 7.75s after the robot starts.
Detailed Description
The invention is further described below with reference to the accompanying drawings and some examples.
The invention discloses a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment, which is shown in a flow chart as shown in figure 1, and comprises the steps of defining a state space and part of observable Markov tuples, setting a multi-step reward function in consideration of the influence of future multi-step robot operation on the current operation, randomly generating a pedestrian track data set D, establishing a local perception map of the robot according to the data set, initializing the state space of the robot, extracting pedestrian motion characteristics by utilizing a multi-layer coding and decoding layer network, and predicting the pedestrian track in a simulation scene through reinforcement learning training action-evaluation network.
As shown in fig. 1-2, the specific implementation steps are as follows:
step 1, defining a state space according to states of a robot and a person at the current moment:
s=[d g ,v pref ,v x ,v y ,r] (1)
h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r] (2)
wherein d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y Wherein s represents the robot state, d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a i Represents the i-th pedestrian state, p x ,p y Representing the position of a pedestrian, v x ,v y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r i Radius d representing the space occupied by the ith pedestrian i Representing the distance between the robot and the ith pedestrian, r i +r represents the minimum safe distance of the robot from the person.
Modeling the mutual observation between the robot and the pedestrian as a partially observable markov tuple (S, a, P, R, Ω, O, γ), where S represents a state space, a represents an action space, P represents a transition state, R represents a reward, Ω represents an observation probability distribution, O represents a mapping of state space to observation space, and γ represents a discount rate. Within each time step, the robot selects action a e a given observation O e O. The next state S' is determined only by the current state S according to the markov assumption; the observation probability distribution Ω (o, s ', a) and the state transitions P (s, a, s') are determined by conditional probabilities.
Step 2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;
in this embodiment, the multi-step proxy stage rewards function is defined as follows:
Figure BDA0004192733470000051
where N represents the total number of steps of the multi-step proxy stage rewarding function, r i Indicating the rewards of the T-th step, wherein T indicates the total decision number in the navigation process, gamma indicates the discount rate, and k indicates the current step number. Take the multi-step value N from the experience stored in the temporary replay buffer. For the training segments of T steps, the buffer is from the initial state s 0 To the end state s T Is T. The influence of future multi-step robot action selection on the current rewarding value is considered, so that the action of the robot is converged in a safe action interval, the variance of a cost function is reduced, and the training convergence speed is accelerated.
Wherein r is t The expression is as follows:
Figure BDA0004192733470000061
wherein t represents the current time,
Figure BDA0004192733470000062
a represents the joint state at time t in the robot navigation t Represents the action at time t, d t The distance between the robot and the pedestrian at the time t. The reward function comprises punishment of collision with the dynamic obstacle, constraint on the distance between the robot and the dynamic obstacle, punishment on the distance between the robot and the target point and time cost. When the distance between the robot and the pedestrian is smaller than 0, collision occurs, and punishment is carried out on the robot, namely punishment of collision of the dynamic obstacle; when the distance between the robot and the pedestrian is smaller than 0.15, punishment is performed according to the approaching degree, namely the distance between the robot and the dynamic obstacle is restrained. When the robot reaches the target position within a specified time, the robot is rewarded, and when reaching the destination, the robot is inversely related to the rewards, namely punishment of the distance from the target point and time cost.
Step 3: random generation of pedestrian track data based on collision algorithm according to pedestrian action spaceA set D; in one embodiment of the invention, three RVO, ORCA, SFM collision avoidance algorithms are adopted in the action space of the pedestrians, and a pedestrian motion trail data set D is randomly generated according to the proportion of 1:1:1, wherein the pedestrian motion trail data set consists of a plurality of pedestrian states h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r]The composition is formed.
Step 4: a surrounding grid map of the robot is generated for capturing the locations of surrounding dynamic pedestrians.
Step 5: the encoder and decoder are constructed in a three-layer structure, wherein each encoder layer comprises a feedforward network and a self-attention module, and each decoder layer comprises a self-attention module, a codec attention module and a feedforward network. The self-attention module is used for learning the relation between the current behavior of the pedestrian and the previous behavior, and the encoding and decoding attention module is used for learning the current behavior of the pedestrian and the motion characteristic F= { F of the pedestrian to be encoded t ,f t-1 ,f t-2 Relationships of }. As shown in fig. 2, the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic f= { F t ,f t-1 ,f t-2 And (f), where f t The motion characteristics of the pedestrian at the time t are shown.
Step 6: and estimating the state-action value pair of the robot by using the deep learning network.
The current state s of the robot t And observe o t And the pedestrian motion characteristic F is input into the deep neural network to estimate the state-action value pair of the robot. In this embodiment, the deep neural network structure is sequentially composed of 3 multi-layer perceptron networks, a softmax activation function, 2 multi-layer perceptron networks, and 1 fully connected layer.
Step 7: and outputting the optimal action selected by the robot in the current state by using an action neural network in the PPO algorithm, scoring the state-action of the robot by using an evaluation neural network in the PPO algorithm, and selecting the action with the largest rewarding value for updating. By using the course learning method, the number of pedestrians in the environment is gradually increased, the pedestrian speed is changed, and the like. And (5) carrying out iterative training, updating and optimizing network parameters, and selecting an optimal motion trail.
In one implementation of the invention, the simulation environment is set up as a GYM station, the scene is a square space with a side length of 12 meters, the initial positions of pedestrians are randomly distributed in the space, and the number of pedestrians is 8 dynamic pedestrians. Under the environment, the trained reinforcement learning network is used for selecting the optimal collision avoidance path, and the safe navigation of the robot is completed.
The simulation results of the robot path planning process are shown in fig. 3 and 4, wherein square units are robots, round units of 0-7 represent pedestrians, upper left corner legends respectively represent the instantaneous speed of pedestrians of 0-7, arrows represent the speed direction, and five-pointed stars are destinations. The black area is a blind area for laser radar scanning when the robot is moving.
As can be seen from fig. 3, when the robot starts for 2.5s, pedestrians No. 2 and No. 5 are completely covered by pedestrians No. 3 and No. 7, and are in the blind area of the robot. Meanwhile, the position of the pedestrian No. 6 causes local shielding for the pedestrian No. 0. In fig. 4, moving to 7.75s, there is still an occlusion relationship between the two. The invention can output the pedestrian characteristic vector F= { F through the coder-decoder i ,f i-1 ,f i-2 And the multi-step reward function is used for capturing the pedestrian dynamics rapidly and selecting the optimal behavior action of the current robot so as to avoid collision. Experimental simulation proves that the robot obstacle avoidance method can solve the problems of obstacle avoidance and path planning of the robot in an actual pedestrian environment. And the safety of navigation and the effectiveness of obstacle avoidance in a multi-line human environment are proved.
The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.

Claims (8)

1. The robot path collision avoidance planning method based on deep reinforcement learning in the pedestrian environment is characterized by comprising the following steps of:
s1: defining a state space according to the states of the robot and the person at the current moment; modeling the mutual observation between the robot and the pedestrian as a partially observable Markov tuple according to the state space;
s2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;
s3: randomly generating a pedestrian track data set formed by pedestrian states based on a collision algorithm according to the action space of the pedestrians;
s4: inputting the pedestrian track data set into a coder-decoder network, and extracting pedestrian motion characteristics;
s5: estimating the state-action value pair of the robot by using a deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by part of observable Markov tuples;
s6: outputting the optimal action selected by the robot in the current state by utilizing the action neural network according to the estimated value of the state-action value pair of the robot, scoring the state-action of the robot by utilizing the evaluation neural network, and updating parameters of the optimized codec network, the deep learning network, the action neural network and the evaluation neural network by utilizing reinforcement learning iterative training and combining a multi-step agent stage rewarding function;
s7: in the robot path collision avoidance planning process, acquiring surrounding environment data, and firstly, coding the pedestrian state by utilizing a trained codec network to obtain pedestrian motion characteristics; estimating the state-action value pair of the robot by using the trained deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by the partially observable Markov tuple; and finally, outputting the optimal action in the current state through the trained action neural network, and realizing the safe navigation in the dynamic pedestrian environment.
2. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the state space is expressed as:
s=[d g ,v pref ,v x ,v y ,r]
h i =[p x ,p y ,v x ,v y ,r i ,d i ,r i +r]
wherein s represents the robot state, d g Representing the distance between the robot and the target point, v pref Indicating the preferred speed of the robot, v x ,v y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a i Represents the i-th pedestrian state, p x ,p y Representing the position of a pedestrian, v x ,v y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r i Radius d representing the space occupied by the ith pedestrian i Representing the distance between the robot and the ith pedestrian, r i +r represents the minimum safe distance of the robot from the person.
3. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the Markov tuple is expressed as (S, A, P, R, omega, O and gamma), wherein S represents a state space, A represents an action space, P represents a transition state, R represents a reward, omega represents an observation probability distribution, O represents a relation mapping from the state space to the observation space, and gamma represents a discount rate.
4. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the multi-step proxy stage rewards function is as follows:
Figure FDA0004192733460000021
wherein N represents a multi-step generationTotal number of steps of the stage prize function, r t Indicating the rewards of the T-th step, wherein T indicates the total decision number in the navigation process, gamma indicates the discount rate, and k indicates the current step number.
5. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: said r t The following are provided:
Figure FDA0004192733460000022
wherein t represents the current time,
Figure FDA0004192733460000023
a represents the joint state at time t in the robot navigation t Represents the action at time t, d t When the distance between the robot and the pedestrian at the moment t is smaller than 0, collision occurs, and punishment is carried out on the robot and the pedestrian; when the distance between the robot and the pedestrian is smaller than 0.15, punishing according to the approaching degree; when the robot reaches the target position within a prescribed time, the robot is rewarded, and the time when reaching the destination is inversely related to the rewarding.
6. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the pedestrian trajectory data set D described in step S3 is generated by three collision avoidance algorithms RVO, ORCA, SFM.
7. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the coder-decoder network consists of an encoder and a decoder, wherein the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic F= { F t ,f t-1 ,f t-2 And (f), where f t The motion characteristics of the pedestrian at the time t are shown.
8. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the deep neural network is sequentially composed of a multi-layer perceptron network, a softmax activation function, the multi-layer perceptron network and a full-connection layer; the input of the deep neural network is the current state s of the robot t And observe o t And the pedestrian motion characteristic F, and the output of the deep neural network is an estimated value of the state-action value pair of the robot.
CN202310437715.2A 2023-04-18 2023-04-18 Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment Pending CN116360454A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310437715.2A CN116360454A (en) 2023-04-18 2023-04-18 Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310437715.2A CN116360454A (en) 2023-04-18 2023-04-18 Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment

Publications (1)

Publication Number Publication Date
CN116360454A true CN116360454A (en) 2023-06-30

Family

ID=86909203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310437715.2A Pending CN116360454A (en) 2023-04-18 2023-04-18 Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment

Country Status (1)

Country Link
CN (1) CN116360454A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116562332A (en) * 2023-07-10 2023-08-08 长春工业大学 Robot social movement planning method in man-machine co-fusion environment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116562332A (en) * 2023-07-10 2023-08-08 长春工业大学 Robot social movement planning method in man-machine co-fusion environment
CN116562332B (en) * 2023-07-10 2023-09-12 长春工业大学 Robot social movement planning method in man-machine co-fusion environment

Similar Documents

Publication Publication Date Title
Chiang et al. RL-RRT: Kinodynamic motion planning via learning reachability estimators from RL policies
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
Bai et al. Intention-aware online POMDP planning for autonomous driving in a crowd
CN102819264B (en) Path planning Q-learning initial method of mobile robot
CN102402712B (en) Robot reinforced learning initialization method based on neural network
Cao et al. Target search control of AUV in underwater environment with deep reinforcement learning
Wang et al. A survey of learning‐based robot motion planning
Eiffert et al. Path planning in dynamic environments using generative rnns and monte carlo tree search
CN112034887A (en) Optimal path training method for unmanned aerial vehicle to avoid cylindrical barrier to reach target point
Kanezaki et al. Goselo: Goal-directed obstacle and self-location map for robot navigation using reactive neural networks
CN116360454A (en) Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment
CN113391633A (en) Urban environment-oriented mobile robot fusion path planning method
Liu et al. Reinforcement learning-based collision avoidance: Impact of reward function and knowledge transfer
US11911902B2 (en) Method for obstacle avoidance in degraded environments of robots based on intrinsic plasticity of SNN
Martinez-Baselga et al. Improving robot navigation in crowded environments using intrinsic rewards
Alcalde et al. DA-SLAM: Deep Active SLAM based on Deep Reinforcement Learning
CN114396949B (en) DDPG-based mobile robot apriori-free map navigation decision-making method
CN116202526A (en) Crowd navigation method combining double convolution network and cyclic neural network under limited view field
CN116127853A (en) Unmanned driving overtaking decision method based on DDPG (distributed data base) with time sequence information fused
CN115031753A (en) Driving condition local path planning method based on safety potential field and DQN algorithm
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial
Martovytskyi et al. Approach to building a global mobile agent way based on Q-learning
Xiaoxian et al. Obstacle Avoidance Algorithm for Mobile Robot Based on Deep Reinforcement Learning in Dynamic Environments
CN113033893B (en) Method for predicting running time of automatic guided vehicle of automatic container terminal
CN117606490B (en) Collaborative search path planning method for autonomous underwater vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination