CN116360454A

CN116360454A - Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment

Info

Publication number: CN116360454A
Application number: CN202310437715.2A
Authority: CN
Inventors: 张建明; 曹晋瑜; 朱骞; 朱科
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-06-30

Abstract

The invention discloses a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment, and belongs to the technical field of deep learning and mobile robot navigation. The invention fuses various collision algorithms to generate a pedestrian track data set formed by pedestrian states, and the pedestrian track data set is used for encoding and decoding the pedestrian states by using a coder and decoder, and then is used as the input of a deep neural network together with the current state and observation of the robot obtained by a Markov tuple to generate an estimated value of the state-action value pair of the robot; and outputting the optimal action selected by the robot in the current state by using the action neural network, scoring the state-action of the robot by using the evaluation neural network, and performing iterative training based on reinforcement learning. In the execution stage, the invention adopts the laser radar to acquire surrounding environment data, realizes the safe navigation of the mobile robot in the pedestrian environment, effectively solves the problems of short vision and shielding of the robot navigation in the dynamic environment, and improves the safety and timeliness of the obstacle avoidance behavior of the robot.

Description

Robot path collision avoidance planning method based on deep reinforcement learning in pedestrian environment

Technical Field

The invention belongs to the technical field of deep learning and mobile robot navigation, and particularly relates to a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.

Background

With the rapid development of the robot industry in recent years, intelligent mobile robots are increasingly applied to the fields of transportation, logistics, social services and emergency rescue. The intelligent mobile robot can sense and understand the external environment by means of the sensors carried by the intelligent mobile robot, makes a real-time decision according to task requirements, performs closed-loop control, performs operation in an autonomous or semi-autonomous mode, and has certain self-learning and adaptive capacity in known or unknown environments. The most fundamental technology of intelligent mobile robots is navigation and obstacle avoidance. Navigation is an important problem to be solved when an intelligent mobile robot realizes path planning, and refers to a process that the mobile robot senses the environment and the state of the mobile robot through a sensor and learning to realize autonomous movement to a target in an obstacle environment.

The robot path planning means that the robot senses the surrounding environment according to various sensors and autonomously searches a collision-free path from a starting point to an end point. The traditional navigation obstacle avoidance method mainly comprises a search-based method, a sampling-based method, an artificial potential field method and the like. The search-based method mainly comprises Dijkstra algorithm and A algorithm. The Dijkstra algorithm adopts a greedy mode, solves the problem of the shortest path from a single node to another node in the directed graph, and is mainly characterized in that the next node selected during each iteration is the nearest child node of the current node. In order to ensure that the final searched path is shortest, in each iteration process, the shortest path from the initial node to all traversed points is updated, and when the searched range is covered to the target point, the algorithm is ended. The A algorithm is a heuristic search algorithm, namely, a heuristic search rule is established in the search process, so that the distance relation between the real-time search position and the target position is measured, the search direction preferentially faces the direction of the position where the target point is located, and the effect of improving the search efficiency is finally achieved. The sampling-based method includes probability graph method, random tree method, etc. The probability map method refers to randomly sampling in a path searching space to form a path map and search paths. The random tree method is to guide the path tree to grow in the path searching space through sampling points to form paths. The basic idea of the artificial potential field method is derived from the concept of "fields" in physics. The algorithm simulates the process of robot motion in the environment as motion in an abstract artificial "field".

Reinforcement learning is a learning that maps from an environmental state to an action, with the goal of letting agents get the greatest cumulative rewards during interactions with the environment. While a markov decision process may be used to model RL problems. In the robot navigation technology, the main part that has obvious influence to navigation effect is divided into two parts: a perception part for environment and a decision control part. The deep learning and the reinforcement learning can respectively solve the problems of environment perception reasoning and decision control in the navigation problem, and the deep reinforcement learning has remarkable achievement in the navigation decision problem in recent years.

However, in reinforcement learning, the setting of the reward function is a key to driving the agent to learn the optimal strategy, and improper reward functions result in the agent learning an inefficient or even erroneous strategy. The existing algorithm uses a single-step reward function and the inherent trial-and-error of reinforcement learning, which easily results in excessively high Q value variance. The high variance of the Q value affects the rate of convergence of the training return function.

In the application scene of the intelligent mobile robot, the situation of surrounding pedestrians is not always needed to be considered. Occlusion is very common in the case of highly dynamic and partly measurable. Existing crowd navigation methods often fail to predict human motion trajectories and assume that complete environmental knowledge is provided. When deployed in the real world, these algorithms only consider avoiding detected or observed humans. Therefore, when the blocked human suddenly appears on the path of the robot, collision may occur.

The above varieties provide higher requirements for the navigation obstacle avoidance of the intelligent mobile robot. The required intelligent mobile robot needs to have certain adaptability to an unknown environment and can navigate in a pedestrian environment.

Disclosure of Invention

In order to solve the problems of slow convergence in training and insufficient recognition of dynamic pedestrians in the prior art, the invention provides a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.

The technical scheme adopted for solving the technical problems is as follows:

a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment comprises the following steps:

s1: defining a state space according to the states of the robot and the person at the current moment; modeling the mutual observation between the robot and the pedestrian as a partially observable Markov tuple according to the state space;

s2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;

s3: randomly generating a pedestrian track data set formed by pedestrian states based on a collision algorithm according to the action space of the pedestrians;

s4: inputting the pedestrian track data set D into a codec network, and extracting pedestrian motion characteristics;

s5: estimating the state-action value pair of the robot by using a deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by part of observable Markov tuples;

s6: outputting the optimal action selected by the robot in the current state by utilizing the action neural network according to the estimated value of the state-action value pair of the robot, scoring the state-action of the robot by utilizing the evaluation neural network, and updating parameters of the optimized codec network, the deep learning network, the action neural network and the evaluation neural network by utilizing reinforcement learning iterative training and combining a multi-step agent stage rewarding function;

s7: in the path collision avoidance planning process, firstly, coding the pedestrian state by utilizing a trained codec network to obtain pedestrian motion characteristics; estimating the state-action value pair of the robot by using the trained deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by the partially observable Markov tuple; and finally, outputting the optimal action in the current state through the trained action neural network, and realizing the safe navigation in the dynamic pedestrian environment.

Further, the state space is expressed as:

s＝[d _g ，v _pref ，v _x ，v _y ，r]

h _i ＝[p _x ，p _y ，v _x ，v _y ，r _i ，d _i ，r _i +r]

wherein s represents the robot state, d _g Representing the distance between the robot and the target point, v _pref Indicating the preferred speed of the robot, v _x ，v _y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a _i Represents the i-th pedestrian state, p _x ，p _y Representing the position of a pedestrian, v _x ，v _y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r _i Radius d representing the space occupied by the ith pedestrian _i Representing the distance between the robot and the ith pedestrian, r _i +r represents the minimum safe distance of the robot from the person.

Further, the markov tuple is represented as (S, a, P, R, Ω, O, γ), where S represents a state space, a represents an action space, P represents a transition state, R represents a reward, Ω represents an observation probability distribution, O represents a mapping of the state space to the observation space, and γ represents a discount rate.

Further, the multi-step proxy stage rewards function is as follows:

where N represents the total number of steps of the multi-step proxy stage rewarding function, r _t Represents the rewards of the T-th step, T represents the total decision times in the navigation process, gamma represents the discount rate, and k represents the current timeNumber of steps.

Further, said r _t The following are provided:

wherein t represents the current time,

a represents the joint state at time t in the robot navigation ^t Represents the action at time t, d _t The distance between the robot and the pedestrian at the time t. When the distance between the robot and the pedestrian is smaller than 0, collision occurs, and punishment is carried out on the robot; when the distance between the robot and the pedestrian is smaller than 0.15, punishing according to the approaching degree; when the robot reaches the target position within a prescribed time, the robot is rewarded, and the time when reaching the destination is inversely related to the rewarding.

Further, the pedestrian trajectory data set D in step S3 is generated by three collision avoidance algorithms RVO, ORCA, SFM.

Further, the codec network is composed of an encoder and a decoder, the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic f= { F _t ，f _t-1 ，f _t-2 And (f), where f _t The motion characteristics of the pedestrian at the time t are shown.

Further, the deep neural network is sequentially composed of a multi-layer perceptron network, a softmax activation function, a multi-layer perceptron network and a full connection layer; the input of the deep neural network is the current state s of the robot _t And observe o _t And the pedestrian motion characteristic F, and the output of the deep neural network is an estimated value of the state-action value pair of the robot.

The beneficial effects of the invention are mainly as follows:

(1) The invention combines three collision avoidance algorithms of RVO, ORCA and SFM to randomly generate pedestrian data, thereby enhancing the robustness of the data and solving the problem that a single model is difficult to popularize into a complex dynamic pedestrian environment.

(2) The invention uses the multi-step agent stage rewarding function, reduces the variance of the cost function and accelerates the training convergence speed.

(3) According to the invention, through the encoder network structure, the observed pedestrian behavior is used as additional sensor measurement to estimate the position of the blocked pedestrian, and the mechanism is integrated into deep reinforcement learning, so that the obstacle avoidance success rate and the navigation efficiency of the robot in the pedestrian environment are effectively improved, the problems of short vision and blocking of the robot navigation in the dynamic environment are solved, and the safety and timeliness of the obstacle avoidance behavior are improved.

Drawings

Fig. 1 is a flowchart of a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment.

Fig. 2 is a diagram of the model training network of the present invention.

Fig. 3 is a simulation result at 2.5s after the robot starts.

Fig. 4 is a simulation result at 7.75s after the robot starts.

Detailed Description

The invention is further described below with reference to the accompanying drawings and some examples.

The invention discloses a robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment, which is shown in a flow chart as shown in figure 1, and comprises the steps of defining a state space and part of observable Markov tuples, setting a multi-step reward function in consideration of the influence of future multi-step robot operation on the current operation, randomly generating a pedestrian track data set D, establishing a local perception map of the robot according to the data set, initializing the state space of the robot, extracting pedestrian motion characteristics by utilizing a multi-layer coding and decoding layer network, and predicting the pedestrian track in a simulation scene through reinforcement learning training action-evaluation network.

As shown in fig. 1-2, the specific implementation steps are as follows:

step 1, defining a state space according to states of a robot and a person at the current moment:

s＝[d _g ，v _pref ，v _x ，v _y ，r] (1)

h _i ＝[p _x ，p _y ，v _x ，v _y ，r _i ，d _i ，r _i +r] (2)

wherein d _g Representing the distance between the robot and the target point, v _pref Indicating the preferred speed of the robot, v _x ，v _y Wherein s represents the robot state, d _g Representing the distance between the robot and the target point, v _pref Indicating the preferred speed of the robot, v _x ，v _y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a _i Represents the i-th pedestrian state, p _x ，p _y Representing the position of a pedestrian, v _x ，v _y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r _i Radius d representing the space occupied by the ith pedestrian _i Representing the distance between the robot and the ith pedestrian, r _i +r represents the minimum safe distance of the robot from the person.

Modeling the mutual observation between the robot and the pedestrian as a partially observable markov tuple (S, a, P, R, Ω, O, γ), where S represents a state space, a represents an action space, P represents a transition state, R represents a reward, Ω represents an observation probability distribution, O represents a mapping of state space to observation space, and γ represents a discount rate. Within each time step, the robot selects action a e a given observation O e O. The next state S' is determined only by the current state S according to the markov assumption; the observation probability distribution Ω (o, s ', a) and the state transitions P (s, a, s') are determined by conditional probabilities.

Step 2: setting a multi-step proxy stage rewarding function, wherein the multi-step proxy stage rewarding function comprises punishment on collision of the robot and the dynamic obstacle, constraint on distance between the robot and the dynamic obstacle, punishment on distance between the robot and a target point and time cost;

in this embodiment, the multi-step proxy stage rewards function is defined as follows:

where N represents the total number of steps of the multi-step proxy stage rewarding function, r _i Indicating the rewards of the T-th step, wherein T indicates the total decision number in the navigation process, gamma indicates the discount rate, and k indicates the current step number. Take the multi-step value N from the experience stored in the temporary replay buffer. For the training segments of T steps, the buffer is from the initial state s ₀ To the end state s _T Is T. The influence of future multi-step robot action selection on the current rewarding value is considered, so that the action of the robot is converged in a safe action interval, the variance of a cost function is reduced, and the training convergence speed is accelerated.

Wherein r is _t The expression is as follows:

wherein t represents the current time,

a represents the joint state at time t in the robot navigation ^t Represents the action at time t, d _t The distance between the robot and the pedestrian at the time t. The reward function comprises punishment of collision with the dynamic obstacle, constraint on the distance between the robot and the dynamic obstacle, punishment on the distance between the robot and the target point and time cost. When the distance between the robot and the pedestrian is smaller than 0, collision occurs, and punishment is carried out on the robot, namely punishment of collision of the dynamic obstacle; when the distance between the robot and the pedestrian is smaller than 0.15, punishment is performed according to the approaching degree, namely the distance between the robot and the dynamic obstacle is restrained. When the robot reaches the target position within a specified time, the robot is rewarded, and when reaching the destination, the robot is inversely related to the rewards, namely punishment of the distance from the target point and time cost.

Step 3: random generation of pedestrian track data based on collision algorithm according to pedestrian action spaceA set D; in one embodiment of the invention, three RVO, ORCA, SFM collision avoidance algorithms are adopted in the action space of the pedestrians, and a pedestrian motion trail data set D is randomly generated according to the proportion of 1:1:1, wherein the pedestrian motion trail data set consists of a plurality of pedestrian states h _i ＝[p _x ，p _y ，v _x ，v _y ，r _i ，d _i ，r _i +r]The composition is formed.

Step 4: a surrounding grid map of the robot is generated for capturing the locations of surrounding dynamic pedestrians.

Step 5: the encoder and decoder are constructed in a three-layer structure, wherein each encoder layer comprises a feedforward network and a self-attention module, and each decoder layer comprises a self-attention module, a codec attention module and a feedforward network. The self-attention module is used for learning the relation between the current behavior of the pedestrian and the previous behavior, and the encoding and decoding attention module is used for learning the current behavior of the pedestrian and the motion characteristic F= { F of the pedestrian to be encoded _t ，f _t-1 ，f _t-2 Relationships of }. As shown in fig. 2, the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic f= { F _t ，f _t-1 ，f _t-2 And (f), where f _t The motion characteristics of the pedestrian at the time t are shown.

Step 6: and estimating the state-action value pair of the robot by using the deep learning network.

The current state s of the robot _t And observe o _t And the pedestrian motion characteristic F is input into the deep neural network to estimate the state-action value pair of the robot. In this embodiment, the deep neural network structure is sequentially composed of 3 multi-layer perceptron networks, a softmax activation function, 2 multi-layer perceptron networks, and 1 fully connected layer.

Step 7: and outputting the optimal action selected by the robot in the current state by using an action neural network in the PPO algorithm, scoring the state-action of the robot by using an evaluation neural network in the PPO algorithm, and selecting the action with the largest rewarding value for updating. By using the course learning method, the number of pedestrians in the environment is gradually increased, the pedestrian speed is changed, and the like. And (5) carrying out iterative training, updating and optimizing network parameters, and selecting an optimal motion trail.

In one implementation of the invention, the simulation environment is set up as a GYM station, the scene is a square space with a side length of 12 meters, the initial positions of pedestrians are randomly distributed in the space, and the number of pedestrians is 8 dynamic pedestrians. Under the environment, the trained reinforcement learning network is used for selecting the optimal collision avoidance path, and the safe navigation of the robot is completed.

The simulation results of the robot path planning process are shown in fig. 3 and 4, wherein square units are robots, round units of 0-7 represent pedestrians, upper left corner legends respectively represent the instantaneous speed of pedestrians of 0-7, arrows represent the speed direction, and five-pointed stars are destinations. The black area is a blind area for laser radar scanning when the robot is moving.

As can be seen from fig. 3, when the robot starts for 2.5s, pedestrians No. 2 and No. 5 are completely covered by pedestrians No. 3 and No. 7, and are in the blind area of the robot. Meanwhile, the position of the pedestrian No. 6 causes local shielding for the pedestrian No. 0. In fig. 4, moving to 7.75s, there is still an occlusion relationship between the two. The invention can output the pedestrian characteristic vector F= { F through the coder-decoder _i ,f _i-1 ,f _i-2 And the multi-step reward function is used for capturing the pedestrian dynamics rapidly and selecting the optimal behavior action of the current robot so as to avoid collision. Experimental simulation proves that the robot obstacle avoidance method can solve the problems of obstacle avoidance and path planning of the robot in an actual pedestrian environment. And the safety of navigation and the effectiveness of obstacle avoidance in a multi-line human environment are proved.

The embodiments described in this specification are merely illustrative of the manner in which the inventive concepts may be implemented. The scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but the scope of the present invention and the equivalents thereof as would occur to one skilled in the art based on the inventive concept.

Claims

1. The robot path collision avoidance planning method based on deep reinforcement learning in the pedestrian environment is characterized by comprising the following steps of:

s4: inputting the pedestrian track data set into a coder-decoder network, and extracting pedestrian motion characteristics;

s7: in the robot path collision avoidance planning process, acquiring surrounding environment data, and firstly, coding the pedestrian state by utilizing a trained codec network to obtain pedestrian motion characteristics; estimating the state-action value pair of the robot by using the trained deep learning network according to the pedestrian motion characteristics and the current state and observation of the robot obtained by the partially observable Markov tuple; and finally, outputting the optimal action in the current state through the trained action neural network, and realizing the safe navigation in the dynamic pedestrian environment.

2. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the state space is expressed as:

s＝[d _g ,v _pref ,v _x ,v _y ,r]

h _i ＝[p _x ,p _y ,v _x ,v _y ,r _i ,d _i ,r _i +r]

wherein s represents the robot state, d _g Representing the distance between the robot and the target point, v _pref Indicating the preferred speed of the robot, v _x ,v _y The speed of the robot in the directions of the x axis and the y axis is represented, and r represents the radius of the space occupied by the robot; h is a _i Represents the i-th pedestrian state, p _x ,p _y Representing the position of a pedestrian, v _x ,v _y Representing the speed of the pedestrian in the directions of the x axis and the y axis, r _i Radius d representing the space occupied by the ith pedestrian _i Representing the distance between the robot and the ith pedestrian, r _i +r represents the minimum safe distance of the robot from the person.

3. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the Markov tuple is expressed as (S, A, P, R, omega, O and gamma), wherein S represents a state space, A represents an action space, P represents a transition state, R represents a reward, omega represents an observation probability distribution, O represents a relation mapping from the state space to the observation space, and gamma represents a discount rate.

4. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the multi-step proxy stage rewards function is as follows:

wherein N represents a multi-step generationTotal number of steps of the stage prize function, r _t Indicating the rewards of the T-th step, wherein T indicates the total decision number in the navigation process, gamma indicates the discount rate, and k indicates the current step number.

5. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: said r _t The following are provided:

wherein t represents the current time,

a represents the joint state at time t in the robot navigation ^t Represents the action at time t, d _t When the distance between the robot and the pedestrian at the moment t is smaller than 0, collision occurs, and punishment is carried out on the robot and the pedestrian; when the distance between the robot and the pedestrian is smaller than 0.15, punishing according to the approaching degree; when the robot reaches the target position within a prescribed time, the robot is rewarded, and the time when reaching the destination is inversely related to the rewarding.

6. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the pedestrian trajectory data set D described in step S3 is generated by three collision avoidance algorithms RVO, ORCA, SFM.

7. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the coder-decoder network consists of an encoder and a decoder, wherein the input of the encoder is the pedestrian state in the pedestrian track data set, and the output of the decoder is the pedestrian motion characteristic F= { F _t ,f _t-1 ,f _t-2 And (f), where f _t The motion characteristics of the pedestrian at the time t are shown.

8. The robot path collision avoidance planning method based on deep reinforcement learning in a pedestrian environment according to claim 1, wherein: the deep neural network is sequentially composed of a multi-layer perceptron network, a softmax activation function, the multi-layer perceptron network and a full-connection layer; the input of the deep neural network is the current state s of the robot _t And observe o _t And the pedestrian motion characteristic F, and the output of the deep neural network is an estimated value of the state-action value pair of the robot.