CN116202526A

CN116202526A - Crowd navigation method combining double convolution network and cyclic neural network under limited view field

Info

Publication number: CN116202526A
Application number: CN202310163370.6A
Authority: CN
Inventors: 黄静; 鲁亚洲
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-02

Abstract

The invention discloses a crowd navigation method combining a double convolution network and a cyclic neural network under a limited view field, which fully considers state information in the limited view field, performs layered coding on robot information and pedestrian information in a perception range, takes the coded information as characteristic information of the cyclic neural network, takes the robot state information as another characteristic information of the cyclic neural network, trains the network by utilizing a near-end strategy optimization algorithm, and realizes crowd navigation of a robot under the limited view field. The algorithm only needs to encode the information of the robot and the relation between the robot and pedestrians in a partial perception range, so that the complexity of multi-agent interactive modeling is reduced. The cyclic neural network utilizes the coding information as characteristic information to effectively excavate the characteristic relation between the robot and the pedestrian, helps the robot to make more accurate actions, and realizes the navigation task of the robot with limited field of view in the dynamic dense crowd environment by predicting the optimal actions.

Description

Crowd navigation method combining double convolution network and cyclic neural network under limited view field

Technical Field

The invention belongs to the field of artificial intelligence and robot navigation, and particularly relates to a method for navigating a robot with a limited field of view in a dynamic dense crowd by utilizing deep reinforcement learning.

Background

One of the main purposes of mobile robots is to serve humans, and ensuring human safety is a prerequisite for robots to perform work. The mobile robot is required to navigate in crowd environment to face pedestrians walking, and the mobile robot is required to have strain capacity facing emergency, so that a planned path is safe, efficient and comfortable. The safety indicates that the robot can navigate in the crowd environment to ensure the safety of the robot and keep a certain safety distance from the pedestrian, the robot can efficiently navigate in the crowd environment to reach the target point through the shortest path, and the robot can comfortably indicate that the robot needs to ensure the smooth track of the robot and can not enter the comfort zone of the pedestrian when planning the path. When a robot performs a navigation task in a dynamic dense crowd, the robot can only sense a limited range (limited field of view), which creates a great challenge for the navigation performance of the robot. Current crowd navigation algorithms are classified into conventional navigation algorithms and learning-based navigation algorithms according to whether they have an autonomous learning capability.

The traditional navigation algorithm takes pedestrians as simple dynamic barriers, and when the robot encounters the pedestrians, collision avoidance rules are adopted to avoid collision. The traditional navigation algorithm is simple in principle, but a human design model is needed to complete navigation, so that it is not clear whether the complex behaviors of the human beings follow the accurate geometric behavior rules at present. When facing a dense crowd environment, the robot does not have learning ability, so that the navigation performance of the limiter can not be served in a real scene.

Rapidly evolving artificial intelligence techniques have been introduced in recent years to address crowd navigation issues, where deep reinforcement learning frameworks are used to train crowd navigation strategies. The crowd navigation strategy based on deep reinforcement learning gives learning capability to the robot, and the robot performs navigation in the crowd environment according to the learned navigation strategy by learning experience of interaction with pedestrians in the environment. However, existing crowd navigation strategies do not focus on the situation of a partially perceived range of robots, which results in a dramatic decrease in the navigation performance of robots with limited field of view.

Disclosure of Invention

The present invention aims to overcome the above-mentioned disadvantages of the prior art by combining a double convolution and cyclic neural network (Double Convolutional Neural Networks and Recurrent Neural Network, DCRNN) a mobile robot swarm navigation model of the algorithm. The DCRNN-based robot crowd navigation model comprises a DCRNN strategy network and a DCRNN value network. The algorithm is proposed by research under the realistic condition that the robot has a limited field of view. Firstly, describing a robot navigation decision in a crowd environment as a reinforcement learning (Reinforcement Learning, RL) decision problem, then modeling the robot and crowd in the environment, and calculating the optimal navigation behavior of the robot through DCRNN updated state information. As shown in FIG. 1, the graph is a model overall structure, which is a model modeling the environment and selecting an action a based on reinforcement learning strategy after inputting the environment information into the model ^t The robot obtains a reward r after selecting actions according to the strategy _t ∈R。

The invention comprises the following modules:

(1) An environment modeling module:

1.1 Modeling part of the interaction behavior of robots and pedestrians as part of a observable Markov decision process (Partially Observable Markov Decision Processes, MDP), with a five-tuple<S,A,P,R,S ^* >And (3) representing. s is(s) _t S represents the state information of the agent in the environment at time t,

wherein r is ^t State information of the robot at time t, r ^t Comprising robot position (p) _xr ,p _yr ) Speed (v) _xr ,v _yr ) Target point position (g) _r ,g _y ) Steering angle θ, perceived radius α, and the like.

Representing the status of the i-th pedestrian observed by the robot,/->

Including the position information (p _xi ,p _yi ) In the present inventionThe speed and perceived radius of the pedestrian are not needed in part, mainly because these two parameters are difficult to measure in real scenes. During each test, the robot always goes from an initial state s ₀ Initially, according to policy pi (a ^t |s ^t ) Taking action a at each time step ^t After e A, the corresponding rewards are received and based on the state transition probability function P (s ^t ,s ^t+1 |a ^t ) State from s ^t Conversion to s ^t+1 Meanwhile, other pedestrians can move according to the existing behavior strategies and become the next state. This process of status updating may continue until a robot collides, reaches a target point, or a timeout condition occurs.

1.2 To encourage the robot to evade pedestrians and quickly reach the target point, the following reward function is designed:

wherein the method comprises the steps of

The distance between the robot and the pedestrian at the moment t is represented, and the penalty is larger and the collision penalty is largest as the distance between the robot and the pedestrian is smaller. r is (r) _s =20, indicating rewards to reach the target point. />

A reward is obtained when proceeding towards the target, and a penalty is obtained away. At time t, the policy maximizes the reward that the observed action space selects one action to be:

wherein R(s) ^t ,a ^t ,s ^t+1 ) Is a reward function，γ∈(0,1]Is a discount factor for the jackpot, and is intended to prevent the pedestrian from avoiding the robot when the robot is moving toward the pedestrian, provided that the robot can observe the pedestrian and the pedestrian cannot observe the robot.

(2) DCRNN neural network module design

The crowd environment contains various data information, and when the robot is used for collecting information training, only the information of the robot and the position coordinate information of the pedestrian are considered, and the speed of the pedestrian and the perceived radius of the pedestrian are not taken as information considered by the robot navigation, so that the speed and the perceived radius of the pedestrian in the real scene are difficult to accurately measure. The internal structure of the DCRNN neural network is shown in fig. 2, status information collected from the environment is respectively transmitted to the DCNN network module and the RNN network module, wherein the DCRNN encoder is responsible for encoding the relative positional relationship between the agents (robots and pedestrians) in the environment and the movement trend information (speed and steering angle) of the robots, and the RNN module takes the joint information of the robot nodes and the hidden status information of the DCNN encoder at the last moment after output and splicing as input, and outputs the values of the action and value functions obtained according to the full connection layer.

2.1 DCNN module design

The DCNN module is mainly used for processing the motion trend information (the speed information and the steering angle) of the robot and the relative position relation information of the robot and pedestrians. The internal model design is shown in figure 3, and the relative position relation information of the robot and the pedestrian passes through the full connection layer f _edge Then the robot speed information is encoded by a CNN1 encoder and passes through a full connection layer f _robot And after the information is expanded, the CNN2 encoder is used for encoding, the two encoded information are coupled by using the MLP, and the information is output as one of the characteristic information input by the RNN network module.

The relative relation between the positions of the robot and all pedestrians in the perception range is encoded by a CNN network, and the formula 4 is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

relative relation between robot and pedestrian in sensing range at t moment, < >>

Representing the relative positional relationship between a pedestrian and a pedestrian within a perception range, f _edge Representing a fully connected layer. Speed information v representing motion trend of robot _pref ＝(v _xr ,v _yr ) And the steering angle theta is input to another CNN network for encoding, equation 5 is as follows:

output2＝CNN2(f _v (v _pref ,θ)) (5)

wherein f _v () Representing the full connection layer, coupling the two network outputs into one of the characteristic information of the RNN network module using one MLP network,

output3＝MLP(output1,output2)

(6)

the characteristic information of output3 effectively characterizes the hidden relation between the robot and the pedestrian, and utilizes the state information r of the robot ^t As another characteristic information of the RNN network module.

2.2 RNN module design

The internal structure of the RNN network module is shown in fig. 4, and the network module has three inputs, including the feature information output by the MLP, the robot feature information and the last time hidden state information, where the former two inputs the calculated information and the last time hidden state information together into the GRU unit for state update. The output of the network module is x, which represents the updated characteristics of the robot node.

One input of the inputs of the GRU unit is defined as follows:

output5＝f _cat (output3,f _robot (r ^t ))

(7)

wherein f _cat () F is a matrix splicing function _robot () Is a full connection layer r ^t Is the characteristic information of the robot. The GRU units are defined as follows:

conceal status information for last moment, +.>

And hiding the state information for the current moment. The output of the RNN module will be input to the full connectivity layer to obtain the values of the action and value functions, respectively.

(3) Strategy learning algorithm

A model-free strategy gradient algorithm, near-end strategy optimization (Proximal Policy Optimization, PPO), is used for the learning of strategy and cost functions. The PPO algorithm can be divided into PPO-Penalty containing self-adaptive KL divergence (KLPenalty) and PPO-Clip containing Clippped Surrogate Objective function according to the updating mode of the Actor network. As shown in FIG. 1, a robot obtains a reward r after selecting actions according to a DCRNN strategy _t E R, where (p) _xi ,p _yi ) Representing the ith rowThe current position of the person, (p _xr ,p _yr ) Representing the current position of the robot, (v) _xr ,v _yr ) Representing the speed of the robot, and predicting the next optimal action as a according to probability distribution after being calculated by a DCRNN strategy network ^t Evaluating the action value by using the DCRNN network, and obtaining the next environmental state information s after the robot acts one step according to the predicted action ^t+1 ，

Representing the position of the i-th pedestrian of the next time step,/->

Representing the position of the robot at the next time step +.>

Representing the speed of the robot at the next time step, and changing the next state into the state s at the current time after state transition ^t And comparing the predicted state with the real state to calculate the update amplitude of the control network. And (3) operating 12 environment examples in the training process to collect environment experience, decomposing the robot group navigation problem into independent small problems, effectively learning factors of corresponding problems by using two convolutional neural networks and one cyclic neural network, and training a DCRNN strategy network and a value network by using a PPO gradient algorithm to determine the actions selected by the robot in the dynamic dense crowd environment.

Drawings

Fig. 1 is a general structural view of the present invention.

Fig. 2 is a diagram of the DCRNN neural network.

Fig. 3 is a block diagram of DCNN modules.

Fig. 4 is a structural diagram of RNN modules.

Fig. 5 is a diagram of a group size 5 person test scenario.

FIG. 6 is a graph of a test scenario for a crowd scale of 10 people comparing a model with the present model.

FIG. 7 is a diagram of one of the test scenarios involving dynamic pedestrians and static pedestrians.

Detailed Description

The invention will be described in detail with reference to the drawings and examples.

The invention uses the DCRNN neural network as a core, establishes a crowd navigation model based on the DCRNN, and can realize navigation tasks by using the model in a robot with limited field of view in a dynamic dense crowd environment. The invention designs different simulation environments aiming at model characteristics to test the model performance. Firstly, testing the navigation function of the model when the crowd scale is 5 people, and verifying whether the model can realize the navigation function in a dynamic sparse crowd environment; and secondly, the navigation performance of the comparison model and the model is tested in a dynamic dense crowd environment with the crowd scale of 10 people, and whether the model can realize the navigation function in the dynamic dense crowd environment or not is verified, and the performance is good. Finally, adding stationary pedestrians as static barriers in a dynamic crowd environment to change the environment, and testing whether the model can complete navigation tasks in the environment with pedestrians and the static barriers.

The experimental environment is set to be a two-dimensional environment, the simulation robot navigates in a space which is 10mx and 10m and contains dense crowds, and the simulation environment has more realism than the previous crowd navigation simulation in three aspects. Firstly, the limited field radius of a robot is 4m, and the previous model assumption can acquire the state information of all pedestrians in the environment, which does not accord with the actual situation; secondly, the simulated scene has more pedestrians, has higher crowd density and has frequent interaction; the simulation environment comprises a pedestrian environment without static barriers and a pedestrian environment with static barriers. The method comprises the steps of setting starting positions and target positions of a robot and pedestrians to randomly sample in an environment, enabling all pedestrians to follow a collision avoidance strategy, reacting to other pedestrians only, and neglecting the robot, wherein the aim is to prevent the pedestrians from avoiding invalid situations that the robot obtains high rewards due to the robot, and meanwhile, the method is also to simulate the influence of the robot on the behavior of the pedestrians in a real scene. Using a complete kinetic model of simulated robots and pedestrians in a simulation environment, the motion at time t is determined by the x and y component speeds v _x And v _y The composition, the motion space of the robot is continuous, assuming that all agents can reach the desired speed immediately, and that Δt is kept at that speed for a very short time. In order to simulate pedestrians in a real scene, the behaviors of the pedestrians are randomly changed in the same experiment, and the pedestrians can immediately move towards a new target point after reaching the target point, rather than stopping at the original target point and being stationary.

Training the reinforcement learning model 5x10 using the reward function of equation (2) for model DCRNN ⁶ A learning rate of 2x10 ^-5 Each model is tested in 400 random test cases, and the Success Rate (SR), the Collision Rate (CR) and the average path length (average path length, PL) of the test are used to measure the quality of the model, wherein the success rate represents the ratio of successfully reaching the target point in 400 tests, the collision rate represents the ratio of successfully colliding pedestrians in 400 tests, and the average path length represents the average path length successfully used in 400 tests. All simulation code was written using Python, hardware and software platform table 1.

TABLE 4-1 hardware and software platform parameters

Hardware and Software	Parameters
		ARM	40GB
CPU	Intel(R) Xeon(R) Platinum 8260L
		Operating system	Ubuntu 18.04

Fig. 5 is one of the test scenarios performed with the robot set to a limited field of view of 4 meters, with a crowd scale of 5 pedestrians with random movements. The yellow circle indicates the robot, the arrow indicates the direction of action of the robot, and the black circle indicates the perception range of the robot, i.e. the robot has a limited field of view. Pedestrians outside a limited field of view of the robot are represented by red circles, pedestrians perceived in the limited field of view are represented by green circles, each pedestrian is provided with a corresponding number and an arrow representing a movement direction, and red five-pointed stars represent target points of the robot. From fig. 5 (a) (b) (c) (d), it can be seen that the pedestrian's actions are random, and the robot can route to the target point based on the pedestrian perceived in the limited field of view.

The difficulty of challenges is increased in completing navigation tasks for crowd navigation models in a dynamic dense crowd environment under limited field of view conditions. Fig. 6 is one of the test cases in a scene of a crowd scale of 10 people with a robot having a limited field of view in the range of 4 m. Wherein 6 (a) shows an ORCA test scenario, 6 (b) shows an SF test scenario, 6 (c) shows an SARL test scenario, 6 (d) shows an SSTGCN test scenario, and 6 (e) shows a DCRNN test scenario. Table 2 shows the experimental index for each model in 400 tests. Comparing the DCRNN model with ORCA and SF models, the two methods of ORCA and SF have lower success rate in dense crowd environment, wherein the success rate of SF is the lowest, and the success rate and navigation time of the model are better than those of the model, mainly because the DCRNN model explores the environment and collects experience for learning in the RL training process, the model makes decisions according to past track information and current state information, optimizes accumulated rewards, and ORCA and SF only consider the current state, so the two methods have poorer performance. Comparing the model of this section with two value learning models of SARL and SSTGCN, it is known from Table 2 that SARL and SSTGCN have lower success rate than DCRNN of this section model, because when supervising learning, the value network is initialized by ORCA, and cannot perform optimal value estimation, resulting in the defect of ORCA in the trained model, but DCRNN learns directly from past experience, so that premature convergence of network can be effectively prevented, and the defect of supervising learning by ORCA is avoided, and compared with general expression of SSTGCN in limited view field, DCRNN can show excellent navigation performance by effectively utilizing robot motion information and relation with each pedestrian in limited view field. Through testing and analysis in a dense crowd environment, the DCRNN can quickly and efficiently realize crowd navigation tasks under the condition of limited field of view and is superior to other baseline models.

/>

The model is subjected to navigation test in an environment with dynamic pedestrians and static pedestrians, as shown in fig. 7, the static pedestrians are surrounded into a circle to serve as a static obstacle for the robot to act, the environment also comprises walking pedestrians, the robot can navigate to a target point under the condition of partially sensing the environment, and the dynamic and static pedestrians are avoided, so that the model can still complete navigation tasks even in a complex environment.

In summary, the DCRNN-based robot group navigation method provided by the embodiment of the invention realizes the safe and efficient crowd navigation function of the robot.

The foregoing description is only exemplary embodiments of the invention and is not intended to limit the scope of the invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The crowd navigation method combining the double convolution network and the cyclic neural network under the limited view field is characterized by comprising the following steps:

(1) And (3) environment modeling, namely modeling the interaction behaviors of the robot and the pedestrian.

(2) The double convolution network and the cyclic neural network module are designed and divided into a double convolution module and a cyclic neural module.

(3) And a strategy learning algorithm for performing reinforcement learning training on the dual convolutional network and the cyclic neural network module.

2. The crowd navigation method combining a dual convolutional network and a recurrent neural network in a limited field of view of claim 1, wherein the environmental modeling comprises:

(1) Modeling part of human-computer interaction behaviors in the environment.

(2) The reward function is rationally designed, the robot is encouraged to approach the target point, punishment is carried out when the robot approaches the pedestrian, and if the robot observes the pedestrian, the pedestrian cannot observe the robot, so that the robot is prevented from being avoided by the pedestrian when the robot moves towards the pedestrian.

3. The crowd navigation method combining a dual convolutional network and a recurrent neural network under a limited field of view according to claim 1, wherein the dual convolutional network and recurrent neural network module design comprises:

(1) The method comprises the steps of designing a double convolution network module, respectively processing the relative position relation of an intelligent agent in the environment and the movement trend information of a robot by using two convolution neural networks as encoders, and carrying out full-connection propagation on the information by using a multi-layer perceptron MLP to obtain the output of a DCNN module.

(2) The method comprises the steps of designing a circulating neural network module, splicing the output of a DCNN module and the characteristic information of a robot, updating the neural network by the output and the last-moment hidden state information of a circulating neural network unit of the RNN module, and respectively obtaining the values of an action and a value function by the output of the RNN module.

4. The crowd navigation method combining a dual convolutional network and a recurrent neural network under a limited field of view according to claim 1, wherein the strategy learning algorithm comprises:

(1) The strategy learning uses a near-end strategy optimization algorithm PPO-Clip to train, a DCRNN neural network is respectively used as a strategy network and a value network, the DCRNN strategy network predicts the next action according to the state information of the robot and the pedestrian in the environment, and the DCRNN evaluation network evaluates the value of the action by taking the influence on the environment after the action on the robot.

(2) The strategy learning stage runs 12 environment examples to collect environment experience, the robot group navigation problem is decomposed into independent small problems, factors corresponding to the problems are effectively learned by using two convolutional neural networks and one cyclic convolutional neural network, and the DCRNN strategy network and the value network are trained through a PPO gradient algorithm to determine actions selected by the robot in the dynamic dense crowd environment.