CN117191046B

CN117191046B - Crowd navigation method and system based on deep reinforcement learning and graph neural network

Info

Publication number: CN117191046B
Application number: CN202311450875.7A
Authority: CN
Inventors: 李彬; 杨姝慧; 徐一明; 刘丽霞; 张明亮
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2024-01-26
Anticipated expiration: 2043-11-03
Also published as: CN117191046A

Abstract

The invention discloses a crowd navigation method and a system based on deep reinforcement learning and a graph neural network, and relates to the technical field of artificial intelligence and mobile robot navigation, wherein the method comprises the following steps: modeling is carried out based on a simulation scene of the robot-crowd navigation environment, and the simulation scene is converted into a graph representation; acquiring state characteristic information of robots and pedestrians in people in the graph representation; respectively extracting state diagram representations containing space and time characteristic information through a GAT network and a GRU network; based on the state diagram representation containing the space-time characteristic information obtained through fusion, calculating a behavior decision by adopting a SAC algorithm, generating an optimal action to be executed at the current moment of the robot, and continuously performing iterative calculation until the robot reaches a set target position or collides or exceeds a set maximum navigation time, so as to complete a navigation task. The invention captures complex human-human and human-computer interaction and is used for navigation decision, and safer and more efficient movement is realized in crowd navigation.

Description

Crowd navigation method and system based on deep reinforcement learning and graph neural network

Technical Field

The invention belongs to the technical field of artificial intelligence and mobile robot navigation, and particularly relates to a crowd navigation method and system based on deep reinforcement learning and a graph neural network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of artificial intelligence, mobile robots are increasingly frequently used in public places where people such as convenience stores, shops, restaurants, hospitals and the like are crowded. It follows that mobile robots are increasingly being used in an artificially centric social environment. In these crowd-crowded scenarios, mobile robots need to navigate interactively with dynamic pedestrians in order to reach a specified target, completing a given task. The mobile robot is required to have certain environment sensing capability and dynamic obstacle avoidance capability, and a feasible and optimal obstacle-free path can be efficiently and rapidly planned, so that a destination can be reached to complete a task. Crowd navigation problems involve complex interpersonal interactions, and robots need to predict and adapt to pedestrian behavior to ensure safe and smooth movement. However, the intent and goal of humans is not always apparent, resulting in more complex navigation problems in the crowd environment.

Aiming at the challenges of the mobile robot group navigation, the existing methods are mainly divided into two solutions based on reaction and learning. Reaction-based methods attempt to guide robot behavior by manually establishing interaction rules, but such methods often lack a comprehensive understanding of human-machine complex interactions; learning-based methods, particularly deep reinforcement learning-based methods, utilize neural networks to learn behavior strategies of robots by simulating human-machine interactions, map environmental states to action spaces, and enable robots to perform autonomous navigation in complex environments, however, the method often cannot sufficiently predict future behaviors of humans, and the robots may make inaccurate decisions.

The human-computer interaction problem is a direct factor affecting crowd navigation, and in addition, another problem is that interaction between people is an important factor affecting navigation indirectly, but the existing method mainly focuses on interaction between robots and people, and neglects influence between people. In crowded and highly interactive environments, human-to-human interaction is critical to robot decisions, and existing methods fail to fully consider this factor, resulting in poor navigation of the final mobile robot in the dynamic crowd.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a crowd navigation method and a crowd navigation system based on deep reinforcement learning and a graph neural network, which adopt a mode of combining a graph attention mechanism, a time function and a deep reinforcement learning algorithm to capture complex human-human interaction and human-machine interaction and capture time correlation of crowd navigation graphs under different time steps, and use the interactions for navigation decision, and introduce maximum entropy in reinforcement learning to improve the exploration capacity of a robot so as to realize safer, efficient and natural movement in crowd navigation.

In a first aspect, the invention provides a crowd navigation method based on deep reinforcement learning and a graph neural network.

A crowd navigation method based on deep reinforcement learning and a graph neural network comprises the following steps:

modeling is carried out based on a simulation scene of the robot-crowd navigation environment, and the simulation scene is converted into a graph representation;

acquiring state characteristic information of robots and pedestrians in people in the graph representation;

extracting a state diagram representation containing space characteristic information through a GAT network based on the state characteristic information of the robot and the pedestrian;

extracting a state diagram representation containing time feature information through a GRU network based on the state diagram representation containing space feature information, and fusing the extracted two state diagram representations;

based on the fused state diagram representation containing the space-time characteristic information, calculating a behavior decision by adopting a SAC algorithm, generating an optimal action to be executed at the current moment of the robot, and continuously and circularly iterating the calculation until the robot reaches a set target position or collides or exceeds a set maximum navigation time, thereby completing the navigation task.

In a second aspect, the invention provides a crowd navigation system based on deep reinforcement learning and a graph neural network.

A crowd navigation system based on deep reinforcement learning and graph neural network, comprising:

the model construction module is used for modeling a simulation scene of the robot-crowd navigation environment and converting the simulation scene into a graph representation;

the state information acquisition module is used for acquiring state characteristic information of robots and pedestrians in the graph representation;

the state information processing module is used for extracting a state diagram representation containing space characteristic information through the GAT network based on the state characteristic information of the robot and the pedestrian; extracting a state diagram representation containing time feature information through a GRU network based on the state diagram representation containing space feature information, and fusing the extracted two state diagram representations;

the crowd navigation module is used for calculating a behavior decision by adopting a SAC algorithm based on the fused state diagram representation containing the space-time characteristic information, generating an optimal action to be executed by the robot at the current moment, and continuously and circularly iterating the calculation until the robot reaches a set target position or collides or exceeds a set maximum navigation time, so as to complete a navigation task.

The one or more of the above technical solutions have the following beneficial effects:

1. the invention provides a crowd navigation method and a system based on deep reinforcement learning and a graph neural network, which adopt the graph neural network to capture complex interactions between a man and a person, and use the captured complex interactions for navigation decision, thereby improving the accuracy of final decision and navigation.

2. According to the method, a time function GRU (GRU network) is introduced and used for capturing the time correlation of the crowd navigation map under different time steps, and the visibility dynamic change is caused by the movement of all the intelligent agents, so that the node, the set of edges and the parameters of the interactive function in the map representation are correspondingly changed, and the map of the adjacent time steps is connected through the introduced time function, so that the shortness of the traditional reaction method is overcome, and the robot can make a longer-term decision.

3. According to the invention, the SAC deep reinforcement learning algorithm is adopted, the SAC algorithm is based on an offline strategy and a maximum entropy principle, and compared with the traditional deterministic strategy learning algorithm, the SAC algorithm has more excellent performance in robot task learning, adopts a random strategy, outputs actions with scattered probability distribution instead of concentrating the probabilities on a specific action, and explores all possible optimal paths by pursuing the maximum entropy of the strategy, so that the action exploration capability is enhanced, the robustness and generalization capability of the algorithm are improved, and the model performance is more stable.

4. The invention redesigns the reward function based on the current navigation scene, is used for encouraging the robot to dynamically avoid the obstacle for the pedestrian, keeps a certain safety distance with the pedestrian at the same time, keeps the limited speed and the angle change within a certain range, does not change too severely, can quickly reach the designated target position, simulates the real scene more truly, and enhances the convergence of training.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a general flow chart of a crowd navigation method based on deep reinforcement learning and a graph neural network according to an embodiment of the invention;

FIG. 2 is a system block diagram of a crowd navigation method based on deep reinforcement learning and a graph neural network according to an embodiment of the invention;

fig. 3 is a schematic diagram of a GAT network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a GRU network according to an embodiment of the invention;

FIG. 5 is a schematic diagram showing the structure of a SAC algorithm in an embodiment of the present invention;

fig. 6 is a schematic diagram of a simulation scenario of a robot-crowd navigation environment in an embodiment of the invention.

Detailed Description

It should be noted that the following detailed description is exemplary only for the purpose of describing particular embodiments and is intended to provide further explanation of the invention and is not intended to limit exemplary embodiments according to the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Furthermore, it will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, devices, components, and/or groups thereof.

As described in the background, mobile robots face a number of challenges in the area of crowd navigation. Firstly, in crowded crowd environments, human-computer interaction is very complex, and involves predicting and adapting to human behaviors so as to ensure safe and smooth interaction between robots and human beings; secondly, the mobile robot needs to have high environmental awareness so as to identify obstacles and other pedestrians in time to plan a safe path; finally, because crowd navigation scenes are irregular, the crowd quantity around the robot is continuously changed in the human-computer interaction process, the traditional CNN (Convolutional Neural Networks, convolutional neural network) and other models are difficult to process irregular navigation data, and the model cannot be directly applied to crowd navigation graph data. Therefore, the invention proposes to introduce a graph neural network and a time correlation function modeling for capturing complex human-human and human-computer interactions so as to improve understanding and decision capability of the robot on irregular navigation data, and use the interactions for navigation decision, and introduce maximum entropy in reinforcement learning so as to improve exploration capability of the robot, thereby realizing higher performance and stronger social awareness in crowd navigation, and enabling the mobile robot to navigate in crowd more intelligently, and realize safe, efficient and natural movement.

The invention provides a crowd navigation method and a system based on deep reinforcement learning and graph neural network for mobile robots in dynamic crowd, and provides a mobile robot-crowd navigation model adopting a graph attention mechanism (GAT, graph Attention Networks, graph attention network) and Deep Reinforcement Learning (DRL) algorithm in an unstructured environment of the dynamic crowd. Specifically, the model comprises two major parts, wherein the first part is a neural network structure for extracting space-time characteristic information of robot and pedestrian interaction, and comprises a graph annotation force network for extracting space characteristics and a GRU (Gated Recurrent Unit, gate control circulation unit) network for extracting time characteristics; the second part is a SAC (Soft Actor-Critic) algorithm for learning and training the extracted space-time features. In the invention, firstly, the navigation problem of robots in dynamic crowd is converted into the reinforcement learning problem with sequence decision, then the information of robots and pedestrians in the environment is modeled, secondly, the state information of the robots and pedestrians is input into a GAT network and a GRU network to obtain the state information after being processed and updated, and then the optimal action of the robots is calculated through SAC algorithm. In addition, in the SAC algorithm, in order to simulate a more real social scene, quicken the convergence speed and improve the safety of navigation, the invention also designs a denser rewarding function, and finally, the optimal navigation decision of the robot in the dynamic crowd is obtained.

Example 1

The embodiment provides a crowd navigation method based on deep reinforcement learning and a graph neural network, as shown in fig. 1, comprising the following steps:

step S1, modeling is carried out on a simulation scene based on a robot-crowd navigation environment, and the simulation scene is converted into a graph representation;

s2, acquiring state characteristic information of robots and pedestrians in the graph representation;

s3, extracting a state diagram representation containing space feature information through a GAT network based on the state feature information of the robot and the pedestrian;

s4, extracting a state diagram representation containing time characteristic information through a GRU network based on the state diagram representation containing space characteristic information, and fusing the extracted two state diagram representations;

and S5, calculating a behavior decision by adopting a SAC algorithm based on the fused state diagram representation containing the space-time characteristic information, generating an optimal action to be executed at the current moment of the robot, and continuously and circularly iterating the calculation until the robot reaches a set target position or collides or exceeds a set maximum navigation time, so as to finish a navigation task.

In step S1, modeling is performed based on a simulation scene of the robot-crowd navigation environment, and the simulation scene is converted into a graph representation.

First, a simulated scene of a robot-crowd navigation environment is constructed. In this embodiment, a robot-crowd navigation environment simulation scene (hereinafter referred to as crowd navigation scene) is established through an SFM (Social Force Model ) and an RVO2 (Reciprocal Velocity Obstacles) obstacle avoidance algorithm. A crowd navigation scene diagram as shown in fig. 6, within a square area (abscissaxAnd the ordinateyThe units of (1) are meters), a robot-crowd navigation scene is randomly set, namely, a mobile robot and a crowd navigation scene are assumed to be arrangednThe starting position and the target position of the robot are randomly set for the crowd of dynamic pedestrians. The robot is modeled as a unicycle (unicycle), uses a social force model (SFM, social Force Model) to simulate dynamic motions of pedestrians and interactions between pedestrians, and uses an RVO2 obstacle avoidance algorithm (i.e., an ORCA obstacle avoidance algorithm) to simulate dynamic avoidance of the robot and pedestrians in a 2D plane.

The crowd navigation scene is then converted to a graphical representation. Specifically, the graphic representation refers to a graphical representation of human-computer interaction scene formulation centered on an agent (i.e., mobile robot and pedestrian)Wherein->Representing nodes and edges, respectively. In a crowd navigation scenario of robots, nodes of graph data represent robots and pedestrians, and edges represent paired interaction relationships between the nodes. In crowd navigation scenarios, robots must interact with pedestrians to avoid possible collisions.

In the present embodiment, the graph represents two types of nodes: robot nodes and pedestrian nodes. Wherein, the robot is taken as a robot node, and only one robot node exists in the graph representation. Since the present embodiment focuses on navigating a single robot, there is only one robot in the graphical representation. To avoid a collision, it needs to interact with all other pedestrians. It should be noted that the robot is not visible to pedestrians, so there is no interaction between the robot and human, and since this embodiment focuses on an active collision avoidance strategy, the robot must be fully responsible for avoiding the obstacle. And taking pedestrians in the crowd as pedestrian nodes, wherein the number of the pedestrian nodes is self-defined by the simulation environment. Since it is assumed that the pedestrian does not notice the robot, the pedestrian never interacts with the robot.

In step S2, status feature information of robots and pedestrians in the map representation is acquired. Status feature information (status features may be simply referred to as status) of an agent (including robots and pedestrians) in a crowd navigation environment may be divided into two major parts, one part being an observable status, and the location of the agentSpeed->Radius->(the radius is the abstract scale and represents the range of the agent); another part is a hidden state, the target location expected by the agentPreferred speed->And heading angle->Composition is prepared. Acquiring state characteristic information of the agent, the firstiThe status characteristic information of the individual agents can be described as:；the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Indicating the complete state of the robot,/->Representing the observable state of the pedestrian. />And->All are represented by coordinates centered on the robot, the coordinate system used is centered on the origin of the robot, </i >>The axis is directed towards the target position of the robot.

In step S3, a state diagram representation including spatial feature information is extracted through the GAT network based on the state feature information of the robot and the pedestrian. The method specifically comprises the following steps:

and S3.1, extracting potential state features with fixed length through the multi-layer perceptron MLP based on state feature information of robots and pedestrians. Because the state characteristic values of the robot and the pedestrian are different in dimension, the state characteristic values of the robot and the pedestrian are firstly respectively input into a multi-layer perceptron (MLP), then two types of node characteristics are connected to form a larger characteristic space, and a new potential state characteristic vector with fixed length is obtained.

Specifically, as shown in fig. 2, "R" represents a robot node, "P" represents a pedestrian node, an asterisk represents a target position, and the original input feature (i.e., the state feature of the robot) corresponding to the robot node isThe original input feature corresponding to the pedestrian node (i.e. the status feature of the pedestrian) is +.>. The MLP is used as the 0 th layer, and the MLP only makes a simple linear change on the original input characteristics, namely the input characteristics +.>And->And respectively inputting the hidden state characteristics (namely potential state characteristics) of the robot node and the pedestrian node into a linear MLP with shared node weights.

After the obtained hidden state features of the robot node and the pedestrian node, carrying out aggregation (cat) operation on the hidden state features of all the nodes, and obtaining a final hidden state feature vector with fixed length by adding the hidden state features of the robot and the hidden state features of the pedestrian.

Potential state features are extracted using MLP as follows: 。

wherein,and->Representing multi-layer perceptron MPLs with ReLU activation function, < >>And->Representing network weights corresponding to the multi-layer perceptron MLP; />Representing a potential state feature matrix of the robot, +.>Representing a matrix of potential state features of the pedestrian,/>representing the final latent state feature matrix, i.e., the layer 0 latent state features of the neural network.

And S3.2, calculating and obtaining a relation matrix of the robot neighborhood by utilizing a pair-wise similarity function based on the extracted potential state characteristics.

In the process of acquiring state characteristic matrixThen, a relation matrix is calculated using the pair-wise similarity function (Embedded Gaussian)>The length of an edge in the graph represents the pairing relationship between nodes. Thus, the relationship matrix between the robot node and the pedestrian node is as follows: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is a weight matrix.

And S3.3, extracting a state diagram representation containing space feature information through a GAT network based on the potential state features of the robot and the pedestrian and the relation matrix of the robot neighborhood.

As shown in fig. 3, the relationship matrix (i.e. neighborhood matrix) of the 0 th layer potential state characteristics (i.e. hidden state) of the robot and the pedestrian and the robot neighborhood is input into the first layer GAT network for encoding, and the first layer potential state characteristics are outputAnd corresponding attention weight +.>Can be described as: />。

Thereafter, the machine is put into operationThe relation matrix of the first layer potential state characteristics of the robots and pedestrians and the robot neighborhood is input into a second layer GAT network to be encoded, and the second layer potential state characteristics are outputAnd corresponding attention weight +.>Can be described as:。

adding the outputs of the 2 GAT layers, and adding the state characteristics of the level of the agent obtained by the previous MLP to form the final crowd navigation scene graph representation containing the space characteristic information, namely:。

wherein,is a graphical state representation of the final output of the GAT network, for example>Is the hidden state characteristic output obtained after the original input state characteristic passes through MLPs, and is +.>Is a potential status feature output via a layer of GAT network,>is a potential state feature output through a two-layer GAT network.

In this embodiment, the GAT network specifically includes: the core idea of GAT is to integrate the information embedded by the neighbor nodes in the form of weighted sum, which is different from other methods in that the attention weight is "self-learned" by the model by using the attention mechanism, instead of manually calculating the weight through a predefined formula.

For the firstiPersonal node and neighborhood thereofThe layering graph convolution operation process of (1) is as follows: firstly, establishing a query matrix Q and a key matrix K, so as to convert input features into higher-level features; second, these features are connected through a fully connected network FCN with an activation function LeakyReLU to calculate the attention coefficientsThe method comprises the steps of carrying out a first treatment on the surface of the Then, throughsoftmaxFunction versus attention coefficientNormalizing to obtain->Further use weight->The hidden states are weighted and summed. Wherein (1)>The larger the firstiThe more attention the individual object needs to pay when avoiding a collisionjAn individual object.

Further, the above-described attention mechanism can be described as: 。

wherein, superscriptFor distinguishing potential states of different layers, +.>Represent the firstiPersonal node numberlPotential status features of layer,/->Represent the firstjPersonal node numberlPotential status features of layer,/->Weights representing the query matrix Q and the key matrix K, respectively, < ->Respectively express characteristic->Characteristics converted by query matrix Q and key matrix K, </u >>Representing series operation, a fully convolutional neural network FCN is used as the attention function +.>Mapping the connection state to an attention weight; />Represents normalized attention weight, represents the firstjThe individual nodeiImportance between individual nodes; />Is the first in the diagram representationiThe neighborhood of the individual node(s),krepresenting the first in the neighborhoodkAnd each node.

In step S4, based on the state diagram representation including the spatial feature information, the state diagram representation including the temporal feature information is extracted through the GRU network, and the state diagram representation including the temporal and spatial feature information is fused to obtain the state diagram representation including the temporal and spatial feature information.

As shown in fig. 4, the training efficiency of the GRU network can be greatly improved in the Recurrent Neural Network (RNN), and the GRU network (Gate Recurrent Unit) is the recurrent neural network (Recurrent Neural Network, RNN) is proposed to solve the problems of Long-term memory and gradient in back propagation, like LSTM (Long-Short Term Memory, long-term memory network). The GRU network has two inputs, including a state diagram representation of the current time containing spatial signature information(i.e. hidden state of the current moment of GAT output) state diagram representation comprising spatio-temporal feature information of the last moment +.>(i.e., the hidden state containing time information at the last time) that acts as a neural network memory that contains information of the data passed by the previous node. Combination->And->Output of GRU network to output current hidden stateAnd a hidden state transferred to the next moment +.>The hidden state->The space and time characteristic information of the previous nodes are fused, and the space and time characteristic information can be expressed as follows: />。

Wherein,is the space-time state characteristic of the last moment; />Is the space-time state characteristic of the current moment. Output of GRU network and output of GAT network are fusedAnd then, inputting the values into the SAC algorithm to obtain the values of the action and value functions.

In step S5, based on the state diagram representation including the space-time feature information, the SAC algorithm is adopted to calculate the behavior decision, so as to generate the optimal action to be executed by the robot at the current moment, and the iterative calculation is continuously circulated until the robot reaches the set target position or collides or exceeds the set maximum navigation time, so as to complete the navigation task.

And inputting the state diagram representation containing the space and time characteristic information into a deep reinforcement learning algorithm to perform training learning, and completing the navigation task. In this embodiment, the deep reinforcement learning algorithm adopts a model-free SAC (Soft Actor-Critic) algorithm, and the problem solved by the algorithm is reinforcement learning of discrete action space and continuous action space, and is a model-free off-polar reinforcement learning algorithm combining maximized entropy learning with an Actor-Critic framework (i.e. action-evaluation framework). The SAC algorithm has two versions: SAC-v1 learning by combining Soft Q learning with an Actor-Critic frameworkNetwork, < >>A network and an Actor network, wherein the entropy coefficient is constant; SAC-v2, which has removed the p.o.in comparison with SAC-v1 version>The network learning, and the self-adaptive adjustment entropy temperature coefficient are introduced, so that the algorithm is lighter and more robust. This example uses the SAC-v2 method.

SAC is an algorithm Based on an AC (Actor-Critic) framework, i.e., a method of combining Policy Based with Value Based. As shown in fig. 5, the SAC algorithm includes 6 networks, i.e., an Actor network as a predictive network and 2 Critic networks, a target Actor network as a target network, and 2 target Critic networks. The dissimilarity of the predicted network and its corresponding target network is: the structure is the same and the parameters are different.

Actor network as predictive strategy network, negativeResponsibility generates actions (actions) and directs the robot to interact with the environment. In each scenario, a state diagram representation containing spatio-temporal feature information is input into an Actor policy network, and an action is selected according to a probability distributionWhen the action selected by the robot acts on the environment, the probability of transition is changed according to the unknown stateTransition to the next state +.>At this time, in return, the robot receives the reward +.>. Meanwhile, the pedestrian can take action according to the strategy and enter the next state, and the state transition probability is unknown. The parameters of the Actor network are updated according to the strategy gradient by a gradient descent method, so that the performance of the generating action is improved.

Critic network is used as predictive value network and is responsible for evaluating the performance of the Actor and guiding the action of the next stage of the Actor. Thus, the state features are input to the Critic network along with the actions and the state-action pairs are outputThe values, here output are two +.>Value (i.e.)>And->) The smaller of (3) is the smaller of (3). The parameters of the Critic network are updated by the gradient descent method of the value function to more accurately estimate the value of the state-action pair.

The target Actor strategy network and the target Critic value network update network parameters in a soft update mode, so that the parameter change is slower, and the training stability is improved. The introduction of the target network is helpful to improve the stability of the algorithm and reduce the variance in training, so that the SAC algorithm is easier to converge to a robust and high-performance strategy.

Compared with the traditional deterministic strategy learning algorithm, the SAC algorithm has more excellent performance in robot task learning, and is particularly suitable for application scenes in the real world. Conventional deterministic strategy learning algorithms typically find an optimal path and once found, the learning process ends. The SAC algorithm is unique in that it employs a random strategy, i.e., it tends to output actions with a distributed probability distribution, rather than focusing the probability on one particular action. The core idea of the maximum entropy principle of the SAC algorithm is to not ignore any one potentially useful action. The SAC algorithm pursues the maximum entropy of the strategy, which means that the neural network needs to explore all possible optimal paths. Therefore, the key of the SAC algorithm is to introduce maximum entropy, which enhances the action exploration capability, improves the robustness and generalization capability of the algorithm, and further stabilizes the model performance.

Specifically, the crowd navigation problem is converted into the reinforcement learning decision problem. In the reinforcement learning framework, a mobile robot is navigated in 2D space and connected with the spacenThe problem of individual dynamic pedestrian interactions is modeled as a Partially observable Markov decision problem (POMDP, part-Observable Markov Decision Process), which can be modeled with tuplesAnd (3) representing. Wherein S is the state space at the current moment, < ->Is the action space of the robot and is provided with a plurality of control units,is a state transition function, +.>Is a reward function, ++>Is the state space at the next time.

The state space at the current momentStatus information representing agents in the environment, including robots and pedestrians, including observable and hidden states. The state of an agent can be described as: />；；。

Motion space of the robotIs discrete, in particular, the robot is intThe action of the moment in time can be described as，/>And->Is the target speed and steering angle of the robot.

As shown in fig. 5, the robot and environment interaction process is: in each scenario, the robot perceives the state of the current environment(the state refers to the state characteristics processed by the steps S2-S4), and the current strategy in the Actor strategy networkAt each time steptThe robot selects an action +.>And performs the action; when the robot selects action->When acting on the environment, according to the unknown state transition probability +.>Transition to the next stateIn return, the robot receives a reward +.>. Meanwhile, the pedestrian can take action according to the strategy and enter the next state, and the state transition probability is unknown.

In order to simulate the motion condition of the robot more truly, the embodiment designs a denser rewarding function. The reward function is the sum of 8 components, the 8 components being. Wherein (1)>For guiding the robot forward towards the target, < >>For punishing the collision of the robot with pedestrians, +.>For encouraging the robot to maintain a safe distance from the pedestrian, < > or->In order to keep the speed and angle changes at the front moment and the rear moment within a certain range, punish the abrupt change of the speed and the angle of the robot and add>For punishing machinesThe person is stationary +.>Behavior for punishing the backward travel of the robot, +.>And calculating rewards for the distance travelled. Then, the formula of the above reward function is:。

further, the components of the reward functionThe method comprises the following steps of:；/>；；/>；/>；；/>；/>。

wherein,positions of the robot at the last moment and the current moment, respectively,/->Is the target position of the robot,/->Representing the radius of the robot +.>Indicate->Radius of individual pedestrian->Is a robot and->The individual is->The actual nearest distance of the moment,/-, for>Is a given fixed threshold of uncomfortable or unsafe distance of the robot from the pedestrian,respectively the last moment of the robott-1 and the current momenttLinear velocity of>Respectively the last moment of the robott-1 and the current momenttHeading angle of>Respectively represent the current moment of the robottLinear speed and steering angle of>Respectively the current timetIs at the linear velocity ofxShaft and method for producing the sameyComponents of the axis in both directions. Weights of reward functionsRespectively 1.0, 1.0,1.0、-0.5、-0.5、-0.5、-0.5、-0.5。

Further, critic is responsible for evaluating the performance of the Actor and directing the action of the Actor in the next stage as a cost function in the above process. Thus, the status feature is input to the Critic network along with the action and the value of the status-action pair is output. The loss function for the Q network (i.e., critic network) in the SAC algorithm is:。

wherein,is a prize value calculated based on a prize function,/-for>Is a discount factor, < >>Is all possible actions predicted again through the Actor policy network, +.>And->Is the predictive value of two target Critic networks,/->Is a reward coefficient of entropy, which determines entropy +.>The greater the value of which is of greater importance. By introducing the maximum entropy, the action exploration capability is enhanced, and the robustness and generalization capability of the algorithm are improved, so that the performance of the model is more stable.

Through the deep reinforcement learning training, the optimal action strategy at the current moment is learned and generated, and iterative computation is continuously circulated until the robot reaches a set target position or collides or exceeds the set maximum navigation time, so that the navigation task is completed.

By the method, the complex interaction between a man-machine and a person and the time correlation of the crowd navigation map under different time steps can be captured by adopting the graph neural network and the time function GRU, the learned complex interaction with the time correlation is used in the navigation strategy, and the robot can make a longer-distance and more accurate decision.

Example two

The embodiment provides a crowd navigation system based on deep reinforcement learning and a graph neural network, which comprises:

the model building module is used for modeling based on a simulation scene of the robot-crowd navigation environment and converting the simulation scene into a graph representation;

The steps involved in the second embodiment correspond to those of the first embodiment of the method, and the detailed description of the second embodiment can be found in the related description section of the first embodiment.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

While the present invention has been described in connection with the preferred embodiments, it should be understood that the present invention is not limited to the specific embodiments, but is set forth in the following claims.

Claims

1. A crowd navigation method based on deep reinforcement learning and a graph neural network is characterized by comprising the following steps:

acquiring state characteristic information of robots and pedestrians in people in the graph representation; the state characteristic information of the robot comprises the position, speed and radius of the robot, the expected target position, the preferred speed and the heading angle of the robot;

the pedestrian state characteristic information comprises the position, speed and radius of the pedestrian;

extracting a state diagram representation containing space characteristic information through a GAT network based on the state characteristic information of the robot and the pedestrian; the extracting, through the GAT network, state diagram representations including spatial feature information based on state feature information of robots and pedestrians includes:

extracting potential state features with fixed length through a multi-layer perceptron MLP (multi-layer perceptron) based on state feature information of robots and pedestrians;

based on the extracted potential state features, a pair-wise similarity function is utilized to calculate and obtain a relation matrix of the robot neighborhood;

extracting a state diagram representation containing space feature information through a GAT network based on potential state features of robots and pedestrians and a relation matrix of a robot neighborhood;

extracting a state diagram representation containing time feature information through a GRU network based on the state diagram representation containing space feature information, and fusing the extracted two state diagram representations; the method for extracting the state diagram representation containing the time characteristic information based on the state diagram representation containing the space characteristic information through the GRU network and fusing the extracted two state diagram representations comprises the following steps:

carrying out feature information fusion on state diagram representation containing space feature information at the current moment and state feature information of the robot, inputting the fused feature and state diagram representation containing space-time feature information at the previous moment into a GRU network, and outputting state diagram representation containing space-time feature information at the current moment;

based on the fused state diagram representation containing space-time characteristic information, calculating a behavior decision by adopting a SAC algorithm, generating an optimal action to be executed at the current moment of the robot, and continuously performing iterative calculation until the robot reaches a set target position or collides or exceeds a set maximum navigation time, so as to complete a navigation task; the method for generating the optimal action to be executed at the current moment of the robot based on the fused state diagram representation containing space-time characteristic information by adopting the SAC algorithm to calculate the action decision comprises the following steps:

inputting state diagram representation containing space-time characteristic information at the current moment into an Actor strategy network of a SAC algorithm, selecting an action according to probability distribution and executing, and when the action selected by the robot acts on the environment, switching to the next state according to unknown state transition probability to be used as a return, wherein the robot receives rewards; meanwhile, the pedestrian takes action according to the strategy and enters the next state;

and evaluating the performance of the executed action through the Critic value network, generating the optimal action to be executed by the robot at the current moment, and guiding the action of the robot at the next stage.

2. The crowd navigation method based on deep reinforcement learning and graph neural network of claim 1, wherein the extracting, through the GAT network, a state diagram representation containing spatial feature information based on a relation matrix of potential state features of robots and pedestrians and robot neighborhoods comprises:

inputting the relation matrix of the potential state characteristics of the robots and pedestrians and the robot neighborhood into a first layer GAT network for coding, and outputting the first layer potential state characteristics and the corresponding attention weights;

inputting the first-layer potential state characteristics of the robots and pedestrians and the relation matrix of the robot neighborhood into a second-layer GAT network for coding, and outputting the second-layer potential state characteristics and the corresponding attention weights;

and adding the potential state characteristics output by the first layer GAT and the second layer GAT, and adding the potential state characteristics of the initial robot and the pedestrian to form a final crowd navigation scene graph representation containing space characteristic information.

3. The crowd navigation method based on deep reinforcement learning and graph neural network of claim 1, wherein the rewards are calculated based on a rewards function, the rewards function being a weighted sum of 8 components, the 8 components being；

Wherein,for guiding the robot forward towards the target; />The robot is used for punishing collision of the robot on pedestrians; />For encouraging the robot to maintain a safe distance from the pedestrian; />The speed and angle change of the robot are kept in a certain range, and abrupt change of the speed and angle of the robot is punished; />The robot is used for punishing the static condition of the robot; />Behavior for punishing backward travel of the robot; />And calculating rewards for the distance travelled.

4. A crowd navigation system based on deep reinforcement learning and a graph neural network is characterized by comprising:

the state information acquisition module is used for acquiring state characteristic information of robots and pedestrians in the graph representation; the state characteristic information of the robot comprises the position, speed and radius of the robot, the expected target position, the preferred speed and the heading angle of the robot;

the state information processing module is used for extracting a state diagram representation containing space characteristic information through the GAT network based on the state characteristic information of the robot and the pedestrian; the extracting, through the GAT network, state diagram representations including spatial feature information based on state feature information of robots and pedestrians includes:

the crowd navigation module is used for calculating a behavior decision by adopting a SAC algorithm based on the fused state diagram representation containing space-time characteristic information, generating an optimal action to be executed at the current moment of the robot, and continuously and circularly and iteratively calculating until the robot reaches a set target position or collides or exceeds a set maximum navigation time, so as to complete a navigation task; the method for generating the optimal action to be executed at the current moment of the robot based on the fused state diagram representation containing space-time characteristic information by adopting the SAC algorithm to calculate the action decision comprises the following steps: