CN113359717A

CN113359717A - Mobile robot navigation obstacle avoidance method based on deep reinforcement learning

Info

Publication number: CN113359717A
Application number: CN202110575846.8A
Authority: CN
Inventors: 刘安东; 崔奇; 夏浩; 周时钎; 滕游; 张文安
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-09-07
Anticipated expiration: 2041-05-26
Also published as: CN113359717B

Abstract

A mobile robot navigation obstacle avoidance method based on deep reinforcement learning is characterized in that a neural network is utilized to extract characteristics of paired interaction between a robot and each person, and interaction between the persons is captured through a local map; a self-attention mechanism is used for aggregating interaction features to deduce the relative importance of adjacent human beings relative to the future state of the adjacent human beings, a reinforcement learning method is used for pre-training a value network, and a safety action space is added in the training process to prevent the occurrence of emergency; establishing a two-dimensional grid map, planning a global path by setting global target points by using an RRT (remote distance transform) algorithm, finding the nearest point in a circle with the current position of the robot as the center of the circle and r as the radius, setting the nearest point as a dynamic local target, and selecting the optimal action through an optimal strategy to finally realize the navigation obstacle avoidance function of the mobile robot. The invention solves the problems of short sight, slow response and the like of the robot in the existing navigation obstacle avoidance process.

Description

Mobile robot navigation obstacle avoidance method based on deep reinforcement learning

Technical Field

The invention relates to a navigation obstacle avoidance method of a mobile robot, in particular to a navigation obstacle avoidance method of a mobile robot based on the combination of deep learning and reinforcement learning, and belongs to the field of mobile robots.

Background

The robot is equipment integrating multiple fields of technology intersection, such as machinery, electronics, computers, control, artificial intelligence and the like, particularly a mobile robot, and is widely applied to various fields of human production and life due to the characteristics of autonomous control and flexible motion. The mobile robot refers to an intelligent robot which can detect the external environment through a sensor and realize continuous and autonomous driving in a complex environment, relates to the subject fields of information perception, motion planning, autonomous control and the like, and is the latest achievement of artificial intelligence technology and computer information science.

The most important capability of the mobile robot is navigation, namely the mobile robot can avoid obstacles in a working range space and realize safe movement from an initial position to a target position. In recent years, the application of the mobile robot has been expanded to outdoor unknown environments such as deep sea, polar regions and the like, and higher requirements are put forward on the navigation function of the mobile robot. The autonomous navigation and motion control technology is the key for solving the problem of track planning motion of the mobile robot under the condition of unknown non-structural and environmental conditions, so that the method has important theoretical value and application value for the research of the navigation control method of the mobile robot.

Deep learning and reinforcement learning are two important branches in the field of machine learning, and are always taken as a research hotspot by scholars at home and abroad, particularly in the field of mobile robots.

The learning process of reinforcement learning is a dynamic, constantly interactive process, and the required data is also generated by constantly interacting with the environment. The reinforcement learning involves many objects, such as actions, environments, state transition probabilities, and reward functions. In addition, deep learning such as image recognition and speech recognition solves the perception problem, and reinforcement learning solves the decision problem. Therefore, the deep reinforcement learning algorithm generated by combining the developed deep learning technology and the reinforcement learning algorithm is taken as the development trend of the artificial intelligence in the future.

In reality, the mobile robot is generally an environment in which people, machines and objects coexist, and in the environment, a plurality of moving bodies, people and other equipment need to perform trajectory planning movement in a narrow and crowded environment, so that the mobile robot is required to be capable of avoiding obstacles in a crowded scene. As humans, we have the inherent ability to adjust their behavior by observing others, so we can easily pass through people or other objects. However, performing collision-free navigation in dynamic and crowded scenarios is still a difficult task for mobile robots. In conventional mobile robot navigation methods, a moving intelligent body is generally regarded as a static obstacle or a next action is performed according to a specific interaction rule, and the conventional methods prevent collision through passive reaction or manually define a function to ensure safety, so that the problems of short sight, slow reaction, insecurity and the like of a robot are caused.

Disclosure of Invention

In order to overcome the defects of the prior art and based on deep learning and reinforcement learning research, the invention provides a mobile robot navigation obstacle avoidance method based on deep reinforcement learning, which can predict the dynamics of human beings, solve the problems of short sight, slow response and the like of a robot in the navigation obstacle avoidance process, and add a safety mechanism to prevent sudden situations from happening in the navigation process. And finally, a dynamic local target mechanism is designed, so that the time of the navigation obstacle avoidance process can be reduced.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a mobile robot navigation obstacle avoidance method based on deep reinforcement learning comprises the following steps

1) Training a value network model by adopting a time difference method in deep learning middle and deep layer cyclic neural network and reinforcement learning so as to realize robot navigation obstacle avoidance;

2) simplifying the robot and each person into a circle, and defining the state of the robot at the time t:

S_t＝[p_x,p_y,v,θ,g_x,g_y,r,v_pref] (1)

wherein p is_x,p_yShowing machineThe current position of the robot, v represents the current speed of the robot, theta represents the azimuth angle of the robot, and g_x,g_yRepresenting the target position of the robot, r the radius of the robot, v_prefRepresenting a preferred speed of the robot;

defining each person's state at time t:

defining a reward and penalty function:

wherein, a_t＝v_tIndicating the movement of the robot, d_minRepresenting the minimum separation distance, d, between the robot and the person during the time Δ t_comfRepresents a comfortable distance that a person can endure, d_goalRepresenting the distance from the current position of the robot to the target point;

3) will S_tAnd O_tInputting the data into a deep circulation neural network with initial weight, simulating a navigation strategy of a human expert by the robot to obtain demonstration experience D, storing the demonstration experience D into an initialized experience pool E, initializing a value network V by using a random weight value theta, initializing a target value network V' into a current value network V, and circulating each event to obtain an optimal value network V;

4) establishing a two-dimensional grid map, setting a global target point, and continuously updating the joint state of the robot and the human by using a pre-trained value network V:

K_t＝[S_t,O_t] (4)

5) then, an RRT algorithm is utilized to plan a global optimal path and make an optimal strategy

Wherein A represents a set of motion spacesIn sum, γ ∈ (0,1) denotes the attenuation factor, Δ t denotes the time interval between two actions, v_prefIndicates the preferred speed, V^*(K_t+Δt) Represents the optimal value at time t + Δ t;

6) selecting optimal action a through optimal strategy_tI.e. the optimum speed v_tAnd local obstacle avoidance is realized until the robot reaches the position of the target point.

Further, in the step 1), the value network model consists of an interaction module, a pooling module and a planning module, wherein the interaction module uses a multilayer perceptron to embed the state of each person and the state of the robot into a vector e with a fixed length_i：

e_i＝ψ_e(S_t,O_t；W_e)(i＝1,2,3,…n) (6)

Wherein psi_eIs a multi-layered perceptron with activation functions for modeling human and robot interactions, W_eIs the embedding weight;

then embedding vector e_iInputting into a subsequent multilayer perceptron:

h_i＝φ_h(e_i；W_h)(i＝1,2,3,…n) (7)

wherein phi is_hIs a full connection layer with a nonlinear activation function to obtain the interaction characteristics of the robot and the ith person, W_hIs the network weight;

the pooling module first embeds the interaction into a vector e_iConversion to attention score β_i：

β_i＝ρ_β(e_i,e_m；W_β)(i＝1,2,3,…n) (9)

Wherein e is_mIs a fixed length of embedded vector, rho, obtained by pooling all the individuals on average_βIs a multi-layer perceptron with activation functions;

then will beGiving pairwise interaction vectors h_iAnd corresponding attention scores beta_iThe final calculated population is represented by a weighted linear combination of all pairs:

the planning module is used for estimating the joint state value of the robot and the crowd in the navigation process:

v＝g_v(S_t,C_t；W_v) (11)

wherein, g_vIs a multi-layer sensor with activation function, W_vIs the network weight.

Still further, in the step 3), the process of cycling each event is as follows:

initializing a random joint state K_tLooping each step of each event, selecting a random action a using the probability ε_tAnd if the small probability event does not occur, selecting the action with the maximum current value function by using a greedy strategy:

continuously updating the current state and the reward value, storing the state and the reward value into an experience playback pool, updating the experience pool once every 3000 steps, updating the current value network by a gradient descent method until the robot reaches the final state, ending the inner loop of each event, updating the current network into a target network, and obtaining a value network model V after the number of events is reached.

Furthermore, in the step 4), a speed screening mechanism based on a map is added to form a safe action space, so that the robot can avoid known obstacles in the environment, and in the process of each decision, the safe action space is determined by the current position p of the robot_tThe two-dimensional grid map M and the initialized action space constitute A, namely A_safe＝(p_tM, A), for each speed in the motion space, positiveTo the simulation, it is observed whether the robot will collide with obstacles in the map.

In the step 5), on the two-dimensional grid map, a global path with the minimum cost is generated between the current position of the robot and the global target by using an RRT algorithm, then all path points on the global path are traversed, the nearest point is found in a circle with the current position of the robot as the center of the circle and r as the radius, and the nearest point is set as the dynamic local target.

The invention has the following beneficial effects: (1) by introducing the multilayer perceptron, the training time and the convergence speed of reinforcement learning are accelerated; (2) the trained model can predict the human dynamics, and solves the problems of short sight, slow response and the like of the robot in the navigation obstacle avoidance process; (3) a dynamic local target mechanism is designed, so that the robot searches for an optimal path in local planning, and the time of a navigation obstacle avoidance process is reduced; (4) introducing an action screening mechanism based on a map as a dynamic safety action space;

drawings

FIG. 1 is a flow chart of a robot navigation obstacle avoidance method;

FIG. 2 is a value network training flow diagram;

FIG. 3 is a graph of training simulation results.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a mobile robot navigation obstacle avoidance method based on deep reinforcement learning includes the following steps:

step 1): in a two-dimensional space, considering the robot and each person as a circle, one robot moves to a target point through n persons. For each agent (robot or human), position p ═ p_x,p_y]Velocity v ═ v_x,v_y]And radius r can be observed by other agents. Target position g ═ g_x,g_y]Azimuth theta and preferred velocity v_prefNot observable by other agents. S_tWhich indicates the state of the robot and,

indicating the state of the person at time t. By connecting the state of the robot with the observable state of the human, the joint states K of all the n +1 agents at time t can be obtained_t＝[S_t,O_t]. The robot may determine that the motion command can immediately change speed at time t according to the navigation strategy: v. of_t＝a_t＝π(K_t)。

Navigating pi using optimal strategies^*(K_t) At time t, joint state K_tThe optimum values of (a) are as follows:

in the formula (13), T represents the sum of steps from the state at time T to the final state, Δ T represents the time interval between two actions, γ ∈ (0,1) represents the attenuation factor, v ∈ (1) represents the attenuation factor, and_prefindicating a preferred speed, R (K)_t,a_t) Indicating the corresponding prize during time t.

The optimal strategy is established by the maximum accumulated return and is defined as follows:

in the formula (5), A represents a set of motion spaces,

indicating the optimum value at time t + deltat.

The reward and penalty function is defined as follows:

in the formula (3), d_minRepresents the minimum separation distance between the robot and the person within the time of delta t, d_comfRepresents a comfortable distance that a person can endure, d_goalIndicating the distance from the current position of the robot to the target point.

Step 2: the value network model comprises three parts of an interaction module, a pooling module and a planning module. The interaction module is used for modeling the human and the robot and coding the human and human interaction through coarse-grained local mapping; the pooling module aggregates the interaction into embedded vectors with fixed length through a self-attention mechanism, and learns the relative importance of each person and the collective influence of the crowd in a data-driven manner; the planning module is used for estimating the joint state value of the robot and the crowd in the navigation process. The method comprises the following specific steps:

step 2.1: embedding the state of the ith person and the state of the robot into a vector e with a fixed length by using a multilayer perceptron_iThe method comprises the following steps:

e_i＝ψ_e(S_t,O_t；W_e)(i＝1,2,3,…n) (6)

in formula (16), phi_e(. is an inline function, W_eAre the embedding weights.

Step 2.2: to embed vector e_iInputting the data into a subsequent multilayer perceptron to obtain pairwise interaction characteristics of the robot and the human:

h_i＝φ_h(e_i；W_h)(i＝1,2,3,…n) (7)

in the formula (17), phi_h(. is a fully connected layer, W_hIs the network weight.

Step 2.3: embedding interactions into e_iConversion to attention score β_i：

β_i＝ρ_β(e_i,e_m；W_β)(i＝1,2,3,…n) (9)

In the formula (19), e_mIs a fixed length of embedded vector, rho, obtained by pooling all the individuals on average_β(. cndot.) is a multilayer perceptron with activation functions.

Step 2.4: for each person, give two by twoInteraction vector h_iAnd corresponding attention scores beta_iThe final population is represented by a weighted linear combination of all pairs:

and step 3: as shown in fig. 2, the value network is trained using the time difference method in reinforcement learning. Recording the value network V as the current value network, setting the initial training frequency to be 0, setting the capacity of an experience playback pool to be 50000, setting the sampling number to be 100, setting a target network V', initializing the random joint state, setting the training frequency to be 10000, and setting the state to be s according to an epsilon greedy strategy_tSelecting an action:

get a return r_tAnd the next state s_t', at state s_t' obtaining a according to the greedy strategy of epsilon_tStoring the updated return value and state into an experience pool, updating the experience pool once every 3000 steps, and updating the current value network by a gradient descent method until the robot reaches the final state or exceeds the set maximum time t_maxAnd the time is 25s, otherwise, the current network is updated to the target network, and when the training times are reached, the value network V is obtained.

And 4, step 4: a two-dimensional grid map is established through a laser radar, a global path is planned through an RRT algorithm in the navigation process, then a deep reinforcement learning method is used for achieving local dynamic obstacle avoidance, the robot can select actions according to an optimal strategy through a trained value network V, the radius of the robot and the robot is set to be 0.3m, the preferred speed of the robot is set to be 0.25m/s, the minimum comfortable distance of the robot is set to be 0.5m, a dynamic local target is set to be 4m, and when no dynamic obstacle exists in an environment space, the robot can directly move towards a target point. When a dynamic barrier exists, the robot can quickly and safely avoid the barrier. The navigation time and efficiency of the mobile robot can be effectively improved by introducing dynamic local targets and a safety mechanism.

The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A mobile robot navigation obstacle avoidance method based on deep reinforcement learning is characterized by comprising the following steps:

S_t＝[p_x,p_y,v,θ,g_x,g_y,r,v_pref] (1)

wherein p is_x,p_yRepresenting the current position of the robot, v representing the current velocity of the robot, theta representing the azimuth angle of the robot, g_x,g_yRepresenting the target position of the robot, r the radius of the robot, v_prefRepresenting a preferred speed of the robot;

defining each person's state at time t:

defining a reward and penalty function:

wherein, a_t＝v_tIndicating the movement of the robot, d_minIndicating that the robot is in contact with the person during the time Δ tAt a minimum spacing distance, d_comfRepresents a comfortable distance that a person can endure, d_goalRepresenting the distance from the current position of the robot to the target point;

K_t＝[S_t,O_t] (4)

Where A represents a set of motion spaces, γ ∈ (0,1) represents a decay factor, Δ t represents the time interval between two motions, v_prefIndicating the preferred speed, V (K)_t+Δt) Represents the optimal value at time t + Δ t;

2. The robot navigation obstacle avoidance method based on the deep reinforcement learning as claimed in claim 1, characterized in that: in the step 1), the value network model consists of an interaction module, a pooling module and a planning module, wherein the interaction module uses a multilayer perceptron to embed the ith personal state and the robot state into a vector e with a fixed length_i：

e_i＝ψ_e(S_t,O_t；W_e)(i＝1,2,3,…n) (6)

then embedding vector e_iInputting into a subsequent multilayer perceptron:

h_i＝φ_h(e_i；W_h)(i＝1,2,3,…n) (7)

β_i＝ρ_β(e_i,e_m；W_β)(i＝1,2,3,…n) (9)

then will give two-by-two interaction vector h_iAnd corresponding attention scores beta_iThe final calculated population is represented by a weighted linear combination of all pairs:

v＝g_v(S_t,C_t；W_v) (11)

3. The robot navigation obstacle avoidance method based on the deep reinforcement learning as claimed in claim 1 or 2, characterized in that: in the step 3), the following events are circulated:

4. The robot navigation obstacle avoidance method based on the deep reinforcement learning as claimed in claim 1 or 2, characterized in that: in the step 4), a speed screening mechanism based on a map is added to form a safe action space, so that the robot can avoid known obstacles in the environment, and in the process of each decision, the safe action space is defined by the current position p of the robot_tThe two-dimensional grid map M and the initialized action space constitute A, namely A_safe＝(p_tM, a), for each speed in the motion space, a forward simulation is performed to see if the robot will collide with an obstacle in the map.

5. The robot navigation obstacle avoidance method based on the deep reinforcement learning as claimed in claim 1 or 2, characterized in that: in the step 5), on the two-dimensional grid map, a global path with the minimum cost is generated between the current position of the robot and the global target by using an RRT algorithm, then all path points on the global path are traversed, the nearest point is found in a circle with the current position of the robot as the center of the circle and r as the radius, and the nearest point is set as the dynamic local target.