CN114905505B

CN114905505B - Navigation control method, system and storage medium of mobile robot

Info

Publication number: CN114905505B
Application number: CN202210383369.XA
Authority: CN
Inventors: 余淼盈; 杨尚东; 陈蕾; 王昱川
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2024-04-19
Anticipated expiration: 2042-04-13
Also published as: CN114905505A

Abstract

The invention discloses a navigation control method, a system and a storage medium of a mobile robot in the field of robot navigation, comprising the following steps: adjusting the use sequence of sub-strategies in a navigation control model according to target task data in a real environment, and navigating the mobile robot by using the navigation control model; the training process of the navigation control model comprises the following steps: constructing a navigation control model by using a hierarchical reinforcement learning algorithm, and introducing an LSTM network into the navigation control model to serve as a track coding network; training the navigation control model through a training data set, and performing meta-learning training on an LSTM track coding network of the navigation control model through a meta-training data set; repeatedly iterating and updating to obtain a final navigation control model with function convergence; the sub-strategies are used in a specific order according to specific tasks, so that the migration process of the learned navigation control model applied to the actual environment is simplified, and the instantaneity of the navigation control model is improved.

Description

Navigation control method, system and storage medium of mobile robot

Technical Field

The invention belongs to the field of robot navigation, and particularly relates to a navigation control method, a navigation control system and a storage medium of a mobile robot.

Background

In recent years, reinforcement learning has been attracting attention with the rise of artificial intelligence. In particular, deep reinforcement learning algorithms that combine reinforcement learning with deep learning have made great breakthroughs in many fields. The reinforcement learning aims to enable an intelligent body to sample in the environment, learn autonomously to make correct behavior decisions, and can migrate to reality to help people solve actual problems.

Navigation control is a technology for guiding a mobile robot to move to a target position without collision through planning of movement direction and displacement, and is one of basic functions of the mobile robot and one of core research contents in the field of robot control. The traditional path planning algorithm depends on a global high-precision map, complex modeling and accurate positioning are required, the calculation efficiency can be reduced along with the increase of the complexity of the environment, and the real-time performance of the algorithm is not strong; meanwhile, the reinforcement learning intelligent model needs to be researched and sampled in a large scale in a training environment, but when the learned model is applied to an actual environment, all parameters in the model need to be adjusted according to data of the actual environment, so that the learning efficiency is low.

Disclosure of Invention

The invention aims to provide a navigation control method, a system and a storage medium of a mobile robot, wherein a hierarchical element reinforcement learning algorithm is utilized to construct a navigation control model, sub-strategies are used in a specific order according to specific tasks, the migration process of the learned navigation control model applied to an actual environment is simplified, and the instantaneity of the navigation control model is improved.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

The first aspect of the present invention provides a navigation control method for a mobile robot, comprising:

controlling the mobile robot to acquire target task data in a real environment by using the trained navigation control model;

Adjusting the use sequence of sub-strategies in the navigation control model according to the target task data, and navigating the mobile robot by using the navigation control model;

the training process of the navigation control model comprises the following steps:

constructing a navigation control model by using a hierarchical reinforcement learning algorithm, and introducing an LSTM network into the navigation control model to serve as a track coding network;

Building a training environment of a navigation control model and a mobile robot model, and controlling the mobile robot to interact with the training environment through the navigation control model to obtain a plurality of groups of training data sets;

Training the navigation control model through the training data set to obtain an updated navigation control model, controlling the mobile robot model to interact with the training environment again by using the updated navigation control model to obtain a multi-component training data set, and performing meta-learning training on an LSTM track coding network of the navigation control model through the meta-training data set; and repeatedly and iteratively updating to obtain a final navigation control model with function convergence.

Preferably, the method for training the navigation control model through the training data set to obtain the updated navigation control model comprises the following steps:

Training a navigation control model through a training data set containing multiple tasks, constructing a loss function of the navigation control model, calculating a training loss value of the navigation control model according to the loss function, iteratively updating navigation control model parameters through a training loss value gradient descent method, and storing the navigation control model parameters of each task.

Preferably, the method for performing meta-learning training on the LSTM track coding network of the navigation control model through the meta-training data set comprises the following steps:

And constructing a meta training loss function according to the loss function of the navigation control model, calculating a meta training loss value of the navigation control model, carrying out iterative updating on parameters of an LSTM track coding network in the navigation control model by a meta training loss value gradient descent method, obtaining a final navigation control model with function convergence, and storing the parameters of the final navigation control model.

Preferably, the method for constructing a training environment of a navigation control model and a mobile robot model and controlling the mobile robot to interact with the training environment through the navigation control model to obtain multiple sets of training data sets comprises the following steps:

Constructing a mobile robot model by adopting a robot physical simulation engine MuJoCo platform, and initializing and setting sensor parameters of the mobile robot model;

Designing a training environment comprising a plurality of obstacle areas and a plurality of target point areas, randomly generating obstacles and target points in the obstacle areas and the target point areas respectively to obtain training tasks, resetting positions of the obstacles and the target points to collect a plurality of groups of training tasks, and controlling the mobile robot to interact with each group of training tasks in the training environment by using a navigation control model to obtain a training data set.

Preferably, controlling the mobile robot to interact with each set of training tasks in the training environment through the navigation control model to obtain the training data set includes:

Generating a group of training tasks in a training environment, putting the mobile robot model into the training environment, and acquiring sensor information through a sensor of the mobile robot model; encoding track information according to the sensor information, and outputting a track state z _t and a memory hidden variable (h _t,c_t);

The top-level strategy network pi _Ω in the navigation control model selects strategy sequence number omega _t according to the obtained track state z _t, and sub-strategy networks corresponding to the strategy sequence number omega _t in the navigation control model are started up The sub-policy network/>Outputting action a _t according to track state z _t;

After the mobile robot model executes the action a _t, the mobile robot model interacts with the training environment to obtain the reward r _t, if an obstacle is encountered, r _t = -1, if a target point is encountered, r _t =1, otherwise, r _t =0;

The mobile robot model acquires the next group of sensor information through the sensors and encodes track information to acquire a new track state z _t+1;

termination network for navigation control model Selecting whether to terminate the sub-policy network/>, based on the trace state z _t+1 Executing the action; if terminator policy network/>Performing an action to reselect a starting sub-policy network/>, through a value function network Q _U of a navigation control model

Controlling the mobile robot model to interact with the set of training tasks by using the navigation control model to obtain a set of training data setsResetting training task in training environment, repeating iterative process to obtain multiple sets of training data set

Preferably, the method for encoding track information according to sensor information, outputting the track state z _t and the memory hidden variable (h _t,c_t) includes:

Acquiring the state s _t of the mobile robot model at the current moment by using the sensor information, and reading the memory hidden variable (h _t-1,c_t-1) stored at the last moment, wherein the memory hidden variable (h _t-1,c_t-1) of the initial state of the mobile robot model is a zero vector; long-short time memory network for inputting memory hidden variable (h _t-1,c_t-1) and state s _t into navigation control model The encoded track information is performed, and the track state z _t and the memory hidden variable (h _t,c_t) are output.

Preferably, the Loss functions of the navigation control model include a Loss function Loss _c, a Loss function Loss _a and a Loss function Loss _l;

The expression formula of the Loss function Loss _c is:

The expression formula of the Loss function Loss _a is:

The expression formula of the Loss function Loss _l is:

Loss_l＝Loss_a+Loss_c

In the formula (i), Indicating that the sub-strategy/>, which is the most desirable for choosing the jackpot, is chosen at the trace state z _i Representing that in the track state of z _i+1, a sub-strategy/>, is selectedObtaining the expected value of the maximum accumulated rewards; gamma is expressed as discount rate, ranging from [0,1].

Preferably, the method for controlling the mobile robot model to interact with the training environment again by using the updated navigation control model to obtain the multi-component training data set comprises the following steps:

using a set of training data According to the Loss function of the navigation control model, gradient/>, taking parameter U as independent variable, of the Loss function Loss _c is calculated respectivelyGradient of Loss function Loss _a with parameter θ as argumentLoss function Loss _a is expressed as a parameter/>Gradient/>, which is an argumentLoss function Loss _l is expressed as a parameter/>Gradient/>, which is an argumentValue function network Q _U, sub-policy network/>, using gradient descent methodTerminating network/>And long and short term memory network/>Is updated to U ', (ω, θ'),/>And saving training environment parameters and navigation control model parameters;

controlling the mobile robot model to interact with the set of training tasks again by using the updated navigation control model to obtain a set of metadata training data Repeating the iterative process to obtain T-group meta-training data, and constructing a meta-training data set

Preferably, the expression of the meta-training Loss function Loss _{meta_l} is:

In the formula (i), The loss value, denoted as the i-th set of navigation control model parameters, T is denoted as the number of sets of navigation control model parameters.

A second aspect of the present invention provides a navigation control system of a mobile robot, comprising:

The target task acquisition module is used for controlling the mobile robot to acquire target task data in a real environment by using the trained navigation control model;

the strategy migration module is used for adjusting the use sequence of sub-strategies in the navigation control model according to the target task data;

The navigation module is used for controlling the mobile robot to navigate by using the navigation control model;

The navigation control model construction module is used for constructing a navigation control model by using a hierarchical reinforcement learning algorithm, and introducing the LSTM network into the navigation control model to serve as a track coding network;

The training data set acquisition module is used for building a training environment of the navigation control model and a mobile robot model, and controlling the mobile robot to interact with the training environment through the navigation control model to acquire a plurality of groups of training data sets;

The pre-training module is used for training the navigation control model through the training data set to obtain an updated navigation control model, controlling the mobile robot model to interact with the training environment again by using the updated navigation control model to obtain a multi-component training data set, and performing meta-learning training on the LSTM track coding network of the navigation control model through the meta-training data set; and repeatedly and iteratively updating to obtain a final navigation control model with function convergence.

Preferably, the mobile robot is provided with a sensor for interacting with a training environment and a real environment; the sensor information detected by the sensor comprises coordinates of a mobile robot, coordinates of a plurality of obstacles, coordinates of a plurality of target points, obstacle areas and target point areas.

A third aspect of the present invention provides a computer-readable storage medium, characterized in that a computer program is stored thereon, which program, when being executed by a processor, implements the steps of the navigation control method.

Compared with the prior art, the invention has the beneficial effects that:

(1) The method comprises the steps of controlling a mobile robot to acquire target task data in a real environment by using a trained navigation control model; adjusting the use sequence of sub-strategies in the navigation control model according to the target task data, and navigating the mobile robot by using the navigation control model; the sub-strategies are used in a specific order according to specific tasks, so that the migration process of the learned navigation control model applied to the actual environment is simplified, and the instantaneity of the navigation control model is improved.

(2) The invention utilizes the LSTM network to abstract the characteristics of the state track, so that the mobile robot can distinguish different navigation tasks, thereby better learning the sub-strategies shared by the tasks and memorizing the sub-strategy combination sequence required by different tasks.

Drawings

FIG. 1 is a block diagram of a navigation control model provided by an embodiment of the present invention;

fig. 2 is a flowchart of encoding track information according to sensor information through an LSTM network according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Example 1

As shown in fig. 1 to 2, the present embodiment provides a navigation control method of a mobile robot, including:

Constructing a navigation control model by using a hierarchical reinforcement learning algorithm, and introducing an LSTM network into the navigation control model to serve as a track coding network; the navigation control model is internally provided with a value function network Q _U and a sub-strategy network Top policy network pi _Ω and termination network/>

The method for constructing the training environment of the navigation control model and the mobile robot model and controlling the mobile robot to interact with the training environment through the navigation control model to obtain a plurality of groups of training data sets comprises the following steps:

The method for controlling the mobile robot to interact with each group of training tasks in the training environment through the navigation control model to obtain the training data set comprises the following steps:

Generating a group of training tasks in a training environment, putting the mobile robot model into the training environment, and acquiring sensor information through a sensor of the mobile robot model; acquiring the state s _t of the mobile robot model at the current moment by using the sensor information, and reading the memory hidden variable (h _t-1,c_t-1) stored at the last moment, wherein the memory hidden variable (h _t-1,c_t-1) of the initial state of the mobile robot model is a zero vector; long-short time memory network for inputting memory hidden variable (h _t-1,c_t-1) and state s _t into navigation control model Coding track information, outputting a track state z _t and a memory hidden variable (h _t,c_t), wherein the expression formula is as follows:

i_t＝σ(W_is_t+U_ih_t-1+b_i)

f_t＝σ(W_fs_t+U_fh_t-1+b_f)

o_t＝σ(W_os_t+U_oh_t-1+b_o)

h_t＝o_t⊙tanh(c_t)

z_t＝h_t

Wherein s _t represents the environment state of the mobile robot at the time point t, s _t includes the coordinates of the mobile robot at the time point t, the coordinates of the plurality of obstacles, the coordinates of the plurality of target points, the codes of the obstacle regions, the target point regions, h _t-1 and c _t-1 represent the memory information ,W_i、U_i、b_i、W_f、U_f、b_f、W_o、U_o、b_o、W_c、U_c、b_c at the last time point t-1 as network parameters, σ (·) represents a Logistic function, and as vector element product.

The top-level strategy network pi _Ω in the navigation control model selects strategy sequence number omega _t according to the obtained track state z _t, and sub-strategy networks corresponding to the strategy sequence number omega _t in the navigation control model are started upThe sub-policy network/>Outputting action a _t according to track state z _t;

The mobile robot model executes action a _t to interact with the training environment to obtain rewards r _t, wherein r _t = -1 when an obstacle is encountered, r _t = 1 when an obstacle is encountered, or r _t = 0 when a target point is encountered;

termination network for navigation control model Selecting whether to terminate the policy network/>, based on the trajectory state z _t+1 and the reward r _t Executing the action; if terminator policy network/>Performing an action to reselect a starting sub-policy network/>, through a value function network Q _U of a navigation control model

Controlling the mobile robot model to interact with the set of training tasks by using the navigation control model to obtain a set of training data setsResetting training tasks in a training environment, repeating the iterative process, and obtaining a plurality of groups of training data sets/>

Constructing a Loss function of a navigation control model, wherein the Loss function of the navigation control model comprises a Loss function Loss _c, a Loss function Loss _a and a Loss function Loss _l;

The expression formula of the Loss function Loss _c is:

The expression formula of the Loss function Loss _a is:

The expression formula of the Loss function Loss _l is:

Loss_l＝Loss_a+Loss_c

And calculating a training loss value of the navigation control model according to the loss function, and updating parameters of the navigation control model for the first time through a training loss value gradient descent method.

Using a set of training dataAccording to the Loss function of the navigation control model, gradient/>, taking parameter U as independent variable, of the Loss function Loss _c is calculated respectivelyGradient of Loss function Loss _a with parameter θ as argumentLoss function Loss _a is expressed as a parameter/>Gradient/>, which is an argumentLoss function Loss _l is expressed as a parameter/>Gradient/>, which is an argumentThe parameters of the navigation control model are updated for the second time through the training data set, and the value function network Q _U and the sub-strategy network/>, are updated by utilizing the gradient descent methodTerminating network/>And long and short term memory network/>The network parameters of (a) are U ', (omega, theta'),/>And saving navigation control model parameters;

controlling the mobile robot model to interact with the set of training tasks again by using the updated navigation control model to obtain a set of metadata training data

Repeating the iterative process, and completing training of the T-group navigation control model by using the T-group training data to obtain the parameters of the T-group navigation control modelRespectively utilizing the updated T-group navigation control model to control the mobile robot model to interact with the corresponding T-group training tasks so as to obtain a meta-training data set

Performing meta training on the navigation control model through a meta training data set, wherein the process comprises the following steps:

Constructing a meta-training Loss function according to the Loss function of the navigation control model, wherein the expression formula of the meta-training Loss function Loss _{meta_l} is as follows:

In the formula (i), Expressed as a loss value on the ith group of meta-training data, T is expressed as the number of groups of meta-training data;

Calculation of Meta training is carried out on the navigation control model through a meta training data set, and a gradient descent method is utilized to record the long and short time memory network/>Performing one-time meta-updating and storing long-short-time memory network/>The parameter of (2) is/>Repeating the N-wheel training process until the navigation control model converges to obtain a final navigation control model; saving the final parameter of the navigation control model as/>

The final navigation control model after training is used, and the final navigation control model is obtained through interaction between a sensor and a real environment and samplingCalculating a Loss function Loss _l of the final navigation control model; calculation/>Updating long short-term memory network/>, using gradient descent methodAnd repeatedly sampling on a real environment for a plurality of times, updating and iterating to finish policy migration, adjusting the use sequence of sub-policies in the navigation control model according to the target task data, obtaining an actually used navigation control model, and controlling the mobile robot to navigate by using the actually used navigation control model.

Example two

The present embodiment provides a navigation control system for a mobile robot, where the navigation control system may be applied to the navigation control method of the first embodiment, and the navigation control system includes:

The mobile robot is provided with a sensor for interacting with a training environment and a real environment; the sensor information detected by the sensor comprises coordinates of a mobile robot, coordinates of a plurality of obstacles, coordinates of a plurality of target points, obstacle areas and target point areas.

Example III

The present embodiment provides a computer-readable storage medium, characterized in that a computer program is stored thereon, which when executed by a processor, implements the steps of the navigation control method of the embodiment.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A navigation control method of a mobile robot, comprising:

2. The method for controlling navigation of a mobile robot according to claim 1, wherein the method for obtaining an updated navigation control model by training the navigation control model with the training data set comprises:

3. The method for controlling navigation of a mobile robot according to claim 2, wherein the method for performing meta-learning training on the LSTM track coding network of the navigation control model by using the meta-training data set comprises:

4. A method for controlling navigation of a mobile robot according to claim 3, wherein the method for constructing a training environment of a navigation control model and a mobile robot model, and controlling the mobile robot to interact with the training environment by the navigation control model to obtain a plurality of sets of training data sets comprises:

5. The method of claim 4, wherein controlling the mobile robot to interact with each set of training tasks in the training environment via the navigation control model to obtain the training data set comprises:

6. The method of claim 5, wherein the method of encoding the trajectory information based on the sensor information, outputting the trajectory state z _t and the memory hidden variable (h _t,c_t) comprises:

7. The method according to claim 6, wherein the Loss functions of the navigation control model include Loss functions Loss _c, loss function Loss _a, and Loss function Loss _l;

The expression formula of the Loss function Loss _c is:

The expression formula of the Loss function Loss _a is:

The expression formula of the Loss function Loss _l is:

Loss_l＝Loss_a+Loss_c

In the formula (i), Indicating that the sub-strategy that maximizes jackpot expectations is selected when the track state is zi Representing that the track state is zi+1, and selecting sub-strategy/>Obtaining the expected value of the maximum accumulated rewards; gamma is denoted as discount rate and ranges from [0,1].

8. The method of claim 7, wherein the method of using the updated navigational control model to control the mobile robot model to interact again with the training environment to obtain the multicomponent training data set comprises:

using a set of training data According to the Loss function of the navigation control model, gradient/>, taking parameter U as independent variable, of the Loss function Loss _c is calculated respectivelyGradient/>, with parameter θ as argument, of Loss function Loss _a Loss function Loss _a is expressed as a parameter/>Gradient/>, which is an argumentLoss function Loss _l is expressed as a parameter/>Gradient as an argumentValue function network Q _U, sub-policy network/>, using gradient descent methodTerminating network/>And long and short term memory network/>Is updated to U ', (ω, θ'),/> And saving navigation control model parameters;

9. A navigation control system for a mobile robot, comprising:

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which program, when being executed by a processor, realizes the steps of the navigation control method according to any one of claims 1 to 8.