CN116858241A

CN116858241A - Application method of mobile robot in reinforcement learning of matching network

Info

Publication number: CN116858241A
Application number: CN202310795952.6A
Authority: CN
Inventors: 张祺琛; 倪彬; 滕伟; 潘志刚; 彭志颖
Original assignee: Air Force Logistics University Of Pla
Current assignee: Air Force Logistics University Of Pla
Priority date: 2023-07-01
Filing date: 2023-07-01
Publication date: 2023-10-10

Abstract

The invention discloses an application method of a mobile robot in reinforcement learning of a matching network, wherein a model mainly comprises a near-end strategy optimization algorithm (PPO) and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, and the matching network is used for providing a reward value of each action of the robot; only a small number of successful samples in the current environment need to be provided before the DRL training is started, so that accurate rewarding function generation can be realized; the model obtains the rewarding value by comparing the difference between the path sample of the current robot and the provided positive and negative samples.

Description

Application method of mobile robot in reinforcement learning of matching network

Technical Field

The invention discloses an application method of a mobile robot in reinforcement learning of a matching network, and belongs to the technical field of mobile robot navigation.

Background

The mobile robot can move to the target point through an autonomous navigation system, and a traditional navigation algorithm comprises positioning and mapping, path planning and motion control. A map of high accuracy must be provided when using conventional navigation methods. The conventional navigation method may be limited when the robot is in an unknown environment.

Deep Reinforcement Learning (DRL) achieves mapping from states to actions through constant interaction with the environment, and is increasingly used in a variety of fields. In recent years, various DRL technologies are used in navigation tasks, double-flow Q-network is used in dynamic environments to realize navigation and obstacle avoidance, the method is suitable for training of multi-subtask auxiliary reinforcement learning, and DDPG is used in map-free environments to realize continuous action control and target-driven navigation. The above work realizes that the robot realizes the navigation task through the DRL. However, due to the simpler design of the reward function, the learning speed of the DRL is slow, and meanwhile, when the robot is in a complex environment, the traditional reward function is difficult for the robot to learn a proper strategy.

To solve the problem that the bonus function is difficult to design, an inverse reinforcement learning method is proposed. Reverse reinforcement learning can be performed by expert demonstration of strategies that enable DRL learning to solve complex problems, but expert demonstration often requires a significant amount of time to acquire. At present, a certain research is performed on samples required for reducing demonstration study, but the problem that the required sample size is large still exists. Meanwhile, the expert demonstration mode can lead the DRL to lose the autonomous trial-and-error process, and the core purpose of reinforcement learning is destroyed. Another alternative to the reward function is to select an appropriate classifier and provide a certain amount of data to train the classifier. The classifier gets the current state success probability as a reward function of reinforcement learning. But to ensure that the classifier can provide an accurate success rate in any situation, a large amount of data needs to be collected to ensure that the classifier can be adequately trained. Meanwhile, the situation of fitting can also occur when the classifier is used in the navigation task, so that the effect of the rewarding value generated by the classifier is similar to that of the discrete rewarding value, and the effect of accelerating reinforcement learning training can not be achieved.

In the current research, meta-learning (Meta-learning) can learn the characteristics of a sample through a small number of samples, and the classification of the current sample is obtained through the comparison between the current sample and a support set. Google in 2017 proposed matching networks to achieve low sample learning. For this purpose, the matching network is applied to the generation of a reward function of a navigation task, and is trained by a navigation data set prepared in advance before the DRL training is started, and a small number of samples successfully completed under the current map are provided for the matching network. The matching network judges the success rate of the robot to finish the task in the current state by comparing the similarity between the current path and the successful path.

How to realize navigation under an unknown environment by a mobile robot is always a urgent need to solve the problems and challenges. Deep reinforcement learning enables robots to learn navigation rules in unknown environments to navigate. The bonus function design is an important link in reinforcement learning training, and the quality of the bonus function directly influences the training speed and generalization capability of the reinforcement learning model. In this context we provide a generic rewarding function model that, in use, only requires a small number of positive and negative samples to be provided to the model to provide accurate rewarding values. The reward function model can still be used in a brand new map environment. We combine the bonus function model with an efficient reinforcement learning model. The effect of the model is evaluated in the simulation environment, and experimental results show that the model can improve the training speed by more than 50%, and can realize end-to-end mobile robot navigation in the complex unknown map environment.

Disclosure of Invention

The invention aims to provide an application method of a mobile robot in reinforcement learning of a matching network, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: the application method of the mobile robot in reinforcement learning of the matching network is characterized in that the model mainly comprises a near-end strategy optimization algorithm PPO and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, the matching network is used for providing a reward value of each action of the robot, and the method steps of the autonomous navigation system of the robot are as follows:

step one: the matching network can quickly find the differences among different categories by introducing a small amount of examples in the training process;

step two: the robot will have a set of state sequences (s ₁ ,s ₂ ,...,s _t ) Each state contains radar observation data and a robot (L ₁ ,L ₂ ,...,L _t ) Relative position coordinates (c) with the target point ₁ ,c ₂ ,...,c _t )；

Step three: the matching network randomly selects a group of samples from the existing database, wherein the samples comprise equal amounts of successful samples and failure samples;

step four: in the process that the robot continuously interacts with the environment, the matching network compares a state sequence obtained by the current robot with a provided sample sequence, and the similarity between the two sequence states is the success rate of the current robot for completing the task;

step five: the success rate of the current state is used as a reward value to be fed back to the reinforcement learning near-end strategy optimization algorithm;

step six: and (3) using the reward value generated by the matching network and the common reward value based on the Euclidean distance as the reward value obtained after the current robot performs the action, and updating the parameters according to the obtained reward value and the current state of the robot by the PPO.

As a preferred scheme of the invention, the PPO is updated in small batches in a plurality of training steps by proposing a new objective function, and parameters of the PPO in the training process are updated as follows:

wherein θ is _t The specific numerical value of the parameter of the PPO network at the current t moment is represented, and alpha represents the learning rate;

where L (θ) is the objective function of the PPO update, L (θ) is expressed as:

L(θ)＝E[min(r _i (θ)A _i ,clip(r _i (θ),1-ε,1+ε)A _i )]

in the middle ofExpressed as the probability ratio between the two strategies before and after the update, where pi _θ (a _i |s _i ) Indicating that under parameter θ, policy pi is in a state s _i Generates action a _i Probability of A _i And then, representing an advantage function, wherein the clip function is a cut-off function, and limiting the value of r (theta) between 1-epsilon and 1+epsilon, so that the situation that the strategy has mutation is avoided, and the training effect is ensured to be stable.

As a preferred embodiment of the present invention, the matching network is capable of learning a support setClassification relation between k samples->I.e. the mapping relationship between the sample and the label, when a group of test set samples are givenThe matching network is able to define the probability distribution of the test set in the support set +.>Wherein the mapping->Implemented by neural networks, the most efficient matching networkThe simplified model is shown below

Wherein x is _i ，y _i Representing a support setSamples and labels of (a),>then the samples and labels in the test set are represented; beta represents the mechanism of attention; the expression of β is as follows:

c denotes the distress distance between the two vectors and f, g denote the coding of the support and test sets, respectively.

As a preferred embodiment of the present invention, the matching network takes as input a set of state sequences and predicts the success rate of the current state, the current sequence (s ₁ ,s ₂ ,...,s _t ) By encoder g _θ Becomes the input vector G; while the matching network selects the provided sample as a positive sample, randomly extracts a negative sample from the database, and passes the sample through the encoder f _θ And obtaining a vector F, and calculating the similarity between the vectors G and F through cosine similarity.

As a preferable scheme of the invention, the radar generates one-dimensional array data each time, so that a one-dimensional convolution method is used for compressing the data, and the relative position relationship between the robot and the target point is also one-dimensional array, so that the data is compressed by using one-dimensional convolution, and the compressed data are combined and the coded vector is obtained by using a full-connection layer.

As a preferable scheme of the invention, the PPO inputs the state s of the current robot _t And outputs the speed and angular velocity of the robot, in experimentsThe robot angular speed range and the linear speed range are set, and the network is formed by full connection layers.

Compared with the prior art, the invention has the beneficial effects that: according to the application method of the mobile robot in the reinforcement learning of the matching network, the training speed of the DRL in the navigation task is improved; only a small number of successful samples in the current environment need to be provided before the DRL training is started, so that accurate rewarding function generation can be realized; the model obtains the rewarding value by comparing the difference between the path sample of the current robot and the provided positive and negative samples. The resulting prize value will guide the robot in learning the navigation skills along with the environmental prize value. In experiments, the MNR+ER-based reward function model is shown to be capable of effectively improving the training speed of the DRL and accurately learning the navigation skills. The model can provide accurate rewards value for accelerating DRL training with little or no expert sample.

Drawings

FIG. 1 is a flow chart of a model of the present invention;

FIG. 2 is a matching network rewards flowchart of the invention;

FIG. 3 is a diagram of an encoder network architecture of the present invention;

FIG. 4 is a diagram of the PPO grid construction of the present invention;

FIG. 5 is a graph of error and accuracy curves of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-5, the invention provides an application method of a mobile robot in reinforcement learning of a matching network, wherein a model mainly comprises a near-end policy optimization algorithm PPO and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, the matching network is used for providing a reward value of each action of the robot, and the method steps of an autonomous navigation system of the robot are as follows:

The PPO is updated in small batches in a plurality of training steps by proposing a new objective function, and the parameter updating of the PPO in the training process is as follows:

L(θ)＝E[min(r _i (θ)A _i ,clip(r _i (θ),1-ε,1+ε)A _i )]

The matching network is capable of learning a support setClassification relation between k samples in a seriesI.e. mapping between samples and labels, when a set of test set samples is given +.>The matching network is able to define the probability distribution of the test set in the support set +.>Wherein the mapping->Implemented by neural networks, the most simplified model of the matching network is as follows

c denotes the distress distance between the two vectors and f, g denote the coding of the support and test sets, respectively. Mainly realized by Convolutional Neural Network (CNN) and long-short-time cyclic neural network (LSTM), in the experiment, we set f=g, and the training process of MN is described by the following pseudo code

Further, the matching network takes as input a set of state sequences and predicts the success rate of the current state, the current sequence (s ₁ ,s ₂ ,...,s _t ) By encoder g _θ Becomes the input vector G; while the matching network selects the provided sample as a positive sample, randomly extracts a negative sample from the database, and passes the sample through the encoder f _θ And obtaining a vector F, and calculating the similarity between the vectors G and F through cosine similarity.

The flow chart is as in fig. 2, the encoder g _θ And f _θ The structure of the radar is shown in figure 3, the radar generates one-dimensional array data each time, and for this purpose, we use one-dimensional convolution method to compress the data, robot and sensorThe relative position relationship between the target points is also a one-dimensional array, so the data is compressed by one-dimensional convolution, and the compressed data are combined and the full-connection layer is used for obtaining the coded vector.

According to FIG. 4, the PPO inputs the state s of the current robot _t And outputs the speed and angular velocity of the robot, in experiments we set the angular velocity range and linear velocity range of the robot, the network is composed of fully connected layers.

To collect the appropriate data to train the matching network rewards, we collected the state sequence under a variety of different maps, 5 maps were designed for data collection. Each map uses randomly generated starting points and target points in the process of collecting data. In order to ensure that the distribution of positive and negative samples of the acquired data is balanced, the PPO algorithm based on the traditional rewarding function is used for navigating the Agent. The number of positive and negative samples collected under each map is shown in table 1,

table 1 distribution of data sets under each map

Before reinforcement learning training begins, we need to pretrain the Matching network. The Matching network adopts a training mode of 2-way 20-shot, namely samples in each training are wrapped in two categories (success and failure), and 20 samples are selected in each category. The matching method adopts cosine similarity. The training model learning rate was set to 1e-4 and the batch size was 20. We use 60% of the paths in the sample as training sets and the rest as validation sets. The error and accuracy in the training process of the Matching network are shown in fig. 5.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The application method of the mobile robot in reinforcement learning of the matching network is characterized in that the model mainly comprises a near-end strategy optimization algorithm PPO and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, and the matching network is used for providing a reward value of each action of the robot, and the method of the autonomous navigation system of the robot is characterized by comprising the following steps:

2. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the PPO is updated in small batches in a plurality of training steps by proposing a new objective function, and the parameter updating of the PPO in the training process is as follows:

L(θ)＝E[min(r _i (θ)A _i ,clip(r _i (θ),1-ε,1+ε)A _i )]

3. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the matching network is capable of learning a support setClassification relation between k samples in a seriesI.e. mapping between samples and labels, when a set of test set samples is given +.>The matching network is able to define the probability distribution of the test set in the support set +.>Wherein the mapping->Implemented by neural networks, the most simplified model of the matching network is as follows

4. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the matching network takes as input a set of state sequences and predicts the success rate of the current state, the current sequence (s ₁ ,s ₂ ,...,s _t ) By encoder g _θ Becomes the input vector G; at the same time, the matching network selects the provided sample as a positive sample, and randomly extracts a negative sample from the database to be used as a positive sampleThe samples pass through an encoder f _θ And obtaining a vector F, and calculating the similarity between the vectors G and F through cosine similarity.

5. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 4, wherein: the radar generates one-dimensional array data each time, therefore, the data is compressed by using a one-dimensional convolution method, the relative position relationship between the robot and the target point is also one-dimensional array, so that the data is compressed by using one-dimensional convolution, and the compressed data are combined and the coded vector is obtained by using a full-connection layer.

6. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the PPO inputs the state s of the current robot _t And outputs the speed and angular velocity of the robot, in experiments we set the angular velocity range and linear velocity range of the robot, the network is composed of fully connected layers.