CN110084307B

CN110084307B - Mobile robot vision following method based on deep reinforcement learning

Info

Publication number: CN110084307B
Application number: CN201910361528.4A
Authority: CN
Inventors: 张云洲; 王帅; 庞琳卓; 刘及惟; 王磊
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2021-06-18
Anticipated expiration: 2039-04-30
Also published as: CN110084307A

Abstract

The invention provides a mobile robot vision following method based on deep reinforcement learning. The method comprises the following steps of adopting a framework of 'simulated image supervised pre-training + model migration + RL', firstly collecting a small amount of data in a real environment, and automatically expanding a data set by adopting a computer program and an image processing technology so as to obtain a large amount of simulated data sets which can adapt to a real scene in a short time and be used for carrying out supervised training on a direction control model following a robot; secondly, a CNN model for controlling the direction of the robot is built, and supervised training is carried out on the CNN model by using an automatically constructed simulation data set to enable the CNN model to be used as a pre-training model; and then, the knowledge of the pre-training model is transferred to a control model based on DRL (distributed resource library), so that the robot executes a following task in a real environment, and a reinforcement learning mechanism is combined, so that the robot can follow in the environment interaction process and improve the direction control performance, the robustness is high, and the cost is greatly reduced.

Description

Mobile robot vision following method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of intelligent robots, and relates to a mobile robot vision following method based on deep reinforcement learning.

Background

With the progress of technology and the development of society, more and more intelligent robots appear in the lives of people. The following robot is one of new systems which have attracted much attention in recent years, and can be applied to complex environments such as hospitals, markets or schools as an assistant for owners of the following robot to move along with the follow robot, which brings great convenience to lives of people. The following robot has the functions of autonomous perception, recognition, decision and movement, can recognize a specific target and is combined with a corresponding control system to realize the following of the target in a complex scene.

At present, the following robot system is generally researched based on a visual sensor or combination of multiple sensors, the visual sensor is generally used for acquiring a visual image by using a stereo camera, complicated calibration steps are needed, and the following robot system is difficult to adapt to strong outdoor illumination; the latter increases the system cost due to the addition of additional sensors, and also brings about a complex data fusion process. To ensure robustness of tracking in dynamically unknown environments, complex features are typically required to be designed manually, which greatly increases human costs, time costs, and computational resources. In addition, the conventional following robot system usually splits the whole system into a target tracking module and a robot motion control part, and in such a pipeline design structure, errors occurring in a former module are usually sequentially transmitted to a latter module, so that accumulation of errors is gradually amplified, and finally, a large influence is generated on system performance.

In summary, the traditional following robot system has the disadvantages of high hardware cost and design cost, and cannot completely adapt to the variability and complexity of indoor and outdoor environments under the support of simple hardware, so that the robot easily loses the target person, and the robustness of the following system is reduced, thereby seriously affecting the application and popularization of the following robot in the actual life.

Disclosure of Invention

Aiming at the defects of the current traditional following robot design, the invention provides a mobile robot vision following method based on deep reinforcement learning.

The invention uses a monocular color camera as the only input sensor of the robot, introduces a Convolutional Neural Network (CNN) and Deep Reinforcement Learning (DRL) into the following robot system, gets rid of the process of complicated manual design characteristics in the traditional following robot system, enables the robot to directly learn the control strategy from the visual field image, greatly reduces the possibility of target tracking loss, and can better adapt to illumination change, background object interference and target pedestrian and interference elimination in the complex environment. Meanwhile, the introduction of deep reinforcement learning enables the following robot to continuously learn from experience in the process of interacting with the environment, and the intelligence level of the following robot is continuously improved.

The method adopts a framework of 'simulated image supervised pre-training + model migration + RL', firstly collects a small amount of data in a real environment, and automatically expands a data set by adopting a computer program and an image processing technology so as to obtain a large amount of simulated data sets which can adapt to a real scene in a short time and are used for carrying out supervised training on a direction control model following the robot; secondly, a CNN model for controlling the direction of the robot is built, and supervised training is carried out on the CNN model by using an automatically constructed simulation data set to enable the CNN model to be used as a pre-training model; and then, transferring the knowledge of the pre-training model to a control model based on DRL (DRL), so that the robot executes a following task in a real environment, and combining a Reinforcement Learning (RL) mechanism to enable the robot to follow in the process of environment interaction and improve the direction control performance.

The specific technical scheme is as follows:

a mobile robot vision following method based on deep reinforcement learning comprises the following steps:

the method comprises the following steps: automated construction of a data set;

in order to reduce the cost of data collection and quickly obtain large-scale training data, the invention designs an automatic data set construction method by utilizing a computer program and an image processing technology. A small amount of data is collected in a simple experiment scene, then the obtained small amount of experiment data is expanded in a large scale by using an image mask technology, and a large amount of data which can adapt to complex indoor and outdoor scenes can be obtained in a short time, so that the cost of manually collecting and marking the data is greatly reduced.

(1) Preparing a simple scene with a followed object easily distinguished from the background; acquiring view images of different positions of a target person in the robot view from the view of the following robot in a simple scene;

(2) and preparing an application scene following the robot as a complex scene image, such as an indoor scene, an outdoor scene, a street view and the like. Because the followed target person in the simple scene can be easily separated from the background, the target person can be extracted from the background of the simple scene by utilizing an image mask technology and then superposed with the complex scene to obtain an image of the target person in the complex scene, and a corresponding action space label in the simple scene is directly given to the synthesized complex scene image;

the image mask technology is mainly to multiply a two-dimensional matrix (namely a mask) designed for an interested area of an image and an image to be processed, and an obtained result is the interested area to be extracted.

Step two: constructing and training a direction control model based on the CNN;

the CNN-based direction control model is responsible for outputting direction prediction of actions to be taken for the robot visual field image. The model is supervised and trained by utilizing a large-scale simulation data set which is automatically constructed, so that the model has a high direction control level. Knowledge learned in this model will be migrated by means of model migration into the DRL-based directional control model as a priori knowledge of the latter with respect to the directional control strategy.

Before an image collected from a monocular color camera of the robot is input to a CNN, an RGB three channel of the image is converted into an HSV channel, and then the HSV channel is used as an input image and is sent to the CNN; then, carrying out supervised training on the CNN model by using the automatically constructed data set in the step one, so that the CNN can achieve the effect of outputting a corresponding action state through a robot visual field input image;

step three: model migration;

the invention takes the learned strategy in the CNN direction control model as a model migration method based on the prior knowledge of the DRL direction control model. Although the output of the CNN model and the DRL model have different meanings: the output of the CNN model is the probability of each directional motion, while the output of the DRL model is typically a value estimate of each directional motion, but they have the same output dimensions. Generally, the CNN model has a higher value estimate corresponding to the direction of motion with a higher output probability.

Transferring the CNN parameter weight trained in the step two to a DRL model as an initial parameter so that the DRL model obtains the same control level as the CNN model;

step four: building and training a direction control model based on the DRL;

the DRL model is responsible for further improving the performance of the model by utilizing an RL mechanism on the basis of obtaining the prior knowledge of the CNN model. The introduction of the RL mechanism enables the robot to collect experience and improve own knowledge in the process of interacting with the environment, so that the robot following direction control level higher than that of a CNN model is obtained.

And (4) applying the DRL model after the initial parameter migration in the step three to a robot end for use, and enabling the robot to continuously update the model, learn and adapt to the current environment through continuous interaction with the environment.

Further, the second step: the size of an image collected by a monocular color camera of the robot is 640 x 480, RGB three channels of the image are converted into HSV channels before the image is input to a neural network, the size of the image of 640 x 480 is adjusted to be 60 x 80, the images collected at 4 adjacent moments are combined together to be used as the input of the network, a final input layer comprises 12 channels of 4 x 3, and the size of each channel is 60 x 80.

Further, the second step: the CNN structure based on the nano-structure is composed of 8 layers, including a convolution layer 3 layer, a pooling layer 2 layer, a full-communication layer 2 layer and an output layer; the convolutional layer is designed to perform feature extraction on an input image, and the pooling layer is designed to perform dimension reduction on the extracted features to reduce the amount of calculation required for forward propagation. From front to back, the convolution kernel parameter settings for the three convolution layers are: 8 × 8, 4 × 4, 2 × 2; the two pooling layers are both subjected to maximum pooling, and the sizes of the two pooling layers are both 2 multiplied by 2; after the third convolution, the data is input to two full-connection layers, each layer is provided with 384 nodes, the output layer is arranged behind the full-connection layer, the data is output in a multi-dimensional mode after passing through the output layer, each dimension represents the action in the corresponding direction, and the actions in three directions are contained: forward, left, right; a Relu activation function is added after the three convolutional layers and the two full-link layers to carry out nonlinear transformation on the result of the input layer; the CNN parameter is updated by adopting a cross entropy loss function, which is specifically expressed as:

where y' is the label data of the sample, which is a three-dimensional One-Hot vector, where a dimension of 1 represents the correct action. And f (x) represents the prediction probability of the CNN model for each action dimension.

Further, the DRL model in the third step is specifically a DQN model, and the migration process is: and removing the Softmax layer of the trained CNN network, and directly endowing the weight parameters of the previous layers to the DQN model.

Further, the fourth step: DQN uses neural network to approximate function, i.e. input of neural network is current state value s, output is predicted value Q_θ(s, a) at each time step, the environment gives a state value s, and the agent derives a value quantity Q for s and all actions according to a value function network_θ(s, a), then, selecting an action by using a greedy algorithm e-greedy, making a decision, and giving an incentive value r and a next state s' after the environment receives the action a; this is a step; updating parameters of the value function network according to r; DQN uses mean square error to define an objective function:

wherein s ', a' are the state and action at the next moment, gamma is a hyper-parameter, and theta is a model parameter;

during training, the updating mode of the parameters is as follows:

when the final depth reinforcement learning algorithm is applied to a physical robot, a real-time view image acquired by a monocular color camera carried by the robot is used as a state input value of a DRL algorithm, an algorithm output action space is a set of direction control signals, and the robot can move along with a target person in real time by executing a direction control instruction.

The method of the invention provides an intelligent following robot system based on a deep reinforcement learning algorithm aiming at the problems of the following robot system in practical application, and the end-to-end design enables a tracking module and a direction control module in the traditional following robot system to be fused, thereby preventing error transmission and accumulation between the modules and enabling the robot to directly learn the mapping relation between a target and a behavior strategy. Compared with a traditional following robot system, the system has higher robustness, can greatly reduce hardware cost and labor cost, and increases possibility of popularization and use of the following robot in actual life.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a schematic diagram of an automated data set construction process of the present invention.

FIG. 3 is a diagram of the effect of the invention on the automated construction of a composite image from a data set.

Wherein, each subgraph is described as follows:

(a) (b) (c) (d) are examples of pictures of different positions of a target person acquired by the robot in a simple scene;

(e) (f) (g) (h) are examples of complex scene pictures collected on the interconnect;

(i) (g) (k) (l) is an example of a composite dataset image processed by an image masking technique;

(a) (e) (i), (b) (f) (g), (c) (g) (k), (d) (h) (l) show the complete image mask synthesis process and effect of the target person at different positions in the simple image, respectively.

Fig. 4 is a diagram showing correspondence between an input image and an operation space according to the present invention.

FIG. 5 is a following robot system architecture diagram of the present invention.

Detailed Description

The software environment of the present embodiment is the ubuntu14.04 system, the mobile robot is a urtlebot2 robot, and the robot input sensor is a monocular color camera with a resolution of 640 × 480.

The method comprises the following steps: automatic data set construction process

For the direction control model of the supervised following robot, a camera view image of the following robot is input, and an action which the robot should take at the current moment is output. The construction process of the entire data set includes two parts: and obtaining an input visual field image and labeling an output action.

A simple scene is prepared in which the followed object needs to be more easily distinguished from the background. In a simple scene, a plurality of visual field images of the target person at different positions in the visual field of the robot are collected from the visual field of the following robot. A certain number of complex scene images are downloaded from the Internet, and the scene images mainly comprise common application scenes of the following robot, such as indoor and outdoor scenes, street scenes and the like. Because the followed target person in the simple scene can be easily separated from the background, the target person can be extracted from the background by utilizing an image mask technology and then superposed with the complex scene obtained on the Internet, so that the image of the target person in the complex scene can be obtained, and a corresponding action space label in the simple scene can be directly given to the synthesized complex scene image. A schematic diagram of an automated data set construction process is shown in fig. 2. The effect diagrams of the simple scene image, the internet complex scene image and the data set automatic construction process are shown in fig. 3.

After collecting the images containing target persons at different positions in the simple scene, the color of the tracked target is greatly different from the background color of the simple scene, and an image mask is designed directly by setting a color threshold. After the mask is applied to the robot visual field image, a binary image of the tracked target and the background can be obtained, and the outline of the target person is extracted. At this time, the image values of the background portion are all 0, and the image value of the tracked target person is 1. At this time, the image part of the target person can be overlapped with the complex scene picture. The action tag is obtained by averaging the horizontal position of the tracked target person image value 1 in the binary image after the image mask.

Step two: CNN-based direction control model building and training process

The size of the image collected from the monocular color camera is 640 × 480, the RGB three channels are converted into HSV channels before being input to the neural network, and the 640 × 480 image is adjusted to 60 × 80 and then is used as an input image to be input to the neural network. The design of the invention combines the images collected at 4 adjacent moments together as the input of the network, and because a single image is a three-channel HSV image, the final input layer comprises 12 channels of 4 multiplied by 3, and the size of each channel is 60 multiplied by 80. And then, carrying out supervised training on the CNN model by using the automatically constructed data set, so that the CNN network can achieve the effect of outputting a corresponding action state through a robot visual field input image.

Step three: model migration process

The DQN model finally used in the invention has a structure similar to the CNN direction control network designed by the invention and described above, but the last Softmax layer is removed, and value prediction of each state action pair is directly output instead of probability distribution of each action. Therefore, the model migration strategy adopted by the invention is as follows: removing the Softmax layer of the trained CNN network, and directly endowing the weight parameters of the previous layers to the DQN model so as to achieve the purpose of prior knowledge migration.

Step four: DRL-based directional control model training process

After the model migration is completed, the DRL model can be used at the robot end and continuously interacts with the environment, so that the robot can continuously update the model, learn the current environment and improve the following robustness. In the process, the algorithm outputs discrete motion space to control the robot, the motion space following the robot is a set containing left, right and forward instructions, and the corresponding relation between the motion space and the input image is shown in fig. 4.

There is no separate label for the data in the RL, and only the externally fed reward signal is relied on to suggest how well the action is, so the design of the reward function is a crucial element for the successful application of the RL. The following robot direction control reward function in the invention is designed as follows: a user is remotely connected to the local end of the following robot to observe a visual field image of the robot, and the initial STOP is 0, which indicates that the following is not finished; when finding that the position of the user in the robot visual field image deviates from the center, the user can send a STOP message through the handheld device, and the robot end knows that the user has failed to follow the robot end when receiving the STOP message, sets the STOP to 1, and controls the robot to STOP moving. The design can facilitate the operation of the user on one hand, and on the other hand, a more accurate reward signal can be obtained so as to accelerate the convergence of the model. At this time, the bonus function can be expressed by the following equation:

wherein C is a negative number.

Through a verification experiment on the TurtleBot2 robot, the method can accurately follow a specific target person and has high robustness.

Claims

1. A mobile robot vision following method based on deep reinforcement learning is characterized by comprising the following steps:

the method comprises the following steps: automated construction of a data set;

(2) preparing an application scene following the robot as a complex scene image, extracting a target person from the background of a simple scene by using an image mask technology, and then overlapping the target person with the complex scene to obtain an image of the target person in the complex scene, and directly endowing a corresponding action space label under the simple scene to the synthesized complex scene image;

step two: constructing and training a direction control model based on the CNN;

carrying out supervised training on the CNN model by utilizing the data set automatically constructed in the first step, so that the CNN can achieve the effect of outputting corresponding action states by inputting images through the visual field of the robot, converting RGB three channels of the images collected from a monocular color camera of the robot into HSV channels before inputting the images to the CNN, then taking the HSV channels as input images to the CNN, and then outputting the corresponding action states by a network;

the CNN structure consists of 8 layers, including a convolution layer 3 layer, a pooling layer 2 layer, a full-communication layer 2 layer and an output layer; from front to back, the convolution kernel parameter settings for the three convolution layers are: 8 × 8, 4 × 4, 2 × 2; the two pooling layers are both subjected to maximum pooling, and the sizes of the two pooling layers are both 2 multiplied by 2; after the third convolution, the data is input to two full-connection layers, each layer is provided with 384 nodes, the output layer is arranged behind the full-connection layer, the data is output in a multi-dimensional mode after passing through the output layer, each dimension represents the action in the corresponding direction, and the actions in three directions are contained: forward, left, right; a Relu activation function is added after the three convolutional layers and the two full-link layers to carry out nonlinear transformation on the result of the input layer; the CNN parameter is updated by adopting a cross entropy loss function, which is specifically expressed as:

wherein y' is label data of the sample and is a three-dimensional One-Hot vector, and the dimension of 1 represents a correct action; (x) represents the predicted probability of the CNN model for each action dimension;

step three: model migration;

transferring the CNN parameter weight trained in the step two to a DRL model as an initial parameter so that the DRL model obtains the same control level as the CNN model; the DRL model is a DQN model, and the migration process is as follows: removing the Softmax layer of the trained CNN network, and directly endowing the weight parameters of the previous layers to the DQN model;

step four: building and training a direction control model based on the DRL;

and (4) applying the DRL model after the initial parameter migration in the step three to a robot end for use, and enabling the robot to continuously update the model and learn the current environment through continuous interaction with the environment.

2. The mobile robot visual following method based on deep reinforcement learning according to claim 1, wherein the second step: the size of an image collected by a monocular color camera of the robot is 640 x 480, RGB three channels of the image are converted into HSV channels before the image is input to a neural network, the size of the image of 640 x 480 is adjusted to be 60 x 80, the images collected at 4 adjacent moments are combined together to be used as the input of the network, a final input layer comprises 12 channels of 4 x 3, and the size of each channel is 60 x 80.

3. The mobile robot visual following method based on deep reinforcement learning of claim 1, wherein the fourth step: the DQN uses a neural network approximation function, i.e. the input of the neural network is the current state value s and the output is the predicted value Q_θ(s, a) at each time step, the environment gives a state value s, and the agent derives a value quantity Q for s and all actions according to a value function network_θ(s, a) and then using greedy algorithm e-greedy selects action, makes decision, and gives out a reward value r and a next state s' after the environment receives the action a; this is a step; updating parameters of the value function network according to r; DQN uses mean square error to define an objective function:

during training, the updating mode of the parameters is as follows: