CN112802197A

CN112802197A - Visual SLAM method and system based on full convolution neural network in dynamic scene

Info

Publication number: CN112802197A
Application number: CN202110295567.6A
Authority: CN
Inventors: 吕艳; 柳双磊; 倪益华; 倪忠进; 宋源普
Original assignee: Zhejiang A&F University ZAFU
Current assignee: Zhejiang A&F University ZAFU
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-05-14

Abstract

The invention provides a visual SLAM method and a system based on a full convolution neural network in a dynamic scene, wherein the method comprises the following steps: acquiring an image dataset; constructing a full convolution neural network model according to the image data set; performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image; removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image; and estimating the pose of the camera according to the static feature points. By the method provided by the invention, the dynamic target can be accurately identified and semantic segmentation is completed, the tracking accuracy and robustness of the camera are effectively improved, and the positioning and mapping accuracy of the visual SLAM in the dynamic scene is improved.

Description

Visual SLAM method and system based on full convolution neural network in dynamic scene

Technical Field

The invention relates to the technical field of computer vision, in particular to a visual SLAM method based on a full convolution neural network in a dynamic scene and a visual SLAM system based on the full convolution neural network in the dynamic scene.

Background

Meanwhile, positioning And Mapping (SLAM) refers to a process of estimating the pose of a robot And constructing an environment map by a sensor carried by the robot in a strange environment, And is a prerequisite condition of many robot application scenes, such as path planning, collision-free navigation, environment perception And the like. The visual SLAM refers to perception that self pose estimation of a camera and environment three-dimensional map construction are carried out by using visual information.

In the prior art, the relative displacement of two adjacent frames can be estimated according to the matching of feature points between the two adjacent frames of an input image, so that the actual displacement of a camera is calculated, but a slow moving dynamic object in a scene can cause the deviation of the pose calculation of the camera, so that the positioning of the whole vision SLAM system is misaligned.

Disclosure of Invention

Aiming at the technical problem that the position and pose calculation of a camera are deviated due to a slow moving dynamic object in the prior art, so that the positioning of the whole vision SLAM system is misaligned, the invention provides a vision SLAM method based on a full convolution neural network in a dynamic scene and a vision SLAM system based on the full convolution neural network in the dynamic scene.

In order to achieve the above object, an aspect of the present invention provides a visual SLAM method based on a full convolutional neural network in a dynamic scene, including the following steps: acquiring an image dataset; constructing a full convolution neural network model according to the image data set; performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image; removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image; and estimating the pose of the camera according to the static feature points.

Further, the performing semantic segmentation on the monocular real-time image currently acquired by the camera by using the full convolution neural network model to obtain a semantic label image includes: converting the full-connection layer of the VGG16 network into a convolutional layer by using a convolution kernel with the same size as the input data of the full-connection layer of the VGG16 network in the full-convolution neural network model, so as to obtain an FCN-VGG16 network; optimizing the FCN-VGG16 network; and performing dynamic target and background binary operation on the monocular real-time image by using the optimized FCN-VGG16 network to obtain the semantic label image.

Further, the method further comprises: after each convolution calculation in the FCN-VGG16 network, performing sparsity processing on a previous convolution layer by using a linear rectification function as an excitation function; pooling operations are performed for each pooling layer in the FCN-VGG16 network.

Further, the optimizing the FCN-VGG16 network includes: and introducing a hopping structure to perform upsampling operation and fusion on each pooling layer in the FCN-VGG16 network to obtain the optimized FCN-VGG16 network.

Further, the performing, by using the optimized FCN-VGG16 network, two classification operations of a dynamic target and a background on the monocular real-time image to obtain the semantic label image includes: determining the category number of heatmaps of the optimized FCN-VGG16 network; determining the prediction probability of the monocular real-time image belonging to a target category; identifying a dynamic target of the monocular real-time image according to the category number of the heat map and the prediction probability; and performing semantic segmentation on the monocular real-time image according to the identified dynamic target to obtain the semantic label image.

Further, identifying a dynamic target of the monocular real-time image according to the category number of the heat map and the prediction probability by:

wherein M is the category number of the heat map, C is the target category of the monocular real-time image, y_icAs virtual parameters, p_icThe prediction probability that the monocular real-time image belongs to a target class C is obtained, and L is a loss value; and identifying the dynamic target of the monocular real-time image according to the loss value.

Further, the removing the dynamic feature points of the monocular real-time image according to the semantic segmentation result to obtain the static feature points of the monocular real-time image includes: carrying out image pyramid layering on the monocular real-time image by using a first scaling factor to obtain a plurality of layers of first scaled images; carrying out image pyramid layering on the semantic label image by using a second scaling factor to obtain a plurality of layers of second scaled images; sequentially removing dynamic feature points in each first zooming image: carrying out ORB feature extraction on each first zooming image to obtain ORB feature points of each first zooming image; outputting image coordinate values to pixel points of the dynamic region in each second zoomed image to form a set U_iWherein i is the layer number of the second zoom image, and the dynamic area is the area where the dynamic target is located; coordinate values of ORB feature points of each first zoom image and a corresponding set U_iMatching the coordinate values; eliminating coordinate values and the set U in each first zoomed image_iThe coordinate value in the table is matched with the ORB characteristic point; forming a static feature point set M by the ORB feature points reserved in each first zooming image; the estimating of the pose of the camera from the static feature points comprises: and estimating the pose of the camera according to the static feature point set M.

Further, the first scaling factor is the same as the second scaling factor.

Further, coordinate values of the ORB feature points and the set U are compared_iBefore the coordinate values in (1) are matched, the method further comprises the following steps: storing the ORB feature points.

Another aspect of the present invention provides a full convolutional neural network based visual SLAM system in a dynamic scene, the system being configured to estimate the pose of the camera by using the full convolutional neural network based visual SLAM method in the dynamic scene described above.

Through the technical scheme provided by the invention, the invention at least has the following technical effects:

the vision SLAM method based on the full convolution neural network in the dynamic scene is characterized in that an ORB _ SLAM2 open source vision SLAM system is taken as a frame, a full convolution neural network model is constructed, a dynamic target in a monocular real-time image currently acquired by a camera is identified by using the full convolution neural network model, semantic segmentation is carried out on the monocular real-time image, a dynamic eye in the image is marked, finally, a dynamic feature point of the monocular real-time image is removed by using coordinate point mapping, a static feature point in the monocular real-time image is reserved, and the pose of the camera is estimated according to the static feature point. The method provided by the invention can accurately identify the dynamic target and complete semantic segmentation, effectively improve the accuracy and robustness of camera tracking, and improve the positioning and mapping precision of the visual SLAM in the dynamic scene.

Additional features and advantages of the invention will be set forth in the detailed description which follows.

Drawings

Fig. 1 is a flowchart of a visual SLAM method based on a full convolutional neural network in a dynamic scene according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an FCN-VGG16 network in a full convolutional neural network-based visual SLAM method in a dynamic scenario according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an upsampling operation in a visual SLAM method based on a full convolutional neural network in a dynamic scene according to an embodiment of the present invention;

fig. 4 is a schematic diagram of bilinear interpolation in a visual SLAM method based on a full convolution neural network in a dynamic scene according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating optimization of the FCN-VGG16 network in the visual SLAM method based on the full convolutional neural network in the dynamic scenario provided in the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the present invention, unless specified to the contrary, use of the terms of orientation such as "upper, lower, top, bottom" or the like are generally described with respect to the orientation shown in the drawings or the positional relationship of the components with respect to each other in the vertical, or gravitational direction.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1, an embodiment of the present invention provides a visual SLAM method based on a full convolutional neural network in a dynamic scene, including the following steps: s101: acquiring an image dataset; s102: constructing a full convolution neural network model according to the image data set; s103: performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image; s104: removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image; s105: and estimating the pose of the camera according to the static feature points.

Specifically, in the embodiment of the present invention, the ORB _ SLAM2 open source visual SLAM system is used as a framework, a convolutional neural network is improved, a full convolutional neural network is constructed, model training is performed on the constructed full convolutional neural network through an image data set, and a trained full convolutional neural network model is output. The method comprises the steps of identifying a dynamic target in a monocular real-time image currently acquired by a camera by using a full convolution neural network model, performing semantic segmentation on the monocular real-time image currently acquired by the camera to obtain a semantic label image, removing dynamic feature points of the monocular real-time image according to the semantic label image to obtain static feature points of the monocular real-time image, and estimating the pose of the camera by using the static feature points.

According to the method provided by the invention, the dynamic target can be accurately identified and the semantic segmentation is completed, the tracking accuracy and robustness of the camera are effectively improved, and the positioning and mapping accuracy of the visual SLAM in the dynamic scene is improved.

Specifically, in the embodiment of the invention, the VGG16 network is composed of 13 convolutional layers, 5 pooling layers and 3 full-link layers. The full-connection layer of the VGG16 network is converted into a convolutional layer by using a convolutional kernel in the full-convolutional neural network model, which has the same size as the input data of the full-connection layer of the VGG16 network, so as to obtain the FCN-VGG16 network, please refer to fig. 2, and fig. 2 is a schematic diagram of the FCN-VGG16 network. In fig. 2, the last convolutional layer Conv16 outputs the result as a heat map, and the number of heat maps represents the number of categories in the heat map. And then, optimizing the FCN-VGG16 network, and performing dynamic target and background binary classification operation on the monocular real-time image by using the optimized FCN-VGG16 network to obtain a semantic label image.

Specifically, in the embodiment of the present invention, after each convolution calculation in the FCN-VGG16 network, a linear rectification function is used as an excitation function to perform sparsity processing on a previous convolution layer, and a mathematical expression of the linear rectification function is as follows:

f(x)＝max(0,x)

where x is the input value and f (x) is the output value. When the input x <0, the outputs are all 0; when input x ≧ 0, the output equals the input. The linear rectification function is simple in conversion, low in calculation complexity and high in gradient descent convergence speed, the output value of the FCN-VGG16 network is guaranteed to be a positive number, meanwhile, the sparsity of the FCN-VGG16 network is increased, and the overfitting condition of the network is reduced.

Pooling each pooling layer in the FCN-VGG16 network reduces the size of the input monocular real-time images, enables information filtering and feature screening of features captured by the convolutional layer, and is generally represented in the form of:

wherein i, j is the pixel coordinate of the feature, K is the channel number of the feature map, f, s₀The convolution kernel size and convolution step size parameters, respectively. In the FCN-VGG16 network, after five pooling operations, the length and width of the heatmap are both reduced to 1/32 for the input monocular real-time images. However, in order to keep the size and dimension of the input image and the output semantic segmentation label image the same, 32 times of deconvolution operation needs to be carried out on the heat map, and the effect of output reconstruction input is achieved through the inverse process of pooling, so that the FCN-32s-VGG16 network is obtained.

Specifically, in the embodiment of the present invention, in the FCN-VGG16(FCN-32s-VGG16), the picture output by the 32-fold deconvolution operation is only a feature in the convolution kernel of the Conv16 layer, and is limited in that the feature in the image cannot be restored well due to the accuracy problem. Therefore, a jump structure is introduced to perform feature optimization, namely, the upsampling operation and fusion are performed on the results of different pooling layers. Referring to fig. 3, in the present embodiment, an upsampling method based on bilinear interpolation is adopted to perform an upsampling operation, an input feature map with a size of N × N is changed into a transition feature map with a size of (2N +1) × (2N +1) by using the bilinear interpolation method, and a new feature map is convolved by using a convolution kernel with a size of 2 × 2, so that a new feature map with a size of 2N × 2N is finally obtained.

Referring to FIG. 4, the bilinear interpolation is calculated as follows, let the interpolation point p₀The coordinate is (x)₀,y₀) The pixel value T of₀And (5) waiting for solving. Point p₀4 pixel points are arranged around the periphery, and the coordinates are P respectively₁，P₂，P₃，P₄The corresponding pixel values are T1, T2, T3, and T4. Therefore, the corresponding relation is as follows:

s＝x₂-x₁,t＝y₂-y₁

referring to fig. 5, based on the FCN-VGG16 network, bilinear interpolation is performed on the Conv16 layer to obtain a result of two-fold upsampling, the result is fused with the Pool4 layer characteristics to obtain a Fuse-Pool4 layer, and finally 16-fold deconvolution is performed on the Fuse-Pool4 layer to obtain the FCN-16s-VGG16 network. Similarly, on the basis of the FCN-16s-VGG16 network, performing bilinear interpolation on a Fuse-Pool4 layer, performing fusion feature on the Fuse-Pool4 layer and a Pool3 layer to obtain a Fuse-Pool3 layer, performing 8-time deconvolution on the feature map result of the Fuse-Pool3 layer, and outputting to obtain an FCN-8s-VGG16 network, namely the optimized FCN-VGG16 network.

Specifically, in the embodiment of the present invention, the number of categories of the heatmap is determined according to the number of the optimized FCN-VGG16 network heatmaps, the prediction probability that the monocular real-time image belongs to the target category is determined, and the dynamic target of the monocular real-time image is identified according to the category number and the prediction probability of the heatmap by the following means:

wherein M is the category number of the heat map, C is the target category of the monocular real-time image, y_icAs virtual parameters, p_icAnd (3) the prediction probability that the monocular real-time image belongs to the target class C is shown, L is a loss value, and then the dynamic target of the monocular real-time image is identified according to the size of the loss value. And performing semantic segmentation on the monocular real-time image according to the identified dynamic target to obtain a semantic label image.

Further, the removing the dynamic feature points of the monocular real-time image according to the semantic segmentation result to obtain the static feature points of the monocular real-time image includes: carrying out image pyramid layering on the monocular real-time image by using a first scaling factor to obtain a plurality of first scaled images; carrying out image pyramid layering on the semantic label images by using a second scaling factor to obtain a plurality of second scaled images; sequentially removing dynamic feature points in each first zooming image: carrying out ORB feature extraction on each first zooming image to obtain ORB feature points of each first zooming image; outputting image coordinate values to pixel points of the dynamic region in each second zoomed image to form a set U_iWherein i is the layer number of the second zoom image, and the dynamic area is the area where the dynamic target is located; coordinate values of ORB feature points of each first zoom image and a corresponding set U_iCoordinate value of (5)Matching is carried out; eliminating coordinate values and the set U in each first zoomed image_iThe coordinate value in the table is matched with the ORB characteristic point; forming a static feature point set M by the ORB feature points reserved in each first zooming image; the estimating of the pose of the camera from the static feature points comprises: and estimating the pose of the camera according to the static feature point set M.

Specifically, in the embodiment of the present invention, a full convolution neural network is used to obtain two semantic label images of a dynamic target and a background, a coordinate mapping method based on an image pyramid is used to process feature point elimination of a dynamic region, and the two semantic label images are used to limit the dynamic region, which specifically includes the following operations:

and carrying out image pyramid layering on the monocular real-time image by using the first scaling factor to obtain a plurality of layers of first scaled images, and carrying out image pyramid layering on the semantic label image by using the second scaling factor to obtain a plurality of layers of second scaled images. Further, the first scaling factor is the same as the second scaling factor. Then, sequentially removing the dynamic feature points in each first zooming image: outputting image coordinate values to pixel points in the dynamic region of the second zoomed image to form a set U_iAnd carrying out ORB feature extraction on the first zoomed image with the same layer number as the second zoomed image to obtain ORB feature points. Traversing all the extracted ORB characteristic points, and enabling coordinate values of the ORB characteristic points to correspond to the set U_iThe coordinate values in (3) are matched: for any ORB feature point P, if the corresponding set U_iIn which there is a point P',

P_x＝P′_x∪P_y＝P^′ _y

the ORB feature points P are considered as dynamic feature points and are eliminated. And after the elimination of the single-layer dynamic characteristic points is finished, continuously eliminating the next layer. And then forming a static characteristic point set M by the ORB characteristic points reserved in each first zooming image, and estimating the pose of the camera according to the static characteristic point set M.

Further, coordinate values of the ORB feature points and the set U are compared_iBefore the coordinate values in (1) are matched, the method also comprisesThe method comprises the following steps: storing the ORB feature points.

Specifically, in the embodiment of the present invention, since a quadtree mechanism exists when the ORB feature points are extracted, coordinate matching needs to be performed after the ORB feature points are stored and converted into single-layer image coordinate values, otherwise, static feature points are lost due to the elimination of dynamic feature points.

Another aspect of the present invention provides a full convolutional neural network based visual SLAM system in a dynamic scene, the system being configured to estimate the pose of the camera by using the full convolutional neural network based visual SLAM method in the dynamic scene described above. Hereinafter represented by the ORB _ SLAM2-FCN system.

By the method and the system, the dynamic target can be accurately identified and semantic segmentation is completed, the tracking accuracy and robustness of the camera are effectively improved, and the positioning and mapping accuracy of the visual SLAM in a dynamic scene is improved.

The preferred embodiments of the present invention have been described in detail with reference to the accompanying drawings, however, the present invention is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present invention within the technical idea of the present invention, and these simple modifications are within the protective scope of the present invention.

It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition.

In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as the disclosure of the present invention as long as it does not depart from the spirit of the present invention.

Claims

1. A visual SLAM method based on a full convolution neural network in a dynamic scene, the method comprising:

acquiring an image dataset;

constructing a full convolution neural network model according to the image data set;

performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image;

removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image;

and estimating the pose of the camera according to the static feature points.

2. The method according to claim 1, wherein the performing semantic segmentation on the monocular real-time image currently acquired by the camera by using the full convolutional neural network model to obtain a semantic tag image comprises:

converting the full-connection layer of the VGG16 network into a convolutional layer by using a convolution kernel with the same size as the input data of the full-connection layer of the VGG16 network in the full-convolution neural network model, so as to obtain an FCN-VGG16 network;

optimizing the FCN-VGG16 network;

and performing dynamic target and background binary operation on the monocular real-time image by using the optimized FCN-VGG16 network to obtain the semantic label image.

3. The method of claim 2, further comprising:

after each convolution calculation in the FCN-VGG16 network, performing sparsity processing on a previous convolution layer by using a linear rectification function as an excitation function;

pooling operations are performed for each pooling layer in the FCN-VGG16 network.

4. The method of claim 3, wherein the optimizing the FCN-VGG16 network comprises:

and introducing a hopping structure to perform upsampling operation and fusion on each pooling layer in the FCN-VGG16 network to obtain the optimized FCN-VGG16 network.

5. The method according to claim 4, wherein the performing dynamic target and background binary classification operations on the monocular real-time images by using the optimized FCN-VGG16 network to obtain the semantic label images comprises:

determining the category number of heatmaps of the optimized FCN-VGG16 network;

determining the prediction probability of the monocular real-time image belonging to a target category;

identifying a dynamic target of the monocular real-time image according to the category number of the heat map and the prediction probability;

and performing semantic segmentation on the monocular real-time image according to the identified dynamic target to obtain the semantic label image.

6. The method of claim 5, wherein the dynamic target of the monocular real-time image is identified based on the number of categories of the heat map and the prediction probability by:

wherein M is the category number of the heat map, C is the target category of the monocular real-time image, y_icAs virtual parameters, p_icThe prediction probability that the monocular real-time image belongs to a target class C is obtained, and L is a loss value;

and identifying the dynamic target of the monocular real-time image according to the loss value.

7. The method according to claim 6, wherein the removing the dynamic feature points of the monocular real-time image according to the semantic segmentation result to obtain the static feature points of the monocular real-time image comprises:

carrying out image pyramid layering on the monocular real-time image by using a first scaling factor to obtain a plurality of layers of first scaled images;

carrying out image pyramid layering on the semantic label image by using a second scaling factor to obtain a plurality of layers of second scaled images;

and sequentially removing the dynamic characteristic points in each step: carrying out ORB feature extraction on each first zooming image to obtain ORB feature points of each first zooming image; outputting image coordinate values to pixel points of the dynamic region in each second zoomed image to form a set U_iWherein i is the layer number of the second zoom image, and the dynamic area is the area where the dynamic target is located; coordinate values of ORB feature points of each first zoom image and a corresponding set U_iMatching the coordinate values; eliminating coordinate values and the set U in each first zoomed image_iThe coordinate value in the table is matched with the ORB characteristic point;

forming a static feature point set M by the ORB feature points reserved in each first zooming image;

the estimating of the pose of the camera from the static feature points comprises:

and estimating the pose of the camera according to the static feature point set M.

8. The method of claim 7, wherein the first scaling factor is the same as the second scaling factor.

9. The method of claim 8, wherein the ORB feature points are coordinated with the set U_iBefore the coordinate values in (1) are matched, the method further comprises the following steps:

storing the ORB feature points.

10. A full convolutional neural network based visual SLAM system in a dynamic scene, wherein the system is configured to adopt the full convolutional neural network based visual SLAM method in the dynamic scene of any one of claims 1 to 9 to estimate the pose of the camera.