CN110084850B

CN110084850B - Dynamic scene visual positioning method based on image semantic segmentation

Info

Publication number: CN110084850B
Application number: CN201910270280.0A
Authority: CN
Inventors: 潘树国; 盛超; 曾攀; 黄砺枭; 赵涛; 王帅; 高旺
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2023-05-23
Anticipated expiration: 2039-04-04
Also published as: CN110084850A

Abstract

The invention discloses a dynamic scene visual positioning method based on image semantic segmentation, belonging to the field of SLAM (Simultaneous Localization and Mapping, synchronous positioning and image construction). Firstly, segmenting a dynamic object in an original image by adopting a supervised learning mode in deep learning to obtain a semantic image; on the basis, ORB characteristic points are extracted from the original image, and the characteristic points of the dynamic object are removed according to the semantic image; and finally, positioning and tracking the camera motion by adopting a monocular SLAM method based on point characteristics based on the characteristic points after being removed. The positioning result shows that compared with the traditional method, the positioning accuracy of the method disclosed by the invention in a dynamic scene is improved by 13% to 30%.

Description

Dynamic scene visual positioning method based on image semantic segmentation

Technical Field

The invention relates to application of deep learning in visual SLAM (Simultaneous Localization and Mapping, synchronous positioning and mapping) field.

Background

Meanwhile, positioning and mapping (SLAM) are key technologies for autonomous operation of a robot in an unknown environment. Based on the environmental data detected by the external sensor of the robot, the SLAM constructs a surrounding environment map of the robot, and simultaneously gives the position of the robot in the environment map. Compared with distance measuring instruments such as radar, sonar and the like, the visual sensor has the characteristics of small volume, low power consumption, rich information acquisition and the like, and can provide rich texture information in an external environment. Therefore, visual SLAM has become a hotspot of current research and is applied to the fields of autonomous navigation, VR/AR and the like.

Conventional point feature based visual SLAM algorithms are based on static environment assumptions when recovering scene information and camera motion. Dynamic objects in the scene can affect positioning accuracy. Currently, conventional point feature based visual SLAM algorithms deal with simple dynamic scene problems by detecting dynamic points and labeling them as outliers. ORB-SLAM reduces the impact of dynamic objects on positioning accuracy through RANSAC, chi-square inspection, key frame method and local map. The direct method addresses the occlusion problem caused by dynamic objects by optimizing the cost function. In 2013, a new key frame expression and updating method is proposed by a learner, and is used for adaptively modeling a dynamic environment, and effectively detecting and processing appearance or structure changes in the dynamic environment. In the same year, students introduced methods of multi-camera pose estimation and mapping for processing dynamic scenes. However, the positioning accuracy and robustness of the traditional SLAM method under a dynamic scene are to be improved.

Disclosure of Invention

The technical problems to be solved by the invention are as follows:

in order to improve the positioning accuracy and robustness of the traditional SLAM method in a dynamic scene, the dynamic scene visual positioning method based on image semantic segmentation is provided, and dynamic objects in the scene can be segmented, and feature points of the dynamic objects are removed.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a dynamic scene visual positioning method based on image semantic segmentation, which comprises the following steps:

step 1, acquiring an original image, constructing a convolutional neural network, and dividing a dynamic object in the original image by using the convolutional neural network to obtain a semantic image;

step 2, extracting ORB characteristic points from the original image;

step 3, eliminating the dynamic object feature points in the ORB feature points obtained in the step 2 according to the semantic image obtained in the step 1, and only keeping the static object feature points;

and 4, positioning and tracking the camera motion by adopting a traditional SLAM method based on the point characteristics based on the static object characteristic points obtained in the step 3.

The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in step 1, the step of constructing a convolutional neural network includes:

step 1.1.1, downsampling an original image to 1/4, inputting PSPNet to obtain feature images with the sizes of 1/8 and 1/16 step by step, and finally outputting a feature image F1 with the size of 1/32;

step 1.1.2, downsampling an original image to 1/2, inputting the PSPNet to obtain feature images with the sizes of 1/4 and 1/8 step by step, and finally outputting a feature image F2 with the size of 1/16;

step 1.1.3, inputting the feature graphs F1 and F2 and the truth label with the size of 1/16 of the original image into a first CFF unit for fusion, and outputting the feature graph F with the size of 1/16 ¹ And the loss term L of the first branch ₁ ；

Step 1.1.4, inputting an original image into the PSPNet, obtaining feature images with the sizes of 1/2 and 1/4 step by step, and finally outputting a feature image F3 with the size of 1/8; map F of the characteristics ¹ Fused with feature map F3 and truth label input second CFF unit with size of 1/8 of original image, and outputting feature map F with size of 1/8 ² And loss term L of the second branch ₂ ；

Step 1.1.5, the feature map F ² Up-sampling to obtain a 1/4-size feature map F ³ The characteristic diagram F ³ Outputting the loss item L of the third branch after being processed by the truth value label with the size of 1/4 ₃ ；

Step 1.1.6, adding the loss term L ₁ 、L ₂ 、L ₃ Superposition is used to train the convolutional neural network.

The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: step 1.1.3 and step 1.1.4 the CFF unit comprises the steps of:

up-sampling a feature map with smaller size in the two input feature maps at a sampling rate of 2, and respectively inputting a classified convolution layer and an expanded convolution layer, wherein the convolution kernel size of the classified convolution layer is 1 x 1, and the convolution kernel size of the expanded convolution layer is 3 x C ₃ The expansion ratio is 2; inputting a larger feature map of the two input feature maps into a convolution kernel with a size of 1 x C ₃ Is a projection convolution layer of (1); for the inflated convolution layer and projection volumeRespectively carrying out batch normalization and summation on the laminated output results, inputting the summation results into RELU functions, and outputting a characteristic diagram F _c Outputting the output result of the classified convolution layer

And substituting the truth value label into a Softmax function to obtain a loss item of the corresponding branch of the CFF unit.

The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: step 1.6 said loss term L ₁ 、L ₂ 、L ₃ The specific steps for training the convolutional neural network include:

for loss term L ₁ 、L ₂ 、L ₃ Summing to obtain the final loss term L _total ：

Where i is the number of branches, ω _i For each branch to lose the weight of the term,

for the feature map used to calculate the loss function in each branch, Y _i ×X _i Is->

N is the number of object types to be segmented in the preset image, +.>

To be in the characteristic diagram->

The value of the (n, y, x) position,/->

Is->

The corresponding truth label at (y, x).

The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 1, the step of segmenting the dynamic object in the original image by using the convolutional neural network to obtain a semantic image includes the following steps:

step 1.2.1, downsampling an original image to 1/4, inputting PSPNet to obtain feature images with the sizes of 1/8 and 1/16 step by step, and finally outputting a feature image F1 with the size of 1/32;

step 1.2.2, downsampling an original image to 1/2, inputting the PSPNet to obtain feature images with the sizes of 1/4 and 1/8 step by step, and finally outputting a feature image F2 with the size of 1/16;

step 1.2.3, inputting the feature graphs F1 and F2 and the truth label with the size of 1/16 of the original image into a first CFF unit for fusion, and outputting the feature graph F with the size of 1/16 ¹ ；

Step 1.2.4, inputting an original image into the PSPNet, obtaining feature images with the sizes of 1/2 and 1/4 step by step, and finally outputting a feature image F3 with the size of 1/8; map F of the characteristics ¹ Fused with the feature map F3 input to the second CFF unit, and outputting the feature map F with the size of 1/8 ² ；

Step 1.2.5, the feature map F ² Up-sampling to obtain a 1/4-size feature map F ³ When testing procedure, F ³ Up-sampling is carried out, and a feature map with the size of 1 is output, wherein the feature map is a semantic segmentation map;

step 1.2.6, performing binarization processing on the semantic segmentation map: marking dynamic objects in the semantic segmentation map by using black pixels 0, and marking other objects by using white pixels 1 to obtain a black-and-white semantic image i 'only containing the dynamic objects' _t ；

Step 1.2.7, performing the operations of steps 1.1 to 1.7 on the image sequence composed of the original images, finally obtaining a semantic image sequence I ' = { I ' containing only dynamic objects ' _t ，i′ ₂ ，i′ ₃ ，i′ ₄ ，...，i′ _t }。

The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 2, the specific steps of extracting the ORB feature points from the original image include:

according to the complexity of the scene, the number of features to be extracted is set, and an ORB feature extractor is utilized to extract an input image i _t Characteristic point i of (a) _t (x, y), where x, y is the abscissa of the feature point.

The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 3, the step of eliminating the dynamic object feature points in the ORB feature points obtained in the step 2 according to the semantic image obtained in the step 1 and only keeping the static object feature points includes:

for the original image i _t Each feature point i of (a) _t (x, y) in its semantic image i' _t To determine the corresponding position i' _t (x，y)；

If i' _t (x, y) =0, the point is a black pixel point, namely belongs to the dynamic object feature, and the rejecting operation is executed;

if i' _t (x, y) =1, which is a white pixel, i.e., belonging to a static object feature, performs a hold operation.

The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 4, based on the static object feature points obtained in the step 3, the conventional SLAM method based on the point features is adopted to perform positioning tracking on the camera motion, specifically:

for image sequences i= { I ₁ ,i ₂ ,i ₃ ,i ₄ ,…,i _t And (3) calculating and optimizing the pose of the camera by adopting a traditional SLAM frame based on point characteristics based on ORB characteristic points removed in the step (3), and completing the positioning and tracking of the camera.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. firstly, segmenting a dynamic object in an original image by adopting a supervised learning mode in deep learning to obtain a semantic image; on the basis, ORB characteristic points are extracted from the original image, and the characteristic points of the dynamic object are removed according to the semantic image, so that the positioning accuracy and the robustness of the traditional SLAM method under a dynamic scene are improved;

2. the positioning result of the method provided by the invention is superior to that of the traditional ORB-SLAM, and the positioning precision is improved by 13% to 30%.

Drawings

FIG. 1 is a flow chart of the method;

FIG. 2 is a diagram of a network architecture for semantic segmentation of images according to the present method;

FIG. 3 is a block diagram of a cascading feature fusion unit of the present method;

FIG. 4 is a flow chart of the dynamic object segmentation method;

FIG. 5 is a graph of the result of semantic segmentation of an image according to the method;

FIG. 6 is a graph of the result of eliminating feature points of dynamic objects in the method;

FIG. 7 is a plan view of the positioning track of the present method and the complete ORB-SLAM in four sequences;

FIG. 8 is a plan view of the positioning trajectories of the present method and incomplete ORB-SLAM in four sequences.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:

it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

With the development of deep learning technology, people explore semantic information of images, thereby improving the performance of visual SLAM. Semantic segmentation, in which visual input needs to be divided into different semantically interpretable categories, is a fundamental task in computer vision. The invention provides a dynamic scene visual positioning method based on image semantic segmentation, which aims to improve the positioning accuracy of SLAM under a dynamic scene and obtain semantic information rich in the scene on the basis of eliminating feature points of dynamic objects.

The invention provides a dynamic scene visual positioning method based on image semantic segmentation, wherein fig. 1 is a flow chart of the method, and fig. 4 is a flow chart of the method for segmenting dynamic objects. Firstly, segmenting a dynamic object in an original image by adopting a supervised learning mode in deep learning to obtain a semantic image; on the basis, ORB characteristic points are extracted from the original image, and the characteristic points of the dynamic object are removed according to the semantic image; and finally, positioning and tracking the camera motion by adopting a monocular SLAM method based on point characteristics based on the characteristic points after being removed.

Step 1, constructing a convolutional neural network to divide a dynamic object in an original image so as to obtain a semantic image:

step 1.1, constructing a convolutional neural network for semantic segmentation

The structure of the constructed neural network is shown in fig. 2. In the network architecture depicted in fig. 2, there are three branches, top, middle, bottom; the numbers in brackets are the size ratio compared to the original input image; 'CFF' is a cascading feature fusion unit; the first three networks of the top and middle branches share the same parameters.

The network architecture will now be described in further detail:

cascading image input: at the top branch of the network depicted in fig. 2, the original image is first downsampled to a 1/4 size image, then PSPNet is input, and a 1/32 size feature map is output, which is a rough segmentation result, missing many details and boundaries. At the middle and bottom branches, the 1/2 size image and the original image are adopted to restore and refine the detail of the rough result. Although the segmentation result of the top branch is rough, the top branch contains rich semantic parts. Thus, the middle and bottom branch networks for detail restoration and refinement are lightweight. And fusing the output feature graphs of different branches by using a cascade feature fusion unit (CFF), and guiding and enhancing the learning process of the different branches by using cascade labels.

Cascade feature fusion: fig. 3 shows a specific structure of the cascade feature fusion unit, where F1 and F2 are feature diagrams of different branch outputs, and the spatial dimension of F2 is twice that of F1. The cascade feature fusion unit is used for fusing feature graphs of different branch outputs, the input of the unit comprises two feature graphs F1 and F2 and a truth value label, and the size of F1 is Y ₁ ×X ₁ ×C ₁ F2 has a size of Y ₂ ×X ₂ ×C ₂ The label has a size Y ₁ ×X ₁ X 1. For the feature map F1, first, up-sampling is performed at a sampling rate of 2, and a feature map of the same size as F2 is output. Then a core size of 3×3×c ₃ An expanded convolution layer with an expansion ratio of 2 is used to refine the output feature map, so that the size of F1 becomes Y ₂ ×X ₂ ×C ₃ . For feature map F2, the core size is 1×1×C ₃ Output Y ₂ ×X ₂ ×C ₃ Dimensional feature map. The outputs of F1 and F2 are simultaneously batch normalized and the fused feature map F2' is finally output through the summation layer and the ' RELU ' function layer.

Cascading label guidance; in the network architecture depicted in fig. 2, three different-sized truth labels (1/16, 1/8,1/4, respectively, relative to the original image) are used to generate three independent penalty entries at the top, middle, and bottom branches of the network, and the three penalty entries are summed to obtain the final penalty entry:

wherein omega _t For each branch loss term weight, F ^t For each branch output feature map, Y _t ×X _t Is F ^t N is the number of types of objects to be segmented in a predetermined image,

to be in the characteristic diagram F ^t The values of the (n, y, x) positions of (c),

is->

The corresponding truth label at (y, x).

Step 1.2, segmenting a dynamic object in an original input image:

fig. 3 shows the implementation of this step. For a given set of image sequences i= { I ₁ ，i ₂ ，i ₃ ，i ₄ ，...，i _t (i) _t For the image taken by the camera at time t:

(1) Inputting an image i into the semantic segmentation network constructed in the step 1.1 _t Outputting a segmented color semantic image, wherein objects such as automobiles, pedestrians, buildings, signs and the like are marked by pixels with different colors in the semantic image;

(2) Binarizing the semantic image in the step (1), marking dynamic objects (pedestrians and automobiles) in the image by using black pixels 0, and marking other objects by using white pixels 1 to obtain a black-and-white semantic image i 'only containing the dynamic objects' _t ；

(3) Repeating steps (1) and (2) for each image in the image sequence I;

finally, a semantic image sequence I ' = { I ' only containing dynamic objects is obtained ' _t ，i′ ₂ ，i′ ₃ ，i′ ₄ ，...，i′ _t }。

Step 2, ORB feature points are extracted from the original image, the feature points of the dynamic object are removed according to the semantic image, and only the feature points of the static object are reserved:

step 2.1, extracting ORB characteristic points in an original image:

Step 3, eliminating the characteristic points of the dynamic object according to the semantic image, and only retaining the characteristic points of the static object:

(1) For i _t Each feature point i of (a) _t (x, y), in semantic image i' _t To determine the corresponding position i' _t (x，y)；

(2) If i _t (x, y) =0, the point is a black pixel point, namely belongs to the dynamic object feature, and the rejecting operation is executed;

(3) If i _t (x, y) =1, which is a white pixel, i.e., belonging to a static object feature, performs a hold operation.

Step 4, based on the ORB characteristic points removed in the step 3, adopting a traditional SLAM frame based on point characteristics to position and track the camera:

for image sequences i= { I ₁ ，i ₂ ，i ₃ ，i ₄ ，...，i _t And (3) calculating and optimizing the pose of the camera by adopting a traditional SLAM frame based on point characteristics based on ORB characteristic points removed in the step (2) to finish the positioning and tracking of the camera.

Example 1

The invention has been evaluated using a Frankfurt monocular image sequence that is part of the Cityscapes dataset. The entire Frankfurt sequence provides more than 10 ten thousand frames of outdoor ambient images and provides a localization result that can be used as a true value. The sequence is divided into several smaller sequences, including 1300-2500 frames of dynamic object sequences, such as driving cars or pedestrians. The configuration of the experimental platform is as follows: intel Xeone5-2690V4; a RAM of 128 GB; injeida TitanV GPU.

The sequences isolated from the original Frankfurt sequence were as follows:

Seq.01:frankfurt_000001_054140_leftImg8bit.png-frankfurt_000001_056555_leftImg8bit.png

Seq.02:frankfurt_000001_012745_leftImg8bit.png-frankfurt_000001_014100_leftImg8bit.png

Seq.03:frankfurt_000001_003311_leftImg8bit.png-frankfurt_000001_005555_leftImg8bit.png

Seq.04:frankfurt_000001_010580_leftImg8bit.png-frankfurt_000001_012739_leftImg8bit.png

fig. 5 shows the result of semantic segmentation. The middle column shows that trees, buildings, roads, traffic signs and other objects in the scene are well segmented. The right side only retains the segmentation results of dynamic objects (cars and pedestrians). Although the boundaries are not entirely accurate, the results are sufficient for culling feature points.

Fig. 6 shows the result of dynamic object feature point culling. White automobiles are dynamic objects that travel on roads. The two images in the left column are the results before rejection, and there are a plurality of characteristic points belonging to the dynamic automobile. And the right column is the reject result, and the characteristic points of the automobile are completely rejected.

FIG. 7 shows a plan view of the positioning track in the four-segment video sequence of seq.01, seq.02, seq.03, seq.04 based on the present method of complete ORB-SLAM and complete ORB-SLAM. From the four figures, it can be seen that the positioning trajectory (Ours) obtained by the method of the present invention has a smaller deviation from the real trajectory (Ground Truth) than the trajectory (ORB-SLAM Full) calculated by the complete ORB-SLAM. As more dynamic vehicles and pedestrians exist in the seq.01 sequence, the deviation between the results of the two methods and the true value is larger, but the method is still superior to the complete ORB-SLAM in positioning accuracy. Partial discontinuities in the positioning track may occur due to the system performing position tracking based on the keyframes.

The complete ORB-SLAM uses chi-square test, so that the influence of dynamic characteristic points on positioning accuracy is reduced to a certain extent, and the method based on the incomplete ORB-SLAM with the chi-square test removed and the positioning track plan of the incomplete ORB-SLAM in the four-segment video sequence of seq.01, seq.02, seq.03 and seq.04 are shown in FIG. 8. From the four figures, it can be seen that the positioning trace (our) obtained by the method of the present invention has a smaller deviation from the real trace (Ground Truth) than the trace (ORB-SLAM implementation) calculated by the incomplete ORB-SLAM. As a large number of dynamic feature points exist in the scene due to more pedestrians in the scene, the incomplete ORB-SLAM fails to be positioned completely in the seq.02, and the robustness of the method provided by the invention is better. Partial discontinuities in the positioning track may occur due to the system performing position tracking based on the keyframes.

And finally, positioning results of the four-section image sequence in the complete ORB-SLAM, the incomplete ORB-SLAM and the method are given. From tables 1 and 2, it can be seen that the positioning results of the method provided by the invention are superior to those of the conventional ORB-SLAM, and the positioning accuracy is improved by 13% to 30%.

Table 1: two methods locate result comparisons on the sequence of Seq01-Seq04 images

Table 2: two methods locate result comparisons on the sequence of Seq01-Seq04 images

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. The dynamic scene visual positioning method based on image semantic segmentation is characterized by comprising the following steps of:

the step of constructing a convolutional neural network includes:

Step 1.1.6, adding the loss term L ₁ 、L ₂ 、L ₃ Superimposing the convolutional neural network for training;

in step 1, the convolutional neural network is utilized to segment the dynamic object in the original image, and the semantic image is obtained by the following steps:

Step 1.2.4, inputting an original image into the PSPNet, obtaining feature images with the sizes of 1/2 and 1/4 step by step, and finally outputting a feature image F3 with the size of 1/8; map F of the characteristics ¹ Fused with the input of the characteristic diagram F3 into the second CFF unit, and the output size is 1/8Feature map F ² ；

Step 1.2.7, performing the operations of the steps 1.2.1 to 1.2.7 on the image sequence composed of the original images, and finally obtaining a semantic image sequence I ' = { I ' only containing dynamic objects ' _t ,i' ₂ ,i' ₃ ,i' ₄ ,...,i' _t }；

Step 2, extracting ORB characteristic points from the original image;

2. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: step 1.1.3 and step 1.1.4 the CFF unit comprises the steps of:

up-sampling a feature map with smaller size in the two input feature maps at a sampling rate of 2, and respectively inputting a classified convolution layer and an expanded convolution layer, wherein the convolution kernel size of the classified convolution layer is 1 x 1, and the convolution kernel size of the expanded convolution layer is 3 x C ₃ The expansion ratio is 2; inputting a larger feature map of the two input feature maps into a convolution kernel with a size of 1 x C ₃ Is a projection convolution layer of (1); output results for the expanded convolution layer and the projected convolution layer, respectivelyPerforming batch normalization, obtaining a summation result of two batch normalization results, inputting the summation result into RELU function, and outputting a feature map F _c Outputting the output result of the classified convolution layer

3. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: step 1.1.6 said loss term L ₁ 、L ₂ 、L ₃ The specific steps for training the convolutional neural network include:

N is the number of object types to be segmented in the preset image, +.>

To be in the characteristic diagram->

The value of the (n, y, x) position,/->

Is->

The corresponding truth label at (y, x).

4. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: in the step 2, the specific steps of extracting the ORB feature points from the original image include:

5. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: in the step 3, the step of eliminating the dynamic object feature points in the ORB feature points obtained in the step 2 according to the semantic image obtained in the step 1 and only keeping the static object feature points includes:

for the original image i _t Each feature point i of (a) _t (x, y) in its semantic image i' _t To determine the corresponding position i' _t (x,y)；

6. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: in the step 4, based on the static object feature points obtained in the step 3, the conventional SLAM method based on the point features is adopted to perform positioning tracking on the camera motion, specifically:

for image sequences i= { I ₁ ,i ₂ ,i ₃ ,i ₄ ,...,i _t And (3) calculating and optimizing the pose of the camera by adopting a traditional SLAM frame based on point characteristics based on ORB characteristic points removed in the step (3), and completing the positioning and tracking of the camera.