CN110084850B - Dynamic scene visual positioning method based on image semantic segmentation - Google Patents

Dynamic scene visual positioning method based on image semantic segmentation Download PDF

Info

Publication number
CN110084850B
CN110084850B CN201910270280.0A CN201910270280A CN110084850B CN 110084850 B CN110084850 B CN 110084850B CN 201910270280 A CN201910270280 A CN 201910270280A CN 110084850 B CN110084850 B CN 110084850B
Authority
CN
China
Prior art keywords
feature
image
size
dynamic
original image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910270280.0A
Other languages
Chinese (zh)
Other versions
CN110084850A (en
Inventor
潘树国
盛超
曾攀
黄砺枭
赵涛
王帅
高旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201910270280.0A priority Critical patent/CN110084850B/en
Publication of CN110084850A publication Critical patent/CN110084850A/en
Application granted granted Critical
Publication of CN110084850B publication Critical patent/CN110084850B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a dynamic scene visual positioning method based on image semantic segmentation, belonging to the field of SLAM (Simultaneous Localization and Mapping, synchronous positioning and image construction). Firstly, segmenting a dynamic object in an original image by adopting a supervised learning mode in deep learning to obtain a semantic image; on the basis, ORB characteristic points are extracted from the original image, and the characteristic points of the dynamic object are removed according to the semantic image; and finally, positioning and tracking the camera motion by adopting a monocular SLAM method based on point characteristics based on the characteristic points after being removed. The positioning result shows that compared with the traditional method, the positioning accuracy of the method disclosed by the invention in a dynamic scene is improved by 13% to 30%.

Description

Dynamic scene visual positioning method based on image semantic segmentation
Technical Field
The invention relates to application of deep learning in visual SLAM (Simultaneous Localization and Mapping, synchronous positioning and mapping) field.
Background
Meanwhile, positioning and mapping (SLAM) are key technologies for autonomous operation of a robot in an unknown environment. Based on the environmental data detected by the external sensor of the robot, the SLAM constructs a surrounding environment map of the robot, and simultaneously gives the position of the robot in the environment map. Compared with distance measuring instruments such as radar, sonar and the like, the visual sensor has the characteristics of small volume, low power consumption, rich information acquisition and the like, and can provide rich texture information in an external environment. Therefore, visual SLAM has become a hotspot of current research and is applied to the fields of autonomous navigation, VR/AR and the like.
Conventional point feature based visual SLAM algorithms are based on static environment assumptions when recovering scene information and camera motion. Dynamic objects in the scene can affect positioning accuracy. Currently, conventional point feature based visual SLAM algorithms deal with simple dynamic scene problems by detecting dynamic points and labeling them as outliers. ORB-SLAM reduces the impact of dynamic objects on positioning accuracy through RANSAC, chi-square inspection, key frame method and local map. The direct method addresses the occlusion problem caused by dynamic objects by optimizing the cost function. In 2013, a new key frame expression and updating method is proposed by a learner, and is used for adaptively modeling a dynamic environment, and effectively detecting and processing appearance or structure changes in the dynamic environment. In the same year, students introduced methods of multi-camera pose estimation and mapping for processing dynamic scenes. However, the positioning accuracy and robustness of the traditional SLAM method under a dynamic scene are to be improved.
Disclosure of Invention
The technical problems to be solved by the invention are as follows:
in order to improve the positioning accuracy and robustness of the traditional SLAM method in a dynamic scene, the dynamic scene visual positioning method based on image semantic segmentation is provided, and dynamic objects in the scene can be segmented, and feature points of the dynamic objects are removed.
The invention adopts the following technical scheme for solving the technical problems:
the invention provides a dynamic scene visual positioning method based on image semantic segmentation, which comprises the following steps:
step 1, acquiring an original image, constructing a convolutional neural network, and dividing a dynamic object in the original image by using the convolutional neural network to obtain a semantic image;
step 2, extracting ORB characteristic points from the original image;
step 3, eliminating the dynamic object feature points in the ORB feature points obtained in the step 2 according to the semantic image obtained in the step 1, and only keeping the static object feature points;
and 4, positioning and tracking the camera motion by adopting a traditional SLAM method based on the point characteristics based on the static object characteristic points obtained in the step 3.
The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in step 1, the step of constructing a convolutional neural network includes:
step 1.1.1, downsampling an original image to 1/4, inputting PSPNet to obtain feature images with the sizes of 1/8 and 1/16 step by step, and finally outputting a feature image F1 with the size of 1/32;
step 1.1.2, downsampling an original image to 1/2, inputting the PSPNet to obtain feature images with the sizes of 1/4 and 1/8 step by step, and finally outputting a feature image F2 with the size of 1/16;
step 1.1.3, inputting the feature graphs F1 and F2 and the truth label with the size of 1/16 of the original image into a first CFF unit for fusion, and outputting the feature graph F with the size of 1/16 1 And the loss term L of the first branch 1
Step 1.1.4, inputting an original image into the PSPNet, obtaining feature images with the sizes of 1/2 and 1/4 step by step, and finally outputting a feature image F3 with the size of 1/8; map F of the characteristics 1 Fused with feature map F3 and truth label input second CFF unit with size of 1/8 of original image, and outputting feature map F with size of 1/8 2 And loss term L of the second branch 2
Step 1.1.5, the feature map F 2 Up-sampling to obtain a 1/4-size feature map F 3 The characteristic diagram F 3 Outputting the loss item L of the third branch after being processed by the truth value label with the size of 1/4 3
Step 1.1.6, adding the loss term L 1 、L 2 、L 3 Superposition is used to train the convolutional neural network.
The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: step 1.1.3 and step 1.1.4 the CFF unit comprises the steps of:
up-sampling a feature map with smaller size in the two input feature maps at a sampling rate of 2, and respectively inputting a classified convolution layer and an expanded convolution layer, wherein the convolution kernel size of the classified convolution layer is 1 x 1, and the convolution kernel size of the expanded convolution layer is 3 x C 3 The expansion ratio is 2; inputting a larger feature map of the two input feature maps into a convolution kernel with a size of 1 x C 3 Is a projection convolution layer of (1); for the inflated convolution layer and projection volumeRespectively carrying out batch normalization and summation on the laminated output results, inputting the summation results into RELU functions, and outputting a characteristic diagram F c Outputting the output result of the classified convolution layer
Figure BDA0002018152050000026
And substituting the truth value label into a Softmax function to obtain a loss item of the corresponding branch of the CFF unit.
The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: step 1.6 said loss term L 1 、L 2 、L 3 The specific steps for training the convolutional neural network include:
for loss term L 1 、L 2 、L 3 Summing to obtain the final loss term L total
Figure BDA0002018152050000021
Where i is the number of branches, ω i For each branch to lose the weight of the term,
Figure BDA0002018152050000022
for the feature map used to calculate the loss function in each branch, Y i ×X i Is->
Figure BDA0002018152050000023
N is the number of object types to be segmented in the preset image, +.>
Figure BDA0002018152050000024
To be in the characteristic diagram->
Figure BDA0002018152050000025
The value of the (n, y, x) position,/->
Figure BDA0002018152050000031
Is->
Figure BDA0002018152050000032
The corresponding truth label at (y, x).
The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 1, the step of segmenting the dynamic object in the original image by using the convolutional neural network to obtain a semantic image includes the following steps:
step 1.2.1, downsampling an original image to 1/4, inputting PSPNet to obtain feature images with the sizes of 1/8 and 1/16 step by step, and finally outputting a feature image F1 with the size of 1/32;
step 1.2.2, downsampling an original image to 1/2, inputting the PSPNet to obtain feature images with the sizes of 1/4 and 1/8 step by step, and finally outputting a feature image F2 with the size of 1/16;
step 1.2.3, inputting the feature graphs F1 and F2 and the truth label with the size of 1/16 of the original image into a first CFF unit for fusion, and outputting the feature graph F with the size of 1/16 1
Step 1.2.4, inputting an original image into the PSPNet, obtaining feature images with the sizes of 1/2 and 1/4 step by step, and finally outputting a feature image F3 with the size of 1/8; map F of the characteristics 1 Fused with the feature map F3 input to the second CFF unit, and outputting the feature map F with the size of 1/8 2
Step 1.2.5, the feature map F 2 Up-sampling to obtain a 1/4-size feature map F 3 When testing procedure, F 3 Up-sampling is carried out, and a feature map with the size of 1 is output, wherein the feature map is a semantic segmentation map;
step 1.2.6, performing binarization processing on the semantic segmentation map: marking dynamic objects in the semantic segmentation map by using black pixels 0, and marking other objects by using white pixels 1 to obtain a black-and-white semantic image i 'only containing the dynamic objects' t
Step 1.2.7, performing the operations of steps 1.1 to 1.7 on the image sequence composed of the original images, finally obtaining a semantic image sequence I ' = { I ' containing only dynamic objects ' t ,i′ 2 ,i′ 3 ,i′ 4 ,...,i′ t }。
The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 2, the specific steps of extracting the ORB feature points from the original image include:
according to the complexity of the scene, the number of features to be extracted is set, and an ORB feature extractor is utilized to extract an input image i t Characteristic point i of (a) t (x, y), where x, y is the abscissa of the feature point.
The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 3, the step of eliminating the dynamic object feature points in the ORB feature points obtained in the step 2 according to the semantic image obtained in the step 1 and only keeping the static object feature points includes:
for the original image i t Each feature point i of (a) t (x, y) in its semantic image i' t To determine the corresponding position i' t (x,y);
If i' t (x, y) =0, the point is a black pixel point, namely belongs to the dynamic object feature, and the rejecting operation is executed;
if i' t (x, y) =1, which is a white pixel, i.e., belonging to a static object feature, performs a hold operation.
The dynamic scene visual positioning method based on image semantic segmentation further comprises the following steps: in the step 4, based on the static object feature points obtained in the step 3, the conventional SLAM method based on the point features is adopted to perform positioning tracking on the camera motion, specifically:
for image sequences i= { I 1 ,i 2 ,i 3 ,i 4 ,…,i t And (3) calculating and optimizing the pose of the camera by adopting a traditional SLAM frame based on point characteristics based on ORB characteristic points removed in the step (3), and completing the positioning and tracking of the camera.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
1. firstly, segmenting a dynamic object in an original image by adopting a supervised learning mode in deep learning to obtain a semantic image; on the basis, ORB characteristic points are extracted from the original image, and the characteristic points of the dynamic object are removed according to the semantic image, so that the positioning accuracy and the robustness of the traditional SLAM method under a dynamic scene are improved;
2. the positioning result of the method provided by the invention is superior to that of the traditional ORB-SLAM, and the positioning precision is improved by 13% to 30%.
Drawings
FIG. 1 is a flow chart of the method;
FIG. 2 is a diagram of a network architecture for semantic segmentation of images according to the present method;
FIG. 3 is a block diagram of a cascading feature fusion unit of the present method;
FIG. 4 is a flow chart of the dynamic object segmentation method;
FIG. 5 is a graph of the result of semantic segmentation of an image according to the method;
FIG. 6 is a graph of the result of eliminating feature points of dynamic objects in the method;
FIG. 7 is a plan view of the positioning track of the present method and the complete ORB-SLAM in four sequences;
FIG. 8 is a plan view of the positioning trajectories of the present method and incomplete ORB-SLAM in four sequences.
Detailed Description
The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
With the development of deep learning technology, people explore semantic information of images, thereby improving the performance of visual SLAM. Semantic segmentation, in which visual input needs to be divided into different semantically interpretable categories, is a fundamental task in computer vision. The invention provides a dynamic scene visual positioning method based on image semantic segmentation, which aims to improve the positioning accuracy of SLAM under a dynamic scene and obtain semantic information rich in the scene on the basis of eliminating feature points of dynamic objects.
The invention provides a dynamic scene visual positioning method based on image semantic segmentation, wherein fig. 1 is a flow chart of the method, and fig. 4 is a flow chart of the method for segmenting dynamic objects. Firstly, segmenting a dynamic object in an original image by adopting a supervised learning mode in deep learning to obtain a semantic image; on the basis, ORB characteristic points are extracted from the original image, and the characteristic points of the dynamic object are removed according to the semantic image; and finally, positioning and tracking the camera motion by adopting a monocular SLAM method based on point characteristics based on the characteristic points after being removed.
Step 1, constructing a convolutional neural network to divide a dynamic object in an original image so as to obtain a semantic image:
step 1.1, constructing a convolutional neural network for semantic segmentation
The structure of the constructed neural network is shown in fig. 2. In the network architecture depicted in fig. 2, there are three branches, top, middle, bottom; the numbers in brackets are the size ratio compared to the original input image; 'CFF' is a cascading feature fusion unit; the first three networks of the top and middle branches share the same parameters.
The network architecture will now be described in further detail:
cascading image input: at the top branch of the network depicted in fig. 2, the original image is first downsampled to a 1/4 size image, then PSPNet is input, and a 1/32 size feature map is output, which is a rough segmentation result, missing many details and boundaries. At the middle and bottom branches, the 1/2 size image and the original image are adopted to restore and refine the detail of the rough result. Although the segmentation result of the top branch is rough, the top branch contains rich semantic parts. Thus, the middle and bottom branch networks for detail restoration and refinement are lightweight. And fusing the output feature graphs of different branches by using a cascade feature fusion unit (CFF), and guiding and enhancing the learning process of the different branches by using cascade labels.
Cascade feature fusion: fig. 3 shows a specific structure of the cascade feature fusion unit, where F1 and F2 are feature diagrams of different branch outputs, and the spatial dimension of F2 is twice that of F1. The cascade feature fusion unit is used for fusing feature graphs of different branch outputs, the input of the unit comprises two feature graphs F1 and F2 and a truth value label, and the size of F1 is Y 1 ×X 1 ×C 1 F2 has a size of Y 2 ×X 2 ×C 2 The label has a size Y 1 ×X 1 X 1. For the feature map F1, first, up-sampling is performed at a sampling rate of 2, and a feature map of the same size as F2 is output. Then a core size of 3×3×c 3 An expanded convolution layer with an expansion ratio of 2 is used to refine the output feature map, so that the size of F1 becomes Y 2 ×X 2 ×C 3 . For feature map F2, the core size is 1×1×C 3 Output Y 2 ×X 2 ×C 3 Dimensional feature map. The outputs of F1 and F2 are simultaneously batch normalized and the fused feature map F2' is finally output through the summation layer and the ' RELU ' function layer.
Cascading label guidance; in the network architecture depicted in fig. 2, three different-sized truth labels (1/16, 1/8,1/4, respectively, relative to the original image) are used to generate three independent penalty entries at the top, middle, and bottom branches of the network, and the three penalty entries are summed to obtain the final penalty entry:
Figure BDA0002018152050000051
wherein omega t For each branch loss term weight, F t For each branch output feature map, Y t ×X t Is F t N is the number of types of objects to be segmented in a predetermined image,
Figure BDA0002018152050000061
to be in the characteristic diagram F t The values of the (n, y, x) positions of (c),
Figure BDA0002018152050000062
is->
Figure BDA0002018152050000063
The corresponding truth label at (y, x).
Step 1.2, segmenting a dynamic object in an original input image:
fig. 3 shows the implementation of this step. For a given set of image sequences i= { I 1 ,i 2 ,i 3 ,i 4 ,...,i t (i) t For the image taken by the camera at time t:
(1) Inputting an image i into the semantic segmentation network constructed in the step 1.1 t Outputting a segmented color semantic image, wherein objects such as automobiles, pedestrians, buildings, signs and the like are marked by pixels with different colors in the semantic image;
(2) Binarizing the semantic image in the step (1), marking dynamic objects (pedestrians and automobiles) in the image by using black pixels 0, and marking other objects by using white pixels 1 to obtain a black-and-white semantic image i 'only containing the dynamic objects' t
(3) Repeating steps (1) and (2) for each image in the image sequence I;
finally, a semantic image sequence I ' = { I ' only containing dynamic objects is obtained ' t ,i′ 2 ,i′ 3 ,i′ 4 ,...,i′ t }。
Step 2, ORB feature points are extracted from the original image, the feature points of the dynamic object are removed according to the semantic image, and only the feature points of the static object are reserved:
step 2.1, extracting ORB characteristic points in an original image:
according to the complexity of the scene, the number of features to be extracted is set, and an ORB feature extractor is utilized to extract an input image i t Characteristic point i of (a) t (x, y), where x, y is the abscissa of the feature point.
Step 3, eliminating the characteristic points of the dynamic object according to the semantic image, and only retaining the characteristic points of the static object:
(1) For i t Each feature point i of (a) t (x, y), in semantic image i' t To determine the corresponding position i' t (x,y);
(2) If i t (x, y) =0, the point is a black pixel point, namely belongs to the dynamic object feature, and the rejecting operation is executed;
(3) If i t (x, y) =1, which is a white pixel, i.e., belonging to a static object feature, performs a hold operation.
Step 4, based on the ORB characteristic points removed in the step 3, adopting a traditional SLAM frame based on point characteristics to position and track the camera:
for image sequences i= { I 1 ,i 2 ,i 3 ,i 4 ,...,i t And (3) calculating and optimizing the pose of the camera by adopting a traditional SLAM frame based on point characteristics based on ORB characteristic points removed in the step (2) to finish the positioning and tracking of the camera.
Example 1
The invention has been evaluated using a Frankfurt monocular image sequence that is part of the Cityscapes dataset. The entire Frankfurt sequence provides more than 10 ten thousand frames of outdoor ambient images and provides a localization result that can be used as a true value. The sequence is divided into several smaller sequences, including 1300-2500 frames of dynamic object sequences, such as driving cars or pedestrians. The configuration of the experimental platform is as follows: intel Xeone5-2690V4; a RAM of 128 GB; injeida TitanV GPU.
The sequences isolated from the original Frankfurt sequence were as follows:
Seq.01:frankfurt_000001_054140_leftImg8bit.png-frankfurt_000001_056555_leftImg8bit.png
Seq.02:frankfurt_000001_012745_leftImg8bit.png-frankfurt_000001_014100_leftImg8bit.png
Seq.03:frankfurt_000001_003311_leftImg8bit.png-frankfurt_000001_005555_leftImg8bit.png
Seq.04:frankfurt_000001_010580_leftImg8bit.png-frankfurt_000001_012739_leftImg8bit.png
fig. 5 shows the result of semantic segmentation. The middle column shows that trees, buildings, roads, traffic signs and other objects in the scene are well segmented. The right side only retains the segmentation results of dynamic objects (cars and pedestrians). Although the boundaries are not entirely accurate, the results are sufficient for culling feature points.
Fig. 6 shows the result of dynamic object feature point culling. White automobiles are dynamic objects that travel on roads. The two images in the left column are the results before rejection, and there are a plurality of characteristic points belonging to the dynamic automobile. And the right column is the reject result, and the characteristic points of the automobile are completely rejected.
FIG. 7 shows a plan view of the positioning track in the four-segment video sequence of seq.01, seq.02, seq.03, seq.04 based on the present method of complete ORB-SLAM and complete ORB-SLAM. From the four figures, it can be seen that the positioning trajectory (Ours) obtained by the method of the present invention has a smaller deviation from the real trajectory (Ground Truth) than the trajectory (ORB-SLAM Full) calculated by the complete ORB-SLAM. As more dynamic vehicles and pedestrians exist in the seq.01 sequence, the deviation between the results of the two methods and the true value is larger, but the method is still superior to the complete ORB-SLAM in positioning accuracy. Partial discontinuities in the positioning track may occur due to the system performing position tracking based on the keyframes.
The complete ORB-SLAM uses chi-square test, so that the influence of dynamic characteristic points on positioning accuracy is reduced to a certain extent, and the method based on the incomplete ORB-SLAM with the chi-square test removed and the positioning track plan of the incomplete ORB-SLAM in the four-segment video sequence of seq.01, seq.02, seq.03 and seq.04 are shown in FIG. 8. From the four figures, it can be seen that the positioning trace (our) obtained by the method of the present invention has a smaller deviation from the real trace (Ground Truth) than the trace (ORB-SLAM implementation) calculated by the incomplete ORB-SLAM. As a large number of dynamic feature points exist in the scene due to more pedestrians in the scene, the incomplete ORB-SLAM fails to be positioned completely in the seq.02, and the robustness of the method provided by the invention is better. Partial discontinuities in the positioning track may occur due to the system performing position tracking based on the keyframes.
And finally, positioning results of the four-section image sequence in the complete ORB-SLAM, the incomplete ORB-SLAM and the method are given. From tables 1 and 2, it can be seen that the positioning results of the method provided by the invention are superior to those of the conventional ORB-SLAM, and the positioning accuracy is improved by 13% to 30%.
Table 1: two methods locate result comparisons on the sequence of Seq01-Seq04 images
Figure BDA0002018152050000081
Table 2: two methods locate result comparisons on the sequence of Seq01-Seq04 images
Figure BDA0002018152050000082
The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims (6)

1. The dynamic scene visual positioning method based on image semantic segmentation is characterized by comprising the following steps of:
step 1, acquiring an original image, constructing a convolutional neural network, and dividing a dynamic object in the original image by using the convolutional neural network to obtain a semantic image;
the step of constructing a convolutional neural network includes:
step 1.1.1, downsampling an original image to 1/4, inputting PSPNet to obtain feature images with the sizes of 1/8 and 1/16 step by step, and finally outputting a feature image F1 with the size of 1/32;
step 1.1.2, downsampling an original image to 1/2, inputting the PSPNet to obtain feature images with the sizes of 1/4 and 1/8 step by step, and finally outputting a feature image F2 with the size of 1/16;
step 1.1.3, inputting the feature graphs F1 and F2 and the truth label with the size of 1/16 of the original image into a first CFF unit for fusion, and outputting the feature graph F with the size of 1/16 1 And the loss term L of the first branch 1
Step 1.1.4, inputting an original image into the PSPNet, obtaining feature images with the sizes of 1/2 and 1/4 step by step, and finally outputting a feature image F3 with the size of 1/8; map F of the characteristics 1 Fused with feature map F3 and truth label input second CFF unit with size of 1/8 of original image, and outputting feature map F with size of 1/8 2 And loss term L of the second branch 2
Step 1.1.5, the feature map F 2 Up-sampling to obtain a 1/4-size feature map F 3 The characteristic diagram F 3 Outputting the loss item L of the third branch after being processed by the truth value label with the size of 1/4 3
Step 1.1.6, adding the loss term L 1 、L 2 、L 3 Superimposing the convolutional neural network for training;
in step 1, the convolutional neural network is utilized to segment the dynamic object in the original image, and the semantic image is obtained by the following steps:
step 1.2.1, downsampling an original image to 1/4, inputting PSPNet to obtain feature images with the sizes of 1/8 and 1/16 step by step, and finally outputting a feature image F1 with the size of 1/32;
step 1.2.2, downsampling an original image to 1/2, inputting the PSPNet to obtain feature images with the sizes of 1/4 and 1/8 step by step, and finally outputting a feature image F2 with the size of 1/16;
step 1.2.3, inputting the feature graphs F1 and F2 and the truth label with the size of 1/16 of the original image into a first CFF unit for fusion, and outputting the feature graph F with the size of 1/16 1
Step 1.2.4, inputting an original image into the PSPNet, obtaining feature images with the sizes of 1/2 and 1/4 step by step, and finally outputting a feature image F3 with the size of 1/8; map F of the characteristics 1 Fused with the input of the characteristic diagram F3 into the second CFF unit, and the output size is 1/8Feature map F 2
Step 1.2.5, the feature map F 2 Up-sampling to obtain a 1/4-size feature map F 3 When testing procedure, F 3 Up-sampling is carried out, and a feature map with the size of 1 is output, wherein the feature map is a semantic segmentation map;
step 1.2.6, performing binarization processing on the semantic segmentation map: marking dynamic objects in the semantic segmentation map by using black pixels 0, and marking other objects by using white pixels 1 to obtain a black-and-white semantic image i 'only containing the dynamic objects' t
Step 1.2.7, performing the operations of the steps 1.2.1 to 1.2.7 on the image sequence composed of the original images, and finally obtaining a semantic image sequence I ' = { I ' only containing dynamic objects ' t ,i' 2 ,i' 3 ,i' 4 ,...,i' t };
Step 2, extracting ORB characteristic points from the original image;
step 3, eliminating the dynamic object feature points in the ORB feature points obtained in the step 2 according to the semantic image obtained in the step 1, and only keeping the static object feature points;
and 4, positioning and tracking the camera motion by adopting a traditional SLAM method based on the point characteristics based on the static object characteristic points obtained in the step 3.
2. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: step 1.1.3 and step 1.1.4 the CFF unit comprises the steps of:
up-sampling a feature map with smaller size in the two input feature maps at a sampling rate of 2, and respectively inputting a classified convolution layer and an expanded convolution layer, wherein the convolution kernel size of the classified convolution layer is 1 x 1, and the convolution kernel size of the expanded convolution layer is 3 x C 3 The expansion ratio is 2; inputting a larger feature map of the two input feature maps into a convolution kernel with a size of 1 x C 3 Is a projection convolution layer of (1); output results for the expanded convolution layer and the projected convolution layer, respectivelyPerforming batch normalization, obtaining a summation result of two batch normalization results, inputting the summation result into RELU function, and outputting a feature map F c Outputting the output result of the classified convolution layer
Figure FDA0004038473650000028
And substituting the truth value label into a Softmax function to obtain a loss item of the corresponding branch of the CFF unit.
3. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: step 1.1.6 said loss term L 1 、L 2 、L 3 The specific steps for training the convolutional neural network include:
for loss term L 1 、L 2 、L 3 Summing to obtain the final loss term L total
Figure FDA0004038473650000021
Where i is the number of branches, ω i For each branch to lose the weight of the term,
Figure FDA0004038473650000022
for the feature map used to calculate the loss function in each branch, Y i ×X i Is->
Figure FDA0004038473650000023
N is the number of object types to be segmented in the preset image, +.>
Figure FDA0004038473650000024
To be in the characteristic diagram->
Figure FDA0004038473650000025
The value of the (n, y, x) position,/->
Figure FDA0004038473650000026
Is->
Figure FDA0004038473650000027
The corresponding truth label at (y, x).
4. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: in the step 2, the specific steps of extracting the ORB feature points from the original image include:
according to the complexity of the scene, the number of features to be extracted is set, and an ORB feature extractor is utilized to extract an input image i t Characteristic point i of (a) t (x, y), where x, y is the abscissa of the feature point.
5. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: in the step 3, the step of eliminating the dynamic object feature points in the ORB feature points obtained in the step 2 according to the semantic image obtained in the step 1 and only keeping the static object feature points includes:
for the original image i t Each feature point i of (a) t (x, y) in its semantic image i' t To determine the corresponding position i' t (x,y);
If i' t (x, y) =0, the point is a black pixel point, namely belongs to the dynamic object feature, and the rejecting operation is executed;
if i' t (x, y) =1, which is a white pixel, i.e., belonging to a static object feature, performs a hold operation.
6. The dynamic scene visual positioning method based on image semantic segmentation according to claim 1, wherein the method comprises the following steps: in the step 4, based on the static object feature points obtained in the step 3, the conventional SLAM method based on the point features is adopted to perform positioning tracking on the camera motion, specifically:
for image sequences i= { I 1 ,i 2 ,i 3 ,i 4 ,...,i t And (3) calculating and optimizing the pose of the camera by adopting a traditional SLAM frame based on point characteristics based on ORB characteristic points removed in the step (3), and completing the positioning and tracking of the camera.
CN201910270280.0A 2019-04-04 2019-04-04 Dynamic scene visual positioning method based on image semantic segmentation Active CN110084850B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910270280.0A CN110084850B (en) 2019-04-04 2019-04-04 Dynamic scene visual positioning method based on image semantic segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910270280.0A CN110084850B (en) 2019-04-04 2019-04-04 Dynamic scene visual positioning method based on image semantic segmentation

Publications (2)

Publication Number Publication Date
CN110084850A CN110084850A (en) 2019-08-02
CN110084850B true CN110084850B (en) 2023-05-23

Family

ID=67414356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910270280.0A Active CN110084850B (en) 2019-04-04 2019-04-04 Dynamic scene visual positioning method based on image semantic segmentation

Country Status (1)

Country Link
CN (1) CN110084850B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110706269B (en) * 2019-08-30 2021-03-19 武汉斌果科技有限公司 Binocular vision SLAM-based dynamic scene dense modeling method
CN110673607B (en) * 2019-09-25 2023-05-16 优地网络有限公司 Feature point extraction method and device under dynamic scene and terminal equipment
CN110610521B (en) * 2019-10-08 2021-02-26 云海桥(北京)科技有限公司 Positioning system and method adopting distance measurement mark and image recognition matching
CN110827305B (en) * 2019-10-30 2021-06-08 中山大学 Semantic segmentation and visual SLAM tight coupling method oriented to dynamic environment
CN111311708B (en) * 2020-01-20 2022-03-11 北京航空航天大学 Visual SLAM method based on semantic optical flow and inverse depth filtering
CN111340881B (en) * 2020-02-18 2023-05-19 东南大学 Direct method visual positioning method based on semantic segmentation in dynamic scene
CN111488882B (en) * 2020-04-10 2020-12-25 视研智能科技(广州)有限公司 High-precision image semantic segmentation method for industrial part measurement
CN111950561A (en) * 2020-08-25 2020-11-17 桂林电子科技大学 Semantic SLAM dynamic point removing method based on semantic segmentation
CN112163502B (en) * 2020-09-24 2022-07-12 电子科技大学 Visual positioning method under indoor dynamic scene
CN112734845B (en) * 2021-01-08 2022-07-08 浙江大学 Outdoor monocular synchronous mapping and positioning method fusing scene semantics
CN112766136B (en) * 2021-01-14 2024-03-19 华南理工大学 Space parking space detection method based on deep learning
CN112435278B (en) * 2021-01-26 2021-05-04 华东交通大学 Visual SLAM method and device based on dynamic target detection
CN112967317B (en) * 2021-03-09 2022-12-06 北京航空航天大学 Visual odometry method based on convolutional neural network architecture in dynamic environment
CN113673524A (en) * 2021-07-05 2021-11-19 北京物资学院 Method and device for removing dynamic characteristic points of warehouse semi-structured environment
CN113516664A (en) * 2021-09-02 2021-10-19 长春工业大学 Visual SLAM method based on semantic segmentation dynamic points

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021373B (en) * 2014-05-27 2017-02-15 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN107169974A (en) * 2017-05-26 2017-09-15 中国科学技术大学 It is a kind of based on the image partition method for supervising full convolutional neural networks more
CN107833236B (en) * 2017-10-31 2020-06-26 中国科学院电子学研究所 Visual positioning system and method combining semantics under dynamic environment
CN109186586B (en) * 2018-08-23 2022-03-18 北京理工大学 Method for constructing simultaneous positioning and mixed map facing dynamic parking environment

Also Published As

Publication number Publication date
CN110084850A (en) 2019-08-02

Similar Documents

Publication Publication Date Title
CN110084850B (en) Dynamic scene visual positioning method based on image semantic segmentation
Sakaridis et al. Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation
CN110622213B (en) System and method for depth localization and segmentation using 3D semantic maps
CN111862126B (en) Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
CN108830171B (en) Intelligent logistics warehouse guide line visual detection method based on deep learning
CN107239730B (en) Quaternion deep neural network model method for intelligent automobile traffic sign recognition
Biasutti et al. Lu-net: An efficient network for 3d lidar point cloud semantic segmentation based on end-to-end-learned 3d features and u-net
CN109341703A (en) A kind of complete period uses the vision SLAM algorithm of CNNs feature detection
Han et al. Aerial image change detection using dual regions of interest networks
Bescos et al. Empty cities: Image inpainting for a dynamic-object-invariant space
CN108921850B (en) Image local feature extraction method based on image segmentation technology
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
CN112766136A (en) Space parking space detection method based on deep learning
CN111767854B (en) SLAM loop detection method combined with scene text semantic information
Gao et al. Local feature performance evaluation for structure-from-motion and multi-view stereo using simulated city-scale aerial imagery
CN115100469A (en) Target attribute identification method, training method and device based on segmentation algorithm
Zhu et al. Fusing panoptic segmentation and geometry information for robust visual slam in dynamic environments
CN104463962A (en) Three-dimensional scene reconstruction method based on GPS information video
Gökçe et al. Recognition of dynamic objects from UGVs using Interconnected Neuralnetwork-based Computer Vision system
CN112597996A (en) Task-driven natural scene-based traffic sign significance detection method
Wang et al. 3D object detection algorithm for panoramic images with multi-scale convolutional neural network
CN109740405B (en) Method for detecting front window difference information of non-aligned similar vehicles
CN116385477A (en) Tower image registration method based on image segmentation
Yuan et al. Graph neural network based multi-feature fusion for building change detection
Liu et al. MFF-PR: Point Cloud and Image Multi-modal Feature Fusion for Place Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant