CN112802197A - Visual SLAM method and system based on full convolution neural network in dynamic scene - Google Patents

Visual SLAM method and system based on full convolution neural network in dynamic scene Download PDF

Info

Publication number
CN112802197A
CN112802197A CN202110295567.6A CN202110295567A CN112802197A CN 112802197 A CN112802197 A CN 112802197A CN 202110295567 A CN202110295567 A CN 202110295567A CN 112802197 A CN112802197 A CN 112802197A
Authority
CN
China
Prior art keywords
image
dynamic
fcn
time image
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110295567.6A
Other languages
Chinese (zh)
Inventor
吕艳
柳双磊
倪益华
倪忠进
宋源普
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang A&F University ZAFU
Original Assignee
Zhejiang A&F University ZAFU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang A&F University ZAFU filed Critical Zhejiang A&F University ZAFU
Priority to CN202110295567.6A priority Critical patent/CN112802197A/en
Publication of CN112802197A publication Critical patent/CN112802197A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/05Geographic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Abstract

The invention provides a visual SLAM method and a system based on a full convolution neural network in a dynamic scene, wherein the method comprises the following steps: acquiring an image dataset; constructing a full convolution neural network model according to the image data set; performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image; removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image; and estimating the pose of the camera according to the static feature points. By the method provided by the invention, the dynamic target can be accurately identified and semantic segmentation is completed, the tracking accuracy and robustness of the camera are effectively improved, and the positioning and mapping accuracy of the visual SLAM in the dynamic scene is improved.

Description

Visual SLAM method and system based on full convolution neural network in dynamic scene
Technical Field
The invention relates to the technical field of computer vision, in particular to a visual SLAM method based on a full convolution neural network in a dynamic scene and a visual SLAM system based on the full convolution neural network in the dynamic scene.
Background
Meanwhile, positioning And Mapping (SLAM) refers to a process of estimating the pose of a robot And constructing an environment map by a sensor carried by the robot in a strange environment, And is a prerequisite condition of many robot application scenes, such as path planning, collision-free navigation, environment perception And the like. The visual SLAM refers to perception that self pose estimation of a camera and environment three-dimensional map construction are carried out by using visual information.
In the prior art, the relative displacement of two adjacent frames can be estimated according to the matching of feature points between the two adjacent frames of an input image, so that the actual displacement of a camera is calculated, but a slow moving dynamic object in a scene can cause the deviation of the pose calculation of the camera, so that the positioning of the whole vision SLAM system is misaligned.
Disclosure of Invention
Aiming at the technical problem that the position and pose calculation of a camera are deviated due to a slow moving dynamic object in the prior art, so that the positioning of the whole vision SLAM system is misaligned, the invention provides a vision SLAM method based on a full convolution neural network in a dynamic scene and a vision SLAM system based on the full convolution neural network in the dynamic scene.
In order to achieve the above object, an aspect of the present invention provides a visual SLAM method based on a full convolutional neural network in a dynamic scene, including the following steps: acquiring an image dataset; constructing a full convolution neural network model according to the image data set; performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image; removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image; and estimating the pose of the camera according to the static feature points.
Further, the performing semantic segmentation on the monocular real-time image currently acquired by the camera by using the full convolution neural network model to obtain a semantic label image includes: converting the full-connection layer of the VGG16 network into a convolutional layer by using a convolution kernel with the same size as the input data of the full-connection layer of the VGG16 network in the full-convolution neural network model, so as to obtain an FCN-VGG16 network; optimizing the FCN-VGG16 network; and performing dynamic target and background binary operation on the monocular real-time image by using the optimized FCN-VGG16 network to obtain the semantic label image.
Further, the method further comprises: after each convolution calculation in the FCN-VGG16 network, performing sparsity processing on a previous convolution layer by using a linear rectification function as an excitation function; pooling operations are performed for each pooling layer in the FCN-VGG16 network.
Further, the optimizing the FCN-VGG16 network includes: and introducing a hopping structure to perform upsampling operation and fusion on each pooling layer in the FCN-VGG16 network to obtain the optimized FCN-VGG16 network.
Further, the performing, by using the optimized FCN-VGG16 network, two classification operations of a dynamic target and a background on the monocular real-time image to obtain the semantic label image includes: determining the category number of heatmaps of the optimized FCN-VGG16 network; determining the prediction probability of the monocular real-time image belonging to a target category; identifying a dynamic target of the monocular real-time image according to the category number of the heat map and the prediction probability; and performing semantic segmentation on the monocular real-time image according to the identified dynamic target to obtain the semantic label image.
Further, identifying a dynamic target of the monocular real-time image according to the category number of the heat map and the prediction probability by:
Figure BDA0002984218330000031
wherein M is the category number of the heat map, C is the target category of the monocular real-time image, yicAs virtual parameters, picThe prediction probability that the monocular real-time image belongs to a target class C is obtained, and L is a loss value; and identifying the dynamic target of the monocular real-time image according to the loss value.
Further, the removing the dynamic feature points of the monocular real-time image according to the semantic segmentation result to obtain the static feature points of the monocular real-time image includes: carrying out image pyramid layering on the monocular real-time image by using a first scaling factor to obtain a plurality of layers of first scaled images; carrying out image pyramid layering on the semantic label image by using a second scaling factor to obtain a plurality of layers of second scaled images; sequentially removing dynamic feature points in each first zooming image: carrying out ORB feature extraction on each first zooming image to obtain ORB feature points of each first zooming image; outputting image coordinate values to pixel points of the dynamic region in each second zoomed image to form a set UiWherein i is the layer number of the second zoom image, and the dynamic area is the area where the dynamic target is located; coordinate values of ORB feature points of each first zoom image and a corresponding set UiMatching the coordinate values; eliminating coordinate values and the set U in each first zoomed imageiThe coordinate value in the table is matched with the ORB characteristic point; forming a static feature point set M by the ORB feature points reserved in each first zooming image; the estimating of the pose of the camera from the static feature points comprises: and estimating the pose of the camera according to the static feature point set M.
Further, the first scaling factor is the same as the second scaling factor.
Further, coordinate values of the ORB feature points and the set U are comparediBefore the coordinate values in (1) are matched, the method further comprises the following steps: storing the ORB feature points.
Another aspect of the present invention provides a full convolutional neural network based visual SLAM system in a dynamic scene, the system being configured to estimate the pose of the camera by using the full convolutional neural network based visual SLAM method in the dynamic scene described above.
Through the technical scheme provided by the invention, the invention at least has the following technical effects:
the vision SLAM method based on the full convolution neural network in the dynamic scene is characterized in that an ORB _ SLAM2 open source vision SLAM system is taken as a frame, a full convolution neural network model is constructed, a dynamic target in a monocular real-time image currently acquired by a camera is identified by using the full convolution neural network model, semantic segmentation is carried out on the monocular real-time image, a dynamic eye in the image is marked, finally, a dynamic feature point of the monocular real-time image is removed by using coordinate point mapping, a static feature point in the monocular real-time image is reserved, and the pose of the camera is estimated according to the static feature point. The method provided by the invention can accurately identify the dynamic target and complete semantic segmentation, effectively improve the accuracy and robustness of camera tracking, and improve the positioning and mapping precision of the visual SLAM in the dynamic scene.
Additional features and advantages of the invention will be set forth in the detailed description which follows.
Drawings
Fig. 1 is a flowchart of a visual SLAM method based on a full convolutional neural network in a dynamic scene according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an FCN-VGG16 network in a full convolutional neural network-based visual SLAM method in a dynamic scenario according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an upsampling operation in a visual SLAM method based on a full convolutional neural network in a dynamic scene according to an embodiment of the present invention;
fig. 4 is a schematic diagram of bilinear interpolation in a visual SLAM method based on a full convolution neural network in a dynamic scene according to an embodiment of the present invention;
fig. 5 is a schematic diagram illustrating optimization of the FCN-VGG16 network in the visual SLAM method based on the full convolutional neural network in the dynamic scenario provided in the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
In the present invention, unless specified to the contrary, use of the terms of orientation such as "upper, lower, top, bottom" or the like are generally described with respect to the orientation shown in the drawings or the positional relationship of the components with respect to each other in the vertical, or gravitational direction.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
Referring to fig. 1, an embodiment of the present invention provides a visual SLAM method based on a full convolutional neural network in a dynamic scene, including the following steps: s101: acquiring an image dataset; s102: constructing a full convolution neural network model according to the image data set; s103: performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image; s104: removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image; s105: and estimating the pose of the camera according to the static feature points.
Specifically, in the embodiment of the present invention, the ORB _ SLAM2 open source visual SLAM system is used as a framework, a convolutional neural network is improved, a full convolutional neural network is constructed, model training is performed on the constructed full convolutional neural network through an image data set, and a trained full convolutional neural network model is output. The method comprises the steps of identifying a dynamic target in a monocular real-time image currently acquired by a camera by using a full convolution neural network model, performing semantic segmentation on the monocular real-time image currently acquired by the camera to obtain a semantic label image, removing dynamic feature points of the monocular real-time image according to the semantic label image to obtain static feature points of the monocular real-time image, and estimating the pose of the camera by using the static feature points.
According to the method provided by the invention, the dynamic target can be accurately identified and the semantic segmentation is completed, the tracking accuracy and robustness of the camera are effectively improved, and the positioning and mapping accuracy of the visual SLAM in the dynamic scene is improved.
Further, the performing semantic segmentation on the monocular real-time image currently acquired by the camera by using the full convolution neural network model to obtain a semantic label image includes: converting the full-connection layer of the VGG16 network into a convolutional layer by using a convolution kernel with the same size as the input data of the full-connection layer of the VGG16 network in the full-convolution neural network model, so as to obtain an FCN-VGG16 network; optimizing the FCN-VGG16 network; and performing dynamic target and background binary operation on the monocular real-time image by using the optimized FCN-VGG16 network to obtain the semantic label image.
Specifically, in the embodiment of the invention, the VGG16 network is composed of 13 convolutional layers, 5 pooling layers and 3 full-link layers. The full-connection layer of the VGG16 network is converted into a convolutional layer by using a convolutional kernel in the full-convolutional neural network model, which has the same size as the input data of the full-connection layer of the VGG16 network, so as to obtain the FCN-VGG16 network, please refer to fig. 2, and fig. 2 is a schematic diagram of the FCN-VGG16 network. In fig. 2, the last convolutional layer Conv16 outputs the result as a heat map, and the number of heat maps represents the number of categories in the heat map. And then, optimizing the FCN-VGG16 network, and performing dynamic target and background binary classification operation on the monocular real-time image by using the optimized FCN-VGG16 network to obtain a semantic label image.
Further, the method further comprises: after each convolution calculation in the FCN-VGG16 network, performing sparsity processing on a previous convolution layer by using a linear rectification function as an excitation function; pooling operations are performed for each pooling layer in the FCN-VGG16 network.
Specifically, in the embodiment of the present invention, after each convolution calculation in the FCN-VGG16 network, a linear rectification function is used as an excitation function to perform sparsity processing on a previous convolution layer, and a mathematical expression of the linear rectification function is as follows:
f(x)=max(0,x)
where x is the input value and f (x) is the output value. When the input x <0, the outputs are all 0; when input x ≧ 0, the output equals the input. The linear rectification function is simple in conversion, low in calculation complexity and high in gradient descent convergence speed, the output value of the FCN-VGG16 network is guaranteed to be a positive number, meanwhile, the sparsity of the FCN-VGG16 network is increased, and the overfitting condition of the network is reduced.
Pooling each pooling layer in the FCN-VGG16 network reduces the size of the input monocular real-time images, enables information filtering and feature screening of features captured by the convolutional layer, and is generally represented in the form of:
Figure BDA0002984218330000071
wherein i, j is the pixel coordinate of the feature, K is the channel number of the feature map, f, s0The convolution kernel size and convolution step size parameters, respectively. In the FCN-VGG16 network, after five pooling operations, the length and width of the heatmap are both reduced to 1/32 for the input monocular real-time images. However, in order to keep the size and dimension of the input image and the output semantic segmentation label image the same, 32 times of deconvolution operation needs to be carried out on the heat map, and the effect of output reconstruction input is achieved through the inverse process of pooling, so that the FCN-32s-VGG16 network is obtained.
Further, the optimizing the FCN-VGG16 network includes: and introducing a hopping structure to perform upsampling operation and fusion on each pooling layer in the FCN-VGG16 network to obtain the optimized FCN-VGG16 network.
Specifically, in the embodiment of the present invention, in the FCN-VGG16(FCN-32s-VGG16), the picture output by the 32-fold deconvolution operation is only a feature in the convolution kernel of the Conv16 layer, and is limited in that the feature in the image cannot be restored well due to the accuracy problem. Therefore, a jump structure is introduced to perform feature optimization, namely, the upsampling operation and fusion are performed on the results of different pooling layers. Referring to fig. 3, in the present embodiment, an upsampling method based on bilinear interpolation is adopted to perform an upsampling operation, an input feature map with a size of N × N is changed into a transition feature map with a size of (2N +1) × (2N +1) by using the bilinear interpolation method, and a new feature map is convolved by using a convolution kernel with a size of 2 × 2, so that a new feature map with a size of 2N × 2N is finally obtained.
Referring to FIG. 4, the bilinear interpolation is calculated as follows, let the interpolation point p0The coordinate is (x)0,y0) The pixel value T of0And (5) waiting for solving. Point p04 pixel points are arranged around the periphery, and the coordinates are P respectively1,P2,P3,P4The corresponding pixel values are T1, T2, T3, and T4. Therefore, the corresponding relation is as follows:
s=x2-x1,t=y2-y1
Figure BDA0002984218330000081
referring to fig. 5, based on the FCN-VGG16 network, bilinear interpolation is performed on the Conv16 layer to obtain a result of two-fold upsampling, the result is fused with the Pool4 layer characteristics to obtain a Fuse-Pool4 layer, and finally 16-fold deconvolution is performed on the Fuse-Pool4 layer to obtain the FCN-16s-VGG16 network. Similarly, on the basis of the FCN-16s-VGG16 network, performing bilinear interpolation on a Fuse-Pool4 layer, performing fusion feature on the Fuse-Pool4 layer and a Pool3 layer to obtain a Fuse-Pool3 layer, performing 8-time deconvolution on the feature map result of the Fuse-Pool3 layer, and outputting to obtain an FCN-8s-VGG16 network, namely the optimized FCN-VGG16 network.
Further, the performing, by using the optimized FCN-VGG16 network, two classification operations of a dynamic target and a background on the monocular real-time image to obtain the semantic label image includes: determining the category number of heatmaps of the optimized FCN-VGG16 network; determining the prediction probability of the monocular real-time image belonging to a target category; identifying a dynamic target of the monocular real-time image according to the category number of the heat map and the prediction probability; and performing semantic segmentation on the monocular real-time image according to the identified dynamic target to obtain the semantic label image.
Specifically, in the embodiment of the present invention, the number of categories of the heatmap is determined according to the number of the optimized FCN-VGG16 network heatmaps, the prediction probability that the monocular real-time image belongs to the target category is determined, and the dynamic target of the monocular real-time image is identified according to the category number and the prediction probability of the heatmap by the following means:
Figure BDA0002984218330000082
wherein M is the category number of the heat map, C is the target category of the monocular real-time image, yicAs virtual parameters, picAnd (3) the prediction probability that the monocular real-time image belongs to the target class C is shown, L is a loss value, and then the dynamic target of the monocular real-time image is identified according to the size of the loss value. And performing semantic segmentation on the monocular real-time image according to the identified dynamic target to obtain a semantic label image.
Further, the removing the dynamic feature points of the monocular real-time image according to the semantic segmentation result to obtain the static feature points of the monocular real-time image includes: carrying out image pyramid layering on the monocular real-time image by using a first scaling factor to obtain a plurality of first scaled images; carrying out image pyramid layering on the semantic label images by using a second scaling factor to obtain a plurality of second scaled images; sequentially removing dynamic feature points in each first zooming image: carrying out ORB feature extraction on each first zooming image to obtain ORB feature points of each first zooming image; outputting image coordinate values to pixel points of the dynamic region in each second zoomed image to form a set UiWherein i is the layer number of the second zoom image, and the dynamic area is the area where the dynamic target is located; coordinate values of ORB feature points of each first zoom image and a corresponding set UiCoordinate value of (5)Matching is carried out; eliminating coordinate values and the set U in each first zoomed imageiThe coordinate value in the table is matched with the ORB characteristic point; forming a static feature point set M by the ORB feature points reserved in each first zooming image; the estimating of the pose of the camera from the static feature points comprises: and estimating the pose of the camera according to the static feature point set M.
Specifically, in the embodiment of the present invention, a full convolution neural network is used to obtain two semantic label images of a dynamic target and a background, a coordinate mapping method based on an image pyramid is used to process feature point elimination of a dynamic region, and the two semantic label images are used to limit the dynamic region, which specifically includes the following operations:
and carrying out image pyramid layering on the monocular real-time image by using the first scaling factor to obtain a plurality of layers of first scaled images, and carrying out image pyramid layering on the semantic label image by using the second scaling factor to obtain a plurality of layers of second scaled images. Further, the first scaling factor is the same as the second scaling factor. Then, sequentially removing the dynamic feature points in each first zooming image: outputting image coordinate values to pixel points in the dynamic region of the second zoomed image to form a set UiAnd carrying out ORB feature extraction on the first zoomed image with the same layer number as the second zoomed image to obtain ORB feature points. Traversing all the extracted ORB characteristic points, and enabling coordinate values of the ORB characteristic points to correspond to the set UiThe coordinate values in (3) are matched: for any ORB feature point P, if the corresponding set UiIn which there is a point P',
Px=P′x∪Py=P y
the ORB feature points P are considered as dynamic feature points and are eliminated. And after the elimination of the single-layer dynamic characteristic points is finished, continuously eliminating the next layer. And then forming a static characteristic point set M by the ORB characteristic points reserved in each first zooming image, and estimating the pose of the camera according to the static characteristic point set M.
Further, coordinate values of the ORB feature points and the set U are comparediBefore the coordinate values in (1) are matched, the method also comprisesThe method comprises the following steps: storing the ORB feature points.
Specifically, in the embodiment of the present invention, since a quadtree mechanism exists when the ORB feature points are extracted, coordinate matching needs to be performed after the ORB feature points are stored and converted into single-layer image coordinate values, otherwise, static feature points are lost due to the elimination of dynamic feature points.
Another aspect of the present invention provides a full convolutional neural network based visual SLAM system in a dynamic scene, the system being configured to estimate the pose of the camera by using the full convolutional neural network based visual SLAM method in the dynamic scene described above. Hereinafter represented by the ORB _ SLAM2-FCN system.
By the method and the system, the dynamic target can be accurately identified and semantic segmentation is completed, the tracking accuracy and robustness of the camera are effectively improved, and the positioning and mapping accuracy of the visual SLAM in a dynamic scene is improved.
The preferred embodiments of the present invention have been described in detail with reference to the accompanying drawings, however, the present invention is not limited to the specific details of the above embodiments, and various simple modifications can be made to the technical solution of the present invention within the technical idea of the present invention, and these simple modifications are within the protective scope of the present invention.
It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition.
In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as the disclosure of the present invention as long as it does not depart from the spirit of the present invention.

Claims (10)

1. A visual SLAM method based on a full convolution neural network in a dynamic scene, the method comprising:
acquiring an image dataset;
constructing a full convolution neural network model according to the image data set;
performing semantic segmentation on a monocular real-time image currently acquired by a camera by using the full convolution neural network model to obtain a semantic label image;
removing the dynamic characteristic points of the monocular real-time image according to the semantic label image to obtain the static characteristic points of the monocular real-time image;
and estimating the pose of the camera according to the static feature points.
2. The method according to claim 1, wherein the performing semantic segmentation on the monocular real-time image currently acquired by the camera by using the full convolutional neural network model to obtain a semantic tag image comprises:
converting the full-connection layer of the VGG16 network into a convolutional layer by using a convolution kernel with the same size as the input data of the full-connection layer of the VGG16 network in the full-convolution neural network model, so as to obtain an FCN-VGG16 network;
optimizing the FCN-VGG16 network;
and performing dynamic target and background binary operation on the monocular real-time image by using the optimized FCN-VGG16 network to obtain the semantic label image.
3. The method of claim 2, further comprising:
after each convolution calculation in the FCN-VGG16 network, performing sparsity processing on a previous convolution layer by using a linear rectification function as an excitation function;
pooling operations are performed for each pooling layer in the FCN-VGG16 network.
4. The method of claim 3, wherein the optimizing the FCN-VGG16 network comprises:
and introducing a hopping structure to perform upsampling operation and fusion on each pooling layer in the FCN-VGG16 network to obtain the optimized FCN-VGG16 network.
5. The method according to claim 4, wherein the performing dynamic target and background binary classification operations on the monocular real-time images by using the optimized FCN-VGG16 network to obtain the semantic label images comprises:
determining the category number of heatmaps of the optimized FCN-VGG16 network;
determining the prediction probability of the monocular real-time image belonging to a target category;
identifying a dynamic target of the monocular real-time image according to the category number of the heat map and the prediction probability;
and performing semantic segmentation on the monocular real-time image according to the identified dynamic target to obtain the semantic label image.
6. The method of claim 5, wherein the dynamic target of the monocular real-time image is identified based on the number of categories of the heat map and the prediction probability by:
Figure FDA0002984218320000021
wherein M is the category number of the heat map, C is the target category of the monocular real-time image, yicAs virtual parameters, picThe prediction probability that the monocular real-time image belongs to a target class C is obtained, and L is a loss value;
and identifying the dynamic target of the monocular real-time image according to the loss value.
7. The method according to claim 6, wherein the removing the dynamic feature points of the monocular real-time image according to the semantic segmentation result to obtain the static feature points of the monocular real-time image comprises:
carrying out image pyramid layering on the monocular real-time image by using a first scaling factor to obtain a plurality of layers of first scaled images;
carrying out image pyramid layering on the semantic label image by using a second scaling factor to obtain a plurality of layers of second scaled images;
and sequentially removing the dynamic characteristic points in each step: carrying out ORB feature extraction on each first zooming image to obtain ORB feature points of each first zooming image; outputting image coordinate values to pixel points of the dynamic region in each second zoomed image to form a set UiWherein i is the layer number of the second zoom image, and the dynamic area is the area where the dynamic target is located; coordinate values of ORB feature points of each first zoom image and a corresponding set UiMatching the coordinate values; eliminating coordinate values and the set U in each first zoomed imageiThe coordinate value in the table is matched with the ORB characteristic point;
forming a static feature point set M by the ORB feature points reserved in each first zooming image;
the estimating of the pose of the camera from the static feature points comprises:
and estimating the pose of the camera according to the static feature point set M.
8. The method of claim 7, wherein the first scaling factor is the same as the second scaling factor.
9. The method of claim 8, wherein the ORB feature points are coordinated with the set UiBefore the coordinate values in (1) are matched, the method further comprises the following steps:
storing the ORB feature points.
10. A full convolutional neural network based visual SLAM system in a dynamic scene, wherein the system is configured to adopt the full convolutional neural network based visual SLAM method in the dynamic scene of any one of claims 1 to 9 to estimate the pose of the camera.
CN202110295567.6A 2021-03-19 2021-03-19 Visual SLAM method and system based on full convolution neural network in dynamic scene Pending CN112802197A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110295567.6A CN112802197A (en) 2021-03-19 2021-03-19 Visual SLAM method and system based on full convolution neural network in dynamic scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110295567.6A CN112802197A (en) 2021-03-19 2021-03-19 Visual SLAM method and system based on full convolution neural network in dynamic scene

Publications (1)

Publication Number Publication Date
CN112802197A true CN112802197A (en) 2021-05-14

Family

ID=75817245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110295567.6A Pending CN112802197A (en) 2021-03-19 2021-03-19 Visual SLAM method and system based on full convolution neural network in dynamic scene

Country Status (1)

Country Link
CN (1) CN112802197A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913235A (en) * 2022-07-18 2022-08-16 合肥工业大学 Pose estimation method and device and intelligent robot
CN115100643A (en) * 2022-08-26 2022-09-23 潍坊现代农业与生态环境研究院 Monocular vision positioning enhancement method and equipment fusing three-dimensional scene semantics
CN115700781A (en) * 2022-11-08 2023-02-07 广东技术师范大学 Visual positioning method and system based on image inpainting in dynamic scene
CN117809025A (en) * 2024-03-01 2024-04-02 深圳魔视智能科技有限公司 Attention network-based target tracking method, device, equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913235A (en) * 2022-07-18 2022-08-16 合肥工业大学 Pose estimation method and device and intelligent robot
CN115100643A (en) * 2022-08-26 2022-09-23 潍坊现代农业与生态环境研究院 Monocular vision positioning enhancement method and equipment fusing three-dimensional scene semantics
CN115700781A (en) * 2022-11-08 2023-02-07 广东技术师范大学 Visual positioning method and system based on image inpainting in dynamic scene
CN115700781B (en) * 2022-11-08 2023-05-05 广东技术师范大学 Visual positioning method and system based on image complementary painting in dynamic scene
CN117809025A (en) * 2024-03-01 2024-04-02 深圳魔视智能科技有限公司 Attention network-based target tracking method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
JP7236545B2 (en) Video target tracking method and apparatus, computer apparatus, program
CN112802197A (en) Visual SLAM method and system based on full convolution neural network in dynamic scene
CN110135455B (en) Image matching method, device and computer readable storage medium
CN112270249A (en) Target pose estimation method fusing RGB-D visual features
CN112991413A (en) Self-supervision depth estimation method and system
CN111553949B (en) Positioning and grabbing method for irregular workpiece based on single-frame RGB-D image deep learning
CN111260688A (en) Twin double-path target tracking method
Biasutti et al. Lu-net: An efficient network for 3d lidar point cloud semantic segmentation based on end-to-end-learned 3d features and u-net
CN111696110B (en) Scene segmentation method and system
CN112991447A (en) Visual positioning and static map construction method and system in dynamic environment
CN110443883B (en) Plane three-dimensional reconstruction method for single color picture based on droplock
CN110766002B (en) Ship name character region detection method based on deep learning
JP6902811B2 (en) Parallax estimation systems and methods, electronic devices and computer readable storage media
CN111382647B (en) Picture processing method, device, equipment and storage medium
CN111768415A (en) Image instance segmentation method without quantization pooling
CN114663502A (en) Object posture estimation and image processing method and related equipment
CN113781659A (en) Three-dimensional reconstruction method and device, electronic equipment and readable storage medium
CN113052755A (en) High-resolution image intelligent matting method based on deep learning
CN115424017A (en) Building internal and external contour segmentation method, device and storage medium
CN115035172A (en) Depth estimation method and system based on confidence degree grading and inter-stage fusion enhancement
CN114359361A (en) Depth estimation method, depth estimation device, electronic equipment and computer-readable storage medium
CN113313162A (en) Method and system for detecting multi-scale feature fusion target
CN116958927A (en) Method and device for identifying short column based on BEV (binary image) graph
Kottler et al. 3GAN: A Three-GAN-based Approach for Image Inpainting Applied to the Reconstruction of Occluded Parts of Building Walls.
US20220198707A1 (en) Method and apparatus with object pose estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination