CN111998862B

CN111998862B - BNN-based dense binocular SLAM method

Info

Publication number: CN111998862B
Application number: CN202010627927.3A
Authority: CN
Inventors: 陈刚; 吴志慧
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2023-05-16
Anticipated expiration: 2040-07-02
Also published as: CN111998862A

Abstract

The invention relates to a dense binocular SLAM method based on BNN, which comprises the following steps: step one: extracting ORB characteristics from the corrected left graph; step two: acquiring a left image depth map; step three: obtaining the depth of ORB characteristic points of the left graph; step four: estimating the current camera motion gesture; step five: generating a key frame, and carrying out local mapping through the key frame; step six: establishing a dense map; step seven: loop detection and loop correction. The invention uses BNN to realize binocular matching, improves matching precision and matching speed, thereby improving positioning precision and mapping speed, not only solving the problem that the feature method vision SLAM can not build a dense map, but also realizing real-time mapping of the dense map, and enabling the feature method vision SLAM to be practically applied.

Description

BNN-based dense binocular SLAM method

Technical Field

The invention relates to the field of simultaneous positioning and map creation of robots, in particular to a BNN-based dense binocular SLAM method.

Background

Simultaneous localization and mapping (Simultaneous Localization and Mapping, SLAM) refers to a mobile robot estimating its own motion in an unknown environment by means of onboard sensors while building an environment map.

Visual SLAM can be divided into two categories: characteristic methods and direct methods. The visual SLAM method of the feature method needs to extract image feature points, and a sparse map can be created; the feature method vision SLAM has good robustness to illumination and intense movement, but has poor performance in a scene without texture, and the established sparse map can not be further used for tasks such as obstacle avoidance, movement planning and the like.

For binocular signature visual SLAM, a binocular matching process is typically involved. The traditional binocular matching method has large calculated amount and low accuracy. The patent publication with the application number of CN201811416964.9 discloses a robot positioning system and a robot positioning method based on binocular vision and a convolutional neural network, wherein the convolutional neural network is adopted to improve the binocular matching precision, so that more accurate data are better obtained to build a dense map, but the convolutional neural network still causes large calculation amount of binocular matching, poor instantaneity and incapability of building the dense map in real time, and the system and the method are difficult to be applied in practice.

Disclosure of Invention

The invention provides a dense binocular SLAM method based on BNN, which aims to solve the problems that the binocular matching precision of the binocular feature method visual SLAM method in the prior art is low and a dense map is difficult to build in real time, realizes high-precision positioning, reduces the calculated amount and can build the dense map in real time.

In order to solve the technical problems, the invention adopts the following technical scheme: a dense binocular SLAM method based on BNN adopts a system comprising BNN (binary neural network) module, ORB-SLAM2 frame and dense mapping module; BNN module: receiving left and right images of a binocular camera, and calculating a parallax image corresponding to the left image in real time through a trained binary neural network; ORB-SLAM2 framework: ORB-SLAM2 is an open source feature-based SLAM method supporting monocular, binocular and RGB-D cameras, providing high precision positioning through fusion of BNN modules with ORB-SLAM2 frames; and (5) a dense mapping module: and the dense map building module builds a dense three-dimensional environment map in real time according to the images input by the binocular camera, the parallax map obtained by the BNN module and the positioning result obtained by the ORB-SLAM2 frame.

The BNN module comprises a feature extraction unit, a matching cost calculation unit and a disparity map optimization unit;

the feature extraction unit extracts binary image block features from the left and right eye images through a twin binary network

Weight for each image block feature>

Obtaining a binary descriptor with weights, wherein the weights are learned by a network;

the matching cost calculation unit measures the similarity between image blocks by calculating the weighted hamming distance of the weighted binary description, and the calculation formula is as follows:

in the method, in the process of the invention,

representing the weight vector learned by the network; />

The j-th element of the weight vector obtained by the network learning is represented; the H () function represents the hamming distance.

And the disparity map optimizing unit optimizes the matching cost through semi-global matching SGM to obtain a final disparity map.

The ORB-SLAM2 framework comprises a camera motion tracking module, a local mapping module and a loop detection module;

the camera motion tracking module is used for estimating the motion gesture of the camera;

the local map building module is used for building a sparse environment map;

the loop detection module is used for detecting a loop, judging whether the camera returns to a scene which is arrived before, and correcting the loop after detecting the loop, so as to correct the pose and map point position of the relevant camera;

the camera motion tracking module, the local mapping module and the loop detection module run on three CPU threads in parallel.

The system runs on four CPU threads in real time and accelerates by using a GPU, the BNN module is contained in a camera motion tracking module of the ORB-SLAM2 framework and runs in the GPU, the camera motion tracking module, the local mapping module and the loop detection module of the ORB-SLAM2 framework run on three CPU threads, and the dense mapping module runs on one CPU thread.

A dense binocular SLAM method based on BNN comprises the following specific steps:

step one: extracting ORB characteristics from the corrected left graph; the ORB feature consists of rotating multi-scale FAST corner points and 256-bit BRIEF (Binary Robust Independent Elementary Features) descriptors corresponding to the corner points;

step two: acquiring a left image depth map; the method comprises the steps of calculating a parallax map of a left image by using a binary neural network for inputting left and right eye images, and obtaining a depth map of the left image according to a binocular distance principle from the obtained parallax map of the left image;

step three: acquiring the depth of a characteristic point of the left graph ORB (Oriented FAST and Rotated BRIEF); positioning the positions of ORB feature points in the trusted region of the depth map, and obtaining the depth of the corresponding ORB feature points;

step four: estimating the current camera motion gesture;

step five: generating a key frame, and carrying out local mapping through the key frame;

step six: establishing a dense map;

step seven: loop detection and loop correction.

Preferably, in the second step, the binary neural network shares four layers and is trained by a binocular image library, the depth map of the left image is obtained by measuring the similarity of image blocks by using weighted hamming distance to obtain Matching cost, and finally, the Matching cost is optimized by Semi-Global Matching (SGM) to obtain a parallax map, and then, the depth map of the left image is obtained according to a binocular distance principle.

Preferably, for extracted image block features

The cosine similarity is as follows: />

In the method, in the process of the invention,

a feature vector representing an i-th image block of the left image; />

A feature vector representing an i-th image block of the right image;

representing the cosine similarity of the two feature vectors; />

A j-th element representing a feature vector of an i-th image block of the left image; />

A j-th element of the feature vector representing the i-th image block of the right image; the sign () function represents binarization;

cosine similarity is expressed by weighted hamming distance:

in the method, in the process of the invention,

representing the weight vector learned by the network; />

The j-th element of the weight vector obtained by the network learning is represented; the H () function represents the hamming distance;

the solving step of the weighted hamming distance optimal weight is as follows:

in the method, in the process of the invention,

a j-th element representing an i-th weight vector; />

Representing the weight vector learned by the network; />

The j-th element of the weight vector obtained by the network learning is represented; />

Representing the optimal weight; the H () function represents the hamming distance; j () represents building a least squares optimization problem;

for a pair of

About->

Deriving and setting the weight optimal solution to be 0: />

In the method, in the process of the invention,

a j-th element representing an i-th weight vector; />

The j-th element of the weight vector obtained by the network learning is represented.

Preferably, the specific flow of the third step is as follows: selecting a trusted region of the depth map of the left image, enabling the length and width of the trusted region to be a and b respectively relative to the aspect ratio of the original depth map, setting parameters a and b to be determined according to parameters of the binocular camera, and selecting smaller a and smaller b when the baseline of the binocular camera is shorter; unreliable depth rejection, namely, the depth is too large or too small, as unreliable depth, setting upper and lower thresholds of the depth, and rejecting the depth outside the threshold range, wherein the threshold of the depth can be set by multiplying the baseline length of the binocular camera by a value of a certain constant; and positioning the positions of the ORB feature points in the trusted region of the depth map, and acquiring the depth of the corresponding ORB feature points.

Preferably, the specific flow of the fourth step is as follows:

s4.1, if the camera moves at a uniform speed, estimating the motion gesture of the current frame according to the motion gesture of the previous frame according to a uniform motion model; if the camera does not move at a uniform speed, performing feature matching on the current frame and a reference key frame of the current frame, accelerating a matching process by using a Bag-of-Words (Bag) dictionary, and optimizing the motion gesture of the current frame according to a matching result; the optimization method is a Levenberg-Marquardt method in g2o, and the optimization method can be applied to other optimizations of the scheme.

S4.2: and (3) performing feature matching on the current frame and the local map, and using a matching result to optimize the camera motion pose estimation in the step (4.1) to obtain final camera motion pose estimation. The local map consists of map points of a series of key frames that have partially identical map points as the current frame or, in a co-view, are neighbors of a key frame that has partially identical map points as the current frame. A common view is made up of a series of key frames, with the connection weights between them representing the number of points that have the same map.

Preferably, the specific steps of establishing the local map in the fifth step are as follows:

s5.1: establishing a common view and a spanning tree; the common view consists of key frames, if at least 15 identical map points exist between two key frames, an edge exists, and the weight of the edge is the number of the identical map points; the spanning tree is a sub-graph of the common view, and the edge weight requirement of the spanning tree is at least 100;

s5.2: inserting a key frame; processing the key frames sent by the camera motion tracking thread, and updating the common view and the spanning tree;

s5.3: updating map points; the method comprises the steps of clipping and generating map points, wherein the generation of the map points is generated from ORB characteristic points in key frames of common views by using a triangulation principle; cutting map points is to remove map points with less observation times;

s5.4: local BA optimization; optimizing the motion postures of the current frame and all key frames related to the current frame in the common view, and the spatial positions of map points corresponding to the key frames;

s5.5: and cutting out the local key frames. The 90% map points are deleted from the key frames which can be observed by other key frames so as to control the scale of the local map not to be overlarge.

Preferably, in the sixth step, the specific steps of dense map establishment are:

s6.1, for an input key frame, calculating a depth map of the key frame according to a binocular range principle by using a parallax map of the key frame and an internal reference of a camera;

s6.2, mapping image pixel coordinates in the key frame to three-dimensional world point coordinates according to the depth map of the key frame, the camera motion gesture of the key frame and the camera internal parameters;

s6.3, representing and storing three-dimensional world points by using RGB point clouds, and incrementally constructing a three-dimensional environment map;

s6.4: and performing filtering optimization on the three-dimensional environment map by using voxel filtering to obtain an optimized three-dimensional environment map.

Preferably, the specific flow in the step seven is as follows: using a closed-loop key frame selection strategy of an algorithm frame, and correcting corresponding map points and key frame motion pose if a closed loop is formed; and then starting a new thread to perform global optimization, and fusing the optimized key frame motion pose and map points with the initial key frame motion pose and map points after the global optimization is completed.

Preferably, in the first step: in order to realize scale invariance of descriptors, gaussian blur and downsampling are carried out on a left graph to form an 8-layer Gaussian pyramid; the specific method is that each layer of image of the pyramid is subjected to grid division, and FAST corner points are extracted from grids; extracting not less than 5 corner points from each grid; for each corner point, a corresponding descriptor is calculated. In the image block near the corner point, two pixels are randomly selected 256 times according to Gaussian distribution, gray values are compared, if the first pixel is large, 1 is calculated, and finally a 256-bit descriptor is obtained; in order to realize the invariance of the direction of the descriptors, the gray centroid method is used for calculating the direction of the angular points, and the direction of the descriptors is updated.

Compared with the prior art, the invention has the beneficial effects that: the invention uses BNN to realize binocular matching, improves matching precision and matching speed, thereby improving positioning precision and mapping speed, not only solving the problem that the feature method vision SLAM can not build a dense map, but also realizing real-time mapping of the dense map, and enabling the feature method vision SLAM to be practically applied. The invention has the complete function of SLAM, provides the functions of high-precision positioning and dense mapping, and has wide application range and good robustness.

Drawings

FIG. 1 is a schematic overall flow diagram of a BNN-based dense binocular SLAM method of the present invention;

FIG. 2 is a flowchart of the operation of a binary neural network of the BNN-based dense binocular SLAM method of the present invention;

FIG. 3 is a graphical illustration of the positioning accuracy of the present invention on an EuRoC dataset;

fig. 4 is an effect of the present invention to build a dense map on an EuRoC dataset.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent; for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationship depicted in the drawings is for illustrative purposes only and is not to be construed as limiting the present patent.

The technical scheme of the invention is further specifically described by the following specific embodiments with reference to the accompanying drawings:

example 1

As shown in fig. 1-2, an embodiment of a BNN-based dense binocular SLAM method includes the steps of:

step one: extracting ORB characteristics from the corrected left graph; ORB features consist of rotating multi-scale FAST corner points and 256 BRIEF descriptors corresponding to the corner points; in order to realize scale invariance of descriptors, gaussian blur and downsampling are carried out on a left graph to form an 8-layer Gaussian pyramid; the specific method is that each layer of image of the pyramid is subjected to grid division, and FAST corner points are extracted from grids; extracting not less than 5 corner points from each grid; for each corner point, a corresponding descriptor is calculated. In the image block near the corner point, two pixels are randomly selected 256 times according to Gaussian distribution, gray values are compared, if the first pixel is large, 1 is calculated, and finally a 256-bit descriptor is obtained; in order to realize the invariance of the direction of the descriptors, the gray centroid method is used for calculating the direction of the angular points, and the direction of the descriptors is updated.

Step two: acquiring a left image depth map; the method comprises the steps of calculating a parallax map of a left image by using a binary neural network for inputting left and right eye images, and obtaining a depth map of the left image according to a binocular distance principle from the obtained parallax map of the left image; specifically, the binary neural network has four layers and is trained by a binocular image library, the binary neural network is used for obtaining the depth map of the left image by measuring the similarity of image blocks by using weighted Hamming distance to obtain matching cost, and finally, the matching cost is optimized by semi-global matching SGM to obtain a parallax map, and then, the depth map of the left image is obtained according to the principle of binocular distance.

For extracted image features

The cosine similarity is as follows:

in the method, in the process of the invention,

a feature vector representing an i-th image block of the left image; />

A feature vector representing an i-th image block of the right image;

representing the cosine similarity of the two feature vectors; />

cosine similarity is expressed by weighted hamming distance:

in the method, in the process of the invention,

representing the weight vector learned by the network; />

the solving step of the weighted hamming distance optimal weight is as follows:

/>

in the method, in the process of the invention,

a j-th element representing an i-th weight vector; />

Representing the weight vector learned by the network; />

for a pair of

About->

Deriving and setting the weight optimal solution to be 0:

in the method, in the process of the invention,

a j-th element representing an i-th weight vector; />

Step three: obtaining the depth of ORB characteristic points of the left graph; selecting a trusted region of the depth map of the left image, enabling the length and width of the trusted region to be a and b respectively relative to the aspect ratio of the original depth map, setting parameters a and b to be determined according to parameters of the binocular camera, and selecting smaller a and smaller b when the baseline of the binocular camera is shorter; unreliable depth rejection, namely, the depth is too large or too small, as unreliable depth, setting upper and lower thresholds of the depth, and rejecting the depth outside the threshold range, wherein the threshold of the depth can be set by multiplying the baseline length of the binocular camera by a value of a certain constant; and positioning the positions of the ORB feature points in the trusted region of the depth map, and acquiring the depth of the corresponding ORB feature points.

Step four: estimating the current camera motion gesture;

s4.1, if the camera moves at a uniform speed, estimating the motion gesture of the current frame according to the motion gesture of the previous frame according to a uniform motion model; if the camera does not move at a uniform speed, performing feature matching on the current frame and a reference key frame of the current frame, accelerating a matching process by using a BoW dictionary, and optimizing the motion gesture of the current frame according to a matching result; the optimization method is a Levenberg-Marquardt method in g2o, and the optimization method can be applied to other optimizations of the scheme.

Step five: generating a key frame, and carrying out local mapping through the key frame; the specific steps are as follows:

Step six: the method for establishing the dense map comprises the following specific steps:

Step seven: loop detection and loop correction. Using a closed-loop key frame selection strategy of the ORB-SLAM2 framework, and correcting corresponding map points and key frame motion pose if a closed loop is formed; and then starting a new thread to perform global optimization, and fusing the optimized key frame motion pose and map points with the initial key frame motion pose and map points after the global optimization is completed.

To further illustrate the effect of the method presented in this example, the method presented in this example was tested on the V101 sequence of the EuRoC dataset. The EuRoC dataset is a series of images captured by a binocular camera mounted on an unmanned aerial vehicle, and is a common dataset for evaluating binocular vision SLAM. Fig. 3 is a comparison of the positioning accuracy of the method of the present embodiment and the open source ORB-SLAM2 method (a sequence segment with a stronger camera motion and a higher positioning difficulty is selected), and it can be seen that the positioning accuracy of the method and ORB-SLAM2 provided in the present embodiment is higher than that of ORB-SLAM2. Fig. 4 is a dense three-dimensional point cloud map established by the method provided by the embodiment, so that the outline of an object can be seen clearly, and the reconstruction effect on a scene is good.

The beneficial effects of this embodiment are: the invention uses BNN to realize binocular matching, improves matching precision and matching speed, thereby improving positioning precision and mapping speed, not only solving the problem that the feature method vision SLAM can not build a dense map, but also realizing real-time mapping of the dense map, and enabling the feature method vision SLAM to be practically applied. The invention has the complete function of SLAM, provides the functions of high-precision positioning and dense mapping, and has wide application range and good robustness.

Example 2

Another embodiment of a dense binocular SLAM method based on BNN, the method in this embodiment employs a system comprising a BNN (binary neural network) module, an ORB-SLAM2 framework, and a dense mapping module; BNN module: receiving left and right images of a binocular camera, and calculating a parallax image corresponding to the left image in real time through a trained binary neural network; ORB-SLAM2 framework: ORB-SLAM2 is an open source feature-based SLAM method supporting monocular, binocular and RGB-D cameras, providing high precision positioning through fusion of BNN modules with ORB-SLAM2 frames; and (5) a dense mapping module: and the dense map building module builds a dense three-dimensional environment map in real time according to the images input by the binocular camera, the parallax map obtained by the BNN module and the positioning result obtained by the ORB-SLAM2 frame.

the feature extraction unit extracts binary images of left and right eyes through a twin binary networkImage block making feature

Weight for each image block feature>

the local map building module is used for building a sparse environment map;

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A BNN-based dense binocular SLAM method comprising the steps of:

step one: extracting ORB characteristics from the corrected left graph; ORB features consist of rotating multi-scale FAST corner points and 256 BRIEF descriptors corresponding to the corner points;

step two: acquiring a left image depth map; the method comprises the steps of calculating a parallax map of a left image by using a binary neural network for inputting left and right eye images, and obtaining a depth map of the left image according to a binocular distance principle from the obtained parallax map of the left image; the binary neural network has four layers and is trained by a binocular image library, the binary neural network is used for obtaining a depth image of the left image, the similarity of image blocks is measured by using weighted Hamming distance to obtain matching cost, the matching cost is optimized by semi-global matching SGM to obtain a parallax image, and the depth image of the left image is obtained according to a binocular distance principle;

features for network extracted image blocks

The cosine similarity is as follows:

in the method, in the process of the invention,

a feature vector representing an i-th image block of the left image; />

A feature vector representing an i-th image block of the right image;

representing the calculation of twoCosine similarity of feature vectors; />

cosine similarity is expressed by weighted hamming distance:

in the method, in the process of the invention,

representing the weight vector learned by the network; />

the solving step of the weighted hamming distance optimal weight is as follows:

in the method, in the process of the invention,

a j-th element representing an i-th weight vector; />

Representing the weight vector learned by the network; />

for a pair of

About->

Deriving and setting the weight optimal solution to be 0:

in the method, in the process of the invention,

a j-th element representing an i-th weight vector; />

The j-th element of the weight vector obtained by the network learning is represented;

step three: obtaining the depth of ORB characteristic points of the left graph; positioning the positions of ORB feature points in the trusted region of the depth map, and obtaining the depth of the corresponding ORB feature points;

step four: estimating the current camera motion gesture;

step six: establishing a dense map;

step seven: loop detection and loop correction.

2. The BNN-based dense binocular SLAM method of claim 1, wherein the step three is performed as follows: selecting a trusted region of the depth map of the left image, enabling the length and width of the trusted region to be a and b respectively relative to the aspect ratio of the original depth map, setting parameters a and b to be determined according to parameters of the binocular camera, and selecting smaller a and smaller b when the baseline of the binocular camera is shorter; unreliable depth rejection, namely, the depth is too large or too small, as unreliable depth, setting upper and lower thresholds of the depth, and rejecting the depth outside the threshold range, wherein the threshold of the depth can be set by multiplying the baseline length of the binocular camera by a value of a certain constant; and positioning the positions of the ORB feature points in the trusted region of the depth map, and acquiring the depth of the corresponding ORB feature points.

3. The BNN-based dense binocular SLAM method of claim 2, wherein the step four is performed as follows:

s4.1, if the camera moves at a uniform speed, estimating the motion gesture of the current frame according to the motion gesture of the previous frame according to a uniform motion model; if the camera does not move at a uniform speed, performing feature matching on the current frame and a reference key frame of the current frame, accelerating a matching process by using a BoW dictionary, and optimizing the motion gesture of the current frame according to a matching result;

s4.2: and (3) performing feature matching on the current frame and the local map, and using a matching result to optimize the camera motion pose estimation in the step (4.1) to obtain final camera motion pose estimation.

4. The BNN-based dense binocular SLAM method of claim 3, wherein the step five of creating the local map comprises the specific steps of:

s5.5: cutting local key frames; the 90% map points are deleted from the key frames which can be observed by other key frames so as to control the scale of the local map not to be overlarge.

5. The BNN-based dense binocular SLAM method of claim 4, wherein in step six, the dense map is constructed by the specific steps of:

6. A dense binocular SLAM method based on BNN according to any of claims 1-5, characterized in that a system is employed comprising BNN modules, ORB-SLAM2 frames and dense mapping modules; BNN module: receiving left and right images of a binocular camera, and calculating a parallax image corresponding to the left image in real time through a trained binary neural network; ORB-SLAM2 framework: ORB-SLAM2 is an open source feature-based SLAM method supporting monocular, binocular and RGB-D cameras, providing high precision positioning through fusion of BNN modules with ORB-SLAM2 frames; and (5) a dense mapping module: and the dense map building module builds a dense three-dimensional environment map in real time according to the images input by the binocular camera, the parallax map obtained by the BNN module and the positioning result obtained by the ORB-SLAM2 frame.

7. The BNN-based dense binocular SLAM method of claim 6, wherein the BNN module comprises a feature extraction unit, a matching cost calculation unit, and a disparity map optimization unit.

8. The BNN-based dense binocular SLAM method of claim 6, wherein the ORB-SLAM2 framework comprises a camera motion tracking module, a local mapping module, a loop detection module;

the camera motion tracking module is used for estimating the motion gesture of the camera; the local map building module is used for building a sparse environment map; the loop detection module is used for detecting a loop, judging whether the camera returns to a scene which is arrived before, and correcting the loop after detecting the loop, so as to correct the pose and map point position of the relevant camera.