CN111795704A

CN111795704A - Method and device for constructing visual point cloud map

Info

Publication number: CN111795704A
Application number: CN202010615170.6A
Authority: CN
Inventors: 易雨亭; 李建禹; 龙学雄; 党志强
Original assignee: Hangzhou Hikrobot Technology Co Ltd
Current assignee: Hangzhou Hikrobot Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-20
Anticipated expiration: 2040-06-30
Also published as: WO2022002150A1; CN111795704B

Abstract

The method comprises the steps of carrying out feature extraction on a source image frame acquired by a space of a map to be built to obtain feature points of the source image frame, carrying out interframe tracking on the source image frame to determine a key frame, matching the feature points in the current key frame with the feature points in the previous key frame to obtain matching feature points of the current key frame, calculating the spatial position information of the matching feature points in the current key frame, and using the spatial position information of the matching feature points as the map point information of the current key frame, wherein a point cloud formed by a map point set of all key frames is a first visual point cloud map. The map building process separates map building from positioning, effectively removes the mutual influence of map building and positioning, and has better adaptability and stability in a complex and changeable environment.

Description

Method and device for constructing visual point cloud map

Technical Field

The invention relates to the field of navigation and positioning, in particular to a construction method of a visual point cloud map.

Background

Map construction and positioning are key technologies in instant positioning and map (SLAM) research, while map construction is a precondition for realizing positioning, and the accuracy of positioning is directly influenced by the quality of the map. The visual point cloud map is one of the constructed maps, and describes information such as the vision, the pose and the like of points in the environment through a three-dimensional point set in space, so that two types of data information are required for constructing the visual point cloud map: the system comprises key frames and map points, wherein the key frames describe the vision of points in the environment, the map points describe the poses of the points, and a set formed by a large number of map points forms a point cloud.

SLAM starts from an unknown place of an unknown environment, positions the position and the posture of the robot through repeatedly observed map features in the movement process, and constructs a map incrementally according to the position of the robot, so that the purposes of positioning and map construction are achieved simultaneously. From the aspect of input, before the robot runs, no special input exists, when the robot starts to run, sensor original data exist, and from the aspect of output, the pose is estimated and the map is estimated; that is, while a robot is positioned on a map model while a new map model is being built or a known map is being improved, which is similar to the process of putting a person in an unfamiliar city to familiarize the person with the environment, it is known that existing SLAM mapping typically couples together mapping and positioning problems that interact.

Disclosure of Invention

The invention provides a construction method of a visual point cloud map, which is used for avoiding the influence of positioning on map construction.

The construction method of the visual point cloud map provided by the invention is realized as follows:

carrying out feature extraction on a source image frame acquired by a space of a map to be built to obtain source image frame feature points,

performing inter-frame tracking on the source image frame, determining a key frame,

matching the feature points in the current key frame with the feature points in the previous key frame to obtain the matched feature points of the current key frame,

calculating the spatial position information of the matched feature points in the current key frame, taking the spatial position information of the matched feature points as the map point information of the current key frame,

the point cloud formed by the map point sets of all the key frames is a first visual point cloud map.

Preferably, the extracting the features of the source image frames collected from the space of the map to be created to obtain the feature points of the source image frames further comprises,

carrying out image preprocessing on the source image frame to obtain a target image frame,

performing feature extraction based on the target image frame to obtain feature points of the target image frame;

the method further comprises the step of enabling the user to select the target,

based on the first visual point cloud map,

according to the closed loop key frame with closed loop constraint, the least square method is adopted to carry out the image optimization of the key frame position,

and/or, optimizing the spatial position information of the map points according to the reprojection error,

and obtaining a second visual point cloud map.

The invention provides a construction device of a visual point cloud map, which comprises a first visual point cloud map construction unit, a second visual point cloud map construction unit and a third visual point cloud map construction unit,

the characteristic extraction module is used for extracting the characteristics of the source image frames acquired by the space of the map to be built to obtain the characteristic points of the source image frames,

a map point generating module for performing inter-frame tracking on the source image frame, determining the key frame, matching the feature points in the current key frame with the feature points in the previous key frame to obtain the matching feature points of the current key frame, calculating the spatial position information of the matching feature points in the current key frame, using the spatial position information of the matching feature points as the map point information of the current key frame,

According to the construction method of the visual point cloud map, the image frames acquired by the space of the map to be constructed are subjected to feature extraction, the spatial position information of the matched feature points is obtained through inter-frame matching, the matched feature points are used as map points, the visual point cloud map formed by the map point set of all key frames is obtained, and generation and description of three-dimensional points in a physical environment are achieved. The map building process separates map building from positioning, effectively removes the mutual influence of map building and positioning, and has better adaptability and stability in a complex and changeable environment. Because the point cloud map has continuity, compared with a map formed by map nodes, the method can realize continuous positioning, avoid the problem of jumping in the positioning process and reduce the occurrence probability of relocation. Furthermore, the map precision is improved by optimizing the map key frame pose and/or the map point pose, the map can be corrected in time under the condition of wrong closed loop, the initial map data cannot be lost, the amplification is strong, and the map construction method is favorably fused with the improved map construction method.

Drawings

Fig. 1 is a schematic flow chart of a process for constructing a map based on image data acquired by a monocular camera.

FIG. 2 is a schematic diagram of feature point screening.

Fig. 3 is a schematic flow chart of a process for constructing a map based on forward-looking image data acquired by a monocular camera.

FIG. 4 is a schematic view of a process for constructing a map based on image data acquired by a binocular camera

FIG. 5 is a diagram illustrating accumulated error.

Fig. 6 is a schematic flow chart illustrating optimization of the first visual point cloud map.

FIG. 7 is a schematic diagram of a visual dictionary.

Fig. 8 is a schematic diagram of a visual point cloud map construction apparatus according to the present application.

FIG. 9 is a diagram of an image pre-processing module.

Detailed Description

For the purpose of making the objects, technical means and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings.

The method comprises the steps that based on collected image data, a visual point cloud map is obtained through feature extraction and feature point matching of inter-frame tracking; further, pose map optimization is performed through closed-loop constraint and/or map point optimization is performed through reprojection errors, so that the map accuracy is improved. The constructed visual point cloud map at least comprises key frame position and posture information and map point position information, wherein each map point also comprises feature point descriptor information.

Example one

For convenience of understanding, in the present embodiment, the image data is collected by the monocular camera, and the image is the ground texture. It should be understood that the present embodiment may not be limited to ground texture images, and that other images may also be suitable.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a process of constructing a map based on image data acquired by a monocular camera. The method mainly comprises image preprocessing, feature extraction and inter-frame tracking, and specifically comprises the following steps of for each frame of source image:

step 101, taking the collected image as a source image, preprocessing the image to obtain a target image, so as to extract feature points in the image, for example, for a ground texture image, the purpose of preprocessing is to obtain an image with texture information as a main component.

And step 1011, performing distortion removal processing on the source image according to the distortion coefficient of the camera to obtain a distortion-removed image I (u, v), wherein u and v represent pixel coordinates, and I (u, v) represents the pixel value of the pixel coordinates.

Step 1012, for the undistorted image I(u, v), filtering the image to obtain a background image I_b(u, v). For example, gaussian filtering is performed, wherein the gaussian filtering kernel size can be set to 45 × 45. Expressed mathematically as:

I_b(u，v)＝G×I(u，v)，

wherein G is a filter kernel, I_b(u, v) is a background image, and I (u, v) is a undistorted image;

preferably, the image filtering kernel is generally set to be relatively large, so that the filtered image is as close to the background image as possible.

If the texture region is a darker part of the image, the undistorted image I (u, v) may be inverted and expressed as: pixel maximum-I (u, v),

for example, if the maximum pixel value is 255, the inversion operation is: 255-I (u, v).

Step 1013, subtracting the background image from the undistorted image to obtain a foreground image I mainly based on texture information_f(u, v), the formula is expressed as:

I_f(u，v)＝I(u，v)-I_b(u，v)

1014, stretching the foreground image to obtain a target image;

in this step, since the texture information is weak in the image that is usually acquired, the pixel values (gray levels) in the texture region are mostly distributed in a narrow gray level section, and therefore the pixel values in this section are stretched to the pixel value range. In particular to a method for preparing a high-performance nano-silver alloy,

when the foreground image pixel value is less than or equal to the minimum gray value, taking the foreground image pixel value as the minimum value in the pixel value range, namely, taking the pixel value as 0;

when the pixel value of the foreground image is larger than the minimum gray value and smaller than the maximum gray value, the contrast of the pixel value of the foreground image is increased; preferably, the foreground image pixel value is the pixel value of the foreground image according to the pixel value which is in a certain proportion to the maximum value of the pixel; preferably, the ratio is: the ratio of the difference between the foreground image pixel value and the minimum gray value to the difference between the maximum gray value and the minimum gray value.

When the foreground image pixel value is greater than or equal to the maximum gray scale value, the foreground image pixel value is set to be the maximum value in the pixel value range, for example, the pixel maximum value is 255.

Expressed mathematically as:

stretched foreground image I_f' (u, v) is represented by:

wherein, I_minIs the minimum gray value, I_maxThe gray scale value is the maximum gray scale value, and in the above formula, the pixel value range is 0-255.

The image preprocessing is beneficial to improving the contrast of the image, and in some environments with weak texture information, the contrast of the gray scale of the texture area is beneficial to improving, so that more feature points are extracted.

102, extracting feature points based on a current target image frame to convert image information into feature information to obtain a feature point set of the current target image frame; features such as ORB (organized FAST and rotaed BRIEF), Scale Invariant Feature Transform (SIFT), SURF (speeded Up Robust features), etc. may be used.

Taking the ORB algorithm as an example, based on a frame of target image:

step 1021, performing feature detection by using FAST (features from accessed Segment test) algorithm to obtain FAST feature point,

step 1022, screening the FAST feature points to effectively control the scale of the feature points. In order to ensure that the feature points are uniformly distributed and simultaneously screen out the significant feature points as much as possible, the target image is divided into a certain number of grids, as shown in fig. 2, and fig. 2 is a schematic diagram of feature point screening. All feature points are screened according to grids: and aiming at any grid, arranging the characteristic points in the grid in a descending order according to the FAST characteristic point response value, and reserving the former Q characteristic points, wherein Q is determined according to the number of the characteristic points in a frame of target image, the set upper limit of the total number of the characteristic points and the total number of the characteristic points in the grid, and the reserved characteristic points of different grids can be different. For example, in a frame of target image, the upper limit of the total number of feature points is set to 100, and the number of feature points of the current target image is 2000, one feature point is selected from every 20 feature points in the frame of target image, and if there are 20 feature points in a certain grid, the remaining feature point of the grid is 1, that is, Q is 1. Expressed mathematically as:

wherein, the symbol

Indicating a rounding down.

And step 1023, respectively determining the direction of each FAST feature point, namely calculating the centroid of the feature point in the radius range by taking r as the center of mass, and forming a vector from the coordinates of the feature point to the centroid as the direction of the feature point.

Step 1024, for each screened FAST feature point, calculating a feature descriptor of a binary string, thereby obtaining feature point information in the current target image. Specifically, the characteristics of rBRIEF, oBRIEF and the like can be adopted.

And 103, tracking between frames to match the feature points in the previous and next frames, calculating the coordinates of the matched feature points in a world coordinate system, and storing the coordinates as map points with three-dimensional space position information.

In this step, for the target image frame for which the feature points are currently extracted:

step 1031, judging whether the current target image frame is the first frame, if yes, using the target image frame as the key frame, otherwise, executing step 1032, performing inter-frame matching to determine whether the current target image frame is the key frame,

step 1032, matching the current target image frame with the previous key frame, that is:

for any feature point i of the current target image frame, calculating whether the matching degree between the feature point i in the current target image frame and the feature point i descriptor in the previous key frame is smaller than a set matching threshold value, if so, judging that the two feature points are matched, otherwise, judging that the two feature points are not matched; the matching degree can be described by a Hamming distance, and the matching threshold is a Hamming distance threshold;

and 1033, judging whether the current target image frame is a key frame according to the key frame condition, if so, taking the current target image frame as the key frame, executing 1034 to update the map based on the key frame, otherwise, not updating the map.

In step 1033, when the number of matched feature points is greater than a set first threshold, it is determined that the current target image frame is a key frame;

the key frame condition may also be one of the following conditions:

the space distance between the key frame and the previous key frame is larger than a set second threshold;

the space angle between the last key frame and the previous key frame is larger than a set third threshold;

when the key frame condition is a condition that the number of the processing matching feature points is larger than the set first threshold, step 1033 and step 1032 are exchanged, that is: the key frames are determined first, and then the matching feature points are determined.

Step 1034, calculating the coordinates of each matched feature point (simply referred to as a matched feature point) based on the current key frame, and storing the coordinates as map point information;

since the matching feature points of each current keyframe are different from the matching feature points of the previous keyframe, the matching feature points in the previous keyframe are updated by the calculation result of step 1034, and the coordinates of the unmatched feature points are not updated, so that the obtained current map information includes the un-updated map point information and the updated map point information, wherein each map point corresponds to the three-dimensional spatial location information.

In this embodiment, considering that the ground textures acquired by the monocular camera are located on the same plane, the camera is usually installed at the bottom of the mobile robot, so that the coordinates of any matching feature point i of the current keyframe in the world coordinate system can be projected onto the image plane through external parameters of the camera, specifically:

the x coordinate is the ratio of the product of the pixel abscissa u of the current key frame matching feature point i and the camera installation height to the camera focal length,

the y coordinate is the ratio of the product of the pixel ordinate v of the current key frame matching feature point i and the camera installation height to the camera focal length,

the z coordinate is the camera mounting height.

Expressed mathematically:

Z＝H

wherein, H is the installation height of the camera, f is the focal length of the camera, and the pixel coordinate of the matching feature point i in the image coordinate system is (u, v).

And (4) repeatedly executing the steps 101-103 until all the source images are processed, and obtaining a first visual point cloud map consisting of a plurality of map points.

The embodiment provides a visual point cloud construction method based on ground texture, which comprises the steps of carrying out post-processing on an acquired source image, carrying out feature point matching on adjacent key frames, and generating spatial three-dimensional coordinates of map points based on pixel coordinates of the matched feature points so as to obtain a visual point cloud map.

Example two

In this embodiment, the image data is collected by a monocular camera, and the collected images are not the same plane. For example, the camera is mounted with a forward view, i.e. the mobile robot captures images with a forward looking camera.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating a process of constructing a map based on forward-looking image data collected by a monocular camera. For each frame of image, the following steps are performed:

step 301, performing distortion removal processing on a source image according to a distortion coefficient of a camera to obtain a distortion-removed image I (u, v), wherein u and v represent pixel coordinates.

Step 302, judging whether the pixel value of each pixel point in the undistorted image is larger than a set first pixel threshold, if so, performing an inversion operation on the pixel points with the pixel values larger than the first pixel threshold, then filtering the undistorted image after the inversion operation, otherwise, directly performing image filtering on the undistorted image I (u, v) to obtain a background image I (u, v)_b(u，v)。

Step 303, subtracting the background image from the undistorted image to obtain a foreground image I_f(u, v), wherein the foreground image is expressed by the following mathematical formula:

I_f(u，v)＝I(u，v)-I_b(u，v)

step 304, judging the foreground image I_f(u, v) whether the pixel values are uniformly distributed, if so, taking the foreground image as a target image, otherwise, stretching the foreground image to obtain the target image, wherein the specific stretching processing is the same as that in the step 1014;

in this step, the high image quality is not subjected to the image stretching processing, and the low image quality is subjected to the image stretching processing, so that the image stretching processing is selectively processed in accordance with the image quality.

305, extracting feature points based on the current target image frame to convert the image information into feature information to obtain a feature point set of the current target image frame; features such as ORB (organized FAST and rotaed BRIEF), Scale Invariant Feature Transform (SIFT), SURF (speeded Up Robust features) which is a highly efficient modified version of SIFT, and the like may be used. This step 305 is the same as step 102.

Step 306, judging whether the current target image frame is the first frame, if so, taking the target image frame as a key frame, then returning to step 310, otherwise, executing step 307, performing inter-frame matching to determine whether the current target image frame is the key frame,

step 307, matching the current target image frame with the previous key frame, that is:

step 308, judging whether the current target image frame is a key frame according to the key frame condition, if so, taking the current target image frame as the key frame, executing step 309 to update the map based on the key frame, otherwise, not updating the map, and directly executing step 310. When one of the following key frame conditions is satisfied, the current target image frame is determined to be a key frame:

the number of the matched feature points is greater than a set first threshold value;

step 309, based on the current key frame, calculating the coordinates of each matching feature point, and storing the coordinates as map point information, so that the obtained current map information includes the non-updated map point information and the updated map point information, wherein each map point corresponds to three-dimensional spatial position information.

In this embodiment, since images acquired by the monocular camera are on a non-identical plane, an eight-point method is adopted to calculate an essential matrix between two frames according to pixel coordinates of the matched feature points, the essential matrix is subjected to SVD decomposition to obtain a relative pose of the camera, and then a least square method is adopted to calculate coordinates of any matched feature point i in a world coordinate system according to the poses of the two frames based on a triangulation calculation principle. Specifically, the method comprises the following steps:

3091, matching the normalized plane coordinate p of the feature point i according to the intrinsic matrix E and the current key frame₁Last key frame matching feature point i normalization plane coordinate p₂Satisfies the following conditions: for any matching feature point, the product of the transpose matrix of the normalized plane coordinate of the matching feature point in the previous key frame, the essential matrix and the matrix of the normalized plane coordinate of the matching feature point in the current key frame is equal to 0. Expressed mathematically as:

the intrinsic matrix E reflects the relationship between the representations of the image points of one point P in the space in the camera coordinate system under the cameras with different view angles, is a 3 x 3 matrix, and has the function that one point on the first image frame is multiplied by the intrinsic matrix, and the result is the epipolar line of the point on the second image frame; normalized plane coordinates of current keyframe matching feature point i

Last key frame matching feature point i normalization plane coordinate

Substituting into the normalized plane coordinates of eight pairs of matched characteristic points to solve the essential matrix. And the current key frame matching feature point i and the previous key frame matching feature point i form a pair of matching feature points.

3092, performing Singular Value Decomposition (SVD) on the intrinsic matrix E to obtain the inter-frame relative pose between the current key frame and the previous key frame, namely the relative pose of the camera, which includes the translation matrix t and the rotation matrix R.

3093, based on the trigonometric calculation principle, matching the depth value s of the feature point i with the current key frame₁The depth value s of the last key frame matching feature point i₂Satisfies the following conditions:

s₁p₁＝s₂Rp₂+t

by using least squares, s can be solved₁And s₂。

Alternatively, the first and second electrodes may be,

multiplying both sides of the above formula by p₁Of the inverse-symmetric matrix p₁^, can obtain:

s₁p₁^p₁＝0＝s₂p₁^Rp₂+p₁^t

from this, s can be obtained₂S to be obtained₂Substitution of formula 1 to give s₁。

3094, matching the depth value s of the feature point i according to the current key frame₁And calculating the coordinate of the current key frame matching feature point i in a world coordinate system:

the x coordinate is: the product of the pixel abscissa of the normalized plane of the matched feature points in the current keyframe and the depth value of the matched feature points,

the y coordinate is: the product of the pixel ordinate of the matching feature point normalization plane in the current keyframe and the matching feature point depth value,

the z coordinate is the camera focal length.

The mathematical formula is expressed as:

X＝s₁u₁

Y＝s₁v₁

Z＝f

wherein f is the focal length of the camera to convert the normalized plane coordinates into an imaging plane.

And 310, judging whether the source image is processed or not, if so, ending, otherwise, processing the next frame, returning to the step 301 until all the source images are processed, and obtaining a first visual point cloud map consisting of a large number of map points.

The embodiment provides a visual point cloud construction method of a non-coplanar image based on forward-looking collection, which comprises the steps of carrying out post-processing on a collected source image, carrying out feature point matching on adjacent key frames, and generating a space three-dimensional coordinate of a map point based on a pixel coordinate of the matched feature point so as to obtain a visual point cloud map.

EXAMPLE III

In the present embodiment, the image data is collected by a binocular camera, and the collected images are non-coplanar.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a process of constructing a map based on image data collected by a binocular camera. For each binocular image frame, i.e. a first source image frame from a first destination and a second source image frame from a second destination at the same time, the following steps are performed:

step 401, performing image preprocessing on a first source image frame and a second source image frame to obtain a current binocular target image frame, which comprises a first target image frame and a second target image frame;

in this step, the first image frame and the second image frame may be subjected to image preprocessing in parallel, or the first image frame and the second image frame may be subjected to image preprocessing in series respectively,

the specific processing is the same as steps 301 to 304,

step 402, respectively extracting feature points of a first target image frame and feature points of a second target image frame based on a current binocular target image frame so as to convert image information into feature information and obtain a feature point set of the current target image frame; features such as ORB (organized FAST and rotaed BRIEF), Scale Invariant Feature Transform (SIFT), SURF (speedUp Robust features), etc. may be used. This step is the same as step 102.

Step 403, determining whether the current binocular target image frame is the first binocular image frame, if so, taking any frame in the binocular target image frame as a key frame, executing step 406, otherwise, executing step 404, performing inter-frame tracking to determine whether any frame in the current binocular target image frame is a key frame,

in step 404, in order to improve the matching efficiency, any target image frame in the current binocular target image frames may be matched with the previous key frame, so as to obtain the matching feature points of the target image frames,

the specific matching method is the same as step 307.

Step 405, judging whether the current target image frame is a key frame according to the key frame condition, if so, taking the current target image frame as the key frame of the binocular target image frame, executing step 406 to update the map based on the key frame, otherwise, not updating the map;

when one of the following key frame conditions is satisfied, the current target image frame is determined to be a key frame:

step 406, searching a second matching feature point successfully matched in the binocular image frame based on the first matching feature point in the current key frame, calculating coordinates of the second matching feature point, and storing the coordinates as map point information, so that the obtained current map information comprises non-updated map point information and updated map point information, wherein each map point corresponds to three-dimensional space position information, namely map point coordinates.

In this embodiment, the calculation process of the coordinates of any matching feature point (first matching feature point) i in the current keyframe is as follows:

taking the current key frame as a first frame in the current binocular target image frame, wherein the matching feature points in the frame are obtained through the step 404; taking another target image frame in the binocular target image frames as a second frame; matching the matching feature point i in the first frame with the feature point j in the second frame, namely, calculating the matching degree between the descriptors of the matching feature point i and the feature point j, and if the matching feature point i in the second frame is obtained through matching, the map point coordinates of the matching feature point i are as follows:

the x coordinate is: dividing the product of the pixel abscissa of the normalization plane of the matching feature point in the current key frame and the length of the binocular baseline by the absolute value of the difference between the pixel abscissa of the normalization plane of the matching feature point in the current key frame and the pixel abscissa of the normalization plane of the matching feature point in the second frame;

the y coordinate is: dividing the product of the pixel ordinate of the matching feature point normalization plane in the current key frame and the length of the binocular baseline by the absolute value of the difference between the pixel abscissa of the matching feature point normalization plane in the current key frame and the pixel abscissa of the matching feature point normalization plane in the second frame;

the z coordinate is: the product of the camera focal length and the binocular baseline length is divided by the absolute value of the difference between the pixel abscissa of the matched feature point normalized plane in the current keyframe and the pixel abscissa of the matched feature point normalized plane in the second frame.

The mathematical formula is expressed as:

wherein (u)₁，v₁) Matching normalized planar pixel coordinates of feature points for the first frame, (u)₃，v₂) The normalized plane pixel coordinates of the second frame feature points are represented, f represents the focal length of the camera, and b represents the length of a binocular baseline;

if not, the coordinate calculation of the matching feature point i is abandoned.

And (4) repeatedly executing the steps 401 to 406 until all the source binocular images are processed, and obtaining a first visual point cloud map consisting of a large number of map points.

The embodiment provides a binocular image-based visual point cloud construction method, the binocular image is used for obtaining the space coordinates of the matched feature points, the calculation is simple, the positioning information of a mobile robot is not needed in the construction process, the map construction and the positioning are separated, and the stability and the adaptability of a complex environment are improved.

In view of the fact that the first visual point cloud map is obtained by continuously matching the inter-frame images and continuously recording generated map points, the inter-frame continuously matched map point generation mode can generate accumulated errors, and the errors are larger and larger along with the increase of the movement distance. Referring to fig. 5, fig. 5 is a schematic diagram of accumulated error. In the figure, T_iRepresenting the position of the ith frame data, the left-hand objective real track, and the right-hand track, where T is the calculated track₁And T₁₉Near the same location, but the calculated trajectory will not be near the same location due to accumulated errors. In order to eliminate accumulated errors and improve the precision of map point spatial position information in the first visual point cloud map, closed-loop constraint is constructed, and a least square method is adopted for optimization.

Referring to fig. 6, fig. 6 is a schematic flow chart illustrating optimization of the first visual point cloud map. The optimization method mainly comprises closed-loop point identification, closed-loop constraint calculation and map optimization, wherein the map optimization comprises pose map optimization and/or map point optimization. The details are as follows.

Step 601, identifying a key frame with closed-loop constraint in the first visual point cloud map through manual marking or key frame similarity calculation.

In one embodiment, the method of manual marking is adopted: a unique identification pattern is placed in the environment at the time of image data acquisition so that a closed loop is created between key frames of the same identification acquired at different times. This approach has the advantage of high reliability.

In a second embodiment, the method of natural labeling: whether closed loop occurs is judged by calculating whether the similarity between the two key frame images is larger than a set similarity threshold, wherein the similarity comprises the similarity on the distribution of the characteristic points and the similarity of image pixels. In particular, it may be that,

constructing a visual dictionary of natural features, if N feature points exist, constructing a tree with depth of d and bifurcation of k each time, wherein the process is as follows:

at the root node, all feature points are clustered into k classes by using a k-means clustering algorithm (k-means), so that a first-layer node is obtained.

For each node of the first layer, feature points belonging to the node are further aggregated into k types to obtain the next layer.

And so on until the leaf level, wherein d levels are counted from the root node to the leaf level. The leaf layer is the word feature point in the dictionary.

Referring to fig. 7, fig. 7 is a schematic diagram of a visual dictionary.

For all the features in the image with the number n, the number of times of the word feature point omega appears is c_iThe weight of the word feature point is expressed as:

according to the dictionary model, any key frame A is described as each word feature point omega_iAnd the weight thereof is a set of elements, and the mathematical expression is:

A＝{(ω₁，η₁)，(ω₂，η₂)，……(ω_N，η_N)}

and N is the total number of the feature points in the visual dictionary.

According to the above model, the similarity S between any two key frames a and B can be described in the form of L1 norm:

wherein v is_AiIs an element in the set of key frames A described according to the dictionary model, v_BiAre elements in the set of key frames B described according to the dictionary model.

And if the similarity S is larger than the set similarity threshold, judging that closed-loop constraint exists between the two frames.

In step 602, a closed-loop constraint is calculated based on the key frame (hereinafter referred to as a closed-loop key frame) determined to have the closed-loop constraint. Specifically, the method comprises the following steps of,

step 6021, based on the closed-loop key frame, calculating the matching feature points in the closed-loop key frame:

if the first closed-loop key frame A and the second closed-loop key frame B have closed loops, calculating the matching degree of any map point i in the first closed-loop key frame A and any map point j in the second closed-loop key frame B, if the matching degree is smaller than a set matching degree threshold value, judging that the two map points are matched, repeatedly matching each map point in such a way to obtain m matched feature points, wherein the mathematical expression is as follows:

P＝{p₁，p₂……p_m}，P′＝{p₁′，p₂′……p_m′}

and P is a set of m matched feature points in the first closed-loop key frame, and P is a set of m matched feature points in the second closed-loop key frame.

The matching degree calculation may be to calculate a hamming distance between the two map point descriptors, and determine a match if the hamming distance is smaller than a set hamming threshold.

Step 6022, calculating the inter-frame motion information between the closed-loop key frames according to the matching feature points in the closed-loop key frames, namely, calculating the relative pose between the closed-loop key frames, wherein the relative pose represents the accumulated error.

Considering that any matching feature point in the closed-loop key frame satisfies the relationship:

p_i＝Rp′_i+t

wherein, (R, t) reflects closed-loop constraint between two closed-loop key frames, and relative pose can be calculated through the relationship and is used as an initial value; i is a natural number, i is more than or equal to 1 and less than or equal to m, p_i、p′_iIs a matrix of pixel coordinates.

For the solution of the closed-loop constraint, a least square method can be used, for example, a nonlinear optimization method LM algorithm is used. And (3) substituting pixel position information of all matched characteristic points in the closed-loop key frame by constructing a first objective function, and iteratively solving zeta when the first objective function obtains a minimum value, thereby solving R and t. Expressed mathematically as:

where ζ is the lie algebra representation of (R, t).

Step 603, optimizing map points in the first visual map according to the closed-loop constraint. Specifically, the method comprises pose graph optimization and map point optimization, wherein the pose graph optimization is processed according to a step 6031, and the map point optimization is processed according to a step 6032. Step 6031 is not in sequence with step 6032.

It should be understood that

steps

6031 and 6032 may also be performed selectively to optimize either step, such as only pose map optimization or only map point optimization.

Step 6031, consider pose T of any keyframe i_iLie algebraic representation of ζ_iAnd pose T of any keyframe j_jLie algebraic representation of ζ_jThe error of the relative pose between the two is expressed as:

wherein the symbol ^ represents an antisymmetric matrix, and the symbol V represents the inverse operation of the antisymmetric matrix.

Therefore, a second objective function for pose graph optimization of the key frame is constructed:

wherein Ω is the weight of the error term, and is the key frame set.

Substituting the measured inter-frame relative pose errors of the key frame i and the key frame j into the second objective function as initial values, taking the relative pose between the closed-loop key frames obtained in the step 6022 as constraint, and adopting a Gaussian-Newton method or an LM method to iteratively solve the pose T of the key frame i when the second objective function obtains the minimum value_iLie algebraic representation of ζ_iAnd shut offKey frame j pose T_jLie algebraic representation of ζ_jIn this way, the accumulated error determined from the closed-loop keyframes is distributed to the individual keyframes, thereby correcting the pose of the keyframes.

Step 6032, according to the pose T of any key frame i_iThe coordinate y of the three-dimensional map point j is acquired_jPixel position z in keyframe i_ijConstructing a reprojection error:

wherein the content of the first and second substances,

the position of the map point re-projection in the image is represented as:

wherein I is an identity matrix, [ I ]_3×30_3×1y forming a 3 x 4 matrix, T_iIs a matrix of 4 x 4, and,

is a 4 × 1 matrix, and K is camera intrinsic parameters.

Constructing a third objective function of the reprojection error:

where Ω is the weight of the error term and j is the map point.

Substituting the pose T according to the key frame i into a third objective function_iThe coordinates of the map point j, the camera internal reference and the reprojection error obtained by the pixel coordinates of the map point j in the key frame i are used as initial values, and Gaussian-Newton is adoptedIterative solution of the method or LM method to make the three-dimensional map point j coordinate y when the third objective function takes the minimum value_jThereby correcting the three-dimensional spatial position information of the map point.

In this step, preferably, the pose T of the keyframe i_iThe pose optimized through step 6031.

And storing the optimized key frame pose and/or the optimized map point position as the map information of the visual point cloud, so that a second visual point cloud map is obtained by optimizing the pose of the key frame and/or the coordinates of the map points.

According to the method, the mapping process is separated into an independent first visual point cloud map construction, a second visual point cloud map is obtained through closed-loop constraint calculation and map optimization, corresponding output maps are stored in each processing stage, and even if the mapping is not ideal, original data in the previous process are stored; the expansibility is strong, and the fusion with various improved methods is convenient.

Referring to fig. 8, fig. 8 is a schematic diagram of a visual point cloud map building apparatus according to the present application. The device comprises a first visual point cloud map building unit 801, a closed-loop unit 802, a map optimization unit 803, and an input/output IO unit 804 for reading and saving map files, wherein an external source image is input to the first visual point cloud map building unit 801, the closed-loop unit 802 is used for adding closed-loop constraints to the first visual point cloud map generated by the first visual point cloud map building unit 801, and the map optimization unit 803 performs keyframe pose map optimization and map point position information optimization on the first visual point cloud map based on the closed-loop constraints.

The first visual point cloud mapping unit 801 includes,

an image preprocessing module 8011 configured to preprocess an image;

a feature extraction module 8012, configured to convert the image information into feature information;

a map point generating module 8013, configured to track image frames from frame to frame, determine a key frame, match feature points in the current key frame with feature points in the previous key frame to obtain matching feature points of the current key frame, calculate spatial location information of the matching feature points in the current key frame, use the spatial location information of the matching feature points as map point information of the current key frame,

The closed-loop element 802 comprises a closed-loop element,

a closed-loop key frame identification module 8021, which identifies a closed-loop key frame in the first visual point cloud map according to the artificial mark or the key frame similarity calculation;

the closed-loop constraint calculation module 8022 calculates the relative pose between the closed-loop key frames based on the closed-loop key frames as closed-loop constraints; and constructing a second objective function for optimizing the key frame pose graph, and solving the key frame pose when the second objective function obtains the minimum value by adopting a least square method.

The map optimization unit 803 includes a keyframe pose map optimization module 8031 and/or a map point optimization module 8032, wherein,

a key frame pose graph optimization module 8031, which performs graph optimization on the key frame pose by using a least square method according to the closed loop key frame with closed loop constraint based on the first visual point cloud map to obtain a second visual point cloud map;

the map point optimization module 8032 optimizes the spatial position information of the map points according to the reprojection error based on the first visual point cloud map to obtain a second visual point cloud map.

Referring to fig. 9, fig. 9 is a schematic diagram of an image preprocessing module. The image pre-processing module comprises a pre-processing module,

the image distortion removal submodule is used for carrying out distortion removal processing on the source image frame according to the distortion coefficient of the camera to obtain a distortion removal image,

the image filtering submodule carries out image filtering on the distortion-removed image to obtain a background image,

an image difference submodule for subtracting the background image from the de-distorted image to obtain a foreground image,

and the image stretching submodule is used for stretching the foreground image to obtain a target image frame.

The mapping process is separated into an independent first visual point cloud mapping construction unit, a closed loop unit and a mapping optimization unit, coupling relations do not exist among the units, corresponding output maps are stored in each processing stage, and even if the mapping is not ideal, original data in the previous process are also stored; the expansibility is strong, and the fusion with various improved methods is convenient.

The present application may also provide a visual point cloud mapping apparatus comprising a memory storing executable computer instructions and a processor configured to execute the instructions in the memory to implement the steps of the visual point cloud mapping method.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored in the storage medium, and when being executed by a processor, the computer program realizes the steps of the visual point cloud map construction method.

For the device/network side device/storage medium embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, refer to the partial description of the method embodiment.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A construction method of a visual point cloud map is characterized by comprising the following steps,

2. The method as claimed in claim 1, wherein said performing feature extraction on source image frames collected in a space to be mapped to obtain source image frame feature points, further comprises,

based on the first visual point cloud map,

and obtaining a second visual point cloud map.

3. The method of claim 2, wherein the image pre-processing comprises,

performing distortion removal processing on the source image frame according to the distortion coefficient of the camera to obtain a distortion-removed image,

judging whether the pixel value of each pixel point in the distortion-removed image is larger than a first pixel threshold value, if so, carrying out inversion operation on the pixel points with the pixel values larger than the first pixel threshold value, then carrying out image filtering to obtain a background image, otherwise, carrying out image filtering on the distortion-removed image to obtain the background image,

subtracting the background image from the de-distorted image to obtain a foreground image,

and judging whether the pixel values in the foreground image are uniformly distributed, if so, taking the foreground image as a target image frame, otherwise, stretching the foreground image to obtain the target image frame.

4. The method of claim 3, wherein the stretching the foreground image comprises,

if the foreground image pixel value is less than or equal to the set minimum gray value, taking the foreground image pixel value as the minimum value in the pixel value range;

if the foreground image pixel value is larger than the minimum gray value and smaller than the set maximum gray value, taking the pixel value which is in a certain proportion to the maximum pixel value as the foreground image pixel value; the ratio is the ratio of the difference between the pixel value and the minimum gray value of the foreground image to the difference between the maximum gray value and the minimum gray value;

if the pixel value of the foreground image is larger than or equal to the maximum gray value, taking the pixel value of the foreground image as the maximum value in a pixel value range;

the feature extraction is carried out based on the target image frame to obtain the feature points of the target image frame, including,

carrying out feature detection on the target image frame to obtain feature points,

the target image frame is divided into a number of grids,

for the feature points in any grid, arranging the feature points in the grid in a descending order according to the response values of the feature points, and reserving the first Q feature points to obtain screened feature points; q is determined according to the number of the feature points in the target image frame, the set upper limit of the total number of the feature points and the total number of the feature points in the grid;

and respectively calculating the feature descriptors of the screened feature points.

5. The method as claimed in claim 4, wherein Q is determined according to the number of the feature points in the target image frame and the set upper limit of the total number of the feature points, and comprises that Q is the number of the feature points in the target image frame divided by the upper limit of the total number of the feature points, and is rounded down by multiplying the result of the total number of the feature points in the grid;

the inter-frame tracking of the source image frame, the determination of the key frame, including,

for each target image frame: judging whether the target image frame is the first frame or not, if so, taking the target image frame as a key frame, otherwise, determining whether the target image frame is the key frame or not according to the key frame condition;

wherein the key frame condition satisfies at least one of the following conditions:

the spatial angle between the last key frame and the previous key frame is larger than a set third threshold value.

6. The method of claim 5, wherein the source image frames are image frames that originate from a monocular camera and are in the same plane;

the calculating the spatial location information of the matching feature points in the current key frame comprises,

for each matching feature point:

the x coordinate is: the ratio of the product of the pixel abscissa of the matched feature point in the current key frame and the camera installation height to the camera focal length,

the y coordinate is: the ratio of the product of the pixel ordinate of the matched feature point in the current key frame and the camera installation height to the camera focal length,

the z coordinate is the camera mounting height.

7. The method of claim 5, wherein the source image frames are image frames that originate from a monocular camera and are non-coplanar;

obtaining the essential matrix of the current key frame and the previous key frame according to the pixel coordinates of at least 8 pairs of matched feature points consisting of the matched feature point pairs of the current key frame matched feature point and the previous key frame matched feature point,

performing singular value decomposition on the essential matrix to obtain the inter-frame relative pose between the current key frame and the previous key frame;

for each matching feature point: at least obtaining the depth value of the matching feature point of the current key frame according to the inter-frame relative pose between the current key frame and the previous key frame and the triangularization calculation relationship; and obtaining the spatial position information of the matched feature points according to the depth value of the matched feature points of the current key frame.

8. The method of claim 7, wherein obtaining the essence matrix of the current key frame and the previous key frame according to the pixel coordinates of at least 8 pairs of matched feature points consisting of the matched feature points of the current key frame matched feature point and the matched feature points of the previous key frame matched feature point comprises,

for any of the matched feature points, the feature points,

according to the relationship that the product of the transpose matrix of the normalized plane coordinates of the matching feature points in the previous key frame, the essential matrix and the matrix of the normalized plane coordinates of the matching feature points in the current key frame is equal to 0,

substituting 8 pairs of pixel coordinates of the matched characteristic points to obtain an essential matrix;

the method at least obtains the depth value of the matching feature point of the current key frame according to the relative pose between the current key frame and the previous key frame and the triangularization calculation relationship, and comprises the following steps,

for any of the matched feature points, the feature points,

based on the sum of the product of the depth value of the matching feature point in the current key frame and the matrix of the normalized plane coordinate of the matching feature point being equal to, the depth value of the matching feature point in the previous key frame, the rotation matrix in the relative pose, and the product of the matrix of the normalized plane coordinate of the matching feature point in the previous key frame and the translation matrix in the relative pose,

obtaining the depth value of the matching feature point in the current key frame according to the rotation matrix and the translation matrix in the relative pose and the matrixes of the normalized plane coordinates of the matching feature point in the current key frame and the previous key frame;

the obtaining the spatial position information of the matching feature point according to the depth value of the matching feature point of the current key frame includes,

the z coordinate is the camera focal length.

9. The method of claim 5, wherein the source image frames are binocular image frames derived from a binocular camera and are non-coplanar;

performing image preprocessing on the source image frames to obtain target image frames, wherein the image preprocessing is performed on a first source image frame from a first destination and a second source image frame from a second destination respectively to obtain a first target image frame and a second target image frame which are used as binocular target image frames;

performing feature extraction based on the target image frame to obtain feature points of the target image frame, wherein the feature points of the first target image frame and the feature points of the second target image frame are respectively extracted;

judging whether the target image frame is the first frame or not comprises judging whether the binocular target image frame is the first frame or not, if so, taking any frame in the binocular target image frame as a key frame, and otherwise, determining whether any frame in the target image frame is the key frame or not according to the key frame condition;

for each matching feature point in the current keyframe:

taking the current key frame as a first frame in the current binocular target image frame, taking another target image frame in the current binocular target image frame as a second frame, matching the matching feature points in the first frame with the feature points in the second frame, and if the matching is successful, obtaining the matching feature points in the second frame

The x coordinate of the matched feature point in the current keyframe is: dividing the product of the pixel abscissa of the normalization plane of the matching feature point in the current key frame and the length of the binocular baseline by the absolute value of the difference between the pixel abscissa of the normalization plane of the matching feature point in the current key frame and the pixel abscissa of the normalization plane of the matching feature point in the second frame;

the y coordinate of the matching feature point in the current key frame is: dividing the product of the pixel ordinate of the matching feature point normalization plane in the current key frame and the length of the binocular baseline by the absolute value of the difference between the pixel abscissa of the matching feature point normalization plane in the current key frame and the pixel abscissa of the matching feature point normalization plane in the second frame;

the z-coordinate of the matched feature point in the current keyframe is: the product of the camera focal length and the binocular baseline length is divided by the absolute value of the difference between the pixel abscissa of the matched feature point normalized plane in the current keyframe and the pixel abscissa of the matched feature point normalized plane in the second frame.

10. The method according to any one of claims 2 to 9, wherein the performing the graph optimization on the key frame positions by using a least square method according to the closed-loop key frame with the closed-loop constraint comprises,

identifying a closed loop key frame in the first visual point cloud map according to the manual mark or key frame similarity calculation;

calculating the relative pose between the closed-loop key frames based on the closed-loop key frames to serve as closed-loop constraint;

and constructing a second objective function for optimizing the key frame pose graph, and solving the key frame pose when the second objective function obtains the minimum value by adopting a least square method.

11. The method of claim 10, wherein identifying closed-loop keyframes in the first visual point cloud map based on artificial markers or keyframe similarity calculations comprises,

acquiring key frames with the same identification at different times as closed-loop key frames;

alternatively, the first and second electrodes may be,

calculating whether the similarity between the two key frames is greater than a set similarity threshold, if so, judging the two key frames to be closed-loop key frames, wherein the similarity comprises the similarity on the distribution of the feature points and the similarity of the image pixels;

the computing of relative poses between closed-loop keyframes as closed-loop constraints based on the closed-loop keyframes includes,

calculating matching feature points in the closed-loop key frame based on the closed-loop key frame,

for any matching feature point in the closed-loop key frame, calculating to obtain a relative pose as an initial value according to the relation that a pixel coordinate matrix of the matching feature point in the first closed-loop key frame is equal to, the product of a rotation matrix in the relative pose and a pixel coordinate matrix of the second closed-loop key frame and a translation matrix in the relative pose;

constructing a first objective function for accumulating pixel position information errors of all matched characteristic points in a closed-loop key frame, substituting the first objective function into pixel coordinate matrixes of all matched characteristic points, and iteratively solving to enable the first objective function to obtain a relative pose when the first objective function obtains a minimum value;

the second objective function for optimizing the key frame pose graph is constructed, the key frame pose when the second objective function obtains the minimum value is solved by adopting a least square method, and the method comprises the following steps of,

constructing a second objective function for accumulating the pose errors of the first key frame and the second key frame according to the relative pose errors between the pose of any first key frame and the pose of any second key frame,

and taking the inter-frame relative pose error of the first key frame and the second key frame as an initial value, taking the closed-loop constraint as a constraint, and iteratively solving the pose of the first key frame and the pose of the second key frame when the second objective function obtains the minimum value.

12. The method of claim 11, wherein said calculating whether the similarity between two key frames is greater than a set similarity threshold comprises,

at the root node, all the characteristic points are clustered into k classes by using a k-means clustering algorithm to obtain a first-layer node,

for each node of the first layer, clustering the characteristic points belonging to the node into k classes to obtain the next layer node,

and so on until the last leaf layer, obtaining a visual dictionary which comprises N characteristic points and is branched into k trees each time,

wherein d layers are counted from the root node to the leaf layer, and the leaf layer comprises word feature points in the dictionary; k. d and N are both natural numbers, and N is the total number of the feature points in the dictionary;

for any key frame, calculating the weight of the word feature point according to the quantity of all features in the key frame and the occurrence frequency of any word feature point, describing the key frame as a set taking each word feature point and the weight thereof as elements, wherein the set comprises N elements,

calculating the similarity of the first key frame and the second key frame according to all elements in the set described by the first key frame and all elements in the set described by the second key frame,

and if the similarity is greater than the set similarity threshold, judging that closed-loop constraint exists between the two key frames.

13. The method of any one of claims 2 to 9, wherein said optimizing the spatial location information of the map points based on the reprojection error comprises,

for any key frame, constructing a third objective function related to the reprojection error according to the reprojection error of any map point in the pixel position of the key frame collected in the pose of the key frame,

using the initial value of the reprojection error to iteratively solve the space position information of the space map point when the third objective function obtains the minimum value,

wherein, the initial value of the reprojection error is: the difference between the pixel position of the map point in the key frame and the position of the re-projected map point in the image,

and the position of the re-projection of the map point in the image is obtained according to the camera internal parameters, the pose of the key frame and the spatial position of the map point.

14. The construction device of the visual point cloud map is characterized by comprising a first visual point cloud map construction unit, wherein the unit comprises,

15. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for constructing a visual point cloud map according to any one of claims 1 to 13.