CN110335319B

CN110335319B - Semantic-driven camera positioning and map reconstruction method and system

Info

Publication number: CN110335319B
Application number: CN201910557726.8A
Authority: CN
Inventors: 桑农; 王玘; 高常鑫
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2022-03-18
Anticipated expiration: 2039-06-26
Also published as: CN110335319A

Abstract

The invention discloses a semantic-driven camera positioning and map reconstruction method, and belongs to the technical field of computer vision. Firstly, performing semantic segmentation on the feature points of the current frame image; then, according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain matching pairs; initializing the camera attitude through all matching in the current frame and the key frame; then, updating the feature point matching pairs by adopting a three-dimensional projection method in combination with semantic judgment; updating all feature point matching pairs by utilizing attitude minimization; finally, constructing a three-dimensional map by using the camera attitude; the invention also realizes a semantic-driven camera positioning and map reconstruction system. According to the technical scheme, not only are multiple processes performed in the camera positioning stage, but also point cloud constraints are performed in the reconstruction stage, so that semantic segmentation and a camera positioning and reconstruction system are combined more closely, and a more accurate positioning result and a more complete reconstruction effect are obtained.

Description

Semantic-driven camera positioning and map reconstruction method and system

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a semantic-driven camera positioning and map reconstruction method.

Background

Currently, camera localization and reconstruction techniques are not combined or not tightly combined with semantic segmentation techniques.

For the correlation algorithm without combining semantic segmentation, on one hand, the method is difficult to cope with various environments, such as dynamic scenes and weak texture scenes. On the other hand, the map models reconstructed by these algorithms are often composed of point clouds or landmarks, and are maps based on geometric information, so that they cannot provide any high-level understanding of the surrounding environment.

For the correlation algorithm combined with semantic segmentation, class labels are generally attached to identification objects, optimization for removing the influence of dynamic objects is performed, but the result of semantic segmentation is not fully utilized, and then the semantic segmentation is tightly integrated into a positioning and map reconstruction technical system.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a semantic-driven camera positioning and map reconstruction method, aiming at optimizing the matching of characteristic points, the optimization of reprojection errors, the constraint of reconstructed point clouds and the detection of loop in the camera positioning and reconstruction process by utilizing semantic information, so that the accuracy of camera positioning is higher, and the reconstruction contains high-level understanding and is more complete.

In order to achieve the above object, the present invention provides a semantic-driven camera positioning and map reconstructing method, which comprises the following steps:

(1) extracting feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;

(2) according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain feature point matching pairs;

the similar matching method specifically comprises the following substeps:

(21) acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;

(22) calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, wherein if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are matched pairs of the objects;

(23) carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;

(3) initializing the camera attitude through matching of all feature points in the current frame and the key frame;

(4) calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projected point is in an object region where the feature point d is located, if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;

(5) updating all the feature point matching pairs by using the step (4), and updating the camera attitude by minimizing the following formula:

wherein, exp (ξ)^∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. of_iRepresenting the image coordinates of the ith feature point matching pair in the current frame; s_iRepresents the ith scale factor; p is a radical of_iRepresenting the image coordinates of the ith matching pair in the key frame;

(6) and constructing a three-dimensional map by using the new camera attitude, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.

Further, the method further comprises the steps of:

(7) further judging whether a loop exists in the current frame or not by utilizing the semantic category, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating an accumulated error by utilizing closed loop optimization;

(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global optimization.

Further, the step (23) specifically includes:

centering by matching objectsThe feature point sets of the areas where the two objects are located are respectively as follows:

(231) selecting a feature point ai from the set A, and sequentially calculating the similarity between the feature point ai and all feature points in the set B; if the similarity between one feature point bj and the feature point ai in the set B is the maximum and is greater than the set similarity threshold, bj and ai are feature point matching pairs,

(232) another feature point is selected from the set a, and the step (231) is repeated until all matching pairs of all feature points of the set a are found.

Further, the step (3) is specifically:

(31) calculating an essential matrix E by using an eight-point method;

(32) decomposing the essential matrix through SVD (Singular Value Decomposition) to obtain four possible solutions, namely the camera attitude;

(33) and calculating three-dimensional point cloud according to each possible camera pose and the feature point matching pair, wherein if the position of the point cloud conforms to the camera imaging model, the corresponding camera pose is the initialized camera pose.

Further, the step (7) of determining whether a loop exists in the current frame specifically includes the following sub-steps:

(41) detecting a candidate loopback frame through a Bag of words (BOW);

(42) comparing the semantic categories of the detected candidate loopback frames with the current frame again, and finding out the candidate loopback frames with the same number and the same semantic categories;

(43) comparing the number of the reconstructed point clouds of the candidate loopback frames again, and storing the candidate loopback frames with the similarity larger than a set threshold;

(44) finally, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loopback frame, and reserving the candidate loopback frame which is greater than the similarity threshold value, namely, a loopback;

(45) cumulative errors are eliminated using closed loop optimization.

Further, the elimination of the accumulated error by using the closed-loop optimization in the step (45) specifically includes the following sub-steps:

(451) solving the transformation between the two frames by calculating the matching pair between the current key frame and the loop key frame;

(452) and if the matching pair of the feature points meets the correction threshold, performing closed-loop correction, and calculating the correct posture of each key frame by using a propagation algorithm.

Further, the step (8) is specifically:

(81) taking the pose and the point cloud of each key frame as a vertex;

(82) establishing a constraint edge between the vertexes, wherein the constraint edge is the relative motion estimation between two pose nodes, the mapping constraint between the point cloud and the camera and the semantic constraint between the point clouds;

(83) and (4) using the vertex as an optimization variable and the edge as a constraint term, and solving the optimal vertex meeting the constraint by using a Gauss-Newton method (GAUSS-NEWTON), namely solving the optimized camera attitude and the point cloud position.

Further, the selection criteria of the key frame are as follows: the key frame is created if one of the following conditions is met:

determining the Nth frame after the previous round of map reconstruction as a new key frame;

after inserting the previous key frame, determining the new key frame by N frames;

and if the number of the tracked feature point matching pairs of the current frame is less than ninety percent of the number of the feature point matching pairs of the reference key frame, determining the current frame as a new key frame.

According to another aspect of the present invention, there is provided a semantic-driven camera localization and mapping system, comprising:

the first module is used for extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;

the second module is used for matching all the feature points in the current frame and the key frame by adopting a similar matching method according to the similarity and the semantic category to obtain a feature point matching pair;

the second module comprises a similar matching unit, and the similar matching unit comprises the following parts:

the first subunit is used for acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;

the second subunit is used for calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, and if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are an object matching pair;

the third subunit is used for carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;

the third module is used for initializing the camera attitude through matching of all feature points in the current frame and the key frame;

the fourth module is used for calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projection point is in an object region where the feature point d is located, and if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;

a fifth module for updating all the feature point matching pairs using the fourth module, and then updating the camera pose by minimizing:

and the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.

Further, the system further comprises:

the seventh module is used for further judging whether the current frame has a loop or not by utilizing the semantic type, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating the accumulated error by utilizing closed loop optimization;

and the eighth module is used for optimizing the global key map by using a nonlinear least square map optimization method and finally performing global optimization.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

(1) the method adopts the similar matching method based on semantic segmentation to match the feature points, utilizes the semantic label information of each frame image, matches the direction of the object after semantic classification and classification as constraint, increases the constraint condition in the matching process, reduces the range of the feature point matching, thereby saving the time of the feature point matching, reducing a plurality of wrong feature point matching pairs, improving the matching precision and providing a good computing environment for the estimation of the camera attitude;

(2) the method adopts a semantic-based re-projection optimization method, combines semantic information of each frame of image, increases constraint conditions of re-projection points, filters a part of wrong re-projection points, improves the re-projection optimization efficiency, further improves the accuracy of camera attitude optimization due to the removal of a part of wrong re-projection points, enables the camera to track more accurately, and is not easy to drift due to overlarge errors;

(3) the method adopts a semantic-based graph optimization method, utilizes the geometric information of objects segmented by semantics, optimizes the camera pose and the point cloud according to the transformation between the camera poses and the mapping between the point cloud and the camera pose, also restricts the position between the point cloud and the point cloud through geometric constraint, and indirectly influences the optimization and adjustment of the camera pose, thereby obtaining more accurate camera pose and point cloud

(4) The method adopts a semantic-based loop detection method, and the method takes the category number of semantic labels of each frame of image as a constraint item, and further judges after a series of candidate loop frames are found through the BOW, so that the found loop frames are more similar to the current frame, the loop accuracy is higher, and the loop optimization error elimination is more accurate.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of four possible camera poses obtained by decomposing an essential matrix using SVD in the method of the present invention;

FIG. 3 is a schematic view of a three-dimensional point projection in the method of the present invention;

FIG. 4 is a schematic diagram of global optimization in the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the method of the present invention comprises the steps of:

the similar matching method specifically comprises the following substeps:

(5) updating all the feature point matching pairs by adopting the step (4), and updating the camera attitude by minimizing the following formula:

(6) constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;

(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global BA optimization.

The method of the invention will now be described in connection with one embodiment of the invention:

1. semantic segmentation: extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point.

2. Tracking: the pose estimation of the current frame is optimized by finding the corresponding relation between the current frame and the local map as much as possible. The method specifically comprises the following steps:

ORB feature extraction and semantic segmentation: and setting the input frame as a current frame, extracting ORB feature points and corresponding ORB feature descriptors, putting the current frame into a segmentation network, and waiting until the prediction result is obtained.

b. Estimating camera motion: firstly, matching feature points of a current frame and a previous frame by using a similar matching method through the similarity and semantic category information of an ORB feature descriptor, and specifically comprising the following steps:

(b1) acquiring the positions of objects in the same category in the current frame and the previous frame according to the semantic category of the feature points;

(b2) calculating the descriptor similarity between every two feature points in the object positions of the same category, and storing the descriptor similarity as a final feature point matching pair when each group of highest similarity is obtained;

the camera pose is then predicted using the motion pattern. The motion model assumes that the camera moves at a constant speed, the pose of the current frame is estimated through the pose of the camera of the previous frame, the feature point matching relation between two frames and the speed, and if the number of the feature point matching pairs is lower than a threshold value, the key frame mode is changed. The method comprises the following steps of trying to match feature points with the nearest key frame, matching the current frame with all global key frames if the number of matching pairs of the current frame and the nearest key frame is still lower than a threshold value, searching the key frame with the highest number of matching pairs, and solving the pose of the camera by using a PnP algorithm, wherein the specific method comprises the following steps:

(bb1) computing the essential matrix E using an eight-point method;

(bb2) decomposing the essential matrix through SVD to obtain four possible solutions (rotation matrix and translation matrix), namely postures;

(bb3) calculating a three-dimensional point cloud according to each possible pose and the feature point matching pair, and determining which solution to select by judging the position of the point cloud, namely calculating the pose of the camera, as shown in fig. 2.

And then carrying out re-projection optimization based on semantic segmentation on the pose of the previous frame by using the matched feature points to obtain the pose of the current frame, wherein the specific method comprises the following steps:

if the feature points projected into the image fall into the place with different types of the feature points matched with the feature points in the original image, the re-projection of the pair of feature points is considered to be unqualified, the pair of matched pairs is removed, and the optimization of the objective function is not participated in. As shown in the figure, the corresponding two-dimensional image point of the P space point is P1, and in the feature matching stage, the feature point P1 of the previous frame is matched with the feature point P2 of the current frame, so that it is considered that P should be projected to the position of P2, however, because of the error of the camera pose estimation, the drop point is not at the position of P2, but falls to P'. If the semantic label of the p' pixel point is different from p1, it is determined that the p1 and p2 match incorrectly, so that the pair of matching points is eliminated and no longer participates in the estimation of the camera motion, as shown in fig. 3.

For all the reserved reprojection points, calculating the distance between the projection point and the feature point matched with the projection point in the same image, and minimizing all the distances to update the camera posture:

wherein, exp (ξ)^∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. of_iRepresenting the image coordinates of the ith pair of matching pairs in the current frame; s_iRepresents the ith scale factor; p is a radical of_iRepresenting the image coordinates in the current frame tracking frame of the ith pair of matching pairs;

c. tracking a local map: and finding key frames which have a common three-dimensional space point with the current frame in the local map and key frames adjacent to the key frames. And projecting the three-dimensional points correspondingly projected to the three-dimensional space in the key frame into the current frame, updating and matching the three-dimensional points with the feature points in the current frame, and finally optimizing the camera pose again by using all matched pairs, wherein the optimization mode is the same as that in the previous step.

d. And (3) key frame judgment: the key frame is created if one of the following conditions is met: the number of pairs of feature points tracked by the current frame is less than ninety percent of the number of matching pairs of the reference key frame from the last globally repositioned 15 frames and from the last key frame inserted by 15 frames. (the reference key frame is the key frame which has the most common observation three-dimensional point with the current frame) if the condition is not met, the clustering adjustment is carried out, and the posture of the previous key frame is optimized.

3. And (3) semantic label fusion: after the key frame is created, the key frame with the common view degree higher than a certain threshold value with the current key frame in the local map is used for updating the semantic label probability corresponding to each pixel in the current key frame. The degree of common vision is determined by the number of matching pairs between two frames and the number of the same point of the three-dimensional space observed together.

4. Local map building: after the semantic labels are fused, the current key frame is inserted into a local map, redundant three-dimensional space points and key frames are filtered, and finally local clustering adjustment is carried out.

a. Inserting the key frame: and adding the pose of the key frame as a node into the pose graph, and adding an optimized edge of the key frame which has the same observation three-dimensional space point with the current key frame.

b. Local bundling adjustment: and putting the current key frame, the adjacent key frames, the key frames with the common observation three-dimensional points and the corresponding three-dimensional space points into a pose graph for optimization. Each key frame is examined and rejected if ninety percent of the feature points are observed by more than three other key frames.

5. Loop detection: if the number of the key frames in the map is less than 10, loop detection is not carried out, if the number of the key frames in the map is more than 10, the key frames with a common BoW word with the current key frame are searched in the map, then the number of the words which are the most common with the BoW of the current key frame is counted, eighty percent of the number is used as a threshold value, and the key frames with the number of the words which is more than the threshold value are searched and used as candidate key frames. Comparing the semantic categories of a series of detected candidate loop frames with the current frame again, finding out candidate loop frames with the same number and the same semantic categories, comparing the reconstructed point cloud number of the candidate loop frames again, storing the candidate loop frames with the similarity larger than a certain threshold, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loop frame, keeping the candidate loop frames with the similarity larger than a certain degree, namely, the loop, calculating the matching pair between the current key frame and the loop key frame, solving the transformation between the two frames, performing closed-loop correction if the matching pair of the feature points meets the number, and calculating the correct transformation value of each key frame by using a propagation algorithm.

6. And finally, carrying out graph optimization and global optimization.

(1) Taking the pose and the point cloud of each key frame as a vertex;

(2) establishing a constraint edge between the vertexes, wherein the constraint edge is a relative motion estimation (marked as T) between two pose nodes, a mapping constraint (marked as M) between the point cloud and the camera, and a semantic constraint (shown in the figure) between the point clouds;

(3) the vertex is used as an optimization variable, the edge is used as a constraint item, and the optimal vertex meeting the constraint is solved by using an L-M method, namely the optimized camera attitude and the point cloud position are solved, as shown in FIG. 4.

A semantic-driven camera localization and mapping system is further described with reference to specific embodiments, the system comprising the following components:

wherein, exp (ξ)^∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. of_iRepresenting the image coordinates of the ith feature point matching pair in the current frame; s_iRepresents the ith scale factor; p is a radical of_iGraph representing the ith matching pair in key frameImage coordinates;

It will be appreciated by those skilled in the art that the foregoing is only a preferred embodiment of the invention, and is not intended to limit the invention, such that various modifications, equivalents and improvements may be made without departing from the spirit and scope of the invention.

Claims

1. A semantic-driven camera positioning and map reconstruction method is characterized by specifically comprising the following steps:

the similar matching method specifically comprises the following substeps:

the step (3) is specifically as follows:

(31) calculating an essential matrix E by using an eight-point method;

(32) decomposing the essential matrix through SVD to obtain four possible solutions, namely the camera attitude;

(33) calculating three-dimensional point cloud according to each possible camera pose and the feature point matching pair, wherein if the position of the point cloud conforms to the camera imaging model, the corresponding camera pose is an initialized camera pose;

wherein exp (ξ ^) represents a lie algebraic representation of the camera pose; n represents the number of matching pairs of the feature points; u. of_iRepresenting the image coordinates of the ith feature point matching pair in the current frame; s_iRepresents the ith scale factor; p is a radical of_iRepresenting the image coordinates of the ith matching pair in the key frame;

2. The semantically driven camera localization and mapping method according to claim 1, wherein said step (23) specifically comprises:

setting the feature point sets of the areas where the two objects are located in the object matching pair as follows:

3. The semantic-driven camera positioning and map reconstructing method according to claim 1, wherein the step (7) of determining whether the current frame has a loop specifically comprises the following sub-steps:

(41) detecting candidate loop frames through a bag-of-words model;

(45) cumulative errors are eliminated using closed loop optimization.

4. The semantically driven camera localization and mapping method according to claim 3, wherein the step (45) of eliminating the accumulated error by using closed loop optimization specifically comprises the following sub-steps:

5. The semantically driven camera positioning and map reconstructing method according to claim 1, wherein said step (8) is specifically:

(81) taking the pose and the point cloud of each key frame as a vertex;

(83) and (4) using the vertex as an optimization variable and the edge as a constraint item, solving the optimal vertex meeting the constraint by using a Gauss-Newton method, namely solving the optimized camera attitude and the point cloud position.

6. The method for semantically driving camera positioning and map reconstruction as claimed in any one of claims 1 to 5, wherein the key frame is selected according to the following criteria: the key frame is created if one of the following conditions is met:

7. A semantically driven camera localization and mapping system, comprising:

the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;