CN110335319B - Semantic-driven camera positioning and map reconstruction method and system - Google Patents
Semantic-driven camera positioning and map reconstruction method and system Download PDFInfo
- Publication number
- CN110335319B CN110335319B CN201910557726.8A CN201910557726A CN110335319B CN 110335319 B CN110335319 B CN 110335319B CN 201910557726 A CN201910557726 A CN 201910557726A CN 110335319 B CN110335319 B CN 110335319B
- Authority
- CN
- China
- Prior art keywords
- matching
- current frame
- point
- camera
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/05—Geographic models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/80—Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Abstract
The invention discloses a semantic-driven camera positioning and map reconstruction method, and belongs to the technical field of computer vision. Firstly, performing semantic segmentation on the feature points of the current frame image; then, according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain matching pairs; initializing the camera attitude through all matching in the current frame and the key frame; then, updating the feature point matching pairs by adopting a three-dimensional projection method in combination with semantic judgment; updating all feature point matching pairs by utilizing attitude minimization; finally, constructing a three-dimensional map by using the camera attitude; the invention also realizes a semantic-driven camera positioning and map reconstruction system. According to the technical scheme, not only are multiple processes performed in the camera positioning stage, but also point cloud constraints are performed in the reconstruction stage, so that semantic segmentation and a camera positioning and reconstruction system are combined more closely, and a more accurate positioning result and a more complete reconstruction effect are obtained.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a semantic-driven camera positioning and map reconstruction method.
Background
Currently, camera localization and reconstruction techniques are not combined or not tightly combined with semantic segmentation techniques.
For the correlation algorithm without combining semantic segmentation, on one hand, the method is difficult to cope with various environments, such as dynamic scenes and weak texture scenes. On the other hand, the map models reconstructed by these algorithms are often composed of point clouds or landmarks, and are maps based on geometric information, so that they cannot provide any high-level understanding of the surrounding environment.
For the correlation algorithm combined with semantic segmentation, class labels are generally attached to identification objects, optimization for removing the influence of dynamic objects is performed, but the result of semantic segmentation is not fully utilized, and then the semantic segmentation is tightly integrated into a positioning and map reconstruction technical system.
Disclosure of Invention
Aiming at the defects or improvement requirements of the prior art, the invention provides a semantic-driven camera positioning and map reconstruction method, aiming at optimizing the matching of characteristic points, the optimization of reprojection errors, the constraint of reconstructed point clouds and the detection of loop in the camera positioning and reconstruction process by utilizing semantic information, so that the accuracy of camera positioning is higher, and the reconstruction contains high-level understanding and is more complete.
In order to achieve the above object, the present invention provides a semantic-driven camera positioning and map reconstructing method, which comprises the following steps:
(1) extracting feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
(2) according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain feature point matching pairs;
the similar matching method specifically comprises the following substeps:
(21) acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
(22) calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, wherein if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are matched pairs of the objects;
(23) carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
(3) initializing the camera attitude through matching of all feature points in the current frame and the key frame;
(4) calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projected point is in an object region where the feature point d is located, if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
(5) updating all the feature point matching pairs by using the step (4), and updating the camera attitude by minimizing the following formula:
wherein, exp (ξ)∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
(6) and constructing a three-dimensional map by using the new camera attitude, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.
Further, the method further comprises the steps of:
(7) further judging whether a loop exists in the current frame or not by utilizing the semantic category, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating an accumulated error by utilizing closed loop optimization;
(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global optimization.
Further, the step (23) specifically includes:
centering by matching objectsThe feature point sets of the areas where the two objects are located are respectively as follows:
(231) selecting a feature point ai from the set A, and sequentially calculating the similarity between the feature point ai and all feature points in the set B; if the similarity between one feature point bj and the feature point ai in the set B is the maximum and is greater than the set similarity threshold, bj and ai are feature point matching pairs,
(232) another feature point is selected from the set a, and the step (231) is repeated until all matching pairs of all feature points of the set a are found.
Further, the step (3) is specifically:
(31) calculating an essential matrix E by using an eight-point method;
(32) decomposing the essential matrix through SVD (Singular Value Decomposition) to obtain four possible solutions, namely the camera attitude;
(33) and calculating three-dimensional point cloud according to each possible camera pose and the feature point matching pair, wherein if the position of the point cloud conforms to the camera imaging model, the corresponding camera pose is the initialized camera pose.
Further, the step (7) of determining whether a loop exists in the current frame specifically includes the following sub-steps:
(41) detecting a candidate loopback frame through a Bag of words (BOW);
(42) comparing the semantic categories of the detected candidate loopback frames with the current frame again, and finding out the candidate loopback frames with the same number and the same semantic categories;
(43) comparing the number of the reconstructed point clouds of the candidate loopback frames again, and storing the candidate loopback frames with the similarity larger than a set threshold;
(44) finally, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loopback frame, and reserving the candidate loopback frame which is greater than the similarity threshold value, namely, a loopback;
(45) cumulative errors are eliminated using closed loop optimization.
Further, the elimination of the accumulated error by using the closed-loop optimization in the step (45) specifically includes the following sub-steps:
(451) solving the transformation between the two frames by calculating the matching pair between the current key frame and the loop key frame;
(452) and if the matching pair of the feature points meets the correction threshold, performing closed-loop correction, and calculating the correct posture of each key frame by using a propagation algorithm.
Further, the step (8) is specifically:
(81) taking the pose and the point cloud of each key frame as a vertex;
(82) establishing a constraint edge between the vertexes, wherein the constraint edge is the relative motion estimation between two pose nodes, the mapping constraint between the point cloud and the camera and the semantic constraint between the point clouds;
(83) and (4) using the vertex as an optimization variable and the edge as a constraint term, and solving the optimal vertex meeting the constraint by using a Gauss-Newton method (GAUSS-NEWTON), namely solving the optimized camera attitude and the point cloud position.
Further, the selection criteria of the key frame are as follows: the key frame is created if one of the following conditions is met:
determining the Nth frame after the previous round of map reconstruction as a new key frame;
after inserting the previous key frame, determining the new key frame by N frames;
and if the number of the tracked feature point matching pairs of the current frame is less than ninety percent of the number of the feature point matching pairs of the reference key frame, determining the current frame as a new key frame.
According to another aspect of the present invention, there is provided a semantic-driven camera localization and mapping system, comprising:
the first module is used for extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
the second module is used for matching all the feature points in the current frame and the key frame by adopting a similar matching method according to the similarity and the semantic category to obtain a feature point matching pair;
the second module comprises a similar matching unit, and the similar matching unit comprises the following parts:
the first subunit is used for acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
the second subunit is used for calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, and if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are an object matching pair;
the third subunit is used for carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
the third module is used for initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the fourth module is used for calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projection point is in an object region where the feature point d is located, and if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
a fifth module for updating all the feature point matching pairs using the fourth module, and then updating the camera pose by minimizing:
wherein, exp (ξ)∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
and the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.
Further, the system further comprises:
the seventh module is used for further judging whether the current frame has a loop or not by utilizing the semantic type, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating the accumulated error by utilizing closed loop optimization;
and the eighth module is used for optimizing the global key map by using a nonlinear least square map optimization method and finally performing global optimization.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
(1) the method adopts the similar matching method based on semantic segmentation to match the feature points, utilizes the semantic label information of each frame image, matches the direction of the object after semantic classification and classification as constraint, increases the constraint condition in the matching process, reduces the range of the feature point matching, thereby saving the time of the feature point matching, reducing a plurality of wrong feature point matching pairs, improving the matching precision and providing a good computing environment for the estimation of the camera attitude;
(2) the method adopts a semantic-based re-projection optimization method, combines semantic information of each frame of image, increases constraint conditions of re-projection points, filters a part of wrong re-projection points, improves the re-projection optimization efficiency, further improves the accuracy of camera attitude optimization due to the removal of a part of wrong re-projection points, enables the camera to track more accurately, and is not easy to drift due to overlarge errors;
(3) the method adopts a semantic-based graph optimization method, utilizes the geometric information of objects segmented by semantics, optimizes the camera pose and the point cloud according to the transformation between the camera poses and the mapping between the point cloud and the camera pose, also restricts the position between the point cloud and the point cloud through geometric constraint, and indirectly influences the optimization and adjustment of the camera pose, thereby obtaining more accurate camera pose and point cloud
(4) The method adopts a semantic-based loop detection method, and the method takes the category number of semantic labels of each frame of image as a constraint item, and further judges after a series of candidate loop frames are found through the BOW, so that the found loop frames are more similar to the current frame, the loop accuracy is higher, and the loop optimization error elimination is more accurate.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of four possible camera poses obtained by decomposing an essential matrix using SVD in the method of the present invention;
FIG. 3 is a schematic view of a three-dimensional point projection in the method of the present invention;
FIG. 4 is a schematic diagram of global optimization in the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the method of the present invention comprises the steps of:
(1) extracting feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
(2) according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain feature point matching pairs;
the similar matching method specifically comprises the following substeps:
(21) acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
(22) calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, wherein if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are matched pairs of the objects;
(23) carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
(3) initializing the camera attitude through matching of all feature points in the current frame and the key frame;
(4) calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projected point is in an object region where the feature point d is located, if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
(5) updating all the feature point matching pairs by adopting the step (4), and updating the camera attitude by minimizing the following formula:
wherein, exp (ξ)∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
(6) constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;
(7) further judging whether a loop exists in the current frame or not by utilizing the semantic category, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating an accumulated error by utilizing closed loop optimization;
(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global BA optimization.
The method of the invention will now be described in connection with one embodiment of the invention:
1. semantic segmentation: extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point.
2. Tracking: the pose estimation of the current frame is optimized by finding the corresponding relation between the current frame and the local map as much as possible. The method specifically comprises the following steps:
ORB feature extraction and semantic segmentation: and setting the input frame as a current frame, extracting ORB feature points and corresponding ORB feature descriptors, putting the current frame into a segmentation network, and waiting until the prediction result is obtained.
b. Estimating camera motion: firstly, matching feature points of a current frame and a previous frame by using a similar matching method through the similarity and semantic category information of an ORB feature descriptor, and specifically comprising the following steps:
(b1) acquiring the positions of objects in the same category in the current frame and the previous frame according to the semantic category of the feature points;
(b2) calculating the descriptor similarity between every two feature points in the object positions of the same category, and storing the descriptor similarity as a final feature point matching pair when each group of highest similarity is obtained;
the camera pose is then predicted using the motion pattern. The motion model assumes that the camera moves at a constant speed, the pose of the current frame is estimated through the pose of the camera of the previous frame, the feature point matching relation between two frames and the speed, and if the number of the feature point matching pairs is lower than a threshold value, the key frame mode is changed. The method comprises the following steps of trying to match feature points with the nearest key frame, matching the current frame with all global key frames if the number of matching pairs of the current frame and the nearest key frame is still lower than a threshold value, searching the key frame with the highest number of matching pairs, and solving the pose of the camera by using a PnP algorithm, wherein the specific method comprises the following steps:
(bb1) computing the essential matrix E using an eight-point method;
(bb2) decomposing the essential matrix through SVD to obtain four possible solutions (rotation matrix and translation matrix), namely postures;
(bb3) calculating a three-dimensional point cloud according to each possible pose and the feature point matching pair, and determining which solution to select by judging the position of the point cloud, namely calculating the pose of the camera, as shown in fig. 2.
And then carrying out re-projection optimization based on semantic segmentation on the pose of the previous frame by using the matched feature points to obtain the pose of the current frame, wherein the specific method comprises the following steps:
if the feature points projected into the image fall into the place with different types of the feature points matched with the feature points in the original image, the re-projection of the pair of feature points is considered to be unqualified, the pair of matched pairs is removed, and the optimization of the objective function is not participated in. As shown in the figure, the corresponding two-dimensional image point of the P space point is P1, and in the feature matching stage, the feature point P1 of the previous frame is matched with the feature point P2 of the current frame, so that it is considered that P should be projected to the position of P2, however, because of the error of the camera pose estimation, the drop point is not at the position of P2, but falls to P'. If the semantic label of the p' pixel point is different from p1, it is determined that the p1 and p2 match incorrectly, so that the pair of matching points is eliminated and no longer participates in the estimation of the camera motion, as shown in fig. 3.
For all the reserved reprojection points, calculating the distance between the projection point and the feature point matched with the projection point in the same image, and minimizing all the distances to update the camera posture:
wherein, exp (ξ)∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith pair of matching pairs in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates in the current frame tracking frame of the ith pair of matching pairs;
c. tracking a local map: and finding key frames which have a common three-dimensional space point with the current frame in the local map and key frames adjacent to the key frames. And projecting the three-dimensional points correspondingly projected to the three-dimensional space in the key frame into the current frame, updating and matching the three-dimensional points with the feature points in the current frame, and finally optimizing the camera pose again by using all matched pairs, wherein the optimization mode is the same as that in the previous step.
d. And (3) key frame judgment: the key frame is created if one of the following conditions is met: the number of pairs of feature points tracked by the current frame is less than ninety percent of the number of matching pairs of the reference key frame from the last globally repositioned 15 frames and from the last key frame inserted by 15 frames. (the reference key frame is the key frame which has the most common observation three-dimensional point with the current frame) if the condition is not met, the clustering adjustment is carried out, and the posture of the previous key frame is optimized.
3. And (3) semantic label fusion: after the key frame is created, the key frame with the common view degree higher than a certain threshold value with the current key frame in the local map is used for updating the semantic label probability corresponding to each pixel in the current key frame. The degree of common vision is determined by the number of matching pairs between two frames and the number of the same point of the three-dimensional space observed together.
4. Local map building: after the semantic labels are fused, the current key frame is inserted into a local map, redundant three-dimensional space points and key frames are filtered, and finally local clustering adjustment is carried out.
a. Inserting the key frame: and adding the pose of the key frame as a node into the pose graph, and adding an optimized edge of the key frame which has the same observation three-dimensional space point with the current key frame.
b. Local bundling adjustment: and putting the current key frame, the adjacent key frames, the key frames with the common observation three-dimensional points and the corresponding three-dimensional space points into a pose graph for optimization. Each key frame is examined and rejected if ninety percent of the feature points are observed by more than three other key frames.
5. Loop detection: if the number of the key frames in the map is less than 10, loop detection is not carried out, if the number of the key frames in the map is more than 10, the key frames with a common BoW word with the current key frame are searched in the map, then the number of the words which are the most common with the BoW of the current key frame is counted, eighty percent of the number is used as a threshold value, and the key frames with the number of the words which is more than the threshold value are searched and used as candidate key frames. Comparing the semantic categories of a series of detected candidate loop frames with the current frame again, finding out candidate loop frames with the same number and the same semantic categories, comparing the reconstructed point cloud number of the candidate loop frames again, storing the candidate loop frames with the similarity larger than a certain threshold, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loop frame, keeping the candidate loop frames with the similarity larger than a certain degree, namely, the loop, calculating the matching pair between the current key frame and the loop key frame, solving the transformation between the two frames, performing closed-loop correction if the matching pair of the feature points meets the number, and calculating the correct transformation value of each key frame by using a propagation algorithm.
6. And finally, carrying out graph optimization and global optimization.
(1) Taking the pose and the point cloud of each key frame as a vertex;
(2) establishing a constraint edge between the vertexes, wherein the constraint edge is a relative motion estimation (marked as T) between two pose nodes, a mapping constraint (marked as M) between the point cloud and the camera, and a semantic constraint (shown in the figure) between the point clouds;
(3) the vertex is used as an optimization variable, the edge is used as a constraint item, and the optimal vertex meeting the constraint is solved by using an L-M method, namely the optimized camera attitude and the point cloud position are solved, as shown in FIG. 4.
A semantic-driven camera localization and mapping system is further described with reference to specific embodiments, the system comprising the following components:
the first module is used for extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
the second module is used for matching all the feature points in the current frame and the key frame by adopting a similar matching method according to the similarity and the semantic category to obtain a feature point matching pair;
the second module comprises a similar matching unit, and the similar matching unit comprises the following parts:
the first subunit is used for acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
the second subunit is used for calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, and if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are an object matching pair;
the third subunit is used for carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
the third module is used for initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the fourth module is used for calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projection point is in an object region where the feature point d is located, and if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
a fifth module for updating all the feature point matching pairs using the fourth module, and then updating the camera pose by minimizing:
wherein, exp (ξ)∧) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiGraph representing the ith matching pair in key frameImage coordinates;
and the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.
The seventh module is used for further judging whether the current frame has a loop or not by utilizing the semantic type, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating the accumulated error by utilizing closed loop optimization;
and the eighth module is used for optimizing the global key map by using a nonlinear least square map optimization method and finally performing global optimization.
It will be appreciated by those skilled in the art that the foregoing is only a preferred embodiment of the invention, and is not intended to limit the invention, such that various modifications, equivalents and improvements may be made without departing from the spirit and scope of the invention.
Claims (7)
1. A semantic-driven camera positioning and map reconstruction method is characterized by specifically comprising the following steps:
(1) extracting feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
(2) according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain feature point matching pairs;
the similar matching method specifically comprises the following substeps:
(21) acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
(22) calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, wherein if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are matched pairs of the objects;
(23) carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
(3) initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the step (3) is specifically as follows:
(31) calculating an essential matrix E by using an eight-point method;
(32) decomposing the essential matrix through SVD to obtain four possible solutions, namely the camera attitude;
(33) calculating three-dimensional point cloud according to each possible camera pose and the feature point matching pair, wherein if the position of the point cloud conforms to the camera imaging model, the corresponding camera pose is an initialized camera pose;
(4) calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projected point is in an object region where the feature point d is located, if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
(5) updating all the feature point matching pairs by using the step (4), and updating the camera attitude by minimizing the following formula:
wherein exp (ξ ^) represents a lie algebraic representation of the camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
(6) constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;
(7) further judging whether a loop exists in the current frame or not by utilizing the semantic category, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating an accumulated error by utilizing closed loop optimization;
(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global optimization.
2. The semantically driven camera localization and mapping method according to claim 1, wherein said step (23) specifically comprises:
setting the feature point sets of the areas where the two objects are located in the object matching pair as follows:
(231) selecting a feature point ai from the set A, and sequentially calculating the similarity between the feature point ai and all feature points in the set B; if the similarity between one feature point bj and the feature point ai in the set B is the maximum and is greater than the set similarity threshold, bj and ai are feature point matching pairs,
(232) another feature point is selected from the set a, and the step (231) is repeated until all matching pairs of all feature points of the set a are found.
3. The semantic-driven camera positioning and map reconstructing method according to claim 1, wherein the step (7) of determining whether the current frame has a loop specifically comprises the following sub-steps:
(41) detecting candidate loop frames through a bag-of-words model;
(42) comparing the semantic categories of the detected candidate loopback frames with the current frame again, and finding out the candidate loopback frames with the same number and the same semantic categories;
(43) comparing the number of the reconstructed point clouds of the candidate loopback frames again, and storing the candidate loopback frames with the similarity larger than a set threshold;
(44) finally, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loopback frame, and reserving the candidate loopback frame which is greater than the similarity threshold value, namely, a loopback;
(45) cumulative errors are eliminated using closed loop optimization.
4. The semantically driven camera localization and mapping method according to claim 3, wherein the step (45) of eliminating the accumulated error by using closed loop optimization specifically comprises the following sub-steps:
(451) solving the transformation between the two frames by calculating the matching pair between the current key frame and the loop key frame;
(452) and if the matching pair of the feature points meets the correction threshold, performing closed-loop correction, and calculating the correct posture of each key frame by using a propagation algorithm.
5. The semantically driven camera positioning and map reconstructing method according to claim 1, wherein said step (8) is specifically:
(81) taking the pose and the point cloud of each key frame as a vertex;
(82) establishing a constraint edge between the vertexes, wherein the constraint edge is the relative motion estimation between two pose nodes, the mapping constraint between the point cloud and the camera and the semantic constraint between the point clouds;
(83) and (4) using the vertex as an optimization variable and the edge as a constraint item, solving the optimal vertex meeting the constraint by using a Gauss-Newton method, namely solving the optimized camera attitude and the point cloud position.
6. The method for semantically driving camera positioning and map reconstruction as claimed in any one of claims 1 to 5, wherein the key frame is selected according to the following criteria: the key frame is created if one of the following conditions is met:
determining the Nth frame after the previous round of map reconstruction as a new key frame;
after inserting the previous key frame, determining the new key frame by N frames;
and if the number of the tracked feature point matching pairs of the current frame is less than ninety percent of the number of the feature point matching pairs of the reference key frame, determining the current frame as a new key frame.
7. A semantically driven camera localization and mapping system, comprising:
the first module is used for extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
the second module is used for matching all the feature points in the current frame and the key frame by adopting a similar matching method according to the similarity and the semantic category to obtain a feature point matching pair;
the second module comprises a similar matching unit, and the similar matching unit comprises the following parts:
the first subunit is used for acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
the second subunit is used for calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, and if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are an object matching pair;
the third subunit is used for carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
the third module is used for initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the fourth module is used for calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projection point is in an object region where the feature point d is located, and if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
a fifth module for updating all the feature point matching pairs using the fourth module, and then updating the camera pose by minimizing:
wherein exp (ξ ^) represents a lie algebraic representation of the camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;
the seventh module is used for further judging whether the current frame has a loop or not by utilizing the semantic type, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating the accumulated error by utilizing closed loop optimization;
and the eighth module is used for optimizing the global key map by using a nonlinear least square map optimization method and finally performing global optimization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910557726.8A CN110335319B (en) | 2019-06-26 | 2019-06-26 | Semantic-driven camera positioning and map reconstruction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910557726.8A CN110335319B (en) | 2019-06-26 | 2019-06-26 | Semantic-driven camera positioning and map reconstruction method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110335319A CN110335319A (en) | 2019-10-15 |
CN110335319B true CN110335319B (en) | 2022-03-18 |
Family
ID=68142729
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910557726.8A Active CN110335319B (en) | 2019-06-26 | 2019-06-26 | Semantic-driven camera positioning and map reconstruction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110335319B (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110910389B (en) * | 2019-10-30 | 2021-04-09 | 中山大学 | Laser SLAM loop detection system and method based on graph descriptor |
CN111046125A (en) * | 2019-12-16 | 2020-04-21 | 视辰信息科技(上海)有限公司 | Visual positioning method, system and computer readable storage medium |
CN111311708B (en) * | 2020-01-20 | 2022-03-11 | 北京航空航天大学 | Visual SLAM method based on semantic optical flow and inverse depth filtering |
CN111310654B (en) * | 2020-02-13 | 2023-09-08 | 北京百度网讯科技有限公司 | Map element positioning method and device, electronic equipment and storage medium |
CN111325842B (en) * | 2020-03-04 | 2023-07-28 | Oppo广东移动通信有限公司 | Map construction method, repositioning method and device, storage medium and electronic equipment |
CN111368759B (en) * | 2020-03-09 | 2022-08-30 | 河海大学常州校区 | Monocular vision-based mobile robot semantic map construction system |
CN111429517A (en) * | 2020-03-23 | 2020-07-17 | Oppo广东移动通信有限公司 | Relocation method, relocation device, storage medium and electronic device |
CN111427373B (en) * | 2020-03-24 | 2023-11-24 | 上海商汤临港智能科技有限公司 | Pose determining method, pose determining device, medium and pose determining equipment |
CN112585946A (en) * | 2020-03-27 | 2021-03-30 | 深圳市大疆创新科技有限公司 | Image shooting method, image shooting device, movable platform and storage medium |
CN111311742B (en) * | 2020-03-27 | 2023-05-05 | 阿波罗智能技术(北京)有限公司 | Three-dimensional reconstruction method, three-dimensional reconstruction device and electronic equipment |
CN111815687A (en) * | 2020-06-19 | 2020-10-23 | 浙江大华技术股份有限公司 | Point cloud matching method, positioning method, device and storage medium |
CN112085026A (en) * | 2020-08-26 | 2020-12-15 | 的卢技术有限公司 | Closed loop detection method based on deep neural network semantic segmentation |
CN112419512B (en) * | 2020-10-13 | 2022-09-13 | 南昌大学 | Air three-dimensional model repairing system and method based on semantic information |
CN112507056B (en) * | 2020-12-21 | 2023-03-21 | 华南理工大学 | Map construction method based on visual semantic information |
CN112927269A (en) * | 2021-03-26 | 2021-06-08 | 深圳市无限动力发展有限公司 | Map construction method and device based on environment semantics and computer equipment |
CN113591865B (en) * | 2021-07-28 | 2024-03-26 | 深圳甲壳虫智能有限公司 | Loop detection method and device and electronic equipment |
CN114639006B (en) * | 2022-03-15 | 2023-09-26 | 北京理工大学 | Loop detection method and device and electronic equipment |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645170A (en) * | 2009-09-03 | 2010-02-10 | 北京信息科技大学 | Precise registration method of multilook point cloud |
CN102308320A (en) * | 2009-02-06 | 2012-01-04 | 香港科技大学 | Generating three-dimensional models from images |
CN104361627A (en) * | 2014-11-07 | 2015-02-18 | 武汉科技大学 | SIFT-based (scale-invariant feature transform) binocular vision three-dimensional image reconstruction method of asphalt pavement micro-texture |
CN107392964A (en) * | 2017-07-07 | 2017-11-24 | 武汉大学 | The indoor SLAM methods combined based on indoor characteristic point and structure lines |
CN107833236A (en) * | 2017-10-31 | 2018-03-23 | 中国科学院电子学研究所 | Semantic vision positioning system and method are combined under a kind of dynamic environment |
CN108230337A (en) * | 2017-12-31 | 2018-06-29 | 厦门大学 | A kind of method that semantic SLAM systems based on mobile terminal are realized |
CN108596053A (en) * | 2018-04-09 | 2018-09-28 | 华中科技大学 | A kind of vehicle checking method and system based on SSD and vehicle attitude classification |
CN109272577A (en) * | 2018-08-30 | 2019-01-25 | 北京计算机技术及应用研究所 | A kind of vision SLAM method based on Kinect |
CN109544629A (en) * | 2018-11-29 | 2019-03-29 | 南京人工智能高等研究院有限公司 | Camera pose determines method and apparatus and electronic equipment |
CN109658449A (en) * | 2018-12-03 | 2019-04-19 | 华中科技大学 | A kind of indoor scene three-dimensional rebuilding method based on RGB-D image |
CN109815847A (en) * | 2018-12-30 | 2019-05-28 | 中国电子科技集团公司信息科学研究院 | A kind of vision SLAM method based on semantic constraint |
CN109816686A (en) * | 2019-01-15 | 2019-05-28 | 山东大学 | Robot semanteme SLAM method, processor and robot based on object example match |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107610175A (en) * | 2017-08-04 | 2018-01-19 | 华南理工大学 | The monocular vision SLAM algorithms optimized based on semi-direct method and sliding window |
-
2019
- 2019-06-26 CN CN201910557726.8A patent/CN110335319B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102308320A (en) * | 2009-02-06 | 2012-01-04 | 香港科技大学 | Generating three-dimensional models from images |
CN101645170A (en) * | 2009-09-03 | 2010-02-10 | 北京信息科技大学 | Precise registration method of multilook point cloud |
CN104361627A (en) * | 2014-11-07 | 2015-02-18 | 武汉科技大学 | SIFT-based (scale-invariant feature transform) binocular vision three-dimensional image reconstruction method of asphalt pavement micro-texture |
CN107392964A (en) * | 2017-07-07 | 2017-11-24 | 武汉大学 | The indoor SLAM methods combined based on indoor characteristic point and structure lines |
CN107833236A (en) * | 2017-10-31 | 2018-03-23 | 中国科学院电子学研究所 | Semantic vision positioning system and method are combined under a kind of dynamic environment |
CN108230337A (en) * | 2017-12-31 | 2018-06-29 | 厦门大学 | A kind of method that semantic SLAM systems based on mobile terminal are realized |
CN108596053A (en) * | 2018-04-09 | 2018-09-28 | 华中科技大学 | A kind of vehicle checking method and system based on SSD and vehicle attitude classification |
CN109272577A (en) * | 2018-08-30 | 2019-01-25 | 北京计算机技术及应用研究所 | A kind of vision SLAM method based on Kinect |
CN109544629A (en) * | 2018-11-29 | 2019-03-29 | 南京人工智能高等研究院有限公司 | Camera pose determines method and apparatus and electronic equipment |
CN109658449A (en) * | 2018-12-03 | 2019-04-19 | 华中科技大学 | A kind of indoor scene three-dimensional rebuilding method based on RGB-D image |
CN109815847A (en) * | 2018-12-30 | 2019-05-28 | 中国电子科技集团公司信息科学研究院 | A kind of vision SLAM method based on semantic constraint |
CN109816686A (en) * | 2019-01-15 | 2019-05-28 | 山东大学 | Robot semanteme SLAM method, processor and robot based on object example match |
Non-Patent Citations (1)
Title |
---|
"基于云的语义库设计及机器人语义地图构建";于金山等;《机器人 ROBOT》;20161231;第 38 卷(第 4 期);第410-419页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110335319A (en) | 2019-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110335319B (en) | Semantic-driven camera positioning and map reconstruction method and system | |
CN107967457B (en) | Site identification and relative positioning method and system adapting to visual characteristic change | |
CN110781262B (en) | Semantic map construction method based on visual SLAM | |
US8798357B2 (en) | Image-based localization | |
Eade et al. | Monocular graph SLAM with complexity reduction | |
CN113129335B (en) | Visual tracking algorithm and multi-template updating strategy based on twin network | |
CN110119768B (en) | Visual information fusion system and method for vehicle positioning | |
CN112037268B (en) | Environment sensing method based on probability transfer model in dynamic scene | |
CN112446882A (en) | Robust visual SLAM method based on deep learning in dynamic scene | |
CN114088081A (en) | Map construction method for accurate positioning based on multi-segment joint optimization | |
CN114140527A (en) | Dynamic environment binocular vision SLAM method based on semantic segmentation | |
Shi et al. | Dense semantic 3D map based long-term visual localization with hybrid features | |
Hu et al. | Multiple maps for the feature-based monocular SLAM system | |
CN112287906B (en) | Template matching tracking method and system based on depth feature fusion | |
Yang et al. | Probabilistic projective association and semantic guided relocalization for dense reconstruction | |
Ali et al. | A life-long SLAM approach using adaptable local maps based on rasterized LIDAR images | |
CN113570713B (en) | Semantic map construction method and device for dynamic environment | |
CN112560651B (en) | Target tracking method and device based on combination of depth network and target segmentation | |
CN113888603A (en) | Loop detection and visual SLAM method based on optical flow tracking and feature matching | |
CN114067128A (en) | SLAM loop detection method based on semantic features | |
Zhang et al. | Appearance-based loop closure detection via bidirectional manifold representation consensus | |
CN116592897B (en) | Improved ORB-SLAM2 positioning method based on pose uncertainty | |
CN112396593B (en) | Closed loop detection method based on key frame selection and local features | |
CN113435256B (en) | Three-dimensional target identification method and system based on geometric consistency constraint | |
CN113012212B (en) | Depth information fusion-based indoor scene three-dimensional point cloud reconstruction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |