CN110335319B - Semantic-driven camera positioning and map reconstruction method and system - Google Patents

Semantic-driven camera positioning and map reconstruction method and system Download PDF

Info

Publication number
CN110335319B
CN110335319B CN201910557726.8A CN201910557726A CN110335319B CN 110335319 B CN110335319 B CN 110335319B CN 201910557726 A CN201910557726 A CN 201910557726A CN 110335319 B CN110335319 B CN 110335319B
Authority
CN
China
Prior art keywords
matching
current frame
point
camera
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910557726.8A
Other languages
Chinese (zh)
Other versions
CN110335319A (en
Inventor
桑农
王玘
高常鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910557726.8A priority Critical patent/CN110335319B/en
Publication of CN110335319A publication Critical patent/CN110335319A/en
Application granted granted Critical
Publication of CN110335319B publication Critical patent/CN110335319B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/05Geographic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a semantic-driven camera positioning and map reconstruction method, and belongs to the technical field of computer vision. Firstly, performing semantic segmentation on the feature points of the current frame image; then, according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain matching pairs; initializing the camera attitude through all matching in the current frame and the key frame; then, updating the feature point matching pairs by adopting a three-dimensional projection method in combination with semantic judgment; updating all feature point matching pairs by utilizing attitude minimization; finally, constructing a three-dimensional map by using the camera attitude; the invention also realizes a semantic-driven camera positioning and map reconstruction system. According to the technical scheme, not only are multiple processes performed in the camera positioning stage, but also point cloud constraints are performed in the reconstruction stage, so that semantic segmentation and a camera positioning and reconstruction system are combined more closely, and a more accurate positioning result and a more complete reconstruction effect are obtained.

Description

Semantic-driven camera positioning and map reconstruction method and system
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a semantic-driven camera positioning and map reconstruction method.
Background
Currently, camera localization and reconstruction techniques are not combined or not tightly combined with semantic segmentation techniques.
For the correlation algorithm without combining semantic segmentation, on one hand, the method is difficult to cope with various environments, such as dynamic scenes and weak texture scenes. On the other hand, the map models reconstructed by these algorithms are often composed of point clouds or landmarks, and are maps based on geometric information, so that they cannot provide any high-level understanding of the surrounding environment.
For the correlation algorithm combined with semantic segmentation, class labels are generally attached to identification objects, optimization for removing the influence of dynamic objects is performed, but the result of semantic segmentation is not fully utilized, and then the semantic segmentation is tightly integrated into a positioning and map reconstruction technical system.
Disclosure of Invention
Aiming at the defects or improvement requirements of the prior art, the invention provides a semantic-driven camera positioning and map reconstruction method, aiming at optimizing the matching of characteristic points, the optimization of reprojection errors, the constraint of reconstructed point clouds and the detection of loop in the camera positioning and reconstruction process by utilizing semantic information, so that the accuracy of camera positioning is higher, and the reconstruction contains high-level understanding and is more complete.
In order to achieve the above object, the present invention provides a semantic-driven camera positioning and map reconstructing method, which comprises the following steps:
(1) extracting feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
(2) according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain feature point matching pairs;
the similar matching method specifically comprises the following substeps:
(21) acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
(22) calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, wherein if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are matched pairs of the objects;
(23) carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
(3) initializing the camera attitude through matching of all feature points in the current frame and the key frame;
(4) calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projected point is in an object region where the feature point d is located, if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
(5) updating all the feature point matching pairs by using the step (4), and updating the camera attitude by minimizing the following formula:
Figure BDA0002107339560000021
wherein, exp (ξ)) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
(6) and constructing a three-dimensional map by using the new camera attitude, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.
Further, the method further comprises the steps of:
(7) further judging whether a loop exists in the current frame or not by utilizing the semantic category, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating an accumulated error by utilizing closed loop optimization;
(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global optimization.
Further, the step (23) specifically includes:
centering by matching objectsThe feature point sets of the areas where the two objects are located are respectively as follows:
Figure BDA0002107339560000031
Figure BDA0002107339560000032
(231) selecting a feature point ai from the set A, and sequentially calculating the similarity between the feature point ai and all feature points in the set B; if the similarity between one feature point bj and the feature point ai in the set B is the maximum and is greater than the set similarity threshold, bj and ai are feature point matching pairs,
(232) another feature point is selected from the set a, and the step (231) is repeated until all matching pairs of all feature points of the set a are found.
Further, the step (3) is specifically:
(31) calculating an essential matrix E by using an eight-point method;
(32) decomposing the essential matrix through SVD (Singular Value Decomposition) to obtain four possible solutions, namely the camera attitude;
(33) and calculating three-dimensional point cloud according to each possible camera pose and the feature point matching pair, wherein if the position of the point cloud conforms to the camera imaging model, the corresponding camera pose is the initialized camera pose.
Further, the step (7) of determining whether a loop exists in the current frame specifically includes the following sub-steps:
(41) detecting a candidate loopback frame through a Bag of words (BOW);
(42) comparing the semantic categories of the detected candidate loopback frames with the current frame again, and finding out the candidate loopback frames with the same number and the same semantic categories;
(43) comparing the number of the reconstructed point clouds of the candidate loopback frames again, and storing the candidate loopback frames with the similarity larger than a set threshold;
(44) finally, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loopback frame, and reserving the candidate loopback frame which is greater than the similarity threshold value, namely, a loopback;
(45) cumulative errors are eliminated using closed loop optimization.
Further, the elimination of the accumulated error by using the closed-loop optimization in the step (45) specifically includes the following sub-steps:
(451) solving the transformation between the two frames by calculating the matching pair between the current key frame and the loop key frame;
(452) and if the matching pair of the feature points meets the correction threshold, performing closed-loop correction, and calculating the correct posture of each key frame by using a propagation algorithm.
Further, the step (8) is specifically:
(81) taking the pose and the point cloud of each key frame as a vertex;
(82) establishing a constraint edge between the vertexes, wherein the constraint edge is the relative motion estimation between two pose nodes, the mapping constraint between the point cloud and the camera and the semantic constraint between the point clouds;
(83) and (4) using the vertex as an optimization variable and the edge as a constraint term, and solving the optimal vertex meeting the constraint by using a Gauss-Newton method (GAUSS-NEWTON), namely solving the optimized camera attitude and the point cloud position.
Further, the selection criteria of the key frame are as follows: the key frame is created if one of the following conditions is met:
determining the Nth frame after the previous round of map reconstruction as a new key frame;
after inserting the previous key frame, determining the new key frame by N frames;
and if the number of the tracked feature point matching pairs of the current frame is less than ninety percent of the number of the feature point matching pairs of the reference key frame, determining the current frame as a new key frame.
According to another aspect of the present invention, there is provided a semantic-driven camera localization and mapping system, comprising:
the first module is used for extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
the second module is used for matching all the feature points in the current frame and the key frame by adopting a similar matching method according to the similarity and the semantic category to obtain a feature point matching pair;
the second module comprises a similar matching unit, and the similar matching unit comprises the following parts:
the first subunit is used for acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
the second subunit is used for calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, and if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are an object matching pair;
the third subunit is used for carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
the third module is used for initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the fourth module is used for calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projection point is in an object region where the feature point d is located, and if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
a fifth module for updating all the feature point matching pairs using the fourth module, and then updating the camera pose by minimizing:
Figure BDA0002107339560000051
wherein, exp (ξ)) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
and the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.
Further, the system further comprises:
the seventh module is used for further judging whether the current frame has a loop or not by utilizing the semantic type, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating the accumulated error by utilizing closed loop optimization;
and the eighth module is used for optimizing the global key map by using a nonlinear least square map optimization method and finally performing global optimization.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
(1) the method adopts the similar matching method based on semantic segmentation to match the feature points, utilizes the semantic label information of each frame image, matches the direction of the object after semantic classification and classification as constraint, increases the constraint condition in the matching process, reduces the range of the feature point matching, thereby saving the time of the feature point matching, reducing a plurality of wrong feature point matching pairs, improving the matching precision and providing a good computing environment for the estimation of the camera attitude;
(2) the method adopts a semantic-based re-projection optimization method, combines semantic information of each frame of image, increases constraint conditions of re-projection points, filters a part of wrong re-projection points, improves the re-projection optimization efficiency, further improves the accuracy of camera attitude optimization due to the removal of a part of wrong re-projection points, enables the camera to track more accurately, and is not easy to drift due to overlarge errors;
(3) the method adopts a semantic-based graph optimization method, utilizes the geometric information of objects segmented by semantics, optimizes the camera pose and the point cloud according to the transformation between the camera poses and the mapping between the point cloud and the camera pose, also restricts the position between the point cloud and the point cloud through geometric constraint, and indirectly influences the optimization and adjustment of the camera pose, thereby obtaining more accurate camera pose and point cloud
(4) The method adopts a semantic-based loop detection method, and the method takes the category number of semantic labels of each frame of image as a constraint item, and further judges after a series of candidate loop frames are found through the BOW, so that the found loop frames are more similar to the current frame, the loop accuracy is higher, and the loop optimization error elimination is more accurate.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of four possible camera poses obtained by decomposing an essential matrix using SVD in the method of the present invention;
FIG. 3 is a schematic view of a three-dimensional point projection in the method of the present invention;
FIG. 4 is a schematic diagram of global optimization in the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the method of the present invention comprises the steps of:
(1) extracting feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
(2) according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain feature point matching pairs;
the similar matching method specifically comprises the following substeps:
(21) acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
(22) calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, wherein if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are matched pairs of the objects;
(23) carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
(3) initializing the camera attitude through matching of all feature points in the current frame and the key frame;
(4) calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projected point is in an object region where the feature point d is located, if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
(5) updating all the feature point matching pairs by adopting the step (4), and updating the camera attitude by minimizing the following formula:
Figure BDA0002107339560000081
wherein, exp (ξ)) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
(6) constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;
(7) further judging whether a loop exists in the current frame or not by utilizing the semantic category, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating an accumulated error by utilizing closed loop optimization;
(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global BA optimization.
The method of the invention will now be described in connection with one embodiment of the invention:
1. semantic segmentation: extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point.
2. Tracking: the pose estimation of the current frame is optimized by finding the corresponding relation between the current frame and the local map as much as possible. The method specifically comprises the following steps:
ORB feature extraction and semantic segmentation: and setting the input frame as a current frame, extracting ORB feature points and corresponding ORB feature descriptors, putting the current frame into a segmentation network, and waiting until the prediction result is obtained.
b. Estimating camera motion: firstly, matching feature points of a current frame and a previous frame by using a similar matching method through the similarity and semantic category information of an ORB feature descriptor, and specifically comprising the following steps:
(b1) acquiring the positions of objects in the same category in the current frame and the previous frame according to the semantic category of the feature points;
(b2) calculating the descriptor similarity between every two feature points in the object positions of the same category, and storing the descriptor similarity as a final feature point matching pair when each group of highest similarity is obtained;
the camera pose is then predicted using the motion pattern. The motion model assumes that the camera moves at a constant speed, the pose of the current frame is estimated through the pose of the camera of the previous frame, the feature point matching relation between two frames and the speed, and if the number of the feature point matching pairs is lower than a threshold value, the key frame mode is changed. The method comprises the following steps of trying to match feature points with the nearest key frame, matching the current frame with all global key frames if the number of matching pairs of the current frame and the nearest key frame is still lower than a threshold value, searching the key frame with the highest number of matching pairs, and solving the pose of the camera by using a PnP algorithm, wherein the specific method comprises the following steps:
(bb1) computing the essential matrix E using an eight-point method;
(bb2) decomposing the essential matrix through SVD to obtain four possible solutions (rotation matrix and translation matrix), namely postures;
(bb3) calculating a three-dimensional point cloud according to each possible pose and the feature point matching pair, and determining which solution to select by judging the position of the point cloud, namely calculating the pose of the camera, as shown in fig. 2.
And then carrying out re-projection optimization based on semantic segmentation on the pose of the previous frame by using the matched feature points to obtain the pose of the current frame, wherein the specific method comprises the following steps:
if the feature points projected into the image fall into the place with different types of the feature points matched with the feature points in the original image, the re-projection of the pair of feature points is considered to be unqualified, the pair of matched pairs is removed, and the optimization of the objective function is not participated in. As shown in the figure, the corresponding two-dimensional image point of the P space point is P1, and in the feature matching stage, the feature point P1 of the previous frame is matched with the feature point P2 of the current frame, so that it is considered that P should be projected to the position of P2, however, because of the error of the camera pose estimation, the drop point is not at the position of P2, but falls to P'. If the semantic label of the p' pixel point is different from p1, it is determined that the p1 and p2 match incorrectly, so that the pair of matching points is eliminated and no longer participates in the estimation of the camera motion, as shown in fig. 3.
For all the reserved reprojection points, calculating the distance between the projection point and the feature point matched with the projection point in the same image, and minimizing all the distances to update the camera posture:
Figure BDA0002107339560000101
wherein, exp (ξ)) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith pair of matching pairs in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates in the current frame tracking frame of the ith pair of matching pairs;
c. tracking a local map: and finding key frames which have a common three-dimensional space point with the current frame in the local map and key frames adjacent to the key frames. And projecting the three-dimensional points correspondingly projected to the three-dimensional space in the key frame into the current frame, updating and matching the three-dimensional points with the feature points in the current frame, and finally optimizing the camera pose again by using all matched pairs, wherein the optimization mode is the same as that in the previous step.
d. And (3) key frame judgment: the key frame is created if one of the following conditions is met: the number of pairs of feature points tracked by the current frame is less than ninety percent of the number of matching pairs of the reference key frame from the last globally repositioned 15 frames and from the last key frame inserted by 15 frames. (the reference key frame is the key frame which has the most common observation three-dimensional point with the current frame) if the condition is not met, the clustering adjustment is carried out, and the posture of the previous key frame is optimized.
3. And (3) semantic label fusion: after the key frame is created, the key frame with the common view degree higher than a certain threshold value with the current key frame in the local map is used for updating the semantic label probability corresponding to each pixel in the current key frame. The degree of common vision is determined by the number of matching pairs between two frames and the number of the same point of the three-dimensional space observed together.
4. Local map building: after the semantic labels are fused, the current key frame is inserted into a local map, redundant three-dimensional space points and key frames are filtered, and finally local clustering adjustment is carried out.
a. Inserting the key frame: and adding the pose of the key frame as a node into the pose graph, and adding an optimized edge of the key frame which has the same observation three-dimensional space point with the current key frame.
b. Local bundling adjustment: and putting the current key frame, the adjacent key frames, the key frames with the common observation three-dimensional points and the corresponding three-dimensional space points into a pose graph for optimization. Each key frame is examined and rejected if ninety percent of the feature points are observed by more than three other key frames.
5. Loop detection: if the number of the key frames in the map is less than 10, loop detection is not carried out, if the number of the key frames in the map is more than 10, the key frames with a common BoW word with the current key frame are searched in the map, then the number of the words which are the most common with the BoW of the current key frame is counted, eighty percent of the number is used as a threshold value, and the key frames with the number of the words which is more than the threshold value are searched and used as candidate key frames. Comparing the semantic categories of a series of detected candidate loop frames with the current frame again, finding out candidate loop frames with the same number and the same semantic categories, comparing the reconstructed point cloud number of the candidate loop frames again, storing the candidate loop frames with the similarity larger than a certain threshold, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loop frame, keeping the candidate loop frames with the similarity larger than a certain degree, namely, the loop, calculating the matching pair between the current key frame and the loop key frame, solving the transformation between the two frames, performing closed-loop correction if the matching pair of the feature points meets the number, and calculating the correct transformation value of each key frame by using a propagation algorithm.
6. And finally, carrying out graph optimization and global optimization.
(1) Taking the pose and the point cloud of each key frame as a vertex;
(2) establishing a constraint edge between the vertexes, wherein the constraint edge is a relative motion estimation (marked as T) between two pose nodes, a mapping constraint (marked as M) between the point cloud and the camera, and a semantic constraint (shown in the figure) between the point clouds;
(3) the vertex is used as an optimization variable, the edge is used as a constraint item, and the optimal vertex meeting the constraint is solved by using an L-M method, namely the optimized camera attitude and the point cloud position are solved, as shown in FIG. 4.
A semantic-driven camera localization and mapping system is further described with reference to specific embodiments, the system comprising the following components:
the first module is used for extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
the second module is used for matching all the feature points in the current frame and the key frame by adopting a similar matching method according to the similarity and the semantic category to obtain a feature point matching pair;
the second module comprises a similar matching unit, and the similar matching unit comprises the following parts:
the first subunit is used for acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
the second subunit is used for calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, and if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are an object matching pair;
the third subunit is used for carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
the third module is used for initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the fourth module is used for calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projection point is in an object region where the feature point d is located, and if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
a fifth module for updating all the feature point matching pairs using the fourth module, and then updating the camera pose by minimizing:
Figure BDA0002107339560000121
wherein, exp (ξ)) A lie algebraic representation representing a camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiGraph representing the ith matching pair in key frameImage coordinates;
and the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object.
The seventh module is used for further judging whether the current frame has a loop or not by utilizing the semantic type, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating the accumulated error by utilizing closed loop optimization;
and the eighth module is used for optimizing the global key map by using a nonlinear least square map optimization method and finally performing global optimization.
It will be appreciated by those skilled in the art that the foregoing is only a preferred embodiment of the invention, and is not intended to limit the invention, such that various modifications, equivalents and improvements may be made without departing from the spirit and scope of the invention.

Claims (7)

1. A semantic-driven camera positioning and map reconstruction method is characterized by specifically comprising the following steps:
(1) extracting feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
(2) according to the similarity and the semantic category, matching all the feature points in the current frame and the key frame by adopting a similar matching method to obtain feature point matching pairs;
the similar matching method specifically comprises the following substeps:
(21) acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
(22) calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, wherein if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are matched pairs of the objects;
(23) carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
(3) initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the step (3) is specifically as follows:
(31) calculating an essential matrix E by using an eight-point method;
(32) decomposing the essential matrix through SVD to obtain four possible solutions, namely the camera attitude;
(33) calculating three-dimensional point cloud according to each possible camera pose and the feature point matching pair, wherein if the position of the point cloud conforms to the camera imaging model, the corresponding camera pose is an initialized camera pose;
(4) calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projected point is in an object region where the feature point d is located, if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
(5) updating all the feature point matching pairs by using the step (4), and updating the camera attitude by minimizing the following formula:
Figure FDA0003282363330000021
wherein exp (ξ ^) represents a lie algebraic representation of the camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
(6) constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;
(7) further judging whether a loop exists in the current frame or not by utilizing the semantic category, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating an accumulated error by utilizing closed loop optimization;
(8) and optimizing the global key map by using a nonlinear least square map optimization method, and finally performing global optimization.
2. The semantically driven camera localization and mapping method according to claim 1, wherein said step (23) specifically comprises:
setting the feature point sets of the areas where the two objects are located in the object matching pair as follows:
Figure FDA0003282363330000023
Figure FDA0003282363330000024
(231) selecting a feature point ai from the set A, and sequentially calculating the similarity between the feature point ai and all feature points in the set B; if the similarity between one feature point bj and the feature point ai in the set B is the maximum and is greater than the set similarity threshold, bj and ai are feature point matching pairs,
(232) another feature point is selected from the set a, and the step (231) is repeated until all matching pairs of all feature points of the set a are found.
3. The semantic-driven camera positioning and map reconstructing method according to claim 1, wherein the step (7) of determining whether the current frame has a loop specifically comprises the following sub-steps:
(41) detecting candidate loop frames through a bag-of-words model;
(42) comparing the semantic categories of the detected candidate loopback frames with the current frame again, and finding out the candidate loopback frames with the same number and the same semantic categories;
(43) comparing the number of the reconstructed point clouds of the candidate loopback frames again, and storing the candidate loopback frames with the similarity larger than a set threshold;
(44) finally, comparing the main direction of the point cloud reconstructed by the current frame and each candidate loopback frame, and reserving the candidate loopback frame which is greater than the similarity threshold value, namely, a loopback;
(45) cumulative errors are eliminated using closed loop optimization.
4. The semantically driven camera localization and mapping method according to claim 3, wherein the step (45) of eliminating the accumulated error by using closed loop optimization specifically comprises the following sub-steps:
(451) solving the transformation between the two frames by calculating the matching pair between the current key frame and the loop key frame;
(452) and if the matching pair of the feature points meets the correction threshold, performing closed-loop correction, and calculating the correct posture of each key frame by using a propagation algorithm.
5. The semantically driven camera positioning and map reconstructing method according to claim 1, wherein said step (8) is specifically:
(81) taking the pose and the point cloud of each key frame as a vertex;
(82) establishing a constraint edge between the vertexes, wherein the constraint edge is the relative motion estimation between two pose nodes, the mapping constraint between the point cloud and the camera and the semantic constraint between the point clouds;
(83) and (4) using the vertex as an optimization variable and the edge as a constraint item, solving the optimal vertex meeting the constraint by using a Gauss-Newton method, namely solving the optimized camera attitude and the point cloud position.
6. The method for semantically driving camera positioning and map reconstruction as claimed in any one of claims 1 to 5, wherein the key frame is selected according to the following criteria: the key frame is created if one of the following conditions is met:
determining the Nth frame after the previous round of map reconstruction as a new key frame;
after inserting the previous key frame, determining the new key frame by N frames;
and if the number of the tracked feature point matching pairs of the current frame is less than ninety percent of the number of the feature point matching pairs of the reference key frame, determining the current frame as a new key frame.
7. A semantically driven camera localization and mapping system, comprising:
the first module is used for extracting the feature points of the current frame image, performing semantic segmentation on the current frame image by using the built full convolution neural network, and obtaining corresponding semantic categories by using each feature point;
the second module is used for matching all the feature points in the current frame and the key frame by adopting a similar matching method according to the similarity and the semantic category to obtain a feature point matching pair;
the second module comprises a similar matching unit, and the similar matching unit comprises the following parts:
the first subunit is used for acquiring objects of the same category in the current frame and the key frame according to the semantic category of the feature points;
the second subunit is used for calculating the point cloud main direction of each object in the same class of objects in the current frame and the key frame, and if the difference value of the point cloud main directions of a certain object in the current frame and a certain object in the key frame is smaller than a set threshold value, the two objects are an object matching pair;
the third subunit is used for carrying out similarity matching on the feature points of the areas where the two objects are located in the object matching pair to obtain a final feature point matching pair;
the third module is used for initializing the camera attitude through matching of all feature points in the current frame and the key frame;
the fourth module is used for calculating by utilizing the camera attitude to obtain a three-dimensional point corresponding to the matched feature point d in the current frame, projecting the three-dimensional point to the current frame by utilizing camera intrinsic parameters, judging whether the projection point is in an object region where the feature point d is located, and if not, searching a new matched point of the feature point d in the unmatched feature points of the key frame by adopting a similar matching method to form a new matched pair;
a fifth module for updating all the feature point matching pairs using the fourth module, and then updating the camera pose by minimizing:
Figure FDA0003282363330000051
wherein exp (ξ ^) represents a lie algebraic representation of the camera pose; n represents the number of matching pairs of the feature points; u. ofiRepresenting the image coordinates of the ith feature point matching pair in the current frame; siRepresents the ith scale factor; p is a radical ofiRepresenting the image coordinates of the ith matching pair in the key frame;
the sixth module is used for constructing a three-dimensional map by using the new camera posture, acquiring the appearance characteristics of the object according to the semantic category of the object in the three-dimensional map, and deleting the three-dimensional points which do not accord with the appearance characteristics of the object in the object;
the seventh module is used for further judging whether the current frame has a loop or not by utilizing the semantic type, the point cloud number and the point cloud main direction of the object in the current frame, and if so, eliminating the accumulated error by utilizing closed loop optimization;
and the eighth module is used for optimizing the global key map by using a nonlinear least square map optimization method and finally performing global optimization.
CN201910557726.8A 2019-06-26 2019-06-26 Semantic-driven camera positioning and map reconstruction method and system Active CN110335319B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910557726.8A CN110335319B (en) 2019-06-26 2019-06-26 Semantic-driven camera positioning and map reconstruction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910557726.8A CN110335319B (en) 2019-06-26 2019-06-26 Semantic-driven camera positioning and map reconstruction method and system

Publications (2)

Publication Number Publication Date
CN110335319A CN110335319A (en) 2019-10-15
CN110335319B true CN110335319B (en) 2022-03-18

Family

ID=68142729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910557726.8A Active CN110335319B (en) 2019-06-26 2019-06-26 Semantic-driven camera positioning and map reconstruction method and system

Country Status (1)

Country Link
CN (1) CN110335319B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110910389B (en) * 2019-10-30 2021-04-09 中山大学 Laser SLAM loop detection system and method based on graph descriptor
CN111046125A (en) * 2019-12-16 2020-04-21 视辰信息科技(上海)有限公司 Visual positioning method, system and computer readable storage medium
CN111311708B (en) * 2020-01-20 2022-03-11 北京航空航天大学 Visual SLAM method based on semantic optical flow and inverse depth filtering
CN111310654B (en) * 2020-02-13 2023-09-08 北京百度网讯科技有限公司 Map element positioning method and device, electronic equipment and storage medium
CN111325842B (en) * 2020-03-04 2023-07-28 Oppo广东移动通信有限公司 Map construction method, repositioning method and device, storage medium and electronic equipment
CN111368759B (en) * 2020-03-09 2022-08-30 河海大学常州校区 Monocular vision-based mobile robot semantic map construction system
CN111429517A (en) * 2020-03-23 2020-07-17 Oppo广东移动通信有限公司 Relocation method, relocation device, storage medium and electronic device
CN111427373B (en) * 2020-03-24 2023-11-24 上海商汤临港智能科技有限公司 Pose determining method, pose determining device, medium and pose determining equipment
CN112585946A (en) * 2020-03-27 2021-03-30 深圳市大疆创新科技有限公司 Image shooting method, image shooting device, movable platform and storage medium
CN111311742B (en) * 2020-03-27 2023-05-05 阿波罗智能技术(北京)有限公司 Three-dimensional reconstruction method, three-dimensional reconstruction device and electronic equipment
CN111815687A (en) * 2020-06-19 2020-10-23 浙江大华技术股份有限公司 Point cloud matching method, positioning method, device and storage medium
CN112085026A (en) * 2020-08-26 2020-12-15 的卢技术有限公司 Closed loop detection method based on deep neural network semantic segmentation
CN112419512B (en) * 2020-10-13 2022-09-13 南昌大学 Air three-dimensional model repairing system and method based on semantic information
CN112507056B (en) * 2020-12-21 2023-03-21 华南理工大学 Map construction method based on visual semantic information
CN112927269A (en) * 2021-03-26 2021-06-08 深圳市无限动力发展有限公司 Map construction method and device based on environment semantics and computer equipment
CN113591865B (en) * 2021-07-28 2024-03-26 深圳甲壳虫智能有限公司 Loop detection method and device and electronic equipment
CN114639006B (en) * 2022-03-15 2023-09-26 北京理工大学 Loop detection method and device and electronic equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645170A (en) * 2009-09-03 2010-02-10 北京信息科技大学 Precise registration method of multilook point cloud
CN102308320A (en) * 2009-02-06 2012-01-04 香港科技大学 Generating three-dimensional models from images
CN104361627A (en) * 2014-11-07 2015-02-18 武汉科技大学 SIFT-based (scale-invariant feature transform) binocular vision three-dimensional image reconstruction method of asphalt pavement micro-texture
CN107392964A (en) * 2017-07-07 2017-11-24 武汉大学 The indoor SLAM methods combined based on indoor characteristic point and structure lines
CN107833236A (en) * 2017-10-31 2018-03-23 中国科学院电子学研究所 Semantic vision positioning system and method are combined under a kind of dynamic environment
CN108230337A (en) * 2017-12-31 2018-06-29 厦门大学 A kind of method that semantic SLAM systems based on mobile terminal are realized
CN108596053A (en) * 2018-04-09 2018-09-28 华中科技大学 A kind of vehicle checking method and system based on SSD and vehicle attitude classification
CN109272577A (en) * 2018-08-30 2019-01-25 北京计算机技术及应用研究所 A kind of vision SLAM method based on Kinect
CN109544629A (en) * 2018-11-29 2019-03-29 南京人工智能高等研究院有限公司 Camera pose determines method and apparatus and electronic equipment
CN109658449A (en) * 2018-12-03 2019-04-19 华中科技大学 A kind of indoor scene three-dimensional rebuilding method based on RGB-D image
CN109815847A (en) * 2018-12-30 2019-05-28 中国电子科技集团公司信息科学研究院 A kind of vision SLAM method based on semantic constraint
CN109816686A (en) * 2019-01-15 2019-05-28 山东大学 Robot semanteme SLAM method, processor and robot based on object example match

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610175A (en) * 2017-08-04 2018-01-19 华南理工大学 The monocular vision SLAM algorithms optimized based on semi-direct method and sliding window

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102308320A (en) * 2009-02-06 2012-01-04 香港科技大学 Generating three-dimensional models from images
CN101645170A (en) * 2009-09-03 2010-02-10 北京信息科技大学 Precise registration method of multilook point cloud
CN104361627A (en) * 2014-11-07 2015-02-18 武汉科技大学 SIFT-based (scale-invariant feature transform) binocular vision three-dimensional image reconstruction method of asphalt pavement micro-texture
CN107392964A (en) * 2017-07-07 2017-11-24 武汉大学 The indoor SLAM methods combined based on indoor characteristic point and structure lines
CN107833236A (en) * 2017-10-31 2018-03-23 中国科学院电子学研究所 Semantic vision positioning system and method are combined under a kind of dynamic environment
CN108230337A (en) * 2017-12-31 2018-06-29 厦门大学 A kind of method that semantic SLAM systems based on mobile terminal are realized
CN108596053A (en) * 2018-04-09 2018-09-28 华中科技大学 A kind of vehicle checking method and system based on SSD and vehicle attitude classification
CN109272577A (en) * 2018-08-30 2019-01-25 北京计算机技术及应用研究所 A kind of vision SLAM method based on Kinect
CN109544629A (en) * 2018-11-29 2019-03-29 南京人工智能高等研究院有限公司 Camera pose determines method and apparatus and electronic equipment
CN109658449A (en) * 2018-12-03 2019-04-19 华中科技大学 A kind of indoor scene three-dimensional rebuilding method based on RGB-D image
CN109815847A (en) * 2018-12-30 2019-05-28 中国电子科技集团公司信息科学研究院 A kind of vision SLAM method based on semantic constraint
CN109816686A (en) * 2019-01-15 2019-05-28 山东大学 Robot semanteme SLAM method, processor and robot based on object example match

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于云的语义库设计及机器人语义地图构建";于金山等;《机器人 ROBOT》;20161231;第 38 卷(第 4 期);第410-419页 *

Also Published As

Publication number Publication date
CN110335319A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
CN110335319B (en) Semantic-driven camera positioning and map reconstruction method and system
CN107967457B (en) Site identification and relative positioning method and system adapting to visual characteristic change
CN110781262B (en) Semantic map construction method based on visual SLAM
US8798357B2 (en) Image-based localization
Eade et al. Monocular graph SLAM with complexity reduction
CN113129335B (en) Visual tracking algorithm and multi-template updating strategy based on twin network
CN110119768B (en) Visual information fusion system and method for vehicle positioning
CN112037268B (en) Environment sensing method based on probability transfer model in dynamic scene
CN112446882A (en) Robust visual SLAM method based on deep learning in dynamic scene
CN114088081A (en) Map construction method for accurate positioning based on multi-segment joint optimization
CN114140527A (en) Dynamic environment binocular vision SLAM method based on semantic segmentation
Shi et al. Dense semantic 3D map based long-term visual localization with hybrid features
Hu et al. Multiple maps for the feature-based monocular SLAM system
CN112287906B (en) Template matching tracking method and system based on depth feature fusion
Yang et al. Probabilistic projective association and semantic guided relocalization for dense reconstruction
Ali et al. A life-long SLAM approach using adaptable local maps based on rasterized LIDAR images
CN113570713B (en) Semantic map construction method and device for dynamic environment
CN112560651B (en) Target tracking method and device based on combination of depth network and target segmentation
CN113888603A (en) Loop detection and visual SLAM method based on optical flow tracking and feature matching
CN114067128A (en) SLAM loop detection method based on semantic features
Zhang et al. Appearance-based loop closure detection via bidirectional manifold representation consensus
CN116592897B (en) Improved ORB-SLAM2 positioning method based on pose uncertainty
CN112396593B (en) Closed loop detection method based on key frame selection and local features
CN113435256B (en) Three-dimensional target identification method and system based on geometric consistency constraint
CN113012212B (en) Depth information fusion-based indoor scene three-dimensional point cloud reconstruction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant