CN118089753A

CN118089753A - Monocular semantic SLAM positioning method and system based on three-dimensional target

Info

Publication number: CN118089753A
Application number: CN202410511058.6A
Authority: CN
Inventors: 秦晓辉; 周云水; 曾聪磊; 徐彪; 秦兆博; 谢国涛; 王晓伟
Original assignee: Jiangsu Jicui Qinglian Intelligent Control Technology Co ltd
Current assignee: Jiangsu Jicui Qinglian Intelligent Control Technology Co ltd
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-05-28

Abstract

The invention relates to the technical field of automatic driving positioning, and particularly discloses a monocular semantic SLAM positioning method and system based on a three-dimensional target, wherein the method comprises the following steps: performing two-dimensional target detection on the two-dimensional image with the target label information to obtain a two-dimensional target detection result with geometric characteristics; performing cuboid proposal sampling on the two-dimensional target detection result to realize three-dimensional target detection, and obtaining a three-dimensional target detection result; performing two-dimensional rough matching according to the two-dimensional target detection result, and performing three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result to obtain a three-dimensional object matching result; and (3) carrying out beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result. The monocular semantic SLAM positioning method based on the three-dimensional target can achieve positioning with higher precision.

Description

Monocular semantic SLAM positioning method and system based on three-dimensional target

Technical Field

The invention relates to the technical field of automatic driving positioning, in particular to a monocular semantic SLAM positioning method based on a three-dimensional target and a monocular semantic SLAM positioning system based on the three-dimensional target.

Background

SLAM (Simultaneous Localization AND MAPPING ) technology is a key technology for implementing autopilot localization tasks. Current conventional SLAM techniques based on low-level features such as dots, lines, etc. tend to mature. However, due to the complexity of interaction mechanisms of illumination and object surfaces, it is often difficult to ensure repeated consistency under the influence of time variation and illumination of the low-level features extracted from the real scene, so that the same feature is erroneously associated with different sequence data, and finally, interference is generated on advanced functions such as subsequent positioning navigation, obstacle avoidance and the like. Therefore, the automatic driving system needs to introduce higher-level semantic information to improve the environment understanding capability so as to improve the positioning accuracy and the robustness of the system. The three-dimensional object target is important semantic information in automatic driving, and as the three-dimensional object road sign has better reliability and identifiability, an automatic driving system can track by using the three-dimensional road sign, realize higher-precision positioning and improve the robustness in various scenes, and simultaneously construct a three-dimensional semantic map to lay a foundation for tasks such as navigation, obstacle avoidance, path planning and the like. Object detection is a core technology for acquiring object-level semantic information. Although the technology of 2D target detection is mature, the position information and the result information of an object in the real world cannot be effectively obtained only by means of the target detection result of a two-dimensional image, and the actual requirement of an automatic driving scene is far less than met.

In the prior art, although a binocular semantic SLAM method oriented to an automatic driving scene is proposed, the method uses a quadric surface as the expression of a three-dimensional object, the orientation priori of the object is predicted by using an orientation prediction network, and the mask priori of each object is obtained by using an example segmentation network. And obtaining the pose of the object by using the constraint between the two-dimensional bounding box and the quadric surface projection and the constraint provided by the prior information. However, the method utilizes the network to acquire a large amount of priori information, and real-time performance cannot be ensured in an automatic driving scene. Meanwhile, the method uses feature point matching to carry out data association, but the feature point matching strategy cannot avoid the situation that objects overlap, mismatching is easy to occur, and matching accuracy is difficult to guarantee.

In addition, a vehicle positioning and vehicle 3D detection method based on binocular vision SLAM is also provided in the prior art, the method mainly solves the problem of vehicle 3D detection, on the basis of a two-dimensional target detection result, the position of a space point is obtained through the feature point matched with each pair of images, and the three-dimensional detection of the vehicle is obtained through fitting the space point through a fitting algorithm preset by a binocular camera, namely a minimum external cube fitting algorithm. In the method, the minimum circumscribed cube fitting algorithm only uses the space points corresponding to each pair of matched image feature points to perform fitting, geometric feature constraint of an object is not used, and under the condition that extracted feature points are sparse, the accuracy of the obtained cube cannot be guaranteed. Meanwhile, the acquired three-dimensional targets are not utilized to construct a semantic map, the constraint is increased, the positioning precision is improved, and the obstacle avoidance function, the navigation function and the like are realized.

Therefore, how to improve the positioning and navigation accuracy to achieve a higher accuracy positioning is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides a monocular semantic SLAM positioning method based on a three-dimensional target and a monocular semantic SLAM positioning system based on the three-dimensional target, which solve the problem that the positioning and navigation precision of an automatic driving automobile are difficult to guarantee in the related technology.

As a first aspect of the present invention, there is provided a monocular semantic SLAM localization method based on a three-dimensional object, including: performing two-dimensional target detection on the two-dimensional image with the target label information to obtain a two-dimensional target detection result with geometric characteristics; performing cuboid proposal sampling on the two-dimensional target detection result to realize three-dimensional target detection, and obtaining a three-dimensional target detection result; performing two-dimensional rough matching according to the two-dimensional target detection result, and performing three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result to obtain a three-dimensional object matching result; performing beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result; the two-dimensional rough matching comprises associated feature points obtained by associating feature points obtained according to a two-dimensional target detection result with target tag information in different frames in a two-dimensional image; the three-dimensional fine matching comprises associated feature points which can be associated to a three-dimensional cuboid object from associated feature points of the two-dimensional coarse matching.

Further, performing cuboid proposal sampling on the two-dimensional target detection result to realize three-dimensional target detection, and obtaining a three-dimensional target detection result, including: sampling the cuboid proposal of the two-dimensional target detection result to obtain a projection result of the three-dimensional cuboid target on a two-dimensional image; performing ground assumption according to a projection structure of the three-dimensional cuboid object on the two-dimensional image to recover the pose of the three-dimensional cuboid object, and obtaining a target object proposal result; performing edge detection and line feature extraction on the target object proposal result to obtain a target object geometric feature; and constructing a scoring function comprising angle errors, distance errors and shape errors according to the geometric characteristics of the target object, and scoring the target object proposal structure according to the scoring function to obtain a three-dimensional target detection result.

Further, performing cuboid proposal sampling on the two-dimensional target detection result to obtain a projection result of the three-dimensional cuboid target on a two-dimensional image, including: determining a two-dimensional target detection frame according to the two-dimensional target detection result; determining the parameter quantity of the three-dimensional cuboid targets, and determining blanking point constraint conditions based on the fitting of the projected three-dimensional cuboid targets and the two-dimensional target detection frame; and obtaining a projection result of the three-dimensional cuboid object on the two-dimensional image according to the constraint condition of the blanking point.

Further, performing ground assumption according to a projection structure of the three-dimensional cuboid object on the two-dimensional image to recover the pose of the three-dimensional cuboid object, and obtaining a target object proposal result, including: according to ground assumption, the rolling angle and the pitch angle of the three-dimensional cuboid target are zero; and (3) back projecting the ground corner point of the two-dimensional cuboid target to the three-dimensional ground plane, and calculating the vertical corner point of the two-dimensional cuboid target to restore the pose of the three-dimensional cuboid target.

Further, performing two-dimensional rough matching according to the two-dimensional target detection result, and performing three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result to obtain a three-dimensional object matching result, including: screening out characteristic points positioned in a two-dimensional target detection frame according to a two-dimensional target detection result; respectively associating the characteristic points positioned in the two-dimensional target detection frame with target label information in different frames of the two-dimensional image, and obtaining association characteristic points based on two-dimensional rough matching; and carrying out three-dimensional fine matching on the associated feature points of the two-dimensional coarse matching and the three-dimensional target detection result to obtain a three-dimensional object matching result.

Further, performing three-dimensional fine matching on the associated feature points of the two-dimensional coarse matching and the three-dimensional target detection result to obtain a three-dimensional object matching result, wherein the three-dimensional object matching result comprises: restoring the associated feature points of the two-dimensional rough matching into a three-dimensional space; calculating the distance between the associated feature point of the two-dimensional rough matching and the centroid of the corresponding three-dimensional cuboid target; if the distance is greater than the preset threshold value, determining that the associated feature points of the two-dimensional rough matching can be associated to a three-dimensional cuboid target, and taking the associated feature points of the two-dimensional rough matching as a three-dimensional object matching result.

Further, performing beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result, including: constructing map elements including cameras, feature points and cuboid objects according to the three-dimensional object matching structure; and carrying out beam adjustment optimization on the measurement errors between the map element combined camera and the feature point, between the camera and the cuboid object and between the feature point and the cuboid object so as to obtain a positioning result.

Further, the measurement error between the camera and the cuboid object comprises a corner projection error and a two-dimensional target frame projection error, wherein the expression of the corner projection error is as follows:

，

Wherein, Representing the projection coordinates of the corner points of a cuboid object in a two-dimensional plane,/>Representing coordinates of corner points of cuboid objects in three-dimensional space,/>Representing camera reference matrix,/>Representing a rotation matrix,/>Representing the translation vector.

The expression of the projection error of the two-dimensional target frame is as follows:

，

Wherein, Respectively represent the center/>, of a true two-dimensional bounding boxAnd size/>，/>Respectively represent the center/>, of the projected two-dimensional bounding boxAnd size/>。

Further, the expression of the measurement error between the feature point and the rectangular parallelepiped object is:

，

Wherein, Representing spatial coordinates/>, in world coordinate system, of feature points，/>Representing a transformation matrix transforming feature points from a world coordinate system to a rectangular object coordinate system,/>, andRepresenting the dimensions of a cuboid object。

As another aspect of the present invention, there is provided a monocular semantic SLAM positioning system based on a three-dimensional object, for implementing the monocular semantic SLAM positioning method based on a three-dimensional object described above, including: the two-dimensional target detection module is used for carrying out two-dimensional target detection on the two-dimensional image with the target label information to obtain a two-dimensional target detection result with geometric characteristics; the three-dimensional target detection module is used for carrying out cuboid proposal sampling on the two-dimensional target detection result so as to realize three-dimensional target detection and obtain a three-dimensional target detection result; the three-dimensional object matching module is used for carrying out two-dimensional rough matching according to the two-dimensional target detection result, and carrying out three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result to obtain a three-dimensional object matching result; and the optimization module is used for carrying out beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result.

According to the monocular semantic SLAM positioning method based on the three-dimensional object, the three-dimensional object detection result is obtained based on the two-dimensional object detection and the three-dimensional cuboid proposal, coarse matching is carried out based on the two-dimensional object detection result, fine matching is carried out on the coarse matching structure and the three-dimensional object detection result, and finally, beam adjustment method optimization is carried out on the three-dimensional object matching result, so that the monocular semantic SLAM positioning method based on the three-dimensional object, disclosed by the invention, realizes three-dimensional object detection by using the two-dimensional object detection and the geometric characteristics of an object, avoids the three-dimensional object detection network training process with huge workload, and the detection result is accurate, and adds the three-dimensional object detection result into a map to enable the map to contain semantics and be more dense, and the upper navigation and obstacle avoidance requirements are realized; in addition, the inter-frame data association utilizes semantic information and geometric relation to realize coarse-to-fine matching, so that mismatching is effectively eliminated, and accurate data association is realized; and finally, in the BA method based on the point characteristics, the BA optimization is performed by combining the three-dimensional target detection result and adding constraint, so that the positioning with higher precision is realized.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the description serve to explain, without limitation, the invention.

FIG. 1 is a flow chart of a monocular semantic SLAM positioning method based on a three-dimensional object.

Fig. 2 is a flowchart of obtaining a three-dimensional object detection result according to the present invention.

Fig. 3 is a flowchart of rectangular proposal sampling for a two-dimensional target detection result provided by the invention.

Fig. 4 is a schematic two-dimensional projection diagram of a three-dimensional cuboid object when three observation planes are provided by the invention.

Fig. 5 is a flowchart of obtaining a three-dimensional object matching result according to the present invention.

FIG. 6 is a block diagram of a three-dimensional object-based monocular semantic SLAM positioning system according to the present invention.

Fig. 7 is a schematic diagram of a specific working process of the monocular semantic SLAM positioning system based on a three-dimensional object.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In this embodiment, a monocular semantic SLAM positioning method based on a three-dimensional object is provided, and fig. 1 is a flowchart of a monocular semantic SLAM positioning method based on a three-dimensional object according to an embodiment of the present invention, as shown in fig. 1, including:

s100, performing two-dimensional target detection on a two-dimensional image with target label information to obtain a two-dimensional target detection result with geometric characteristics; in the embodiment of the invention, a two-dimensional image with target label information, namely a two-dimensional RGB image, is subjected to target recognition by adopting a two-dimensional target detection model YOLO model which is currently mainstream, and objects with obvious geometric characteristics, such as vehicles, chairs and the like, are screened out according to semantic labels.

S200, performing cuboid proposal sampling on the two-dimensional target detection result to realize three-dimensional target detection, and obtaining a three-dimensional target detection result; and sampling a series of VP (VANISHING POINT, vanishing points) to construct a three-dimensional cuboid target proposal according to the two-dimensional target detection result so as to realize three-dimensional target detection.

S300, performing two-dimensional rough matching according to the two-dimensional target detection result, and performing three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result to obtain a three-dimensional object matching result; it should be understood that, based on the matching relationship between the two-dimensional target detection result and the feature points and the matching relationship between the cross-frame feature points, the rough matching of the cross-frame object is realized; and based on the distance relation between the three-dimensional cuboid object and the matching feature points, mismatching caused by overlapping is eliminated, and object fine matching is realized.

S400, performing beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result.

In the embodiment of the invention, a map with cameras, feature points and cuboid target elements is constructed based on the result of three-dimensional target detection. And combining the camera with the feature point, the camera with the cuboid target and the measurement error between the feature point and the cuboid target to construct BA (Bundle Adjustment, beam adjustment method), optimizing map elements and realizing higher-precision positioning.

Specifically, in the embodiment of the invention, two-dimensional target detection is performed on a two-dimensional image with target tag information, specifically, a visual SLAM system based on low-level point features, namely ORB-SLAM2, is used for extracting feature points, and simultaneously, a YOLO two-dimensional target detection network is fused to perform two-dimensional target detection on an RGB image, and targets with obvious geometric features, such as automobiles, chairs and the like, are screened and reserved according to the tag information.

It should be understood that when two-dimensional targets are extracted, the open source two-dimensional target detection framework YOLO adopted in the embodiment of the invention can be implemented by other target detection models and semantic segmentation models in order to obtain semantic information of two-dimensional images.

In the embodiment of the present invention, the sampling of the cuboid proposal is performed on the two-dimensional target detection result to realize three-dimensional target detection, so as to obtain a three-dimensional target detection result, as shown in fig. 2, including:

And S210, performing cuboid proposal sampling on the two-dimensional target detection result to obtain a projection result of the three-dimensional cuboid target on a two-dimensional image.

In an embodiment of the present invention, as shown in fig. 3, specifically, the method includes:

S211, determining a two-dimensional target detection frame according to the two-dimensional target detection result.

S212, determining the parameter number of the three-dimensional cuboid targets, and determining the constraint condition of the blanking points based on the fitting of the projected three-dimensional cuboid targets and the two-dimensional target detection frame.

9 Parameters for cuboid targetsThe representation is: wherein/>Representing location,/>Representing rotation,/>Representing the dimensions. The rectangular coordinate system is aligned with the principal axis at its center.

Based on the assumption that the three-dimensional cuboid object is closely attached to the two-dimensional object detection frame after being projected, the projection angles of the cuboid should be four on four sides of the two-dimensional object detection frame. Four constraints are given to four sides of the two-dimensional target detection frame, and the constraints are increased according to VP of parallel lines projected by the cuboid object because the four constraints are insufficient to fully constrain 9 degrees of freedom.

S213, obtaining a projection result of the three-dimensional cuboid object on the two-dimensional image according to the constraint condition of the blanking point.

Three cases are taken as examples of the case of three observation surfaces shown in fig. 4 according to the number of observation surfaces after the rectangular solid object is projected: sampling corner points at the upper edgeThen, according to the three VP, the rest seven corner points can be calculated, and the projection of the cuboid object on the two-dimensional plane is obtained. The specific process is as follows:

，

wherein the blanking points VP1, VP2, VP3 and the upper corner point Sampling to obtain the sample; bounding Box/>The method comprises the steps of obtaining through two-dimensional target detection; /(I)The sign indicates that the intersection of two straight lines is calculated, and the above condition is known, by/>Calculating two intersecting straight lines to obtain corner points; finally, the projection of the three-dimensional cuboid object on the two-dimensional image can be obtained.

S220, carrying out ground assumption according to a projection structure of the three-dimensional cuboid object on the two-dimensional image to recover the pose of the three-dimensional cuboid object, and obtaining a target object proposal result.

Specifically, in the embodiment of the present invention, according to a projection structure of a three-dimensional cuboid object on a two-dimensional image, ground assumption is performed to recover a pose of the three-dimensional cuboid object, and a target object proposal result is obtained, including: according to ground assumption, the rolling angle and the pitch angle of the three-dimensional cuboid target are zero; and (3) back projecting the ground corner point of the two-dimensional cuboid target to the three-dimensional ground plane, and calculating the vertical corner point of the two-dimensional cuboid target to restore the pose of the three-dimensional cuboid target.

Based on ground assumptions, the roll angle roll and pitch angle pitch of the cuboid objects are zero. The degree of freedom of the cuboid object is reduced by two, the constraint is exactly equal to DoF (Degree of Freedom, degrees of freedom). And (3) back projecting the two-dimensional cuboid target ground corner points to a three-dimensional ground plane, and then calculating other vertical corner points to form a three-dimensional cuboid. The projection formula is shown as follows:

，

Wherein, Corner points representing a two-dimensional cuboid projection,/>Normal vector representing corresponding corner point of three-dimensional space,/>Representing the distance from the corresponding angular point of the three-dimensional space to the camera coordinate system,/>Representing a back projection matrix, projecting the two-dimensional image corner points into a three-dimensional plane.

And S230, carrying out edge detection and line feature extraction on the target object proposal result to obtain the geometric features of the target object.

S240, constructing a scoring function comprising angle errors, distance errors and shape errors according to the geometric features of the target object, and scoring the proposed structure of the target object according to the scoring function to obtain a three-dimensional target detection result.

After a series of object proposals are obtained through sampling, based on the assumption that the projection of a cuboid object is attached to the edge of an image, the geometric features of the object are rapidly extracted through Canny edge detection and FLD line feature extraction, and a rapid and effective scoring function is constructed by utilizing geometric feature constraint to score the proposals. The scoring function is as follows, and includes three scoring terms, namely distance error, angle error and shape error.

，

Wherein,Representing an image/>Medium three-dimensional target/>Comprises three error terms; Respectively representing a distance error, an angle alignment error and a shape error; /(I) And/>Weights respectively representing the angle alignment error and the shape error are respectively set as/>And/>。

Distance error: The edges of the two-dimensional cuboid should be attached to the edges of the actual image. And constructing a distance graph by using a Canny edge detection method, accumulating and summing according to the chamfering distance of the cuboid edges, and normalizing by the size of the two-dimensional detection frame. Defining a distance error as:

，

Wherein, Representing points obtained by sampling from the straight line of the object after edge detection, wherein 10 points are sampled in total; /(I)Representing the sampled points from the parallel sides of a two-dimensional cuboid object, and sampling 12 points in total; /(I)The function is used for calculating the distance between the edge straight line of the object and the parallel edge of the cuboid; and finally obtaining the average distance.

Angle alignment error: And (3) adopting a line feature extraction algorithm FLD to rapidly extract line segments in the target frame, merging and removing the line segments with shorter lengths, measuring angles of the line segments and three groups of parallel edges of the cuboid to judge whether VP is aligned, and defining angle errors as follows:

，

Wherein, Representing VP Point,/>All represent detected line segments,/>Representing the line angle between the two points.

Shape error: The first two terms can be evaluated in two-dimensional image space, but similar cuboid angles may yield quite different three-dimensional cuboids, thus proposing a cost to penalize a cuboid with a larger skew ratio (s=length/width), defining the cost function of shape error as:

，

Wherein, And representing a threshold value, selecting according to the target label information, and ensuring that the generated cuboid target is not overlong or too flat. /(I)The function judges whether the skew ratio is larger than 0, if the skew ratio is smaller than the threshold value, the return value is 0, and the shape error is not punished.

It should be understood that in the embodiment of the invention, the three-dimensional target detection does not need to train a three-dimensional target detection model with huge workload, semantic information is obtained only by using a mature two-dimensional target detection model, a label object with obvious geometric characteristics is screened, and then a three-dimensional cuboid target is constructed by using a VP model, so that the quality of generating the three-dimensional target is ensured. Meanwhile, a scoring function is constructed based on the assumption that the cuboid targets are attached to the edges of the images, geometric features of the objects are extracted by means of Canny edge detection and FLD line features, the cuboid target proposals are scored based on the geometric features and corresponding error functions, the threshold value of the scoring function is set by means of label semantic information, and accuracy of three-dimensional target detection results is guaranteed.

In the embodiment of the present invention, two-dimensional rough matching is performed according to a two-dimensional target detection result, and three-dimensional fine matching is performed based on the two-dimensional rough matching result and a three-dimensional target detection result, so as to obtain a three-dimensional object matching result, as shown in fig. 5, including:

and S310, screening out characteristic points positioned in the two-dimensional target detection frame according to the two-dimensional target detection result.

S320, respectively associating the characteristic points located in the two-dimensional target detection frame with target tag information in different frames of the two-dimensional image, and obtaining association characteristic points based on two-dimensional rough matching.

In the embodiment of the invention, the feature points in the detection frame are screened out based on the front-end two-dimensional target detection result. And associating the tag object with the feature point. Comparing objects among different frames, comparing the matching among the associated feature points, sequencing the number of the matched feature points, and carrying out one-time association among targets exceeding a certain number of thresholds to finish rough matching of the two-dimensional detection objects among the frames.

，

Wherein,Representing a set of identical objects of the same label in different frames,/>, representing identical targetsThe function is used to calculate the/>Frame and/>Frame identical tag object/>And/>Number of feature point matches in/(The function returns a pair of objects with the largest feature point matching quantity, and the feature point matching quantity returned exceeds a threshold value/>The object of the object is associated once, and the rough matching of the same object across frames is finally realized.

S330, performing three-dimensional fine matching on the associated feature points of the two-dimensional coarse matching and the three-dimensional target detection result to obtain a three-dimensional object matching result.

Further specifically, performing three-dimensional fine matching on the associated feature points of the two-dimensional coarse matching and the three-dimensional target detection result to obtain a three-dimensional object matching result, including: restoring the associated feature points of the two-dimensional rough matching into a three-dimensional space; calculating the distance between the associated feature point of the two-dimensional rough matching and the centroid of the corresponding three-dimensional cuboid target; if the distance is greater than the preset threshold value, determining that the associated feature points of the two-dimensional rough matching can be associated to a three-dimensional cuboid target, and taking the associated feature points of the two-dimensional rough matching as a three-dimensional object matching result.

In the embodiment of the invention, the characteristic points contained in the two-dimensional detection object can be restored to the three-dimensional space, and the distance between the characteristic points and the centroid of the corresponding cuboid object is calculated:。/> representing the spatial coordinates of feature points,/> Representing centroid coordinates of a cuboid object,/>A size threshold representing the object is selected based on the semantic tags. If the road mark point is close enough to the centroid of the cube in the 3D space and the distance is larger than 0, the point is associated to the object, the object matching with the most sharable map points is finally selected, and the error association caused by overlapping of the two-dimensional target detection frames during rough matching is removed, so that more accurate data association is realized. And selecting the proposal with the highest score as the unique expression for the same cuboid target generated by different frames.

It should be understood that in the embodiment of the invention, the two-dimensional target detection result and the feature point inclusion relationship are utilized to perform coarse association of the object, and the geometric relationship between the three-dimensional target and the feature point is utilized to perform fine matching, so that the calculation amount of the matching process is reduced, and the accuracy of the association of the object is ensured.

In the embodiment of the invention, the beam adjustment method optimization is carried out on the three-dimensional object matching result to obtain the positioning result, which comprises the following steps: constructing map elements including cameras, feature points and cuboid objects according to the three-dimensional object matching structure; and carrying out beam adjustment optimization on the measurement errors between the map element combined camera and the feature point, between the camera and the cuboid object and between the feature point and the cuboid object so as to obtain a positioning result.

It should be appreciated that BA (Bundle Adjustment, beam adjustment method) is used to jointly optimize different map elements, including camera pose, points, lines, etc. After adding the cuboid target element in the map, an error term needs to be added. Considering a set of camera poses, a set of 3D object landmarks, and a set of feature point landmarks, the BA can be expressed as a least squares problem as follows.

Specifically, the measurement error between the camera and the feature point adopts a standard 3D point re-projection error, and the formula is as follows:

，

Wherein, Representing the two-dimensional pixel position of the characteristic point, and P represents the three-dimensional spatial position of the characteristic point,/>Representing camera reference matrix,/>Representing a rotation matrix,/>Representing the translation vector.

In the embodiment of the invention, the measurement errors between the camera and the cuboid object comprise angular point projection errors and two-dimensional target frame projection errors, and the three-dimensional cuboid road mark angular point is attached to the two-dimensional cuboid target after projection. And constructing a distance error after the angular point projection.

Specifically, the expression of the corner projection error is:

，

Projecting a cuboid object to an image plane, constructing a two-dimensional frame to surround the cuboid, and comparing the cuboid object with a two-dimensional boundary frame detected by a target, wherein the projection error of the two-dimensional target frame is expressed as follows:

，

The feature points and the cuboid object are mutually constrained, if the feature points belong to the cuboid object, the feature points are located in the three-dimensional cuboid, the feature points are projected onto the frame of the cuboid object, and then the feature points are compared with the size of the cuboid object, in the embodiment of the invention, the expression of the measurement error between the feature points and the cuboid object is as follows:

，

Wherein, Representing spatial coordinates/>, in world coordinate system, of feature points，/>Representing a transformation matrix transforming feature points from a world coordinate system to a rectangular object coordinate system,/>, andRepresenting the dimensions of a cuboid object. And/>After doing the difference,/>And the function judges the magnitude of the difference value and 0, if the difference value is larger than 0, the difference value structure error is returned, and if the difference value is smaller than 0, the characteristic point is in the cuboid object, and the error is not formed.

In the embodiment of the invention, the BA optimization process is combined with a cuboid target, and the angular point projection error of a camera-object and the geometric position error of a feature point-object are increased on the basis of the re-projection error of a traditional camera-feature point, so that higher-precision positioning is realized.

In summary, the monocular semantic SLAM positioning method based on the three-dimensional target provided by the invention realizes three-dimensional target detection by using two-dimensional target detection and object geometric features, avoids a three-dimensional target detection network training process with huge workload, has more accurate detection results, adds the three-dimensional target detection results into a map to enable the map to contain semantics and be more dense, and meets the requirements of upper navigation and obstacle avoidance. In addition, the inter-frame data association in the embodiment of the invention utilizes semantic information and geometric relations to realize coarse-to-fine matching, effectively eliminates mismatching and realizes accurate data association. Finally, in the BA optimization process based on the point characteristics, constraint is added by combining the three-dimensional target detection results, and error items among the camera-characteristic points, the camera-object and the characteristic points-object are constructed to perform BA optimization, so that positioning with higher precision is realized.

As another embodiment of the present invention, a monocular semantic SLAM positioning system 100 based on a three-dimensional object is provided, for implementing the monocular semantic SLAM positioning method based on a three-dimensional object described above, where, as shown in fig. 6, the method includes:

The two-dimensional target detection module 110 is configured to perform two-dimensional target detection on a two-dimensional image with target tag information, and obtain a two-dimensional target detection result with geometric features.

The three-dimensional target detection module 120 is configured to sample the two-dimensional target detection result by using a cuboid proposal to implement three-dimensional target detection, thereby obtaining a three-dimensional target detection result.

The three-dimensional object matching module 130 is configured to perform two-dimensional rough matching according to the two-dimensional target detection result, and perform three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result, so as to obtain a three-dimensional object matching result.

And the optimization module 140 is used for performing beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result.

It should be understood that, in the embodiment of the present invention, the two-dimensional target detection module mainly realizes two-dimensional target detection, the three-dimensional target detection module mainly realizes cuboid proposal sampling, cuboid target three-dimensional structure recovery and cuboid proposal scoring, the three-dimensional object matching module mainly realizes object coarse matching and object fine matching, and the optimization module mainly performs BA optimization by combining three map elements of a camera, map points and a cuboid object. The observation errors of the BA optimization comprise three types of errors of camera-map points, camera-objects and map points-objects. The camera-map point error adopts a standard characteristic point re-projection error; the camera-error consists of two parts, the re-projection error of the cuboid corner and the map point-object error.

In summary, the monocular semantic SLAM positioning system based on the three-dimensional target provided by the invention can realize three-dimensional target detection by using two-dimensional target detection and object geometric features, avoids a three-dimensional target detection network training process with huge workload, has more accurate detection results, adds the three-dimensional target detection results into a map to enable the map to contain semantics and be more dense, and meets the requirements of upper navigation and obstacle avoidance; in addition, the inter-frame data association utilizes semantic information and geometric relation to realize coarse-to-fine matching, so that mismatching is effectively eliminated, and accurate data association is realized; and finally, in the BA method based on the point characteristics, the BA optimization is performed by combining the three-dimensional target detection result and adding constraint, so that the positioning with higher precision is realized.

The specific working principle of the monocular semantic SLAM positioning system based on the three-dimensional object of the present invention can refer to the description of the monocular semantic SLAM positioning method based on the three-dimensional object, and will not be repeated here.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims

1. A monocular semantic SLAM positioning method based on a three-dimensional target is characterized by comprising the following steps:

performing two-dimensional target detection on the two-dimensional image with the target label information to obtain a two-dimensional target detection result with geometric characteristics;

performing cuboid proposal sampling on the two-dimensional target detection result to realize three-dimensional target detection, and obtaining a three-dimensional target detection result;

Performing two-dimensional rough matching according to the two-dimensional target detection result, and performing three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result to obtain a three-dimensional object matching result;

performing beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result;

The two-dimensional rough matching comprises associated feature points obtained by associating feature points obtained according to a two-dimensional target detection result with target tag information in different frames in a two-dimensional image; the three-dimensional fine matching comprises associated feature points which can be associated to a three-dimensional cuboid object from associated feature points of the two-dimensional coarse matching.

2. The three-dimensional object-based monocular semantic SLAM positioning method of claim 1, wherein performing cuboid proposal sampling on the two-dimensional object detection result to realize three-dimensional object detection, obtaining a three-dimensional object detection result, comprises:

Sampling the cuboid proposal of the two-dimensional target detection result to obtain a projection result of the three-dimensional cuboid target on a two-dimensional image;

Performing ground assumption according to a projection structure of the three-dimensional cuboid object on the two-dimensional image to recover the pose of the three-dimensional cuboid object, and obtaining a target object proposal result;

Performing edge detection and line feature extraction on the target object proposal result to obtain a target object geometric feature;

And constructing a scoring function comprising angle errors, distance errors and shape errors according to the geometric characteristics of the target object, and scoring the target object proposal structure according to the scoring function to obtain a three-dimensional target detection result.

3. The monocular semantic SLAM positioning method based on a three-dimensional object according to claim 2, wherein the sampling of the cuboid proposal is performed on the two-dimensional object detection result to obtain a projection result of the three-dimensional cuboid object on a two-dimensional image, comprising:

determining a two-dimensional target detection frame according to the two-dimensional target detection result;

determining the parameter quantity of the three-dimensional cuboid targets, and determining blanking point constraint conditions based on the fitting of the projected three-dimensional cuboid targets and the two-dimensional target detection frame;

and obtaining a projection result of the three-dimensional cuboid object on the two-dimensional image according to the constraint condition of the blanking point.

4. The monocular semantic SLAM positioning method based on a three-dimensional object according to claim 2, wherein the step of performing ground assumption to recover the pose of the three-dimensional cuboid object according to the projection structure of the three-dimensional cuboid object on the two-dimensional image, and obtaining the object proposal result comprises the following steps:

according to ground assumption, the rolling angle and the pitch angle of the three-dimensional cuboid target are zero;

And (3) back projecting the ground corner point of the two-dimensional cuboid target to the three-dimensional ground plane, and calculating the vertical corner point of the two-dimensional cuboid target to restore the pose of the three-dimensional cuboid target.

5. The monocular semantic SLAM positioning method based on a three-dimensional object according to any one of claims 1 to 4, wherein performing two-dimensional rough matching according to a two-dimensional object detection result, and performing three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional object detection result, to obtain a three-dimensional object matching result, comprises:

Screening out characteristic points positioned in a two-dimensional target detection frame according to a two-dimensional target detection result;

Respectively associating the characteristic points positioned in the two-dimensional target detection frame with target label information in different frames of the two-dimensional image, and obtaining association characteristic points based on two-dimensional rough matching;

and carrying out three-dimensional fine matching on the associated feature points of the two-dimensional coarse matching and the three-dimensional target detection result to obtain a three-dimensional object matching result.

6. The monocular semantic SLAM locating method based on three-dimensional object according to claim 5, wherein performing three-dimensional fine matching on the associated feature points of the two-dimensional coarse matching and the three-dimensional object detection result to obtain a three-dimensional object matching result comprises:

Restoring the associated feature points of the two-dimensional rough matching into a three-dimensional space;

calculating the distance between the associated feature point of the two-dimensional rough matching and the centroid of the corresponding three-dimensional cuboid target;

If the distance is greater than the preset threshold value, determining that the associated feature points of the two-dimensional rough matching can be associated to a three-dimensional cuboid target, and taking the associated feature points of the two-dimensional rough matching as a three-dimensional object matching result.

7. The monocular semantic SLAM locating method based on three-dimensional object according to any one of claims 1 to 4, wherein the performing of beam adjustment optimization on the three-dimensional object matching result to obtain a locating result includes:

constructing map elements including cameras, feature points and cuboid objects according to the three-dimensional object matching structure;

and carrying out beam adjustment optimization on the measurement errors between the map element combined camera and the feature point, between the camera and the cuboid object and between the feature point and the cuboid object so as to obtain a positioning result.

8. The three-dimensional object-based monocular semantic SLAM positioning method of claim 7, wherein the measurement error between the camera and the cuboid object comprises a corner projection error and a two-dimensional object box projection error, wherein the expression of the corner projection error is:

，

Wherein, Representing the projection coordinates of the corner points of a cuboid object in a two-dimensional plane,/>Representing coordinates of corner points of cuboid objects in three-dimensional space,/>Representing camera reference matrix,/>Representing a rotation matrix,/>Representing a translation vector;

，

9. The three-dimensional object-based monocular semantic SLAM locating method of claim 7, wherein the expression of the measurement error between the feature point and the rectangular object is:

，

Wherein, Representing spatial coordinates/>, in world coordinate system, of feature points，/>Representing a transformation matrix transforming feature points from a world coordinate system to a rectangular object coordinate system,/>, andRepresenting the dimensions/>, of a cuboid object。

10. A monocular semantic SLAM locating system based on a three-dimensional object for implementing the monocular semantic SLAM locating method based on a three-dimensional object according to any one of claims 1 to 9, characterized by comprising:

the two-dimensional target detection module is used for carrying out two-dimensional target detection on the two-dimensional image with the target label information to obtain a two-dimensional target detection result with geometric characteristics;

The three-dimensional target detection module is used for carrying out cuboid proposal sampling on the two-dimensional target detection result so as to realize three-dimensional target detection and obtain a three-dimensional target detection result;

The three-dimensional object matching module is used for carrying out two-dimensional rough matching according to the two-dimensional target detection result, and carrying out three-dimensional fine matching based on the two-dimensional rough matching result and the three-dimensional target detection result to obtain a three-dimensional object matching result;

And the optimization module is used for carrying out beam adjustment optimization on the three-dimensional object matching result to obtain a positioning result.