CN109815847B

CN109815847B - Visual SLAM method based on semantic constraint

Info

Publication number: CN109815847B
Application number: CN201811648994.2A
Authority: CN
Inventors: 王蓉; 查文中; 葛建军; 孟繁乐; 孟祥瑞
Original assignee: CETC Information Science Research Institute
Current assignee: CETC Information Science Research Institute
Priority date: 2018-12-30
Filing date: 2018-12-30
Publication date: 2020-12-01
Anticipated expiration: 2038-12-30
Also published as: CN109815847A

Abstract

The invention discloses a visual SLAM method based on semantic constraint, which comprises the following steps: continuously acquiring a sequence of images of the surrounding environment by a depth camera; processing a key frame reconstruction map in the image sequence by a visual SLAM method; performing semantic segmentation on the key frame, and formulating semantic constraint parameters according to a semantic segmentation result; and performing semantic constraint on the reconstructed map through the semantic constraint parameters, and fusing semantic segmentation results to obtain a semantic map. Then, binding and updating semantic constraint parameters and constraint points after each key frame is detected; and the semantic map is utilized to estimate the pose of the depth camera under the condition that the texture features of the non-key frames are rich or lack the texture features. According to the invention, a more accurate semantic map is obtained in a semantic constraint mode; the accuracy of camera pose estimation is improved by estimating the camera pose under the two conditions of abundant texture features or lack of texture features and combining the two conditions.

Description

Visual SLAM method based on semantic constraint

Technical Field

The invention belongs to the technical field of computer vision and artificial intelligence, and particularly relates to a visual SLAM method based on semantic constraint.

Background

Simultaneous Localization and Mapping (SLAM) is an instant positioning and Mapping technology, and can position a sensor in real time through the motion of the sensor in an unknown environment and obtain a three-dimensional structure of the unknown environment. SLAMs can be broadly classified into laser SLAMs and visual SLAMs according to sensors used therein. Visual SLAM is gaining increasing attention due to its outstanding advantages in price, convenience, versatility, etc. of using color or depth cameras, etc. The visual SLAM has wide application prospect in the fields of robots, augmented reality, automatic driving and the like.

Conventional visual SLAM techniques are prone to failure under conditions of weak texture, fast motion, etc. With the continuous development of deep learning technology and the excellent performance of the deep learning technology in the classification and identification tasks, the deep learning and visual SLAM are combined to present a wide application prospect and a huge potential value. The semantic SLAM is one of the important directions. The conventional visual SLAM only utilizes and presents information such as color, geometric structure and the like, does not utilize rich semantic information in space,

disclosure of Invention

The invention aims to provide a visual SLAM method based on semantic constraint, which is realized by the following technical scheme and comprises the following steps: continuously acquiring a sequence of images of the surrounding environment by a depth camera; performing semantic segmentation on the key frames in the image sequence, and obtaining semantic constraint parameters according to semantic segmentation results; processing a semantic segmentation result by a visual SLAM method, and reconstructing a map; and performing semantic constraint on the reconstructed map through the semantic constraint parameters, and fusing semantic segmentation results to obtain a semantic map.

Further, the semantic segmentation of the key frame includes: and segmenting the image according to the texture features of specific objects in the key frame, thereby segmenting one frame of image into a plurality of regions and identifying the reality semantics of the corresponding region according to the texture features.

Further, the obtaining semantic constraint parameters according to the semantic segmentation result includes: obtaining depth information of all feature points in the key; obtaining a plurality of three-dimensional points corresponding to a plurality of feature points according to depth information of the feature points of the ground area in the semantic segmentation result; obtaining optimal plane parameters by utilizing a random sampling consistency algorithm according to the three-dimensional points; and updating the semantic constraint parameters after detecting the key frame every time.

Further, the semantically constraining the reconstructed map by the semantically constraining parameters includes: connecting straight lines from a plurality of three-dimensional points in the ground area in the reconstructed map to a plurality of feature points of the ground area in the segmentation result are made to obtain a plurality of straight line parameters; obtaining a plurality of intersection points according to the plurality of straight line parameters and the optimal plane parameter, wherein the plurality of intersection points are used as obtained constraint points, so that semantic constraint is carried out on the reconstructed map; wherein the constraint points are updated in a binding manner with the semantic constraint parameters.

Further, the obtaining the semantic map comprises: and fusing constraint points obtained by performing semantic constraint on the reconstructed map through the semantic constraint parameters, three-dimensional points corresponding to the feature points of the plurality of regions in the semantic segmentation result and reality semantics thereof, thereby obtaining the semantic map.

Further, the semantic constraint-based visual SLAM method further includes: estimating the pose of the depth camera by analyzing non-key frames in the image sequence and combining the semantic map; wherein estimating the pose of the depth camera comprises: and estimating the pose of the non-key frame with abundant textural features and estimating the pose of the non-key frame with lacking textural features.

Further, the pose estimation according to the texture feature rich non-key frame comprises: identifying texture features in non-key frames of the image sequence and determining a ground area; extracting feature points in the ground area, and obtaining projection points of constraint points in the semantic map in the non-key frame; constructing a first energy function according to the Euclidean distance from the characteristic point to the projection point by using a least square method; solving the first energy function to estimate a pose of the depth camera.

Further, the solving the first energy function includes: solving the first energy function by using a singular value decomposition method to obtain a transformation matrix; wherein the transformation matrix is used for pose estimation of the depth camera.

Further, the pose estimation for the non-key frames lacking texture features comprises: acquiring corresponding three-dimensional points by using the depth information of the pixel points in the non-key frame; judging whether the pixel point belongs to a ground area in the image or not according to the distance from the three-dimensional point to a plane formed by constraint points in the current semantic map; for the three-dimensional points which belong to the ground area, a second energy function is constructed according to the distance from the three-dimensional points to a plane formed by constraint points in the current semantic map by using a least square method; solving the second energy function to estimate the pose of the depth camera.

Further, the solving the second energy function includes: decomposing the transformation matrix in the second energy function into a rotation matrix and a translation vector; calculating the partial derivatives of the parameters to be calculated in the rotation matrix and the translation vector by using a gradient descent algorithm; and the partial derivatives of the parameters to be solved are used for estimating the pose of the depth camera.

The invention has the advantages that:

(1) aiming at the problem that the visual SLAM can only obtain the color and geometric information of a scene generally, the patent combines the visual SLAM and an image semantic segmentation technology to construct a semantic map of the scene, so that high-level cognitive information of the scene is obtained, and a more natural man-machine interaction mode is provided for application fields including robot navigation, augmented reality and automatic driving.

(2) Aiming at the problem that scene semantic information and geometric information are independent and unrelated, the patent provides an idea of converting the semantic information into geometric structure constraint in SLAM. Focusing on the ground area after semantic segmentation, and setting a constraint condition that all the ground areas should be located on the same spatial plane. The method has the advantages that the constraint condition of semantic information construction is also considered in the SLAM process, so that the performance of the SLAM algorithm is improved, and the method is widely applied to indoor scenes. Furthermore, the semantic information can be used for restricting the ground area and can be naturally popularized to the restriction of any object level.

(3) Aiming at the problems of generation and updating of semantic constraints, the patent provides a ground parameter generation and updating method based on a key frame in SLAM, and aims to obtain accurate global semantic ground parameters in an incremental manner. In order to avoid introducing noise in the input depth map in the process of generating and optimizing map points, the three-dimensional coordinate points of the feature points in the ground area are directly recovered through the current plane parameters.

(4) Aiming at the problem that the traditional method for estimating the pose based on the feature points only utilizes texture salient regions in the image, the invention provides the method for correcting the pose estimation result of the camera by applying the constraint of semantic ground regions. The energy function as in equation (4) is designed and solved using the gradient descent method. The advantage of doing so is that the area that texture lacks such as ground that is not negligible in the actual indoor application has been considered in the process of position appearance estimation to promote the precision of position appearance estimation.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart illustrating semantic map construction in a visual SLAM method based on semantic constraints according to an embodiment of the present invention.

Fig. 2 is a flow chart illustrating the processing of key frames in the visual SLAM method based on semantic constraints according to an embodiment of the present invention.

FIG. 3 is a flow chart illustrating the operation of a visual SLAM method based on semantic constraints according to an embodiment of the present invention.

Fig. 4 shows a schematic diagram of the effect of semantic segmentation.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The invention provides a semantic constraint-based visual instant positioning and Mapping SLAM (Simultaneous Localization and Mapping SLAM) method, which is hereinafter referred to as a semantic constraint-based visual SLAM method for short. Semantic constraints are formulated according to semantic segmentation in the visual SLAM method, and global plane parameters are updated, so that the SLAM semantic map is constructed and the camera pose is corrected. The invention will be explained in more detail below with reference to the specific figures.

Fig. 1 is a flow chart illustrating the construction of a semantic map in a visual SLAM method based on semantic constraints according to an embodiment of the present invention. Wherein the semantic map is constructed by: continuously acquiring a sequence of images of the surrounding environment by a depth camera; performing semantic segmentation on the key frames in the image sequence, and obtaining semantic constraint parameters according to semantic segmentation results; processing a semantic segmentation result by a visual SLAM method, and reconstructing a map; and performing semantic constraint on the reconstructed map through the semantic constraint parameters, and fusing semantic segmentation results to obtain a semantic map.

Specifically, the method adopts a fully-connected convolutional neural network to perform image-level semantic segmentation on a key frame in an image sequence, segments one frame of image into a plurality of regions according to the texture features of specific objects in the key frame, identifies the actual semantics of the corresponding regions according to the texture features, and extracts feature points in the regions; the texture features and the corresponding reality semantics form semantic point cloud which is used for self-learning of the fully-connected convolutional neural network; then, obtaining depth information of all feature points in the key, and obtaining a plurality of three-dimensional points corresponding to the feature points according to the depth information of the feature points of the ground area in the semantic segmentation result; obtaining optimal plane parameters by utilizing a random sampling consistency algorithm according to the three-dimensional points; the optimal plane parameter is used as a semantic constraint parameter, and the semantic constraint parameter is updated after a key frame is detected each time; then, connecting straight lines from a plurality of three-dimensional points in the ground area in the reconstructed map to a plurality of feature points of the ground area in the segmentation result are made, and a plurality of straight line parameters are obtained; obtaining a plurality of intersection points according to the plurality of straight line parameters and the optimal plane parameter, wherein the plurality of intersection points are used as obtained constraint points, so that semantic constraint is carried out on the reconstructed map; wherein the constraint points are updated in a binding manner with the semantic constraint parameters; the constraint points are three-dimensional map points of a ground area in a semantic map generated in the future; by the method, input noise introduced when the three-dimensional map point is directly obtained through the depth information of the feature point is avoided, and the visual SLAM method can obtain more accurate and robust positioning and reconstruction results by combining semantic constraint; and then, fusing the three-dimensional points corresponding to the constraint points and the characteristic points of the plurality of areas and the real semantics thereof to obtain a semantic map.

More specifically, the method for identifying key frames in a video sequence includes: and setting an inspection frequency, and judging whether the image sequence is a key frame according to the texture features of the specific frame in the image sequence and the pose change condition of the depth camera. The semantic map is a three-dimensional map comprising dense map points and corresponding semantics. The meaning of the semantic constraint is that in most cases, the identification of the spatial position of a pixel point is determined by the characteristics of the pixel point, such as the ORB characteristics, however, because of factors such as the light angle, the characteristics of the pixel points in the same spatial plane in an image also have differences. Therefore, the invention provides the method for restraining the three-dimensional points corresponding to the characteristic points in the ground area to the same three-dimensional plane through semantic restraint, so that a more accurate three-dimensional map is constructed. The method can enable the machine to obtain high-level cognitive information of a scene, and provides a more natural man-machine interaction mode for application fields including robot navigation, augmented reality and automatic driving.

The processing of key frames involved in the semantic map building process described above is shown in fig. 2.

Fig. 2 is a flowchart illustrating processing of a key frame in a visual SLAM method based on semantic constraints according to an embodiment of the present invention. Wherein the processing of the key frame comprises: judging the frame as a key frame, and performing image-level semantic segmentation on the key frame; selecting and obtaining three-dimensional points identified as ground areas in the segmentation graph; obtaining optimal plane parameters through a random sampling consistency algorithm according to the three-dimensional points; then selecting three-dimensional points corresponding to the feature points in the ground area and fitting the optimal plane parameters to obtain constraint points, so as to obtain three-dimensional map points of the ground area in the semantic map; and then, binding and updating the optimal plane parameters, the three-dimensional points and the constraint points according to the detected key frames each time.

In the process of building and updating the semantic map, the depth camera carries out attitude estimation in real time according to non-key frames in the image sequence and constraint points of the ground area in the built semantic map. Wherein the pose of the depth camera can be expressed as a three-dimensional transformation matrix T from a local to a global coordinate system_wcOr transformation matrix T from global to local coordinate system_cwThe two are in inverse transformation relation to each other.

Wherein, the corner mark C represents local part, the corner mark W represents global part, and three-dimensional space point X is [ X, Y, Z ═ X]^T. The plane vector is pi ═ pi (pi)₁,π₂,π₃,π₄)^T＝(n^T,d)^TWhere n is the normal vector of the plane and d is the distance of the plane from the world coordinate system origin. By said pose, i.e. transformation matrix T_wcA local coordinate point X_cCan pass through X_w＝T_wcX_cAnd transforming into a global coordinate system. Likewise, the conversion of local planes to global coordinate planes may also be accomplished

Where X is essentially [ X, Y, Z,1 ]]^THomogeneous coordinates of the form, but for simplicity of presentation, X is no longer distinguished herein from its homogeneous coordinates, which are automatically transformed according to computational needs. Therefore, the key to estimating the pose of the depth camera is to find the transformation matrix T_wcOr T_cw. For the calculation of the conversion matrix, the conversion matrix is obtained by constructing an energy function and solving the energy function. In addition, the method considers two situations of sufficient texture features and sparse texture features in the acquired image. When the texture features are sufficient, estimating the pose of the depth camera comprises: identifying texture features in non-key frames of the image sequence and determining a ground area; extracting feature points in the ground area, and obtaining projection points of constraint points in the semantic map in the non-key frame;constructing a first energy function according to the Euclidean distance from the characteristic point to the projection point by using a least square method; solving the first energy function to estimate a pose of the depth camera. Wherein, the conversion relationship between the projection point and the constraint point can be expressed by the following formula:

wherein, X_cIs a constraint point, d_cIs the distance from the origin in the local coordinate system to the local plane,

Is a normal vector of a local plane where the current characteristic point is located, K is a calibration matrix,

Homogeneous coordinates of the projected points in the local plane. Wherein, the calibration matrix K is specifically expressed as:

wherein f is_xAnd f_yFor focal lengths in both x and y dimensions in a plane, c_xAnd c_yAre the corresponding optical center coordinates. Then, obtaining the projection point of the three-dimensional map point in the current frame; and constructing a first energy function according to the Euclidean distance from the characteristic point to the projection point by using a least square method:

wherein, pi (KT)_cwX_w) Representing projection points, where K is a calibration matrix, T_cwFor transforming the matrix, X_wIs a global constraint point and u is a feature point. Wherein the feature point u is an ORB feature obtained by an ORB feature descriptor, the ORB feature having a scaleRotation, and illumination invariance.

In the pose estimation process, the global constraint points are bundled and updated during semantic map updating (keyframe updating), and pose correction of the depth camera is completed through the processes of formula (1), formula (2) and formula (3).

Since the visual SLAM method is generally used in situations such as robot navigation, augmented reality, and automatic driving, the texture features in the image obtained by the depth camera are not uniform, and therefore, the situation when the texture features are insufficient is particularly considered in the present invention. When the texture features are absent in the non-key frames, the ground area cannot be identified through the texture features. Therefore, the invention provides that the depth information of the pixel points in the non-key frame is utilized to obtain the corresponding three-dimensional points; judging whether the pixel point belongs to the ground area in the image or not according to the distance from the three-dimensional point to the ground in the current semantic map (namely, the plane formed by the constraint points or the optimal plane parameter); for the three-dimensional points which belong to the ground area, a second energy function is constructed according to the distance from the three-dimensional points to the ground in the current semantic map by using a least square method; wherein the second energy function is as follows:

wherein the content of the first and second substances,

three-dimensional points in all global coordinate systems, which represent that the distance from the three-dimensional map point of the current frame to the semantic map ground is less than a set threshold value, are provided

Wherein the content of the first and second substances,

three-dimensional points under a local coordinate system; t is_wcIs a transformation matrix. Wherein the content of the first and second substances,

the Z-direction component of (a) can be directly obtained from the corresponding depth map of the frame. The variable to be solved is a conversion matrix T_wcCan be decomposed into a 3 x 3 rotation matrix R and a 3 x 1 translation vector t. Different from the solution of the first energy function, since noise in the depth map is introduced in the process of obtaining the corresponding three-dimensional point through the depth information of the pixel point, the solution of the second energy function is used for correcting the pose when estimating the pose by solving the partial derivative through a gradient descent method as shown in the following:

wherein λ is_kIs the parameter to be solved, and t is used as the step length of iteration. For ease of solution, R and t in the transform to be solved can be expressed as:

t＝[t_x t_y t_z]^T (7)

wherein q is_x,q_y,q_z,q_wRepresenting a rotational quaternion, t_x,t_y,t_zIndicating the amount of translation along three coordinate axes.

The partial derivative of the parameter to be solved for the sum of the internal terms in equation (4) can be expressed as:

through the formula (6) and the formula (7), the partial derivatives of the three-dimensional points in the world coordinate system to each parameter to be solved can be expressed as functions of the three-dimensional points in the current camera coordinate system and the current pose parameter values, as shown in table 1.

TABLE 1 partial derivatives of three-dimensional points in the world coordinate System to each pose parameter

For the pose estimation of the depth camera, the pose estimation of the depth camera under the two conditions of rich texture features and lack of texture features is combined, and the pose is corrected under the condition of lack of texture features, so that the pose estimation of the depth camera has higher accuracy compared with the traditional pose estimation. Of course, this increase in accuracy also leaves the improvement in the constraint points in the semantic map.

Fig. 3 is a flowchart illustrating a visual SLAM method based on semantic constraints according to an embodiment of the present invention. Wherein, the input of the visual SLAM method based on semantic constraint is an image with color and depth map information because the image sequence is obtained by a depth camera; each non-key frame is used for estimation or correction of the depth camera pose. The semantic map is obtained by performing processes of semantic segmentation, semantic point cloud generation, three-dimensional map point updating, binding optimization, global semantic updating and the like through the respective insertion of each key frame. And updating the map points of the ground area in the semantic map and correcting the pose of the camera simultaneously when the global semantic constraint is updated. The images before and after semantic segmentation are shown in fig. 4.

Fig. 4 is a schematic diagram illustrating the effect of semantic segmentation. The left side is an input image, the right side is a corresponding image after semantic segmentation, areas recognized as different semantics are displayed through different colors and specific segmentation graphs thereof, and the images are replaced by different gray levels.

Finally, it should be noted that the above description of the method of the present invention is directed to a scene such as an indoor ground, that is, in the indoor scene, the ground is a plane, and it is reasonable to conversely constrain the feature points in the ground area to be on the same plane by the semantic "ground area". However, it should be emphasized that the method of the present invention can be used not only for constraining a ground area, but also for naturally extending to any object level constraint, for example, if there is a spherical object in an identified scene, when determining the area of the spherical object in an image, the feature points in the area should also conform to the features corresponding to the sphere, for example, when converting the feature points into three-dimensional points, the three-dimensional points should have the characteristic of equal distance to a certain spatial point.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A visual SLAM method based on semantic constraints is characterized by comprising the following steps:

continuously acquiring a sequence of images of the surrounding environment by a depth camera;

performing semantic segmentation on the key frames in the image sequence, and obtaining semantic constraint parameters according to semantic segmentation results;

processing a semantic segmentation result by a visual SLAM method, and reconstructing a map;

performing semantic constraint on the reconstructed map through the semantic constraint parameters, and fusing semantic segmentation results to obtain a semantic map;

the obtaining of the semantic constraint parameters according to the semantic segmentation result comprises:

obtaining depth information of all feature points in the key frame;

obtaining a plurality of three-dimensional points corresponding to a plurality of feature points according to depth information of the feature points of the ground area in the semantic segmentation result;

obtaining optimal plane parameters by utilizing a random sampling consistency algorithm according to the three-dimensional points; wherein the content of the first and second substances,

the optimal plane parameter is used as a semantic constraint parameter, and the semantic constraint parameter is updated after a key frame is detected each time;

the semantically constraining the reconstructed map by the semantically constraining parameters comprises:

connecting straight lines from a plurality of three-dimensional points in the ground area in the reconstructed map to a plurality of feature points of the ground area in the segmentation result are made to obtain a plurality of straight line parameters;

obtaining a plurality of intersection points according to the plurality of straight line parameters and the optimal plane parameter, wherein the plurality of intersection points are used as obtained constraint points, so that semantic constraint is carried out on the reconstructed map; wherein the content of the first and second substances,

and the constraint points and the semantic constraint parameters are updated in a binding mode.

2. The visual SLAM method based on semantic constraints of claim 1 wherein said semantically segmenting key frames comprises:

and segmenting the image according to the texture features of specific objects in the key frame, thereby segmenting one frame of image into a plurality of regions and identifying the reality semantics of the corresponding region according to the texture features.

3. The visual SLAM method based on semantic constraints of claim 1 wherein the obtaining a semantic map comprises:

and fusing constraint points obtained by performing semantic constraint on the reconstructed map through the semantic constraint parameters, three-dimensional points corresponding to the feature points of the plurality of regions in the semantic segmentation result and reality semantics thereof, thereby obtaining the semantic map.

4. The visual SLAM method based on semantic constraints of claim 1 further comprising:

estimating the pose of the depth camera by analyzing non-key frames in the image sequence and combining the semantic map; wherein the content of the first and second substances,

estimating the pose of the depth camera comprises: and estimating the pose of the non-key frame with abundant textural features and estimating the pose of the non-key frame with lacking textural features.

5. The visual SLAM method based on semantic constraints of claim 4 wherein the pose estimation from textural feature rich non-key frames comprises:

identifying texture features in non-key frames of the image sequence and determining a ground area;

extracting feature points in the ground area, and obtaining projection points of constraint points in the semantic map in the non-key frame;

constructing a first energy function according to the Euclidean distance from the characteristic point to the projection point by using a least square method;

solving the first energy function to estimate a pose of the depth camera.

6. The visual SLAM method based on semantic constraints of claim 5 wherein said solving a first energy function comprises:

solving the first energy function by using a singular value decomposition method to obtain a transformation matrix; wherein the content of the first and second substances,

the transformation matrix is used for pose estimation of the depth camera.

7. The visual SLAM method based on semantic constraints of claim 4 wherein the pose estimation for non-key frames lacking textural features comprises:

acquiring corresponding three-dimensional points by using the depth information of the pixel points in the non-key frame;

judging whether the pixel point belongs to a ground area in the image or not according to the distance from the three-dimensional point to a plane formed by constraint points in the current semantic map;

for the three-dimensional points which belong to the ground area, a second energy function is constructed according to the distance from the three-dimensional points to a plane formed by constraint points in the current semantic map by using a least square method;

solving the second energy function to estimate the pose of the depth camera.

8. The visual SLAM method based on semantic constraints of claim 7 wherein said solving a second energy function comprises:

decomposing the transformation matrix in the second energy function into a rotation matrix and a translation vector;

calculating the partial derivatives of the parameters to be calculated in the rotation matrix and the translation vector by using a gradient descent algorithm; and the partial derivatives of the parameters to be solved are used for estimating the pose of the depth camera.