CN112435262A

CN112435262A - Dynamic environment information detection method based on semantic segmentation network and multi-view geometry

Info

Publication number: CN112435262A
Application number: CN202011365151.9A
Authority: CN
Inventors: 孙仝; 游林辉; 胡峰; 陈政; 张谨立; 宋海龙; 黄达文; 王伟光; 梁铭聪; 黄志就; 何彧; 陈景尚; 谭子毅; 尤德柱; 区嘉亮; 罗鲜林
Original assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-03-02

Abstract

The invention relates to a dynamic environment information detection method based on a semantic segmentation network and multi-view geometry, which comprises the following steps: calibrating a camera and removing image distortion; acquiring and inputting an environment image; segmenting the input image through a semantic segmentation network to obtain mask codes of all objects, and realizing preliminary dynamic segmentation; extracting ORB characteristic points from an input image, and then calculating a descriptor; detecting and eliminating dynamic feature points by using a method of combining multi-view geometric and semantic information; matching the ORB characteristic points to obtain pose information of the robot; judging and inserting a key frame, and performing point cloud processing through a local graph building process to obtain a sparse point cloud map; and optimizing the pose by using loop detection and correcting drift errors. The invention improves the running precision and robustness of the RGB-D SLAM system in a high dynamic environment. Meanwhile, the use of the lightweight semantic segmentation network can reduce the system storage and time overhead, and the real-time performance of the system is also considered while the high precision is kept.

Description

Dynamic environment information detection method based on semantic segmentation network and multi-view geometry

Technical Field

The invention relates to the field of positioning and navigation based on vision in autonomous inspection of unmanned aerial vehicles, in particular to a dynamic environment information detection method based on a semantic segmentation network and multi-view geometry.

Background

The unmanned aerial vehicle intelligent inspection process requires the unmanned aerial vehicle to autonomously determine the next operation according to the real-time information of the current environment. Therefore, real-time positioning of the unmanned aerial vehicle and construction of a diagram of a working environment are important links in the intelligent inspection process of the unmanned aerial vehicle. Especially in the cooperative work of a plurality of unmanned aerial vehicles in a gridding arrangement, the environment detected by each unmanned aerial vehicle is a dynamic scene (including an occasional moving object), so a special algorithm needs to be developed for the dynamic scene in the positioning and environment mapping process.

Meanwhile, location and Mapping (SLAM) is a technology that can estimate the current position and attitude by a corresponding motion estimation algorithm through a sensor and establish a three-dimensional map of an environment without any environment prior information. With the development of computer vision and deep learning technology and the improvement of hardware computing capability, vision-based SLAM research is continuously deepened and widely applied to the fields of autonomous driving, mobile robots, unmanned aerial vehicles and the like.

Chinese patent application documents with the publication number of "CN 110322511A" and the publication date of 2019, 10 and 11 disclose a semantic SLAM method and a semantic SLAM system based on object and plane characteristics, wherein RGB-D image streams of a scene are obtained, and key frame images are obtained by utilizing the RGB-D image streams to perform frame-by-frame tracking; constructing a local map of a scene by using a key frame image, performing plane segmentation on a depth map of the key frame image to obtain a current plane, constructing a global plane map by using the current plane, performing object detection on the key frame image to obtain a detection frame and confidence, reconstructing point cloud of an object by using the detection frame and the confidence, merging feature points in the detection frame to the object to obtain a global object map; and performing loop detection by using the key frame image to obtain a loop frame, and performing loop correction to optimize plane constraint and object constraint by using the loop frame to obtain a plane map and an object map of the scene. The invention can improve SLAM optimization performance and enhance semantic description of the environment. However, in the above method, once the dynamic objects in the scene occupy a large number, the accuracy and robustness of the method are affected. In addition, the delay for processing each frame in the method is too long, and the influence of real-time performance on the system is not considered

Disclosure of Invention

In order to overcome the problems of low detection precision and robustness in the prior art, the invention provides the dynamic environment information detection method based on the semantic segmentation network and the multi-view geometry, so that the detection precision and the robustness of the system are improved, and the detection speed can be improved.

In order to solve the technical problems, the invention adopts the technical scheme that: the dynamic environment information detection method based on the semantic segmentation network and the multi-view geometry comprises the following steps:

the method comprises the following steps: calibrating a camera and removing image distortion; acquiring and inputting an environment image;

step two: segmenting the input image through a semantic segmentation network to obtain mask codes of all objects, and realizing preliminary dynamic segmentation;

step three: extracting ORB characteristic points from an input image, and then calculating a descriptor;

step four: detecting and eliminating dynamic feature points by using a method of combining multi-view geometric and semantic information;

step five: matching the ORB characteristic points to obtain pose information of the robot;

step six: judging and inserting a key frame, and performing point cloud processing through a local graph building process to obtain a sparse point cloud map;

step seven: and optimizing the pose by using loop detection and correcting drift errors.

Preferably, in the step one, calibrating the camera, and specifically removing image distortion comprises:

s1.1: first, the internal reference of the camera is obtained, wherein the internal reference comprises f_x,f_y,c_x,c_yNormalizing the three-dimensional coordinates (X, Y, Z) to homogeneous coordinates (X, Y);

s1.2: removing the effect of distortion on the image, where k₁,k₂,k₃,p₁,p₂]For the distortion coefficient of the lens, the distance of the point to the origin of the coordinate system is artificial:

s1.3: and transferring the coordinates in the camera coordinate system to the pixel coordinate system:

preferably, the semantic segmentation network is a lightweight semantic segmentation network FcHarDnet. The model size is reduced by convolution of HDB block connection 1x1, and the network has image processing speed about 30% higher than other network structures under the same hardware environment. Potential dynamic regions of the image are segmented by FcHarDnet and a mask is generated.

Preferably, in the third step, a gaussian pyramid is constructed in the process of extracting ORB feature points from the input image, and feature point detection is performed on different layers of the pyramid, so that the characteristic of unchanged scale is achieved.

Preferably, the specific steps of ORB feature point extraction are as follows:

s3.1: when the gray value difference between a certain number of points around a certain point P and the gray value of the point P is large, the point is regarded as a target corner point, and the specific expression is as follows:

s3.2: to make the feature points hold the orientation unchanged, the centroid is calculated:

s3.3: in order to make the characteristic points have scale invariance, the image pyramid is patterned by scaling the original image sequence in a certain proportion. Extracting corresponding angular points from the images with different sizes at each level of the image pyramid;

s3.4, simultaneously, in order to enable the feature points to be uniformly distributed in the image, a quadtree uniform algorithm is adopted, the image is continuously subdivided downwards into nodes with the same four-division size, the nodes without the feature points are combined, and when the number of the feature points in the nodes is more than 1, the nodes are increased; and after the node distribution is finished, deleting redundant feature points in the child nodes.

Preferably, in the fourth step, the specific steps are as follows:

s4.1: projecting the feature point Xkf in the key frame to the current frame according to the calculated relative pose to obtain a matched feature point Xcur and the depth Zproj of the Xcur in the current frame;

s4.2: judging whether the characteristic points are dynamic points or not;

s4.3: and eliminating all dynamic points in the dynamic mask in the dynamic points and the semantic segmentation result obtained by a multi-view geometric method, and outputting the residual static points.

Preferably, in step S4.2, the method for determining the dynamic point includes: and obtaining the actually measured depth value Zcur of the characteristic point Xcur according to the depth map corresponding to the current frame. If the depth value of the point as a static projection and the actual depth value should be approximately equal, whether the feature point is a dynamic point is judged by calculating the difference value Δ Z between the depth Zproj of the current frame and the actually measured depth value Zcur. If Δ Z is greater than a certain threshold, then the characteristic point Xcur is considered to be on the dynamic object, is a dynamic point, and needs to be removed, where:

preferably, in step S4.2, the method for determining the dynamic point includes: according to the parallax, if the parallax angle α between the feature point Xkf and its projection point Xcur is greater than 30 °, the feature point Xcur is considered as a dynamic point and is removed from the image.

Preferably, the camera pose is calculated through an iterative closest point algorithm, the iterative closest point algorithm uses the spatial distance between the matching point pairs as a basis, and the pose of the camera is continuously iteratively optimized and adjusted to minimize the accumulated distance between the matching points, so that the optimal rotation matrix R and translation matrix t of the camera moving between two frames of images are calculated, and the method specifically comprises the following steps:

s5.1: calculating a sum of squared errors and least squares problem for pairs of matching points, where p_iAnd p' is the feature point set and the feature point matching point set, respectively, as follows:

in the formula, p_iA feature point set is obtained; p' is a characteristic point; r is an optimal rotation matrix; t is the translation matrix.

S5.2: and obtaining a rotation matrix and further obtaining a translation vector by a singular value decomposition method.

Preferably, in the sixth step, the sparse point cloud is constructed by the following steps:

s6.1: the selection and insertion of key frames selection satisfies the following conditions: the distance between the current frame and the nearest key frame is greater than a threshold value; the time from the last key frame insertion is greater than a threshold; the co-viewpoint with the last key frame is below a threshold; the tracking quality of the frame meets the requirement;

s6.2: updating the common-view relation between the key frame and other key frames and the observation relation between the key frame and the map point, and deleting the key frames which do not meet the conditions; and converting the feature points which meet the conditions in the rest key frames into point cloud map points.

Compared with the prior art, the invention has the beneficial effects that: the invention improves the running precision and robustness of the RGB-D SLAM system in a high dynamic environment. Meanwhile, the use of the lightweight semantic segmentation network can reduce the system storage and time overhead, and the real-time performance of the system is also considered while the high precision is kept.

Drawings

FIG. 1 is a flow chart of a dynamic environment information detection method based on semantic segmentation network and multi-view geometry of the present invention;

FIG. 2 is a flow chart of the present invention for rejecting dynamic feature points;

FIG. 3 is a network framework for a semantically partitioned network.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms such as "upper", "lower", "left", "right", "long", "short", etc., indicating orientations or positional relationships based on the orientations or positional relationships shown in the drawings, it is only for convenience of description and simplicity of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationships in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

The technical scheme of the invention is further described in detail by the following specific embodiments in combination with the attached drawings:

examples

Fig. 1-3 show an embodiment of a method for detecting dynamic environment information based on a semantic segmentation network and multi-view geometry, comprising the following steps:

the method comprises the following steps: calibrating a camera and removing image distortion; acquiring and inputting an environment image; calibrating a camera, and specifically removing image distortion by the following steps:

step two: segmenting the input image through a semantic segmentation network to obtain mask codes of all objects, and realizing preliminary dynamic segmentation; the semantic segmentation network is a lightweight semantic segmentation network FcHarDnet. The model size is reduced by convolution of HDB block connection 1x1, and the network has image processing speed about 30% higher than other network structures under the same hardware environment. Potential dynamic regions of the image are segmented by FcHarDnet and a mask is generated.

Step three: extracting ORB characteristic points from an input image, and then calculating a descriptor; the specific steps of ORB feature point extraction are as follows:

Step four: detecting and eliminating dynamic feature points by using a method of combining multi-view geometric and semantic information; the method comprises the following specific steps:

s4.2: judging whether the characteristic points are dynamic points or not; the specific method comprises the following steps: one is to obtain the actually measured depth value Zcur of the feature point Xcur according to the depth map corresponding to the current frame. If the depth value of the point as a static projection and the actual depth value should be approximately equal, whether the feature point is a dynamic point is judged by calculating the difference value Δ Z between the depth Zproj of the current frame and the actually measured depth value Zcur. If Δ Z is greater than a certain threshold, then the characteristic point Xcur is considered to be on the dynamic object, is a dynamic point, and needs to be removed, where:

the static point obtained by the above determination method is considered as a dynamic point according to the parallax, for example, the parallax angle α between the feature point Xkf and the projection point Xcur is greater than 30 °, and the feature point Xcur is removed from the image.

Step five: matching the ORB characteristic points to obtain pose information of the robot; the camera pose is calculated through an iterative closest point algorithm, the iterative closest point algorithm uses the space distance between the matching point pairs as a basis, and the accumulated distance between the matching points is minimized through continuously iteratively optimizing and adjusting the camera pose, so that the optimal rotation matrix R and translation matrix t of the camera moving between two frames of images are calculated, and the method specifically comprises the following steps:

s5.1: computing matchesError square and least squares problem for point pairs, where p_iAnd p' is the feature point set and the feature point matching point set, respectively, as follows:

Step six: judging and inserting a key frame, and performing point cloud processing through a local graph building process to obtain a sparse point cloud map; the method comprises the following specific steps: s6.1: the selection and insertion of key frames selection satisfies the following conditions: the distance between the current frame and the nearest key frame is greater than a threshold value; the time from the last key frame insertion is greater than a threshold; the co-viewpoint with the last key frame is below a threshold; the tracking quality of the frame meets the requirement;

The beneficial effects of this example: the invention improves the running precision and robustness of the RGB-D SLAM system in a high dynamic environment. Meanwhile, the use of the lightweight semantic segmentation network can reduce the system storage and time overhead, and the real-time performance of the system is also considered while the high precision is kept.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The dynamic environment information detection method based on the semantic segmentation network and the multi-view geometry is characterized by comprising the following steps of:

2. The method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 1, wherein in the step one, calibrating the camera, and removing image distortion specifically comprises the steps of:

3. the method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 1, wherein the semantic segmentation network is a lightweight semantic segmentation network FcHarDnet.

4. The method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 1, wherein in the third step, a gaussian pyramid is constructed in the process of extracting ORB feature points from the input image, and feature point detection is performed on different layers of the pyramid, so as to achieve the characteristic of scale invariance.

5. The method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 4, wherein the specific steps of ORB feature point extraction are as follows:

s3.3: by scaling the original image sequence by a certain ratio, the image pyramid is patterned. Extracting corresponding angular points from the images with different sizes at each level of the image pyramid;

s3.4, adopting a quadtree uniform algorithm to continuously subdivide the image downwards into nodes with the same four-division size, combining the nodes without the characteristic points, and increasing the nodes when the number of the characteristic points in the nodes is more than 1; and after the node distribution is finished, deleting redundant feature points in the child nodes.

6. The method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 1, wherein in the fourth step, the specific steps are as follows:

s4.2: judging whether the characteristic points are dynamic points or not;

7. The method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 6, wherein in the step S4.2, the method for determining dynamic point is: obtaining an actually measured depth value Zcur of the feature point Xcur according to a depth map corresponding to the current frame, and judging whether the feature point is a dynamic point or not by calculating a difference value delta Z between the depth Zproj and the depth value Zcur; if the Δ Z is larger than the preset threshold, the characteristic point Xcur is considered to be on the dynamic object and is a dynamic point, and the characteristic point Xcur is removed, wherein:

8. the method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 7, wherein in the step S4.2, the method for determining dynamic point is: according to the parallax, if the parallax angle α between the feature point Xkf and its projection point Xcur is greater than 30 °, the feature point Xcur is considered as a dynamic point and is removed from the image.

9. The method for detecting the dynamic environment information based on the semantic segmentation network and the multi-view geometry as claimed in claim 1 is characterized in that the camera pose is calculated through an iterative closest point algorithm, and the method comprises the following specific steps:

in the formula, p_iA feature point set is obtained; p' is a characteristic point; r is an optimal rotation matrix; t is a translation matrix;

10. The method for detecting dynamic environment information based on semantic segmentation network and multi-view geometry as claimed in claim 1, wherein in the sixth step, the sparse point cloud is constructed by: