CN108648270B

CN108648270B - Unmanned aerial vehicle real-time three-dimensional scene reconstruction method capable of realizing real-time synchronous positioning and map construction

Info

Publication number: CN108648270B
Application number: CN201810452378.3A
Authority: CN
Inventors: 布树辉; 张咪; 赵勇
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2018-05-12
Filing date: 2018-05-12
Publication date: 2022-04-19
Anticipated expiration: 2038-05-12
Also published as: CN108648270A

Abstract

The invention provides an unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM. Compared with various existing methods, the method provided by the invention has the advantages that the method directly runs on the CPU after the image is collected, and the positioning is rapidly realized in real time and the three-dimensional map is reconstructed; the pose of the unmanned aerial vehicle is solved by the EG-SLAM method, namely the pose is directly solved by using the feature point matching relation between two frames, and the requirement on the repetition rate of the acquired images is reduced; in addition, the obtained large amount of environment information enables the unmanned aerial vehicle to sense the environment structure more precisely and meticulously, and texture rendering is carried out on the large-scale three-dimensional point cloud map generated in real time, so that reconstruction of the large-scale three-dimensional map is realized, and a more visual and real three-dimensional scene is obtained.

Description

Unmanned aerial vehicle real-time three-dimensional scene reconstruction method capable of realizing real-time synchronous positioning and map construction

Technical Field

The invention relates to the fields of unmanned aerial vehicle autonomous flight, computer vision, map construction and the like, in particular to a method for realizing three-dimensional scene reconstruction of a ground target by an unmanned aerial vehicle by performing dense diffusion and texture rendering on an obtained three-dimensional point cloud by utilizing real-time synchronous positioning and map construction (EG-SLAM).

Background

The unmanned aerial vehicle real-time synchronous positioning and reconstruction of the three-dimensional scene are always hot spots and difficulties in the unmanned aerial vehicle autonomous flight field and the computer vision field. Unmanned aerial vehicles have a number of advantages, such as: small volume, low cost, good stealth, flexibility, convenience, strong combat ability and the like, and is widely applied to a plurality of fields of military and civilian in recent years. The unmanned aerial vehicle can be used for military tasks such as reconnaissance, attack and electronic countermeasure, can also be used as a target drone, and has important significance for future air battles. In the civil aspect, the unmanned aerial vehicle is the field that the unmanned aerial vehicle really just needed, and the unmanned aerial vehicle has very big positive effect in fields such as agriculture, plant protection, express delivery transportation, rescue and relief of disaster, city management, news report, movie & TV shooting even autodyne at present. In the unknown space field, the external environment information is often required to be acquired in real time, and a map is constructed so as to meet the requirements of unmanned aerial vehicle obstacle avoidance, path planning and autonomous flight, so that the real-time synchronous positioning and three-dimensional scene reconstruction have particularly important practical significance.

The method of synchronous positioning and Mapping (SLAM for short) describes that an unmanned aerial vehicle starts from an unknown position of an unknown environment, repeatedly observes the environment in the motion process, positions the position and the posture of the unmanned aerial vehicle according to environmental characteristics sensed by a sensor, and constructs a map in an incremental manner according to the position of the unmanned aerial vehicle. With the development of image processing technology and camera hardware, vision-based SLAM has become a relatively popular research direction. Common visual SLAM is generally divided into two highly correlated parts, camera position tracking and scene mapping. The camera positioning and tracking refers to determining the position and the pose of a camera in the environment, and the scene map construction refers to constructing a three-dimensional map of a scene where the camera is located.

However, the existing synchronous positioning and map building method has several defects:

1) image frames with an insufficiently high repetition rate cannot be processed well. For example, the ORB-SLAM based on the ORB feature point method requires a data set with a high repetition rate, such as a Large-Scale Monocular slot SLAM (LSD-SLAM for short), a Sparse Direct method (DSO for short), and a Semi-Direct Visual odometer (SVO for short).

2) The redundancy of the initial image information obtained by the camera is extremely high, resulting in the waste of transmission bandwidth and computing resources.

3) At present, most of large-scale maps generated by unmanned aerial vehicles in real time by using a computer vision technology are three-dimensional point clouds only having geometric structures and not containing texture information.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM, which specifically comprises the following steps:

step 1: collecting and processing images:

the unmanned aerial vehicle airborne camera collects a series of images, transmits the images to the computing unit in real time, and utilizes the camera calibration data to perform distortion removal processing on each collected frame of image to obtain a distortion-removed image;

step 2: performing the following processing on the first two frames of images obtained in the step 1 to obtain an initialized map:

step 2.1: setting a first frame image as a first key frame; extracting and storing SIFT feature points of the images of the first two frames, and arbitrarily setting the initial inverse depth of the feature points extracted from the first two frames;

step 2.2: solving an essential matrix E between the two previous frames by utilizing a characteristic point matching relation, and decomposing a rotation matrix R and a translation matrix t of camera pose change corresponding to the second frame image to the first frame image through SVD;

step 2.3: according to the rotation matrix R and the translation matrix t of the camera pose change corresponding to the second frame image to the first frame image obtained in the step 2.2, calculating the inverse depth of the feature points in the first two frames by an inverse depth filtering method, and obtaining the corresponding space coordinates of the feature points in the three-dimensional space, so as to complete the initialization of the three-dimensional point cloud map;

and step 3: the following processing is performed on the ith frame image obtained in real time, and the tracking and composition process is completed, wherein i is 3,4,5.. times.:

step 3.1: after an i frame distortion-removed image is obtained, the camera pose change between the i frame and the current key frame is solved by taking the current key frame as a reference:

extracting SIFT feature points of the ith frame image, initializing the inverse depth of each feature point, solving an essential matrix E between the ith frame image and the current key frame by using a feature point matching relation, and decomposing an initial value of a pose change SE (3) matrix between the ith frame and the current key frame through SVD;

feature matching point pairs p for current key frame and ith frame₁、p₂Knowing the feature point p in the current keyframe₁P and the initial inverse depth d, can be obtained₁At the position P in the three-dimensional space, the position and pose change SE (3) matrix is utilized to re-project the three-dimensional point P to the ith frame image to obtain a projection point

Calculating projected points

And p on the ith frame₁Corresponding feature matching points p₂The re-projection error e takes a pose change SE (3) matrix and the inverse depth d as optimization variables, and the pose change SE (3) matrix of the current key frame and the ith frame and the inverse depth d' of the feature point in the current key frame are optimized and solved to ensure that the re-projection error is minimum;

calculating the ratio d '/d of the inverse depth optimized values d' of all the feature points in the current key frame to the inverse depth d before optimization, and taking the weighted average of the ratios of all the feature points in the current key frame as a scale parameter s of pose change between cameras corresponding to the ith frame and the current key frame; obtaining a camera pose change SIM (3) matrix between the ith frame and the current key frame according to the optimized pose change SE (3) matrix and the scale parameter s;

step 3.2: obtaining a matching relation between the first optimized camera pose change SIM (3) matrix and the feature points between the ith frame and the current key frame according to the step 3.1, and updating the inverse depth of the feature points of the current key frame by using an inverse depth filtering method;

step 3.3: optimizing a camera pose change SIM (3) matrix between the ith frame and the current key frame and the inverse depth of the feature point of the current key frame again by adopting a method based on graph optimization, and updating the finally optimized feature point with the inverse depth into a map;

step 3.4: if the camera pose change between the ith frame and the current key frame obtained in the step 3.3 is larger than the set pose change threshold value, and the feature matching point number of the ith frame and the current key frame is smaller than 1/2 of the feature point number of the ith frame, determining the ith frame as a new key frame and updating the new key frame into the current key frame;

and 4, step 4: and (3) returning the collected image and the point cloud map updated in the step (3) to the ground terminal for subsequent processing:

step 4.1: clustering space points in the point cloud map by using a k-d tree algorithm, taking the mean value of space point coordinates in each cluster as a central space coordinate of the cluster, and forming a new point set by using the central space coordinates of all clusters;

step 4.2: projecting the space points in the new point set obtained in the step 4.1 onto each collected image, if the number of the projection points on the image is greater than a set threshold value, retaining the image, otherwise, deleting the image; obtaining a new image set through screening;

step 4.3: according to the image segmentation principle, segmenting the image set obtained in the step 4.2 to obtain image clusters, and enabling the number of images in each image cluster to be smaller than a given threshold value;

and 5: and (4) respectively extracting and matching the characteristics of the images in each image cluster obtained in the step (4) to generate a series of three-dimensional space points, generating patches by using the space points, filtering out the patches with larger errors and obtaining a dense space point set:

step 5.1: and (3) performing feature extraction on all images in each image cluster by using a DOG pyramid and a Harries operator: for a certain picture I in a cluster of pictures_iSelecting the main optical axis and image I from other images in the image cluster_iThe image with the included angle smaller than the set angle forms an image set I (p); finding an image I by means of feature matching_iMatching the feature matching point pairs of each image in the images I (p), generating three-dimensional space points by using the feature matching point pairs through a triangulation method, and matching the three-dimensional space points with the image I_iCorresponding optical center O (I)_i) The distances are arranged from small to large; sequentially generating patches by adopting the following method to obtain a series of initialized sparse patches:

the patch p generation method comprises the following steps:

for a patch p, represented by a center point c (p) and a normal vector n (p); initializing the central points c (p), c (p) and O (I) of the candidate patch by using the obtained three-dimensional space point_i) The unit direction vector of (a) is a normal vector n (p);

the central point c (p) of each patch p is represented by the image I_iAnd the corresponding matching image I in the image set I (p)_jObtaining, two images I_iAnd I_jFirstly, adding a visual atlas V (p) of the patch; projecting the center point c (p)To division I in the image set I (p)_jOn other images than the image I_mThere are projection points, and c (p) and the image I_mIf the included angle between the ray corresponding to the optical center and the n (p) direction is larger than the set angle threshold value, the image I is processed_mAdding a visible atlas V (p) of patches p;

defining a photometric error function:

wherein v (p) represents a visual atlas of the patch p, r (p) represents a reference image of the patch p, I represents any image in the visual atlas v (p) except the reference image r (p), and h (p, I, r (p)) means a photometric error between the images I and r (p); the independent variables of the luminosity error function are the central point c (p) of the patch p, the normal vector n (p) and the visual atlas V (p); minimizing the photometric error function of the patch p by a conjugate gradient method, and updating c (p), n (p), V (p); if the number of the images in the visible image set V (p) is larger than a set threshold value, the patch p is successfully generated, otherwise, the patch is not accepted;

step 5.2: diffusion of the initialized patches, generating dense patches:

for a patch p and a corresponding V (p), each image in V (p) is divided into image blocks C of β × β pixels_i(x, y), wherein x and y represent subscripts of the image blocks, and i represents the ith image in V (p); projecting patches p to each image in V (p), for projection to image blocks C_i(x, y) Patch set Q for Patches_i(x, y) represents; repeating the process for all patches p successfully generated in step 5.1; obtaining a patch set of image blocks corresponding to each image in the image cluster;

for a certain tile p, find the neighborhood image block set c (p) that satisfies the following condition:

C(p)＝{C_i(x′,y′)|p∈Q_i(x,y),|x-x′|+|y-y′|＝1}

when there is a neighboring relationship between the patches p' and p, and the image block C corresponding to p_i(x ', y') satisfies C_i(x ', y') ε C (p), C (p) is deleted_i(x ', y'); establishing a new patch for the residual image blocks in the step C (p), wherein the center of the new patch is the intersection point of the ray passing through the center of the image block and the optical center and the plane where the patch p is located, and the normal vector of the new patch is consistent with the normal vector of the patch p; projecting the center of the new patch to the image in the visual atlas V (p) corresponding to the patch p, if the projection point is on the image, adding the image into the visual atlas of the new patch, if the number of the visual atlases of the new patch is greater than a set threshold value, considering that one patch is successfully diffused, otherwise, failing, and diffusing the next patch;

step 5.3: for a plurality of patches in each image block, calculating the average correlation coefficient of a certain patch and the rest patches in the same image block by using the visibility consistency constraint, and deleting the patches of which the average correlation coefficients are larger than a set threshold;

step 5.4: repeating the step 5.2 and the step 5.3N times, and completing dense diffusion to obtain a dense space point set;

step 6: performing Poisson reconstruction on the dense space point set obtained in the step (5) to obtain an equivalence, extracting an equivalence surface from the indication function by utilizing the equivalence through a moving cube algorithm to serve as a surface of the point cloud, and obtaining a triangular mesh model formed by triangular surface patches;

and 7: and (4) performing texture rendering on the triangular mesh model obtained in the step (6) to obtain a complete three-dimensional texture scene:

step 7.1: registering the acquired images, and then screening and deleting the triangular patches and the images in the triangular mesh model: regarding three triangular patches with the same vertex position as the same patch, only one patch is reserved; deleting the images which do not meet the conditions by using a cone removing method and a texture consistency checking method, minimizing a Markov random field energy function E (l) by using a graph cutting and alpha-expanding method, selecting an optimal image l for performing texture rendering on each triangular patch, and projecting the triangular patches back to the optimal image to obtain the texture colors of the triangular patches;

wherein E is_dataTerm representation for each triangular patch F in all patches Faces_iAnd can see F_iEach image of (1)_iData cost in between, expressed in the form:

for visible triangular patch F_iEach image of (1)_i，φ(F_i,l_i) Is represented by F_iIs projected to_iThe area of the image of (a) is,

indicating the projection area phi (F)_i,l_i) The gradient of each pixel p above; e_smoothTerm representation adjacent triangular patch F_i,F_jRespectively using visible patches F_i,F_jImage l of_i,l_jAfter texture rendering, the adjacent triangular patch F_i,F_jSeam visibility;

step 7.2: after each triangular patch obtains the optimal image l, the color of the projection area of the common edge of the adjacent triangular patches in the image texture block corresponding to the triangular patch is adjusted globally firstly, then Poisson editing is used for local adjustment, and finally the color-adjusted image texture block is used for texture rendering of the common edge of the adjacent triangular patches to obtain the color of the common edge of the adjacent triangular patches.

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: the inverse depth filtering method in the step 2 comprises the following steps:

for two adjacent frames of image I_rAnd I_kAccording to which the feature matching point pair u_i、u_i' position coordinates in respective images, and image I_rTo image I_kAnd sets an image I_rMiddle characteristic point u_iHas an inverse depth range of

Thereby calculating the inverse depth of two endpoints of the inverse depth range in the image I_kThe polar line position occurring in; calculating u_iIn picture I_kMiddle corresponding characteristic point u_i' vertical distance from the polar line, if the vertical distance is less than the set threshold, using the feature matching point pair u_i、u_i' coordinate relationship and image I_rTo image I_kThe position and the attitude of the camera therebetween calculate the characteristic point u_iInverse depth of (a) and conversely characteristic point u_iThe inverse depth calculation of (2) fails.

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: the initial inverse depth d of the key frame in the step 3 is obtained through the following processes: when a frame becomes a new key frame, any feature point p in the new key frame_i' the initial inverse depth d is defined by d₁And d₂A weighted average is obtained, wherein d₁Is a characteristic point p_i' corresponding feature point p in last key frame_iInverse depth d of₁；d₂The feature point p in the new key frame is obtained by an inverse depth filtering method according to the corresponding camera pose change and feature point matching relation of the new key frame and the previous key frame_iInverse depth of' d₂。

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: in step 4.1, the number of points in each cluster must not be greater than the set point value by setting a threshold.

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: in step 5.1, the photometric error between the two images is obtained by adopting the following process:

for a pair of images I₁And I₂Projecting patches p on image I respectively₁And I₂After that, bilinear interpolation is carried out on the pixels in the projection area to obtain the pixel gray level q (p, I)₁) And q (p, I)₂) Obtaining q (p, I)₁) And q (p, I)₂) 1 minus the NCC value to obtain an image I₁And I₂Photometric error h (p, I) between₁,I₂)。

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: step 6, uniformly sampling the dense space point set obtained in the step 5 to obtain a sampling directed point set S, wherein one sampling point S in the sampling directed point set comprises position coordinates s.p of the sampling point and an inward curved surface normal vector

Sampling points are uniformly distributed on the surface of the three-dimensional scene, and the normal vector of the inward curved surface of the sampling point s is concentrated by utilizing the sampling directed points

Formed vector field and indicating function χ_MThe equivalence relation of the gradient field is estimated to obtain an indication function chi of the three-dimensional scene model_MThen, an isosurface of the model is found and extracted, and a seamless triangular mesh model is reconstructed on the surface of the model.

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: the step 6 is realized by the following steps:

step 6.1: defining an octree according to the position coordinates of the sampling directed point set, giving a maximum tree depth D, and subdividing the octree to enable each sampling point to fall on a leaf node;

step 6.2: defining a basis function F capable of performing distance translation, scale scaling and unit integration; setting a node function F of a unit integral for each leaf node o of an octree_OUsing the relationship of the basis function F to express F_O(ii) a q represents the sample point position variable, o.c and o.w are the center and width of node o, respectively;

step 6.3: use threeSub-linear interpolation assigns sampling points to the eight nearest leaf nodes, using F_OIs used to represent the vector field of the set of sampling points

Wherein Ngbr_D(s) is the eight leaf nodes with depth D nearest s.p, α_o,sWeight for cubic linear interpolation;

step 6.4: for sampling points near the surface of the three-dimensional model, the gradient of the indicator function is equal to the inward normal vector of the model surface, and a Poisson equation is obtained:

the vector field is obtained in step 6.3, and the indicating function χ is solved by adopting Laplace matrix iteration_M(ii) a For each sample point s, the value χ of an indicator function is estimated_M(s.p) taking the weighted average of all sample point indication functions as an equivalent value;

step 6.5: and (5) extracting an isosurface from the indication function by using the equivalence obtained in the step (6.4) through a mobile cube algorithm to be used as the surface of the point cloud, and obtaining a triangular mesh model formed by triangular patches.

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: in step 7.1, E_smoothThe term is based on the Potts model, and when the triangular patch F_i,F_jTexture rendering with the same image E_smoothIs 0, otherwise, the set maximum value is taken.

In a further preferred scheme, the unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on EG-SLAM is characterized in that: the texture consistency checking method in step 7.1 is specifically as follows:

1) will see the triangular patch F_iAll images of (2) are regarded as interior points;

2) calculating F_iAverage color value c projected at each inner point_iAnd all average color values c_iThe covariance matrix Σ and the mean μ;

3) for visible F_iEach image of (2) using a Gaussian function

Obtaining a value;

4) when the value is smaller than the set threshold value, removing the corresponding image;

5) and repeating the steps 2-4, and stopping iteration when the element value in the covariance matrix is lower than the corresponding set value or the number of the internal points is less than the corresponding set value or the set iteration number is reached.

Advantageous effects

The invention discloses a method for synchronously positioning and reconstructing a three-dimensional scene by an unmanned aerial vehicle in real time. Compared with the existing methods, the method has the characteristics of rapidness and real-time performance, reduces the requirement on the repetition rate of the image data set, can reconstruct a three-dimensional scene with texture details, and improves the fidelity of environmental perception.

The reason why the present invention has the above-mentioned advantageous effects is that: 1) the unmanned aerial vehicle airborne camera directly runs on the CPU after collecting images, and rapidly realizes positioning in real time and reconstructs a three-dimensional map; 2) the pose of the unmanned aerial vehicle is solved by the EG-SLAM method instead of the traditional PNP method, namely the pose is directly solved by the feature point matching relationship between two frames, and the requirement on the repetition rate of the acquired images is reduced; 3) the acquired large amount of environment information enables the unmanned aerial vehicle to sense the environment structure more precisely and meticulously, and texture rendering is carried out on the large-scale three-dimensional point cloud map generated in real time, so that reconstruction of the large-scale three-dimensional map is realized, and a more visual and real three-dimensional scene is obtained.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1: determining a pixel depth schematic diagram by inverse depth filtering;

FIG. 2: a reprojection error diagram;

FIG. 3: a triangular mesh model diagram;

FIG. 4: triangular patch, vertex v₁Is a common point of adjacent patches;

FIG. 5: a schematic diagram of a linear variation function of the sampling point color value weight;

FIG. 6: texture blocks of the strip border.

Detailed Description

The following detailed description of embodiments of the invention is intended to be illustrative, and not to be construed as limiting the invention.

The main devices of the unmanned aerial vehicle are a microminiature computer and an airborne industrial camera. In the flight process of the unmanned aerial vehicle, image information is collected through a camera, data are transmitted to an embedded computer, the embedded computer processes the collected images, real-time positioning and point cloud map construction are completed, and a three-dimensional map scene is reconstructed by combining a GPS provided by a flight controller.

The following are specific implementation steps:

step 1: collecting and processing images:

a series of images are collected by the unmanned aerial vehicle airborne camera, and the images are transmitted to the computing unit in real time, so that the rapidity of image transmission is guaranteed. And carrying out distortion removal treatment on each acquired frame image by using the camera calibration data acquired in advance to obtain a distortion-removed image.

Step 2: processing the first two frames of images to obtain an initialization map (initialization of EG-SLAM):

step 2.1: the undistorted first frame image is set as the first key frame. Extracting and storing SIFT feature points of the images of the first two frames, setting the initial inverse depth of the feature points extracted from the first two frames to be 1, wherein the set initial inverse depth does not have effective depth information, but does not influence the reconstruction precision of subsequent point clouds, because the accurate inverse depth value can be obtained through optimization later.

Step 2.2: and solving an essential matrix E between the two previous frames by utilizing the characteristic point matching relation, and decomposing a rotation matrix R and a translation matrix t of the camera pose change corresponding to the second frame image to the first frame image by SVD. The baseline of the first and second frame images is set to 1.

Step 2.3: and (3) according to the rotation matrix R and the translation matrix t of the camera pose change corresponding to the second frame image to the first frame image obtained in the step (2.2), obtaining the inverse depth of the feature points in the first two frames by an inverse depth filtering method, obtaining the corresponding space coordinates of the feature points in the three-dimensional space, and finishing the initialization of the three-dimensional point cloud map.

The inverse depth filtering method is specifically described as follows:

as shown in fig. 1, for two adjacent frames of images I_rAnd I_kAccording to which the feature matching point pair u_i、u_i' position coordinates in respective images, and image I_rTo image I_kAnd sets an image I_rMiddle characteristic point u_iHas an inverse depth range of

Thereby calculating the inverse depth of the two end points in the image I_kThe polar line position occurring in; calculating u_iIn picture I_kMiddle corresponding characteristic point u_i' perpendicular distance d from the polar line, if d is less than the set threshold, using the feature matching point pair u_i、u_i' coordinate relationship and image I_rTo image I_kThe position and the attitude of the camera therebetween calculate the characteristic point u_iThe inverse depth of (d). Otherwise, then u_iWhen the inverse depth calculation of the feature point fails, the image I is calculated by the same method_rAnd image I_kNext special in (1)And (5) characterizing the inverse depth of the matching point pair.

And step 3: the following processing is performed on the ith frame image obtained in real time (i ═ 3,4,5.. to.) to complete the tracking and composition process (EG-SLAM positioning and sparse map construction):

step 3.1: after the image data of the ith frame is obtained, the camera pose change between the ith frame and the current key frame is solved by taking the current key frame as a reference (namely, a tracking basis):

the pose change of the ith frame (current frame) and the current key frame can be expressed as SE (3) transformation, and SE (3) is a 4 × 4 matrix also called an external parameter matrix of the camera. This matrix representation represents the rotation and translation of the camera.

From the upper left₀₀To a₂₂Is the rotational component SO (3), from T at the top right₀To T₂Representing a translation component. The SE (3) matrix plus the scale parameters becomes SIM (3), s representing the scale parameters for affine transformation.

SIFT feature points of the ith frame image are extracted, and the inverse depth of each feature point is initialized to 1 (the initial inverse depth does not have effective depth information). Solving an essential matrix E between the image of the ith frame and the current key frame by utilizing the feature point matching relation, and decomposing an initial value of pose change (SE (3) matrix) between the ith frame and the current key frame through SVD;

feature matching point pairs p for current key frame and ith frame₁、p₂Knowing the feature point p in the current keyframe₁P and the initial inverse depth d, can be obtained₁At the position P in the three-dimensional space, the three-dimensional point P is re-projected to the ith frame image by using the pose change (SE (3) matrix) to obtain a projection point

As shown in fig. 2. Calculating projected points

And p on the ith frame₁Corresponding feature matching points p₂The re-projection error e (SE (3) matrix and the inverse depth are optimization variables), the pose change (SE (3) matrix) of the current key frame and the ith frame and the inverse depth d' of the characteristic point in the current key frame are solved by iteration through an LM (Levenberg-Marquard algorithm) optimization method, and the re-projection error is minimized.

The inverse depth optimized value d' of a certain feature point in the current key frame obtained by iteration and the inverse depth d of the feature point before iteration should be the same value theoretically. Therefore, the ratio d'/d of the two is solved by utilizing the relation, and the ratio of all the feature points in the current key frame is weighted and averaged to be used as a scale parameter s of the pose change between the cameras corresponding to the ith frame and the current key frame. Knowing the SE (3) matrix and the scale parameter s, the camera pose change (SIM (3) matrix) between the ith frame and the current key frame is obtained.

Wherein the initial inverse depth d of the key frame is obtained by the following process: when the ith frame becomes a new key frame, any feature point p in the ith frame_i' the initial inverse depth d is defined by d₁And d₂A weighted average is obtained, wherein d₁Is a characteristic point p_i' corresponding feature point p in last key frame_iInverse depth d of₁；d₂The feature point p in the ith frame is obtained by an inverse depth filtering method according to the camera pose change and the feature point matching relation corresponding to the ith frame and the last key frame_iInverse depth of' d₂。

Step 3.2: through the step 3.1, knowing the camera pose change (SIM (3) matrix) and the feature point matching relation after the first optimization of the corresponding camera between the ith frame and the current key frame, and updating the inverse depth of the feature point of the current key frame by using an inverse depth filtering method;

step 3.3: and finally, optimizing the pose change of a camera corresponding to the ith frame and the current key frame and the inverse depth of the feature point of the current key frame by adopting a method based on graph optimization, and updating the finally optimized feature point with the inverse depth into the map.

Formula of the basic graph optimization method:

x_krepresenting nodes (unknown independent variable states), Z_kRepresenting an edge (which may be understood as a constraint), e_kIndicating a state in which the nodes satisfy the constraints. Omega_kAnd the confidence coefficient of the introduced information matrix and the constraint is represented, and if the error is large, the confidence coefficient is low. In the invention, the inverse depth of the map point and the pose change of the unmanned aerial vehicle are defined as nodes, the projection relation from the image to the map point and the change of the SIM (3) between two adjacent frames are defined as sides, and if GPS observation information exists, the GPS observation information is added into the sides. The information matrix contains the weights of the respective constraints. Defining a local sliding window, wherein the range of the sliding window comprises all key frames with common observation, and after defining nodes and edges of the graph optimization method, minimizing error functions among all variables in the local sliding window by using an LM (Linear modeling) method to obtain a global optimal solution, namely obtaining the optimal values of attitude change and the inverse depth of the characteristic point.

Step 3.4: if the camera pose change of the ith frame and the current key frame obtained in the step 3.3 is larger than the set pose change threshold, and the number of feature matching point pairs of the ith frame and the current key frame is smaller than 1/2 of the feature point number of the ith frame, determining the ith frame as a new key frame and updating the new key frame into the current key frame;

in order to increase the operation rate and reduce the amount of stored data, in this embodiment, if the camera pose change between the ith frame and the current key frame obtained in step 3.3 is greater than the set pose change threshold, and the feature matching point pair between the ith frame and the current key frame is less than 1/2 of the feature point number of the ith frame, the ith frame is used to replace the current key frame as a new key frame. Otherwise, returning to the step 3.1, and repeatedly processing the next acquired frame image.

The key frame is defined because the camera pose corresponding to the current key frame has larger change relative to the previous key frame, the observed three-dimensional environment information is obviously different from the previous key frame, and the three-dimensional environment information can be used as a reference to expand an environment map and detect whether a new frame can be used as the key frame. Compared with the method that each frame is used as a key frame to perform complex calculation, the method saves calculation resources and accelerates the overall calculation speed.

And 4, step 4: and (3) transmitting the acquired image and the point cloud map updated in the step (3) (the point cloud map is stored in a space point set form) back to the ground end for subsequent processing. The images are overlapped in a large amount, and in order to reduce the processing of data quantity, the images are clustered and classified by a hash image clustering method and are screened:

step 4.1: and clustering space points in the point cloud map by using a k-d tree algorithm, and setting a threshold value so that the number of points in each category is not greater than a set value 400. And taking the mean value of the space point coordinates in each cluster as the central space coordinates of the cluster, and forming a new point set by the central space coordinates of all clusters.

Step 4.2: selecting an image: and 4, projecting the space points in the new point set obtained in the step 4.1 onto each collected image, if the number of the projection points on the image is more than a certain threshold value, retaining the image, and otherwise, deleting the image. Repeated screening may result in a new set of images.

Step 4.3: and according to the image segmentation principle, segmenting the image set obtained in the step 4.2 to obtain image clusters, so that the number of images in each image cluster is smaller than a given threshold value. For image clusters that do not satisfy the number condition, they are segmented into smaller image clusters.

And 5: and (4) respectively extracting and matching the characteristics of the images in each image cluster obtained in the step (4) to generate a series of three-dimensional space points, generating patches by using the space points, and filtering out the patches with larger errors. The generation and the screening of the patches need to be carried out repeatedly, so that the patches are dense enough, and a dense point cloud map is obtained:

the patch p is a small rectangle approximately tangent to the surface of the scene to be reconstructed. Patch p is represented by a center point c (p) and a unit normal vector n (p).

Step 5.1: and (4) performing feature extraction on all images in each image cluster by using a DOG pyramid and a Harries operator. For a certain picture I in a cluster of pictures_iSelecting the main optical axis and image I from other images in the image cluster_iThe images with the included angle smaller than the set angle form an image set I (p). Finding an image I by means of feature matching_iMatching the feature matching point pairs of each image in the images I (p), generating three-dimensional space points by using the feature matching point pairs through a triangulation method, and matching the three-dimensional space points with the image I_iCorresponding optical center O (I)_i) The distances are arranged from small to large. The following methods are sequentially adopted to generate patches, and finally a series of initialized sparse patches are generated.

The patch p generation method comprises the following steps:

for a patch p, it is represented by a center point c (p) and a normal vector n (p). Initializing the central points c (p), c (p) and O (I) of the candidate patch by using the obtained three-dimensional space point_i) The unit direction vector of (a) is a normal vector n (p).

The central point c (p) of each patch p is represented by the image I_i(reference image R (p) of patch p) and corresponding matching image I (p) in image set I (p)_jObtaining, two images I_iAnd I_jThe visual set of patches v (p) is added first. Projecting the center point c (p) to the image set I (p) to divide I_jOn other images than the image I_mThere are projection points, and c (p) and the image I_mIf the included angle between the ray corresponding to the optical center and the n (p) direction is larger than the set angle threshold value, the image I is processed_mAdd the visible atlas v (p) of patches p.

First, a photometric error function is defined:

wherein v (p) represents a visual atlas of the tile p, r (p) represents a reference image of the tile p, and I represents any image in the visual atlas v (p) except the reference image r (p). h (p, I, R (p)) refers to the photometric error between images I and R (p). The arguments of the photometric error function are the center point c (p) of the patch p, the normal vector n (p) and the visual atlas V (p). C (p), n (p), V (p) are updated by minimizing the photometric error function of patch p by conjugate gradient method. If the number of images in the visible atlas V (p) is greater than the set threshold, the patch p is successfully generated, otherwise the patch is not accepted.

For a pair of images I₁And I₂Projecting patches p on image I respectively₁And I₂After that, bilinear interpolation is carried out on the pixels in the projection area to obtain the pixel gray level q (p, I)₁) And q (p, I)₂) Obtaining q (p, I)₁) And q (p, I)₂) The NCC value of (Normalized Cross Correlation Normalized Cross Correlation algorithm). Subtracting the NCC value from 1 to obtain an image I₁And I₂Photometric error h (p, I) between₁,I₂)。

Step 5.2: diffusion of the initialized patches, generating dense patches:

for patch p and the corresponding V (p), each image in V (p) is divided into image blocks C of β × β pixels_i(x, y) (x, y denotes subscript of image block, i denotes ith image in v (p)); projecting patches p to each image in V (p), for projection to image blocks C_i(x, y) Patch set Q for Patches_i(x, y) represents; repeating the process for all patches p successfully generated in step 5.1; obtaining a patch set of image blocks corresponding to each image in the image cluster;

C(p)＝{C_i(x′,y′)|p∈Q_i(x,y),|x-x′|+|y-y′|＝1}

when there is a neighboring relationship between the patches p' and p, and the image block C corresponding to p_i(x ', y') satisfies C_i(x ', y') ε C (p), C (p) is deleted_i(x ', y'). And (3) establishing a new patch for the residual image blocks in the step (C), (p), wherein the center of the new patch is the intersection point of the ray passing through the center of the image block and the optical center and the plane where the patch p is located, and the normal vector of the new patch is consistent with the normal vector of the patch p. Projecting the new patch center to the patchAnd on the image in the visual atlas V (p) corresponding to the patch p, if the projection point is on the image, adding the image into the visual atlas of the new patch, if the number of the visual atlases of the new patch is greater than a set threshold value, determining that one patch is successfully diffused, otherwise, beginning to diffuse the next patch.

and (5.2) repeating the step 5.2 and the step 5.3N times, completing the dense diffusion, and obtaining a dense space point set (the point cloud map is stored in a space point set form).

Step 6: performing Poisson reconstruction on the dense space point set obtained in the step 5 to obtain an equivalent, extracting an equivalent surface (formed by triangular patches) from the indication function by using the equivalent through a Marching Cube algorithm (MC method for short), wherein the equivalent surface is the surface of the point cloud, and obtaining a triangular mesh model formed by the triangular patches:

uniformly sampling the dense space point set obtained in the step 5 to obtain a sampling directed point set S, wherein one sampling point S (S belongs to S) in the sampling directed point set comprises position coordinates s.p of the sampling point and an inward curved surface normal vector

The sampling points are uniformly distributed on the surface of the three-dimensional scene, and the sampling points s are concentrated by utilizing the sampling directed points (

Surface representing a three-dimensional scene) of an inward-facing surface normal vector

Formed vector field and indicating function χ_MThe equivalence relation of the gradient field is used for estimating an indication function chi of the three-dimensional scene model_MThen, thenFinding and extracting an isosurface of the model, and reconstructing a seamless triangular mesh model on the surface of the model:

step 6.2: defining a base function F capable of distance translation and scale scaling and unit integration, wherein the base function F is an n-order convolution of a box filter, and when n is increased, the base function F is approximate to a Gaussian filter with unit variance. Setting a node function F of a unit integral for each leaf node o of an octree_OUsing the relationship of the basis function F to express F_O. q denotes the sample point position variable, o.c and o.w are the center and width of node o, respectively.

Step 6.3: to ensure leaf node F_OUsing cubic linear interpolation to assign the sample points to the eight nearest neighbor leaf nodes. F_OIs used to represent the vector field of the set of sampling points

(Ngbr_D(s) is the eight leaf nodes with depth D nearest s.p, α_o,sWeight for cubic linear interpolation):

step 6.4: for sample points near the surface of the three-dimensional model, the gradient of the indicator function is equal to the inward normal vector of the model surface, so that a poisson equation is obtained:

obtaining a vector field by the step 6.3, and adopting Laplace matrix iteration to solveSolving the indicator function χ_M. I.e. for each sample point s, the value χ of an indicator function is estimated_M(s.p), the weighted average of all sample point indication functions is taken as an equivalent value.

Step 6.5: and (3) extracting an equivalent surface (formed by a triangular patch) from the indication function by using the equivalent obtained in the step (6.4) through a Marching Cube algorithm (MC method for short), wherein the equivalent surface is the surface of the point cloud, and the triangular mesh model formed by the triangular patch is obtained, as shown in figure 3.

step 7.1: first, the acquired images are registered. Then, screening and deleting the triangular patches and the images in the triangular mesh model: three triangular patches with the same vertex position are considered to be the same patch, and only one patch is reserved. Deleting the image which does not meet the conditions by using a cone removing method and a texture consistency checking method, minimizing a Markov random field energy function E (l) by using a graph cutting and alpha-expanding method, selecting an optimal image l for performing texture rendering on each triangular patch, and projecting the triangular patches back to the optimal image to obtain the texture colors of the triangular patches.

Wherein E is_dataTerm representation for each triangular patch F of all patches (Faces)_iAnd can see F_iEach image of (1)_iThe cost of data in between. For E_dataThe expression form is as follows:

indicating the projection area phi (F)_i,l_i) The gradient of each pixel p.

E_smoothTerm representation adjacent triangular patch F_i,F_jRespectively using visible patches F_i,F_jImage l of_i,l_jAdjacent triangular patch F after texture rendering_i,F_jSeam visibility. E_smoothThe term is based on the Potts model, and when the triangular patch F_i,F_jUsing the same image (l)_i＝l_j) When texture rendering is performed, E_smoothIs 0, otherwise, the set maximum value is taken.

The texture consistency checking method specifically comprises the following steps:

6) will see the triangular patch F_iAll images of (2) are regarded as interior points;

7) calculating F_iAverage color value c projected at each inner point_iAnd all average color values c_iThe covariance matrix Σ and the mean μ;

8) for visible F_iEach image of (2) using a Gaussian function

A value is obtained;

9) when the value is smaller than the set threshold value, removing the corresponding image;

10) and repeating the steps 2) to 4), and stopping convergence when the element value in the covariance matrix is lower than a certain value or the number of the inner points is less than a set value or a set iteration number is reached.

Step 7.2: by step 7.1, after each triangular patch obtains the optimal image l, basic global adjustment needs to be performed on the color of the projection area of the common edge of the adjacent triangular patches in the image texture block corresponding to the triangular patch, then the Poisson editing is used for local adjustment, and finally the image texture block after color adjustment is used for performing texture rendering on the common edge of the adjacent triangular patches to obtain the color of the common edge of the adjacent triangular patches:

is known per seAnd obtaining the rectangular image texture block used for rendering the patch in the optimal image. Before adjusting the color of the common edge, the vertex v on the common edge of the adjacent patches can be regarded as two vertexes, corresponding to two colors. I.e. vertex v_leftAnd vertex v_rightPatches to the left and right of a common edge, respectively, vertex v_leftAnd vertex v_rightAll have their own independent corresponding texture block pixel colors

And

the color adjustment function is minimized using the conjugate gradient method:

for a vertex v on a common edge of adjacent patches,

representing a vertex v_leftThe amount of correction of the pixel color of the corresponding image texture block,

representing a vertex v_rightAnd the correction quantity of the pixel color of the corresponding image texture block. For adjacent vertices v on the same triangular patch_i,v_j，

Representing a vertex v_iThe amount of correction of the pixel color of the corresponding image texture block,

representing a vertex v_jAnd the correction quantity of the pixel color of the corresponding image texture block. Minimizing the first term results in the left-adjusted color value of a pixel corresponding to a vertex v on a common edge of adjacent patches

And right side adjusted color value

As consistent as possible; minimizing the second term to make neighboring vertices v in the same patch_i,v_jThe color difference is as small as possible.

Obtaining the pixel color correction g of the image texture block corresponding to each vertex v on the common edge of the adjacent triangular surface patches_v(comprises

And

) And then, calculating the color correction of all pixels in the image texture block region corresponding to the common edge of the adjacent triangular patches by using the centroid coordinates to obtain the adjustment color values of all pixels in the image texture block region corresponding to the common edge of the adjacent triangular patches. After the adjustment color values of all pixels in the image texture block region corresponding to the common edge of the adjacent triangular patches are obtained through the global adjustment, because the adjustment color values of the left and right sides of all pixels in the image texture block region corresponding to the common edge of the adjacent triangular patches cannot be completely the same, the color of the image texture block region corresponding to the common edge of the adjacent triangular patches has an obvious boundary, and the color adjustment of the image texture block region corresponding to the common edge of the adjacent triangular patches is continuously carried out, wherein the method comprises the following steps:

as in fig. 4, vertex v₁Is a vertex on the common edge of two adjacent triangular patches, and the triangular patch represented by the dark gray area corresponds to two edges of the image texture block

And

and (6) uniformly sampling. For the

And

edge, according to the distance of the sampled pixel from vertex v₁The distance-determining weight of (2) is linearly increased from 0 to 1, and as shown in FIG. 5, the color values adjusted by the sampling pixels are multiplied by the weights and added to each other

And

according to

And

length ratio of

And

color mean value of

The vertices v of the triangular patch represented by the light gray region can be found in the same manner₁Color mean value of

Vertex v₁Color sampling of

And

is measured. For other sampling points on the public edge, respectively finding the projected pixel positions of the sampling points in the image texture blocks corresponding to the deep gray triangular patch and the shallow gray triangular patch, taking the color average value as the color of the sampling points, and adjusting the depthAnd enabling the colors of the corresponding projection areas of the common edges of the triangular patch corresponding to the gray color and the triangular patch corresponding to the light gray color to be consistent.

After the colors of the projection areas of the common edges of the adjacent triangular patches in the corresponding image texture blocks are consistent, poisson editing is performed on the 20-pixel wide area (dark gray annular area) of the edge of the image texture block, as shown in fig. 6. And performing texture color rendering on the triangular surface patch by using the image texture block with the locally adjusted color (performing back projection on the image texture block to the triangular surface patch to obtain texture color), so that the rendered triangular surface patch keeps texture consistency. And finishing the reconstruction of the three-dimensional scene with texture details.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims

1. An unmanned aerial vehicle real-time three-dimensional scene reconstruction method based on real-time synchronous positioning and map construction is characterized in that: the method comprises the following steps:

step 1: collecting and processing images:

Calculating projected points

the patch p generation method comprises the following steps:

the central point c (p) of each patch p is represented by the image I_iAnd the corresponding matching image I in the image set I (p)_jObtaining, two images I_iAnd I_jFirstly, adding a visual atlas V (p) of the patch; projecting the center point c (p) to the image set I (p) to divide I_jOn other images than the image I_mThere are projection points, and c (p) and the image I_mIf the included angle between the ray corresponding to the optical center and the n (p) direction is larger than the set angle threshold value, the image I is processed_mAdding a visible atlas V (p) of patches p;

defining a photometric error function:

step 5.2: diffusion of the initialized patches, generating dense patches:

C(p)＝{C_i(x′,y′)|p∈Q_i(x,y),|x-x′|+|y-y′|＝1}

step 6: performing Poisson reconstruction on the dense space point set obtained in the step 5 to obtain equivalence, extracting an equivalent surface from the indication function by utilizing the equivalence through a moving cube algorithm to serve as the surface of the point cloud, and obtaining a triangular mesh model formed by triangular surface patches;

2. The real-time three-dimensional scene reconstruction method of the unmanned aerial vehicle based on the real-time synchronous positioning and the map building as claimed in claim 1, wherein: the inverse depth filtering method in the step 2 comprises the following steps:

Thereby calculating the inverse depth of two endpoints of the inverse depth range in the image I_kThe polar line position occurring in; calculating u_iIn picture I_kMiddle corresponding characteristic point u_i' vertical distance from the polar line, if the vertical distance is less than the set threshold, using the feature matching point pair u_i、u_i' coordinate relationship and image I_rTo image I_kThe position and the attitude of the camera therebetween calculate the characteristic point u_iInverse depth of (a) and conversely characteristic point u_iInverse depth gaugeThe calculation fails.

3. The real-time three-dimensional scene reconstruction method of the unmanned aerial vehicle based on the real-time synchronous positioning and the map building as claimed in claim 2, wherein: the initial inverse depth d of the key frame in the step 3 is obtained through the following processes: when a frame becomes a new key frame, any feature point p in the new key frame_i' the initial inverse depth d is defined by d₁And d₂A weighted average is obtained, wherein d₁Is a characteristic point p_i' corresponding feature point p in last key frame_iInverse depth d of₁；d₂The feature point p in the new key frame is obtained by an inverse depth filtering method according to the corresponding camera pose change and feature point matching relation of the new key frame and the previous key frame_iInverse depth of' d₂。

4. The real-time three-dimensional scene reconstruction method of the unmanned aerial vehicle based on the real-time synchronous positioning and the map building as claimed in claim 3, wherein: in step 4.1, the number of points in each cluster must not be greater than the set point value by setting a threshold.

5. The real-time three-dimensional scene reconstruction method of the unmanned aerial vehicle based on the real-time synchronous positioning and the map building as claimed in claim 3, wherein: in step 5.1, the photometric error between the two images is obtained by adopting the following process:

6. The real-time three-dimensional scene reconstruction method of the unmanned aerial vehicle based on the real-time synchronous positioning and the map building as claimed in claim 3, wherein: step 6 is to stepUniformly sampling the dense space point set obtained in the step 5 to obtain a sampling directed point set S, wherein one sampling point S in the sampling directed point set comprises position coordinates s.p of the sampling point and an inward curved surface normal vector

7. The real-time three-dimensional scene reconstruction method of the unmanned aerial vehicle based on the real-time synchronous positioning and the map building as claimed in claim 6, wherein: the step 6 is realized by the following steps:

step 6.3: using cubic linear interpolation to assign sampling points to the eight nearest neighbor leaf nodes, using F_OIs used to represent the vector field of the set of sampling points

8. The real-time three-dimensional scene reconstruction method of the unmanned aerial vehicle based on the real-time synchronous positioning and the map building as claimed in claim 3, wherein: in step 7.1, E_smoothThe term is based on the Potts model, and when the triangular patch F_i,F_jTexture rendering with the same image E_smoothIs 0, otherwise, the set maximum value is taken.

9. The real-time three-dimensional scene reconstruction method for the unmanned aerial vehicle based on real-time synchronous positioning and mapping as claimed in claim 8, wherein: the texture consistency checking method in step 7.1 is specifically as follows:

1) will see the triangular patchF_iAll images of (2) are regarded as interior points;

3) for visible F_iEach image of (2) using a Gaussian function

Obtaining a value;