CN117726747A

CN117726747A - Three-dimensional reconstruction method, device, storage medium and equipment for complementing weak texture scene

Info

Publication number: CN117726747A
Application number: CN202311461325.5A
Authority: CN
Inventors: 郑川江; 贾学富; 杨心宇
Original assignee: South Surveying & Mapping Technology Co ltd
Current assignee: South Surveying & Mapping Technology Co ltd
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2024-03-19

Abstract

The invention discloses a three-dimensional reconstruction method, a device, a storage medium and equipment for complementing a weak texture scene, which comprise the following steps: acquiring a plurality of aerial survey images and GPS coordinates shot for a target scene under different view angles; inputting each aerial survey image into a preset target detection network model so as to mark a weak texture scene area contained in the aerial survey image; generating a weak texture segmentation mask based on the weak texture scene region; according to all aerial survey images and GPS coordinates, carrying out space three pose calculation to obtain sparse point clouds of a target scene and poses of each aerial survey image; performing multi-view stereo matching based on the sparse point cloud and the pose, and estimating a depth map of each aerial survey image; the depth map is complemented based on the weak texture segmentation mask, and an optimized depth map is obtained; based on the sparse point cloud and the optimized depth map, fusing to generate a dense point cloud; a three-dimensional model is generated based on the dense point cloud. Three-dimensional information of the weak texture region can be recovered, and the quality of the three-dimensional model is improved.

Description

Three-dimensional reconstruction method, device, storage medium and equipment for complementing weak texture scene

Technical Field

The present invention relates to the field of three-dimensional reconstruction technologies, and in particular, to a three-dimensional reconstruction method, apparatus, storage medium, and device for complementing a weak texture scene.

Background

The traditional photogrammetry three-dimensional reconstruction method has extremely poor reconstruction effect in a weak texture area (especially a water area), holes are easy to appear, and the later manual die repair is complex in procedure and large in workload, so that the prior art has no method which is convenient and quick and has higher modeling quality for the three-dimensional reconstruction task comprising a weak texture scene.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a three-dimensional reconstruction method, a device, a storage medium and equipment for complementing a weak texture scene, which are used for solving the defects in the related art.

According to a first aspect of the present invention, there is provided a three-dimensional reconstruction method of a complement weak texture scene, the method comprising:

acquiring a plurality of aerial survey images shot for a target scene under different view angles and corresponding GPS coordinates;

inputting each aerial survey image into a preset target detection network model to mark out a weak texture scene area contained in the aerial survey image;

generating a weak texture segmentation mask corresponding to the aerial survey image based on the weak texture scene region;

Performing space three-pose calculation according to all aerial survey images and corresponding GPS coordinates, and acquiring sparse point clouds corresponding to the target scene and poses corresponding to each aerial survey image;

performing multi-view stereo matching based on the sparse point cloud and the pose corresponding to each aerial survey image, and estimating a depth map corresponding to each aerial survey image;

completing the depth map corresponding to each aerial survey image based on the weak texture segmentation mask to obtain an optimized depth map corresponding to each aerial survey image;

based on the sparse point cloud and the optimized depth map corresponding to each aerial survey image, fusing to generate a dense point cloud corresponding to the target scene;

and generating a three-dimensional model corresponding to the target scene based on the dense point cloud.

Preferably, the preset object detection network model includes a YOLO v8 network model, and based on the YOLO v8 network model, a visual transformer attention mechanism with bi-level routing attention is used, and a bounding box similarity comparison metric based on minimum point distance is used as a loss function of bounding box regression of the YOLO v8 network model, and training is performed using an auto-discovery neural network optimizer based on genetic programming.

Preferably, the performing space three pose calculation according to all the aerial survey images and the corresponding GPS coordinates, and obtaining the sparse point cloud corresponding to the target scene and the pose corresponding to each aerial survey image includes:

extracting and matching feature points of all aerial survey images based on SIFT_GPU;

solving each aerial survey image through a PnP algorithm to obtain an initial value of a corresponding pose;

and minimizing the re-projection error of the pose based on a BA optimization method.

Preferably, the BA-based optimization method minimizes the re-projection error of the pose, specifically:

the pose with the smallest reprojection error is calculated by the following formula:

wherein ζ represents the current pose; u (u) _i Representing the pixel coordinates of the current feature points i, wherein n is the total number of feature points; s is S _i Representing all the aerial survey image ranges associated with the current feature point i; k represents the current aerial survey image; zeta type toy ^Λ Representing the associated pose of the current pose; p (P) _i Representing three-dimensional point coordinates corresponding to pixel coordinates of the current feature point i; zeta type toy ^* Representing the pose in which the reprojection error is minimized, i.e. the pixel coordinate u observed by the current pose ζ _i Coordinate with three-dimensional point P _i Pose with minimal difference between the re-projected coordinates.

Preferably, the estimating the depth map corresponding to each aerial survey image based on the sparse point cloud and the pose corresponding to each aerial survey image for multi-view stereo matching includes:

Matching object surface elements represented by each pixel window of the aerial survey image based on a patch-match method, and calculating NCC (NCC) correlation coefficients between each aerial survey image by using the NCC correlation coefficients as matching cost through matching image blocks guided by a homography matrix H, wherein the matching image blocks comprise a reference image A and an adjacent image AB; wherein,

wherein M is _A·AB M is the mean value of the product of the reference image A and the adjacent image AB _A M is the mean value of the reference image A _AB Is the average value of the adjacent images AB, V _A For the variance of the reference image A, V _AB Variance for the neighboring image AB;

and carrying out propagation optimization on the matching cost and the depth value based on the 4-direction disturbance and the random optimization depth value, generating the depth map, and storing the final matching cost as a confidence map.

Preferably, the complementing the depth map corresponding to each aerial survey image based on the weak texture segmentation mask, to obtain an optimized depth map corresponding to each aerial survey image, includes:

aiming at the depth map corresponding to each aerial survey image, acquiring a corresponding reference confidence map and the depth map and confidence map of a neighbor image of the aerial survey image, and projecting the depth map and the confidence map of the neighbor image to an image space where the depth map corresponding to the aerial survey image is located to form a depth map and confidence map array;

Identifying a weak texture region in the depth map based on the weak texture segmentation mask, and identifying all contour points of the weak texture region;

traversing all the contour points, obtaining the maximum neighborhood depth value of each contour point in a specified range window, and constructing a contour point depth histogram based on specified sampling intervals according to the maximum neighborhood depth value;

acquiring contour points of which the corresponding depth values are in a peak interval and a peak adjacent interval of the depth histogram of the contour points in all the contour points to form a reference point set, and converting coordinates of each contour point in the reference point set from an image coordinate system to a camera coordinate system;

performing plane fitting on contour points in the reference point set by adopting a RANSAC least square method to obtain a weak texture geometric equation under a camera coordinate system; wherein the weak texture geometric equation is

Wherein, in the resolving time orderIs a unit vector, (x) _c y _c z _c ) The geometric meaning of m is the distance from the origin of the camera to the plane, which is the coordinates of the contour points under the camera coordinate system;

based on the weak texture geometric equation, filling the depth value of each pixel point in the weak texture region in the depth map pixel by pixel:

And obtaining the optimized depth map.

Preferably, the three-dimensional model corresponding to the target scene generated based on the dense point cloud is a Delaunay three-dimensional grid model.

According to a second aspect of the present invention, there is provided a three-dimensional reconstruction apparatus for complementing a weak texture scene, the apparatus comprising:

the data acquisition module is used for acquiring a plurality of aerial survey images shot for a target scene under different view angles and corresponding GPS coordinates;

the weak texture recognition module is used for inputting each aerial survey image into a preset target detection network model so as to mark out a weak texture scene area contained in the aerial survey image;

the mask generation module is used for generating a weak texture segmentation mask corresponding to the aerial survey image based on the weak texture scene area;

the pose resolving module is used for resolving the three poses of the sky according to all the aerial survey images and the corresponding GPS coordinates, and acquiring sparse point clouds corresponding to the target scene and poses corresponding to each aerial survey image;

the depth map generation module is used for carrying out multi-view three-dimensional matching based on the sparse point cloud and the pose corresponding to each aerial survey image, and estimating a depth map corresponding to each aerial survey image;

The depth map complement module is used for complementing the depth map corresponding to each aerial survey image based on the weak texture segmentation mask to obtain an optimized depth map corresponding to each aerial survey image;

the point cloud generation module is used for fusing the optimized depth map corresponding to each aerial survey image based on the sparse point cloud and generating dense point cloud corresponding to the target scene;

and the three-dimensional reconstruction module is used for generating a three-dimensional model corresponding to the target scene based on the dense point cloud.

According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of three-dimensional reconstruction of a complement weak texture scene according to any embodiment of the present invention.

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for reconstructing a complement weak texture scene according to any embodiment of the present invention when the program is executed.

The invention discloses a three-dimensional reconstruction method, a device, a storage medium and equipment for complementing a weak texture scene. The method can recover the three-dimensional scene information of the extremely weak texture region and better improve the quality of the three-dimensional model of the weak texture region.

The three-dimensional reconstruction method of the full-complement weak texture scene can be applied to application scenes such as AR/VR, 3D games, 3D film and television works, short videos, automatic driving, free view points and the like, can effectively assist in recovering better 3D structures, generates more attractive results, improves product experience, and improves user experience of the product.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

Fig. 1 is a flowchart illustrating a method for image-based weak texture scene recognition according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a network layer after adding a BiFormer module to a YOLO v8 network model according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of test results of labeling a weak texture scene region contained in a aerial survey image by a target detection network model according to an embodiment of the present invention.

FIG. 4 is a schematic diagram illustrating a weak texture segmentation mask generated from a detected weak texture scene region in a aerial survey image according to one embodiment of the present invention.

Fig. 5 is a schematic diagram of a BA optimization method based on multi-view feature points according to an embodiment of the present invention.

FIG. 6 is a schematic diagram illustrating pose computation and sparse point cloud solution according to an embodiment of the present invention.

FIG. 7 is a schematic diagram illustrating optimizing depth values based on multiple directional perturbations in accordance with an embodiment of the present invention.

Fig. 8 is a schematic diagram illustrating a comparison of a single Zhang Hangce image, an identified region of a weak texture scene, a generated weak texture segmentation mask and an estimated depth map in accordance with an embodiment of the invention.

FIG. 9 is a schematic diagram illustrating a comparison of multiple aerial images with an estimated depth map according to an embodiment of the present invention.

FIG. 10 is a flow chart illustrating the completion of a depth map based on a weak texture segmentation mask according to an embodiment of the present invention.

FIG. 11 is a schematic diagram of a comparison of an aerial survey image, an original depth map, and an optimized depth map according to an embodiment of the present invention.

FIG. 12 is a schematic diagram illustrating a comparison of a sparse point cloud, an original dense point cloud, and an optimized dense point cloud of a target scene according to an embodiment of the present invention.

FIG. 13 is a schematic diagram showing a comparison of a three-dimensional model generated before optimization and a three-dimensional model result generated after optimization of a target scene according to an embodiment of the present invention.

Fig. 14 is a schematic structural diagram of an image-based weak texture scene recognition device according to an embodiment of the present invention.

FIG. 15 is a schematic diagram illustrating the architecture of a computing device hardware according to one embodiment of the invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the invention. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The invention is described in detail below with reference to the drawings and the detailed description.

Referring to fig. 1, fig. 1 is a flowchart of a three-dimensional reconstruction method for a scene with full weak texture according to an embodiment of the present invention, including the following steps:

step S101, acquiring a plurality of aerial survey images and corresponding GPS coordinates shot for a target scene under different view angles;

step S102, inputting each aerial survey image into a preset target detection network model so as to mark a weak texture scene area contained in the aerial survey image;

step S103, generating a weak texture segmentation mask corresponding to the aerial survey image based on the weak texture scene area;

step S104, performing space three-pose calculation according to all aerial survey images and corresponding GPS coordinates, and obtaining sparse point clouds corresponding to a target scene and poses corresponding to each aerial survey image;

step S105, multi-view stereo matching is carried out based on the sparse point cloud and the pose corresponding to each aerial survey image, and a depth map corresponding to each aerial survey image is estimated;

step S106, complementing the depth map corresponding to each aerial survey image based on the weak texture segmentation mask to obtain an optimized depth map corresponding to each aerial survey image;

step S107, based on the sparse point cloud and the optimized depth map corresponding to each aerial survey image, fusing to generate dense point cloud corresponding to the target scene;

Step S108, generating a three-dimensional model corresponding to the target scene based on the dense point cloud.

In step S101, a plurality of aerial survey images may be obtained by capturing the target scene under different viewing angles to form a multi-viewing angle image group, and simultaneously, the GPS coordinates corresponding to each capturing point are obtained, so that three-dimensional reconstruction is performed according to the multi-viewing angle image group and the corresponding GPS coordinates, and a three-dimensional model corresponding to the target scene is generated.

Specifically, in step S101, a weak texture scene is included in the target scene as the shooting target, where the weak texture scene refers to a scene having characteristics of single color, repeated texture, more similar areas, less dotted textures, or the like, such as a water area or a desert. In particular, the weak texture scene included in the aerial survey image taken for the target scene may refer to a specific type of weak texture scene, such as a water area, and the specific referred scene may be determined according to the requirements.

Specifically, the aerial survey image obtained in step S101 may be a plurality of independent images that are taken separately from the target scene at different viewing angles, or may be a plurality of video frames taken from a specific video obtained by continuously recording the target scene at different viewing angles, that is, the method of the present invention may also be applicable to obtaining an aerial survey image of the target scene by recording a video. Specifically, the calculation amount in the subsequent processes of recognition detection, depth map estimation and the like can be adaptively reduced according to the similarity of the front frame and the rear frame of each video frame in the video, and the invention is not limited to this. Specifically, when video recording is performed on a target scene, the GPS coordinates of the camera corresponding to each moment can be recorded simultaneously in real time for subsequent point cloud and depth map generation.

In step S102, a predetermined target detection network model is input by each aerial survey image. Weak texture scene regions in the aerial image may be noted. The target detection network model preset by the invention is used for identifying and marking the weak texture scene area.

Specifically, in some embodiments, the object detection network model used in step S102 may include a YOLO v8 network model, i.e., the object detection network model may be a network model obtained based on an improvement of the YOLO v8 network model. In particular, the object detection network model may use a visual transformer attention mechanism with bi-level routing attention on the basis of the YOLO v8 network model, and may use a bounding box similarity comparison metric based on minimum point distance as a loss function of bounding box regression of the YOLO v8 network model, and may also be trained using an auto-discovery neural network optimizer based on genetic programming. Specifically, the YOLO v8 network model of the present invention may refer to a split detection network model of YOLO v8, i.e., a YOLO v8-seg model structure. The object detection network model of the invention can comprise a YOLO v8 network model, and can also adopt a visual transformer attention mechanism (BiFormer, vision Transformer with Bi-Level Routing Attention) with double-level routing attention, a novel boundary frame similarity comparison metric (MPDIOU, modified Partial Distance IoU) based on minimum point distance and an automatic discovery neural network optimizer (Lion, evolved Sign Momentum) based on genetic programming on the basis of the YOLO 8 network model, so as to be used for training and intelligent reasoning of a weak texture scene (taking a water area as an example).

Compared with the method, the method can adopt dynamic sparse attention based on double-layer routing, save calculation and memory by sparsity, and simultaneously apply a BiFormer vision transformer to carry out query self-adaption, so that the target detection network model is greatly improved in segmentation and identification of weak texture scenes.

In particular, a specific way to use visual transformer attention mechanism (BiFormer) with dual level routing attention on the basis of YOLO v8 network model may be to add one BiFormer module after each C2f layer of the NECK layer based on YOLO v8 network model. Fig. 2 is a schematic diagram of a network layer after adding a BiFormer module in a YOLO v8 network model according to an embodiment of the invention.

Specifically, the core functions of the above-mentioned bifomer module added in the YOLO v8 network model may be:

upon receipt of an input image X ε R ^H×W×C Dividing it into S X S different regions, each region containingThe feature vectors are such that X is changed to +.>And obtain +.>

Q＝X ^r W ^q ,K＝X ^r W ^k ,V＝X ^r W ^u

Wherein W is ^q ,W ^k ,W ^u ∈R ^C×C Projection weights of query, key and value are respectively, and H and W are the height and width of an input image X;

Calculate Q ^T ,K ^T Adjacency matrix A of inter-region correlation of (2) ^T ：

A ^T ＝Q ^T (K ^T ) ^T

The first k connections of each region are reserved to trim the correlation graph to obtain an index matrix I ^T ：

I ^T ＝topkIndex(A ^T )

Wherein, index matrix I ^T The ith row of (2) contains the index of the first k most relevant regions of the ith region;

using an index matrix I ^T Tensors of keys and values are aggregated, i.e.:

K ^g ＝gather(K,I ^T ),V ^g ＝gather(V,I ^T )

wherein K is ^g And V ^g Tensors for the aggregated key and value;

then, the attention operation is performed on K-V:

O＝Attention(Q,K ^g ,V ^g )+LCE(V)

wherein a local context enhancement term LCE (V) is introduced, parameterized with a depth separable convolution with a convolution kernel size of 5.

This completes the improved optimization of the attention mechanism of the YOLO v8 network model.

Further, the default loss function under the YOLO v8 network model is designed based on CIoU (Complete-IoU, based on the intersection ratio of distance and predicted frame size), which, although taking into account the aspect ratio of the frame, does not reflect the true difference in width and height and confidence; the invention can adopt MPDOUO to improve the loss function of the YOLO v8 network model, not only considers the center point distance and the wide-high deviation, but also simplifies the calculation process so as to improve the prediction accuracy and efficiency.

Specifically, a minimum point distance-based bounding box similarity comparison metric (MPDIoU) is used as a loss function of the bounding box regression of the YOLO v8 network model, and specifically, the MPDIoU loss function is used to replace a default loss function of the bounding box regression of the YOLO v8 network model, so that the default loss function of the YOLO v8 network model is improved and optimized.

Specifically, the specific calculation method of the MPDIoU loss function used in the YOLO v8 network model may be:

1) Two detection frames are obtained, and the detection frames are arranged on the frame,wherein, gt is the frame of the true value, prd is the frame of the predicted value, and the coordinates of the upper left point and the lower right point of prd and gt are obtained:

(x ₁ ^prd ,y ₁ ^prd ),(x ₂ ^prd ,y ₂ ^prd ),(x ₁ ^gt ,y ₁ ^gt ),(x ₂ ^gt ,y ₂ ^gt )

among them, it is necessary to ensure that the coordinates of the predicted value frame prd satisfy:

x ₂ ^prd >x ₁ ^prd ，y ₂ ^prd >y ₁ ^prd ；

2) According to the coordinates, solving the point distance from the predicted value to the true value:

3) The areas of the true value frame gt and the predicted value frame prd are calculated:

A ^gt ＝(x ₂ ^gt -x ₁ ^gt )*(y ₂ ^gt -y ₁ ^gt )

A ^prd ＝(x ₂ ^prd -x ₁ ^prd )*(y ₂ ^prd -y ₁ ^prd )；

4) And calculating an intersection I of the real value frame gt and the predicted value frame prd, wherein coordinates of an upper left point and a lower right point of the intersection I respectively take maximum values and minimum values of coordinates of the upper left point and the lower right point of the real value frame gt and the predicted value frame prd:

x ₁ ^I ＝max(x ₁ ^prd ,x ₁ ^gt ),x ₂ ^I ＝min(x ₂ ^prd ,x ₂ ^gt )

y ₁ ^I ＝max(y ₁ ^prd ,y ₁ ^gt ),y ₂ ^I ＝min(y ₂ ^prd ,y ₂ ^gt )

if x is satisfied ₂ ^I >x ₁ ^I ,y ₂ ^I >y ₁ ^I The area of intersection I can be calculated as:

I＝(x ₂ ^I -x ₁ ^I )*(y ₂ ^I -y ₁ ^I )

if the above condition is not satisfied, taking i=0;

5) Calculating a union U of the true value frame gt and the predicted value frame prd:

U＝A ^gt +A ^prd -I；

6) The original IoU of the true value border gt and the predicted value border prd is calculated according to IoU (Intersection over Union) definition, i.e., the intersection ratio:

7) The minimum point distance intersection ratio MPDIOU of the real value frame gt and the predicted value frame prd is solved on the basis of the original IoU of the real value frame gt and the predicted value frame prd, and the calculation method is as follows:

8) The improved MPDIOU loss function is as follows:

L _MPDIoU ＝1-MPDIoU。

the original loss function of YOLO v8 is replaced by the algorithm, and the improvement process of the loss function can be completed.

Further, the default optimizer of the YOLO v8 network model is an SGD (Stochastic Gradient Descent, random gradient descent) optimizer, which has large calculation amount, poor generalization capability and easy occurrence of over-fitting. According to the invention, a Lion optimizer can be adopted, only momentum is tracked, each parameter calculated through symbol operation has the same size, so that the calculation resources are greatly saved, and a better optimization iteration effect can be realized.

Specifically, the neural network optimizer is automatically discovered to train based on genetic programming, and specifically, a Lion optimizer is used to replace a default SGD optimizer of a YOLO v8 network model;

the updating algorithm of each parameter in the Lion optimizer is as follows:

wherein, gamma _t For the weight attenuation rate, the recommended value is 0.01; g _t Gradient of the loss function for the t-th round; sign is a sign function, namely, the internal calculation result is positive number and takes 1, and the negative number is taken 1; u (u) _t The update amount of the t-th round of the optimizer is also the final update result required by the optimizer, u _t May be used to update the weights w:

w _t ＝w _t-1 -u _t

while u is to be obtained _t There are also two intermediate parameters θ _t And m is equal to _t These two parameters also need to be updated continuously with training, and the calculation method is provided in the above formula, wherein L _t For the learning rate of the t-th round, the suggested value of the initial learning rate is 0.0003; beta ₁ And beta ₂ Is a preset super parameter, beta ₁ Recommended value is 0.9, beta ₂ A value of 0.99 is recommended. Specifically, the values of the basic parameters may be modified according to actual experimental data and effects to improve accuracy, which is not limited by the present invention.

And replacing the YOLO v8 default optimizer with the Lion optimizer to finish the improvement of the optimizer.

After adding the BiFormer module, replacing the original CIoU loss function by using the MPDIOU loss function and replacing the original SGD loss function by using the Lion optimizer, the overall improvement of the YOLO v8 network model is completed, and the target detection network model which can be used in the step S102 of the invention is formed.

Specifically, in other embodiments, other target detection network models may be used to identify and label the weak texture scene area in the aerial survey image, which is not limited by the present invention.

Specifically, training the object detection network model is also required before using the object detection network model for the recognition of the weak texture scene. Specifically, first, labeling work can be performed on aerial high-resolution aerial survey images containing weak texture scenes, such as water areas, for example, water surface areas of each picture are labeled, and the water surface areas are converted into corresponding training format files, so that the training format files are divided into training sets, verification sets and test sets. And setting initial parameters for training, wherein other basic parameters besides the parameters with suggested values mentioned by the algorithm can be adjusted according to the specific conditions of the experiment so as to achieve a better effect. After parameters are set, training can be performed by using the training set and the verification set, testing and evaluation can be performed by using the testing set, and after training and evaluation indexes reach expectations, a training model is output, so that subsequent reasoning and detection can be performed by using the model.

In step S103, a weak texture segmentation mask corresponding to the aerial survey image may be generated for the subsequent depth map completion operation by using the reasoning result of the object detection network model.

Specifically, the test result of detecting the weak texture scene of the water surface data type using the modified YOLO v8 network model according to the above embodiment of the present invention as the target detection network model and extracting the weak texture segmentation mask according to the detection result may be shown in fig. 3 and fig. 4, where fig. 3 is a schematic diagram of the test result of marking the weak texture scene region included in the aerial survey image by the target detection network model according to an embodiment of the present invention, and fig. 4 is a schematic diagram of the weak texture segmentation mask generated according to the weak texture scene region detected in the aerial survey image according to the present invention according to an embodiment of the present invention.

In step S104, a spatial three-pose solution based on position constraint may be performed depending on the aerial survey image with high resolution and the corresponding high-precision GPS coordinates to generate poses of each aerial survey image, and the poses of each aerial survey image are optimized through the spatial three-pose solution to generate sparse point clouds. Specifically, in step S104, performing space three pose calculation according to all aerial survey images and corresponding GPS coordinates, and obtaining a sparse point cloud corresponding to a target scene and a pose corresponding to each aerial survey image may include: firstly, extracting and matching feature points of all aerial survey images based on SIFT_GPU; then, solving each aerial survey image through a PnP (Perspotive N-Point) algorithm to obtain an initial value of a corresponding Pose (Pose); and finally, minimizing the re-projection error of the pose based on the BA optimization method, thereby obtaining the final aerial survey image pose. The sift_gpu refers to a SIFT (Scale Invariant Feature Transform, scale-invariant feature transform) algorithm implemented by using a GPU (Graphics Processing Unit, graphics processor), and can be used for feature point extraction and matching between multi-view images. The PnP algorithm refers to an algorithm for estimating the pose of a camera (i.e., the pose of the camera in the coordinate system a) given n three-dimensional spatial point coordinates (relative to a specified coordinate system a) and its two-dimensional projection positions, and can be used to estimate the pose of each aerial survey image in the same spatial coordinate system. Specifically, the PnP algorithm used in step S104 of the present invention may be any one of a direct linear transformation (Direct Linear Transform, DLT), a P3P (transparent 3-Point) method, a Perspective similar triangle method (Perspective Similar Triangle, PST) or other PnP algorithms, which is not limited in this regard. The BA (Bundle Adjustment ) optimization method, which may also be referred to as the beam adjustment method, may be used to minimize the re-projection error and thereby optimize pose.

Specifically, the step of minimizing the re-projection error of the pose based on the BA optimization method may specifically be:

Specifically, as shown in fig. 5 and fig. 6, fig. 5 is a schematic diagram of a BA optimization method based on multi-view feature points according to an embodiment of the present invention, and fig. 6 is a schematic diagram of pose calculation and sparse point cloud calculation according to an embodiment of the present invention.

In step S105, depth estimation may be performed by performing multi-view stereo matching based on the pose obtained by the three-pose solution in step S104 and the sparse point cloud, so as to generate a depth map corresponding to each aerial survey image. Specifically, in step S105, multi-view stereo matching is performed based on the sparse point cloud and the pose corresponding to each aerial survey image, the depth map corresponding to each aerial survey image is estimated, and the key core algorithm includes:

Firstly, matching object surface elements represented by each pixel window of a aerial survey image based on a patch-match method, and calculating NCC (non-return coefficient) correlation coefficients between each aerial survey image by a matching image block guided by a homography matrix H, wherein the matching image block comprises a reference image A and a neighboring image AB, and the NCC correlation coefficients are used as matching cost; wherein,

wherein M is _A·AB M is the mean value of the product of the reference image A and the adjacent image AB _A For the mean value of the reference image A, M _AB To be the average value of the adjacent images AB, V _A For the variance of reference image A, V _AB Variance for the neighboring image AB; and then carrying out matching cost and propagation optimization of the depth values based on the disturbance in the upper, lower, left and right directions, the random optimization of the depth values and the normal vectors until a whole depth map is generated, wherein the final matching cost is stored as a confidence map, and the final normal vector is stored as a normal vector map. Specifically, as shown in fig. 7, fig. 7 is a schematic diagram of optimizing depth values based on a plurality of directional disturbances according to an embodiment of the present invention.

Specifically, the specific effect of the depth map generated by step S105 of the present invention may be as shown in fig. 8 and 9. FIG. 8 is a schematic diagram illustrating a comparison of a single Zhang Hangce image, an identified region of a weak texture scene, a generated weak texture segmentation mask and an estimated depth map according to an embodiment of the present invention. Wherein, the image (8-1) is an original aerial survey image, the image (8-2) is a weak texture scene area of the aerial survey image identified in step S102, the image (8-3) is a weak texture segmentation mask corresponding to the aerial survey image generated according to the weak texture scene area, and the image (8-4) is a depth image estimated from the aerial survey image. FIG. 9 is a schematic diagram of comparing a plurality of aerial images with an estimated depth map according to an embodiment of the present invention. Wherein, the image (9-1) is a plurality of original aerial survey images, and the image (9-2) is each depth image of each original aerial survey image in the image (9-1) which is correspondingly estimated.

In step S106, the depth map corresponding to each aerial survey image may be complemented based on the depth value at the contour formed by the weak texture segmentation mask, so as to obtain an optimized depth map, for example, when the water area is a weak texture scene area, the depth of the missing water body may be complemented and optimized based on the depth value at the contour point of the water body mask. Specifically, as shown in fig. 10, fig. 10 is a schematic flow chart of the present invention for complementing a depth map based on a weak texture segmentation mask, and in step S106, complementing a depth map corresponding to each aerial survey image based on the weak texture segmentation mask to obtain an optimized depth map corresponding to each aerial survey image may include:

for each aerial image a corresponding depth map (as reference depth map D _ref And also comprises a corresponding reference confidence map), acquires the depth map and the confidence map of the neighbor image, and projects the depth map and the confidence map of the neighbor image to a depth map D corresponding to the aerial survey image _ref In the image space, the adjacent depth map, the confidence map and the reference depth map D _ref The confidence images thereof together form a depth image array and a confidence image array;

Traversing all the contour points, obtaining the maximum neighborhood depth value of each contour point in a window of a specified range, and constructing a contour point depth histogram H based on a specified sampling interval b according to the maximum neighborhood depth value; the window of the specified range can be a window of 5×5 or other sizes, and can be specifically determined according to actual requirements, which is not limited by the invention; likewise, the designated sampling interval b can be determined according to actual requirements, which is not limited by the present invention;

acquiring contour points of which the corresponding depth values are near the peak value (including a peak value interval and a peak value adjacent interval) of the depth histogram H of the contour points in all the contour points to form a reference point set, and converting the coordinate of each contour point in the reference point set from an image coordinate system to a camera coordinate system;

performing plane fitting on contour points in the reference point set by adopting a RANSAC (Random Sample Consensus, random sampling coincidence) least square method to obtain a weak texture geometric equation under a camera coordinate system; wherein the weak texture geometry equation may be

thereby complementing the weak texture areas in the depth map and the normal map, and obtaining the optimized depth map. Because normal consistency judgment is needed when the point cloud is generated by fusing the subsequent multi-view depth maps, the normal direction of the plane can be utilizedThe same area of the normal map corresponding to the depth map is complemented, so that the correct complement result of the weak texture area can be well reserved in the fusion process.

Specifically, as shown in fig. 11, fig. 11 is a schematic diagram illustrating comparison of an aerial survey image, an original depth map and an optimized depth map according to an embodiment of the present invention. Wherein, the image (11-1) is two original aerial survey images, the image (11-2) is an original depth map generated according to step S105 corresponding to the two aerial survey images in the image (11-1), respectively, and the image (11-3) is a depth map optimized according to step S106 corresponding to the two aerial survey images in the image (11-1), respectively. As can be seen from fig. 11, compared with the original depth map, the optimized depth map can well complement the depth values of the weak texture scene region.

In step S107, fusion may be performed based on the sparse point cloud obtained in step S104 and the optimized depth map obtained in step S106, so as to generate a dense point cloud corresponding to the target scene.

Specifically, as shown in fig. 12, fig. 12 is a schematic diagram illustrating comparison of a sparse point cloud, an original dense point cloud and an optimized dense point cloud of a target scene according to an embodiment of the present invention. Wherein, the graph (12-1) is a sparse point cloud obtained in step S104 observed under a view angle, the graph (12-2) is an original dense point cloud generated based on the original depth map obtained in step S105 observed under the view angle, the graph (12-3) is an optimized dense point cloud generated based on the optimized depth map obtained in step S106 observed under the view angle, the graph (12-4) is a sparse point cloud obtained in step S104 observed under another view angle, the graph (12-5) is an original dense point cloud generated based on the original depth map obtained in step S105 observed under the another view angle, and the graph (12-6) is an optimized dense point cloud generated based on the optimized depth map obtained in step S106 observed under the another view angle.

In step S108, a three-dimensional modeling corresponding to the target scene may be generated based on the dense point cloud obtained in step S107. Specifically, in some embodiments, the three-dimensional model corresponding to the target scene generated based on the dense point cloud may be a Delaunay three-dimensional mesh model.

Specifically, as shown in fig. 13, fig. 13 is a schematic diagram showing a comparison of a three-dimensional model generated before optimization and a three-dimensional model generated after optimization of a target scene according to an embodiment of the present invention. Wherein, the graph (13-1) is a three-dimensional model constructed by dense point cloud generated based on the depth map before optimization observed under the view angle, the graph (13-2) is a three-dimensional model constructed by dense point cloud generated based on the depth map after optimization obtained in the step S107 of the invention observed under the view angle, the graph (13-3) is a three-dimensional model constructed by dense point cloud generated based on the depth map before optimization observed under another view angle, and the graph (13-4) is a three-dimensional model constructed by dense point cloud generated based on the depth map after optimization obtained in the step S107 of the invention observed under the other view angle.

Specifically, in other embodiments, the three-dimensional model generated by the present invention may be other types of three-dimensional mesh models, which is not limited by the present invention.

According to the three-dimensional reconstruction method for the complement weak texture scene, firstly, a weak texture scene area in an aerial survey image is extracted through a preset target detection network model, a weak texture segmentation mask is generated, so that a depth map generated by the aerial survey image is complemented according to the weak texture segmentation mask, and then a dense point cloud and a three-dimensional model are generated according to the complemented depth map. The method can recover the three-dimensional scene information of the extremely weak texture region and better improve the quality of the three-dimensional model of the weak texture region.

Corresponding to the embodiment of the three-dimensional reconstruction method for the full-weak texture scene, the invention also provides a three-dimensional reconstruction device for the full-weak texture scene.

As shown in fig. 14, fig. 14 is a schematic structural diagram of a three-dimensional reconstruction device for complementing a weak texture scene according to an embodiment of the present invention, which includes the following modules:

the data acquisition module 1401 is configured to acquire a plurality of aerial survey images and corresponding GPS coordinates that are shot for a target scene under different viewing angles;

the weak texture recognition module 1402 is configured to input each aerial survey image into a preset target detection network model to label a weak texture scene area included in the aerial survey image;

a mask generation module 1403 for generating a weak texture segmentation mask corresponding to the aerial survey image based on the weak texture scene region;

the pose resolving module 1404 is configured to perform a space three pose resolving according to all aerial survey images and corresponding GPS coordinates, and obtain a sparse point cloud corresponding to the target scene and a pose corresponding to each aerial survey image;

The depth map generating module 1405 is configured to perform multi-view stereo matching based on the sparse point cloud and the pose corresponding to each aerial survey image, and estimate a depth map corresponding to each aerial survey image;

the depth map complement module 1406 is configured to complement the depth map corresponding to each aerial survey image based on the weak texture segmentation mask, and obtain an optimized depth map corresponding to each aerial survey image;

the point cloud generating module 1407 is configured to generate a dense point cloud corresponding to the target scene based on the sparse point cloud and the optimized depth map corresponding to each aerial survey image;

the three-dimensional reconstruction module 1408 is configured to generate a three-dimensional model corresponding to the target scene based on the dense point cloud.

Preferably, the target detection network model preset in the weak texture recognition module 1402 may include a YOLO v8 network model, and a visual transformer attention mechanism with bi-level routing attention may be used on the basis of the YOLO v8 network model, and a bounding box similarity comparison metric based on a minimum point distance may also be used as a loss function of bounding box regression of the YOLO v8 network model, and a neural network optimizer may also be trained using auto discovery based on genetic programming.

Preferably, the step of performing the space three-pose calculation in the pose calculation module 1404 according to all aerial survey images and corresponding GPS coordinates to obtain a sparse point cloud corresponding to the target scene and a pose corresponding to each aerial survey image may include:

aiming at each aerial survey image, obtaining an initial value of a corresponding pose through PnP algorithm;

and (5) minimizing the re-projection error of the pose based on the BA optimization method.

Preferably, the above-mentioned BA-based optimization method minimizes the re-projection error of the pose, which may specifically be:

Preferably, the step of estimating the depth map corresponding to each aerial survey image in the depth map generating module 1405 based on the sparse point cloud and the pose corresponding to each aerial survey image may include:

Matching object surface elements represented by each pixel window of the aerial survey image based on a patch-match method, and calculating NCC (NCC) correlation coefficients between each aerial survey image by using the NCC correlation coefficients as matching cost through matching image blocks guided by homography matrix H, wherein the matching image blocks comprise reference images A and adjacent images AB; wherein,

/>

wherein M is _A·AB M is the mean value of the product of the reference image A and the adjacent image AB _A For the mean value of the reference image A, M _AB To be the average value of the adjacent images AB, V _A For the variance of reference image A, V _AB Variance for the neighboring image AB;

and carrying out propagation optimization on the matching cost and the depth value based on the 4-direction disturbance and the random optimization depth value, generating a depth map, and storing the final matching cost as a confidence map.

Preferably, the step of obtaining the optimized depth map corresponding to each aerial survey image in the depth map completion module 1406 based on the weak texture segmentation mask may include:

aiming at the depth map corresponding to each aerial survey image, acquiring a corresponding reference confidence map and the depth map and confidence map of the neighbor image, and projecting the depth map and the confidence map of the neighbor image to an image space where the depth map corresponding to the aerial survey image is located to form a depth map and confidence map array;

traversing all the contour points, obtaining the maximum neighborhood depth value of each contour point in a window of a specified range, and constructing a contour point depth histogram based on a specified sampling interval according to the maximum neighborhood depth value;

acquiring contour points of which the corresponding depth values are in a peak interval and a peak adjacent interval of a contour point depth histogram in all the contour points to form a reference point set, and converting coordinates of each contour point in the reference point set from an image coordinate system to a camera coordinate system;

and obtaining an optimized depth map.

Preferably, the three-dimensional model corresponding to the target scene generated based on the dense point cloud in the three-dimensional reconstruction module 1408 may be a Delaunay three-dimensional mesh model.

The implementation process of the functions and roles of each module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The invention also provides a computer device comprising at least a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of the preceding embodiments when executing the program.

FIG. 15 illustrates a more specific hardware architecture of a computing device provided by the present invention, which may include: a processor 1501, memory 1502, input/output interfaces 1503, communication interfaces 1504, and a bus 1505. Wherein the processor 1501, the memory 1502, the input/output interface 1503 and the communication interface 1504 enable communication connection between each other within the device via the bus 1505.

The processor 1501 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for executing relevant programs to implement the technical scheme provided by the present invention. The processor 1501 may also include a graphics card, which may be an Nvidia titanium X graphics card, a 1080Ti graphics card, or the like.

The Memory 1502 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. The memory 1502 may store an operating system and other application programs, and when the present invention is implemented in software or firmware, the relevant program code is stored in the memory 1502 and invoked for execution by the processor 1501.

The input/output interface 1503 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 1504 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1505 includes a path to transfer information between components of the device (e.g., processor 1501, memory 1502, input/output interface 1503, and communication interface 1504).

It should be noted that although the above device only shows the processor 1501, the memory 1502, the input/output interface 1503, the communication interface 1504, and the bus 1505, in the specific implementation, the device may further include other components necessary to achieve normal operation. Furthermore, it will be appreciated by those skilled in the art that the apparatus may include only the components necessary to implement the present invention, and not all of the components shown in the drawings.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described in any of the preceding embodiments.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

The embodiments of the present invention are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing the aspects of the present invention. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A method of three-dimensional reconstruction of a complement weak texture scene, the method comprising:

2. The method of claim 1, wherein the pre-set target detection network model comprises a YOLO v8 network model, and wherein a visual transformer attention mechanism with bi-level routing attention is used on the basis of the YOLO v8 network model, and wherein a bounding box similarity comparison metric based on minimum point distances is used as a loss function for bounding box regression of the YOLO v8 network model, and wherein training is performed using an auto-discovery neural network optimizer based on genetic programming.

3. The method of claim 1, wherein the performing a space three pose solution according to all the aerial survey images and corresponding GPS coordinates to obtain a sparse point cloud corresponding to the target scene and a pose corresponding to each of the aerial survey images comprises:

4. A method according to claim 3, characterized in that the BA-based optimization method minimizes the re-projection error of the pose, in particular:

wherein ζ represents the current pose; u (u) _i Representing the pixel coordinates of the current feature point i, wherein n is the total lambda quantity of the feature points; s is S _i Representing all the aerial survey image ranges associated with the current feature point i; k represents the current aerial survey image; ζ table shows the associated pose of the current pose; p (P) _i Representing three-dimensional point coordinates corresponding to pixel coordinates of the current feature point i; ζ represents the pose that minimizes the re-projection error, i.e., the pixel coordinate u observed by the current pose ζ _i Coordinate with three-dimensional point P _i Pose with minimal difference between the re-projected coordinates.

5. The method of claim 1, wherein the estimating the depth map for each aerial survey image based on the sparse point cloud and the pose for each aerial survey image for multi-view stereo matching comprises:

6. The method of claim 1, wherein the complementing the depth map corresponding to each aerial survey image based on the weak texture segmentation mask to obtain an optimized depth map corresponding to each aerial survey image comprises:

and obtaining the optimized depth map.

7. The method of claim 1, wherein the generating the three-dimensional model corresponding to the target scene based on the dense point cloud is a Delaunay three-dimensional mesh model.

8. A three-dimensional reconstruction apparatus for complementing a weakly textured scene, the apparatus comprising:

9. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of three-dimensional reconstruction of a complement weak texture scene of any one of claims 1-7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a method of three-dimensional reconstruction of a scene of complementary weak textures as claimed in any one of claims 1-7 when executing the program.