CN116091706B

CN116091706B - Three-dimensional reconstruction method for multi-mode remote sensing image deep learning matching

Info

Publication number: CN116091706B
Application number: CN202310363863.4A
Authority: CN
Inventors: 姚国标; 张力; 艾海滨; 张进; 任晓芳; 傅青青
Original assignee: Shandong Jianzhu University
Current assignee: Shandong Jianzhu University
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-06-20
Anticipated expiration: 2043-04-07
Also published as: CN116091706A

Abstract

In the current real-scene three-dimensional model acquisition process, a large number of terrain elements among a large number of multi-mode images and point-line surface features thereof still depend on manual discrimination and extraction.

Description

Three-dimensional reconstruction method for multi-mode remote sensing image deep learning matching

Technical Field

The invention relates to the fields of new generation information technology, digital photogrammetry, computer vision, artificial intelligence and intersection thereof, in particular to a three-dimensional reconstruction method for multi-mode remote sensing image deep learning matching.

Background

The multi-mode remote sensing data of the same area, such as optical images, infrared images, synthetic aperture radars (Synthetic Aperture Radar, SAR) and the like, can provide complementary supporting information such as textures, geometries, spectrums, radiation and the like for three-dimensional reconstruction of a live-action by applying processing technologies such as homonymous feature extraction, image fusion, analysis and the like. However, in the current live-action three-dimensional model acquisition process, a large number of topographic elements and dot line and plane characteristics among a large number of multi-mode images still depend on manual discrimination and extraction, huge manpower and material resources are consumed, and measurement accuracy is limited by the technical level of operators. Specifically, the prior art has several problems:

(1) The imaging light sources and imaging mechanisms of the current multi-mode images such as optical and SAR images are often in intrinsic difference due to the fact that the images are derived from different types of sensors, so that the problems of low repetition rate, rare number and uneven spatial distribution of the homonymous features of the multi-mode images, rare matching of the homonymous features and even matching failure easily occur, and the automation and intelligent level of the current geographic information mapping technology is seriously bound.

(2) The conventional matching method cannot adapt to geometric deformation among images, and prior information is generally needed to be roughly corrected on the basis of the prior information before the image matching is carried out by applying the method to eliminate the geometric deformation among the images. When the first-check information is lost, geometric transformation between images is estimated by manually collecting ground control points, and manual point selection is time-consuming and labor-consuming, so that popularization and application of the method are limited to a certain extent.

(3) The multi-mode remote sensing image often presents difficult states such as large scale, disorder and the like, the algorithms of three-dimensional digital modeling acquisition methods such as home and abroad multi-source multi-mode remote sensing technology and the like are well developed but a system is not formed, the automatic construction of a semantic entity model is still immature, time-consuming manual operation needs to be introduced, and an integrated efficient computing framework integrating multi-mode overlapping image rapid retrieval, deep learning matching and real-scene three-dimensional reconstruction is lacked.

Disclosure of Invention

The invention aims to provide a three-dimensional reconstruction method for deep learning matching of multi-mode remote sensing images, which not only can solve the problem of failure of matching of homonymous features, but also can form a mature system framework.

The invention aims to achieve the aim, and the aim is achieved by the following technical scheme:

a three-dimensional reconstruction method for multi-mode remote sensing image deep learning matching comprises the following steps:

(1) deep learning feature extraction and description algorithm construction

Acquiring a multi-mode homonymy influence block training data set, learning based on the data set to acquire a trained DNN model, and performing feature extraction and multi-channel descriptor output;

(2) feature matching algorithm establishment of full-connection network iteration

Fusing the multi-channel descriptors, constructing complementary comprehensive descriptors considering various information of the images and a matching measure model, and completing optimal matching of the multi-mode image features through self-adaptive iteration of the fully connected network FCN;

(3) matching algorithm integration and live-action three-dimensional reconstruction application

And constructing an application frame integrating multi-mode overlapping image quick retrieval, deep learning matching and live-action three-dimensional reconstruction.

The method for obtaining the multi-mode homonymy influence block training data set comprises the following steps of: and introducing the multimode homonymous image block data set, and performing data augmentation by using a deep generation matching network to obtain the multimode homonymous influence block training data set.

The DNN network structure comprises two key subnetworks, namely affine invariant learning HesAffNet and nonlinear brightness distortion learning NLIntensNet, wherein the former is used for learning affine distortion of the same-name region, and the latter is used for learning brightness distortion.

The multi-mode overlapped image quick search comprises the following steps: the method comprises the steps of based on multi-mode image combination retrieval of visual features, introducing the maximum stable extremum region features in computer vision, setting a feature main direction as 0 direction, generating feature descriptors by utilizing pixel information of feature point neighborhood, thus obtaining a descriptor vector set, generating visual words by using a hierarchical K-means algorithm, constructing a vocabulary tree, generating an inverted document at each node, recording the length of the document, weighting the visual words, and retrieving the image combination with overlapped or similar content; and constructing a random forest based on the detected visual features in the images to be matched by a K-d tree algorithm, and adopting an optimal adjacent point searching strategy and a principal component analysis method to accelerate searching so as to determine the overlapping area of the images to be matched.

The training method of the DNN comprises the following steps: for the multi-mode homonymous image blocks with the size of 64x64 pixels, respectively passing through

And

is further cut into homonymous image blocks with the size of 32x32 pixels; respectively leading the affine 4 parameter matrix into the twin HesAffNet with shared weight, and outputting the affine 4 parameter matrix through network learning>

And->

Further generating a geometric normalized homonymous image block; the method comprises the steps of importing the image blocks with the same name into a twin NLIntenNet with shared weights, carrying out nonlinear radiation distortion learning, outputting a 3-channel descriptor with unchanged brightness, and according to an established multi-mode loss function:

the loss function values of different channels can be calculated respectively, wherein,

for the number of samples>

For multimode match descriptor distance, < >>

Distance for multimode nearest neighbor non-matching descriptor; finally, based on a large number of training samples, a random gradient descent method and residual error backward iterative propagation are utilized, so that a loss function finally tends to be minimum, and the HesAffNet and NLIntensNet combined training and global optimization are completed.

Further, the multi-channel descriptor in the step (2) includes a direction gradient descriptor G, a structural information descriptor S and a local extremum response descriptor R, and the dimension of the descriptor vector is reduced by using a PCA principal component analysis method, and the three descriptors after dimension reduction can be expressed as:

wherein,,nthe number of features extracted is represented by the number,

local ladder respectively representing characteristicsA degree vector, an overall structure information entropy vector and a neighborhood Gaussian Laplace extremum vector; then the three descriptors are fused, and a comprehensive descriptor which considers the global and local information of the image is constructed>

：

Establishing a robust matching measure model of the comprehensive descriptor

：

Wherein,,

respectively representing the gradient space distance, the structure information entropy distance and the local response distance,

respectively represent the weighted values corresponding to the three distances, thus +.>

；

Obtaining the distance weighted value of the descriptor vector through the self-adaptive iteration of the fully-connected network FCN

And a match measure threshold +.>

And (5) completing the optimal matching of the multi-mode image features.

In order to reduce the influence of partial distance extremum and improve the robustness and universality of the matching measure, the gradient space distance, the structure information entropy distance and the local response distance are respectively normalized to [0,1] by utilizing a sigmoid function.

The fully connected network FCN adaptive iteration steps include:

(1) Three distance weighted values in the matching measure are initialized

、/>

、/>

The matching measure threshold is to be initialized to +.>

；

(2) Under double geometric constraint, based on comprehensive descriptors

The initialized matching measure model +.>

Matching measure threshold +.>

Acquiring an initial homonymous matching set +.>

Estimating a new homography matrix and a new fundamental matrix of the stereopair;

(3) Adaptive generation of new distance weights by FCN

Matching measure threshold +.>

Initializing each parameter by referring to the value of the step (1);

(3) Repeating the steps (2) and (3) until the number of the homonymous features is no longer changed, terminating the iteration, and outputting the best matching result.

The three-dimensional reconstruction of the live-action comprises the following steps:

(1) Combining three-dimensional points observed by each sensor into a collineation conditional equation, and jointly obtaining object space three-dimensional coordinates of each characteristic point;

(2) Expanding a matching window to the whole overlapping area to be matched by using a deep learning matching algorithm and object space geometric information constraint to obtain a pixel-by-pixel dense matching point cloud;

(3) And (3) carrying out automatic dense matching and three-dimensional reconstruction on the multi-mode images by utilizing the adjustment of a beam method.

Preferably, the three-dimensional reconstruction of the live-action further comprises the steps of:

determining the image point coordinates of the space points on each image by utilizing a collineation condition equation according to the three-dimensional information of the model, screening target textures based on a vector included angle formed by the normal vector of the current model surface and the surface center to the photographing center, and selecting the target textures according to the size of the reversely projected texture area;

automatically cutting the target texture by adopting a minimum area external rectangle strategy;

and according to a reverse mapping rule, adopting a ray tracing method to sample point by point, and realizing automatic registration of textures and models.

The invention has the advantages that:

aiming at the problems of low repetition rate, rare number, uneven spatial distribution and the like of the homonymous feature extraction of the multi-modal image, the scheme provides a multi-modal feature extraction and description algorithm fused with a deep neural network on the basis of respectively designing an affine invariant feature extraction network and a brightness invariant descriptor network, and improves the repetition rate, the detection number and the spatial distribution uniformity of the features by utilizing the powerful learning and optimizing functions of the invariant network, thereby guaranteeing the invariance, universality and accuracy of feature extraction and description and laying a foundation for reliable matching of the subsequent multi-modal image;

because the scale, azimuth, surface brightness and neighborhood information of the same space target on the multi-mode image are subjected to complex distortion or missing, great difficulty is formed for describing and matching image characteristics;

the multi-mode remote sensing image often presents difficult states such as large scale, disorder and the like, so the multi-mode remote sensing large data intelligent application is oriented, and the scheme provides an integrated efficient computing framework integrating multi-mode overlapped image rapid retrieval, deep learning matching and live-action three-dimensional reconstruction, and finally builds a three-dimensional geographic information product production technology system.

The method is oriented to the multi-mode remote sensing image sequence, can fully automatically identify and reliably extract the homonymous features of various terrain elements, and then replaces manual work with a computer to realize intelligent processing and deep analysis of the multi-mode remote sensing image, so that the geographic information productivity of mapping is liberated and developed, and key technology and high-quality data support are provided for real-scene three-dimensional Chinese construction.

Drawings

FIG. 1 is a flow chart of the general technique of the invention;

FIG. 2 is a schematic diagram of a deep learning feature extraction and description strategy according to the present invention;

fig. 3 is a block diagram of an adaptive matching algorithm for FCN iteration of the present invention;

FIG. 4 is a diagram of a matching algorithm integration and live three-dimensional reconstruction application framework of the invention;

FIG. 5 is a schematic diagram of a multi-modal image block-to-data augmentation strategy based on DGMN according to the present invention;

FIG. 6 is a flowchart illustrating the constant feature extraction and DNN construction and training of the present invention;

FIG. 7 is a diagram of a full-automatic reconstruction effect of a three-dimensional surface model fused with a multi-mode image set according to an embodiment of the present invention, wherein (a) is a side view of the reconstruction of the three-dimensional surface model with pose information, and (b) is a top view of the reconstruction of the three-dimensional surface model;

FIG. 8 is a fully automatic registration of a real texture and a model according to an embodiment of the present invention, wherein (a) is the best texture auto-selection, (b) is the best texture auto-clipping, and (c) is the texture auto-mapping.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention discloses a three-dimensional reconstruction method for deep learning matching of a multi-mode remote sensing image, and the flow is shown in FIG. 1, and the method comprises three steps of deep learning feature extraction and description algorithm construction, feature matching algorithm establishment of full-connection network iteration, matching algorithm integration and live-action three-dimensional reconstruction application. The detailed method is as follows:

step one, deep learning feature extraction and description algorithm construction

Referring to fig. 2, the description performance of the dnn model on the invariant feature depends on the breadth and number of training samples to a certain extent, so that the internationally disclosed multi-source homonymous image block data sets (such as SEN1-2, UBC data sets, etc.) are introduced, and on the basis, the enhancement of the existing data sets is maximally realized by adopting a depth generation matching network (Deep Generative Matching Network, DGMN) as shown in fig. 5, wherein G1 represents an optical to SAR image block converter, and G2 represents an SAR to optical image block converter. First, generating a plurality of matched and non-matched image block pairs using a generation countermeasure network (Generative Adversarial Network, GAN); then, a matching label of the image block is output by utilizing a generated matching network (Generative Matching Network, GMN), wherein '1' and '0' respectively represent matching and non-matching marks, and a sufficient quantity of multi-mode homonymous image block training data sets are finally obtained, so that a sample foundation is laid for the training and learning of the deep neural network (Deep Neural Network, DNN).

The DNN comprises two key subnetworks, namely affine invariant learning HesAffNet and nonlinear brightness distortion learning NLIntensNet, wherein the affine invariant learning HesAffNet is used for learning affine distortion of a same-name region and outputting affine 4 parameters (4D); the latter is used to learn the luminance distortion and output a 3-channel descriptor. The DNN training process needs to respectively design a Hessian affine invariant neighborhood parameter extraction network HesAffNet and a nonlinear brightness invariant descriptor to generate a network NLIntensNet, and thenConstructing a feature extraction and description network DNN; then, learning to obtain a DNN model with good geometric and brightness invariable performance based on the multi-mode homonymous image block training data set; and then, extracting Hessian corner features of the test image, extracting affine invariant neighborhood and brightness invariant multi-channel descriptors thereof by using DNN (digital network) on the Hessian corner, and obtaining geometric and brightness invariant descriptors of the multi-mode image. As shown in FIG. 6, the above-mentioned flow is constructed by first passing through the multi-mode homonymous image blocks with 64x64 pixels

And->

Is further cut into homonymous image blocks with the size of 32x32 pixels; then respectively leading the affine 4 parameter matrix into the twin HesAffNet with shared weight values, and outputting the affine 4 parameter matrix through network learning>

And->

Further generating a geometric normalized homonymous image block; then, the image blocks with the same name are imported into a twin NLIntenNet with shared weights to perform nonlinear radiation distortion learning, and a 3-channel descriptor with unchanged brightness is output, and according to the established multi-mode loss function:

where k is the number of samples,

for multimode match descriptor distance, < >>

The loss function values of different channels can be calculated for the distance of the nearest neighbor non-matching descriptors of the multimode; finally, based on a large number of training samplesAnd finally, utilizing a random gradient descent method and residual backward iterative propagation to enable a loss function to tend to be minimum, thereby completing HesAffNet and NLIntensNet combined training and global optimization.

Step two, establishing a feature matching algorithm of full-connection network iteration

The geometric and brightness unchanged descriptor results of the multi-mode image are obtained through the first step, the second step is to fuse the multi-channel descriptors, and the complementary comprehensive descriptors considering the global structural information of the image, the local gradient, the extreme value and other information are to be constructed; on the basis, a robust matching measure model based on the comprehensive descriptor is constructed; then, through self-adaptive iteration of the fully connected network (Fully Connected Network, FCN), reliable distance weighted values of various descriptors and reliable self-adaptive measure thresholds are obtained, so that optimal matching of multi-mode image features is achieved.

Referring to fig. 3, first, a directional gradient descriptor G, a structural information descriptor S and a local extremum response descriptor R are fused, and the vector dimension of the descriptor is reduced by using a principal component analysis (Principal Components Analysis, PCA) method to improve the subsequent operation efficiency, and the three descriptors after dimension reduction can be expressed as:

wherein,,nthe number of features extracted is represented by the number,

a local gradient vector, an overall structure information entropy vector and a neighborhood Gaussian Laplace extremum vector which respectively represent the characteristics; then the three descriptors are fused, and a comprehensive descriptor which considers the global and local information of the image is constructed>

：

Then, a robust matching measure calculation model based on the comprehensive descriptor is establishedEM：

Wherein,,

representing the spatial distance of the gradient, the entropy distance of the structural information and the local response distance respectively (note: to attenuate the influence of the partial distance extremum to improve the robustness and universality of the matching measure, it is necessary to normalize the three distances to 0,1 respectively by means of a sigmoid function)]) And->

Respectively represent the weighted values corresponding to the three distances, thus

。

Then, the distance weighted value of the descriptor vector is obtained through the adaptive iteration of the fully connected network FCN

And a match measure threshold +.>

. The iteration steps can be briefly described as: step1, three distance weights in the matching measure are to be initialized to +.>

、/>

、/>

The matching measure threshold is to be initialized to +.>

The method comprises the steps of carrying out a first treatment on the surface of the Step2, feature-based comprehensive descriptor ++under double geometric constraints>

Matching measure after initialization->

Matching measure threshold +.>

Acquiring an initial homonymous matching set +.>

Estimating a new homography matrix and a new fundamental matrix of the stereopair; step3, adaptively generating a new more reliable distance weighting value +.>

And a match measure threshold +.>

Initializing each parameter by referring to Step 1; step4, repeatedly executing Step2 and Step3 until the number of the homonymous features is no longer changed, terminating iteration, and outputting the best matching result.

Step three, matching algorithm integration and live-action three-dimensional reconstruction application

Referring to fig. 4, the scheme establishes an application framework integrating three contents of multi-mode overlapping image fast retrieval, deep learning matching, live three-dimensional reconstruction and the like.

Firstly, in order to accurately and efficiently acquire a multi-mode image sequence to be matched from a large-scale multi-mode unordered image, introducing the characteristics of a maximum stable extremum region (Maximally Stable Extremal Regions, MSERs) widely applied in computer vision, wherein the characteristics have stability on the visual angle, brightness, noise and the like of the image, setting the main direction of the characteristics as 0 direction, and generating a characteristic descriptor by utilizing the pixel information of the neighborhood of the characteristic points, thereby acquiring a descriptor vector set S. And then a layered K-means algorithm is adopted to generate a visual word from the vector set S and construct a vocabulary tree, then an inverted document is generated for each node and the length of the document is recorded, and a TF-IDF (Term Frequency-Inverse Document Frequency) statistical technique is adopted to weight the visual word and search out the overlapping content or similar image combination. On the basis, a K-d tree algorithm is adopted to construct a random forest for the detected visual features in the images to be matched, and an optimal adjacent point searching strategy and a principal component analysis method are adopted to accelerate searching, so that the overlapping area of the images to be matched is rapidly determined.

Then, an image blocking and parallel distributed processing technology is introduced, uniformly distributed Hessian characteristic points are efficiently extracted from an overlapping area, hesAffNet and NLIntensNet network models are integrated, hessian affine invariant neighborhood detection and nonlinear brightness invariant descriptor generation are realized, on the basis, an FCN self-adaptive iterative matching technology is applied to obtain an optimal matching result of multi-mode image characteristics, and the joint positioning of multi-mode multi-source images is realized.

Finally, in order to further recover the image gestures and the dense object point cloud positions, the following technical scheme is adopted:

(1) And recovering the three-dimensional coordinates of the feature points. Considering that the position of the same object point on the multi-view image relates to at least two sensors, combining three-dimensional points observed by each sensor into a collineation conditional equation, and jointly solving the object three-dimensional coordinates of each characteristic point.

(2) And (5) extracting high-density three-dimensional point clouds. And expanding a matching window to the whole overlapping area to be matched by using a deep learning matching algorithm and object space geometric information constraint, and finally obtaining the pixel-by-pixel dense matching point cloud. In the dense matching process, the self-adaptive window and gray scale weighting strategy are adopted, so that deformation caused by topography fluctuation, scale change and the like can be automatically compensated, and the reliability of dense matching propagation is ensured.

(3) And (5) leveling by a beam method. The beam method adjustment is according to the following formula:

wherein,,

for the sensor pose parameters, +.>

And->

Respectively, three-dimensional point coordinates of an object and corresponding image point coordinates +.>

Representing a loss function, which can be used to robustly optimize the gross error; and the combined nonlinear least square error is estimated by using the method, and the sum of the re-projection errors is minimized, so that the pose parameters of the sensor and the coordinates of the three-dimensional point of the object can be optimized at the same time. The scene optimization processing is performed by integrating the nonlinear least square model with strong expansibility, automatic dense matching and three-dimensional reconstruction are performed on 1190 multi-mode images of the test area, and the initial test result is shown in fig. 7, so that the feasibility of the method is shown.

After the three-dimensional surface restoration of the object is completed, the real texture automatic mapping is performed, and the specific scheme comprises the following steps:

(1) And selecting the optimal texture. According to the three-dimensional information of the model, the image point coordinates of the space points on each image are determined by utilizing a collineation condition equation, then the target texture is further screened based on the vector included angle formed by the normal vector of the current model surface and the surface center to the photographing center, and the optimal target texture is preferably obtained according to the size of the reversely projected texture area.

(2) And (5) cutting and extracting textures. The minimum area external rectangle strategy is adopted for automatic cutting, so that the storage space is saved to the greatest extent while the sense of reality of textures is ensured.

(3) And (5) automatically mapping textures. And according to a reverse mapping rule, adopting a ray tracing method to sample point by point, and finally realizing automatic registration of textures and models. The texture mapping effect of the local model is shown in fig. 8.

Finally, it should be noted that: the foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The three-dimensional reconstruction method for the multi-mode remote sensing image deep learning matching is characterized by comprising the following steps of:

(1) deep learning feature extraction and description algorithm construction

Introducing a multi-mode homonymous image block data set, performing data augmentation by using a depth generation matching network to obtain a multi-mode homonymous influence block training data set, learning based on the data set to obtain a trained DNN model, and performing feature extraction and multi-channel descriptor output;

the network structure of the DNN model comprises two key subnetworks, namely affine invariant learning HesAffNet and nonlinear brightness distortion learning NLIntenNet, wherein the affine distortion of the same-name region is learned by the affine invariant learning HesAffNet, and the brightness distortion is learned by the nonlinear brightness distortion learning NLIntenNet;

the training method of the DNN comprises the following steps: for the multi-mode homonymous image blocks with the size of 64x64 pixels, the multi-mode homonymous image blocks respectively pass through T ₁ And T ₂ Is further cut into homonymous image blocks with the size of 32x32 pixels; respectively leading the affine 4 parameter matrix A into the twin HesAffNet with shared weight values, and outputting the affine 4 parameter matrix A through network learning ₁ And A ₂ Further generating a geometric normalized homonymous image block; the method comprises the steps of importing the image blocks with the same name into a twin NLIntenNet with shared weights, carrying out nonlinear radiation distortion learning, outputting a 3-channel descriptor with unchanged brightness, and according to an established multi-mode loss function:

the loss function values of different channels can be calculated respectively, wherein k is the number of samples, d (m _i ,p _i ) For multimode matching descriptor distance, d (m _i ,n _i ) Distance for multimode nearest neighbor non-matching descriptor; finally, based on a large number of training samples, the random gradient descent method and residual error backward iterative propagation are utilized to enable the loss function to finally tend to be minimum, thereby completing the HesAffNet and NLIntensNet connectionTraining and global optimization are combined;

the multi-channel descriptor comprises a direction gradient descriptor G, a structural information descriptor S and a local extremum response descriptor R, the vector dimension of the descriptor is simplified by using a PCA principal component analysis method, and three descriptors after dimension reduction can be expressed as:

wherein n represents the number of extracted features, g _i 、s _i And r _i A local gradient vector, an overall structure information entropy vector and a neighborhood Gaussian Laplace extremum vector which respectively represent the characteristics; then fusing the three descriptors, and constructing a comprehensive descriptor F taking global and local information of the image into consideration:

establishing a robust matching measure model EM of the comprehensive descriptor:

EM＝k ₁ E ₁ +k ₂ E ₂ +k ₃ E ₃

wherein E is ₁ 、E ₂ 、E ₃ Respectively representing gradient space distance, structure information entropy distance and local response distance, k ₁ 、k ₂ 、k ₃ Respectively represent the weighted values corresponding to the three distances, so that EM E [0,1]]；

Obtaining the distance weighted value k of the descriptor vector through the self-adaptive iteration of the fully-connected network FCN ₁ 、k ₂ 、k ₃ And a matching measure threshold T _E The optimal matching of the multi-mode image features is completed;

the step of self-adapting iteration of the fully connected network FCN comprises the following steps:

(21) Three distance weighted values in the matching measure are to be initialized to k ₁₀ ＝1/3、k ₂₀ ＝1/3、k ₃₀ =1/3, the matching measure threshold is to be initialized to T _E0 ＝0.6；

(22) Under the double geometric constraint, based on the comprehensive descriptor F, the initialized matching measure model EM and the matching measure threshold T _E0 Acquiring an initial homonymous matching set phi ₀ Estimating a new homography matrix and a new fundamental matrix of the stereopair;

(23) Adaptive generation of new distance weighting value k by FCN ₁ 、k ₂ 、k ₃ Threshold T of matching measure _E Initializing each parameter by referring to the value of the step (1);

(24) Repeating the steps (22) and (23) until the number of the homonymous features is no longer changed, terminating the iteration, and outputting the best matching result;

Constructing an application frame integrating multi-mode overlapping image rapid retrieval, deep learning matching and live-action three-dimensional reconstruction;

the multi-mode overlapped image quick search comprises the following steps: the method comprises the steps of based on multi-mode image combination retrieval of visual features, introducing the maximum stable extremum region features in computer vision, setting a feature main direction as 0 direction, generating feature descriptors by utilizing pixel information of feature point neighborhood, thus obtaining a descriptor vector set, generating visual words by using a hierarchical K-means algorithm, constructing a vocabulary tree, generating an inverted document at each node, recording the length of the document, weighting the visual words, and retrieving the image combination with overlapped or similar content; constructing a random forest based on the detected visual features in the images to be matched by a K-d tree algorithm, and adopting an optimal adjacent point searching strategy and a principal component analysis method to accelerate searching so as to determine an overlapping region of the images to be matched;

then, introducing an image blocking and parallel distributed processing technology, efficiently extracting uniformly distributed Hessian characteristic points from an overlapping area, integrating a HesAffNet network model and an NLIntensNet network model, realizing Hessian affine invariant neighborhood detection and nonlinear brightness invariant descriptor generation, and obtaining an optimal matching result of multi-mode image characteristics by applying an FCN adaptive iterative matching technology on the basis, so as to realize the joint positioning of multi-mode multi-source images;

(31) Recovering the three-dimensional coordinates of the feature points; considering that the position of the same object point on the multi-view image relates to at least two sensors, combining three-dimensional points observed by each sensor into a collineation conditional equation, and jointly solving the object three-dimensional coordinates of each characteristic point;

(32) Extracting high-density three-dimensional point clouds; expanding a matching window to the whole overlapping area to be matched by using a deep learning matching algorithm and object space geometric information constraint, and finally obtaining a pixel-by-pixel dense matching point cloud; in the dense matching process, the deformation caused by relief of topography, change of scale and the like can be automatically compensated by adopting a self-adaptive window and gray scale weighting strategy, so that the reliability of dense matching propagation is ensured;

(33) Adjustment by a beam method; the beam method adjustment is according to the following formula:

wherein P is a sensor pose parameter, X and X are respectively an object space three-dimensional point coordinate and a corresponding image point coordinate, ρ represents a loss function, and the loss function can be used for robust optimization of gross error; estimating a joint nonlinear least squares error using the above and minimizing a sum of the reprojection errors;

(4) after the three-dimensional surface restoration of the object is completed, the real texture automatic mapping is performed, and the specific scheme comprises the following steps:

(41) Selecting an optimal texture; determining the coordinates of image points of space points on each image by utilizing a collineation condition equation according to the three-dimensional information of the model, then further screening target textures based on a vector included angle formed by the normal vector of the current model surface and the surface center to the photographing center, and optimizing the optimal target textures according to the size of the reversely projected texture area;

(42) Cutting and extracting textures; the minimum area external rectangle strategy is adopted for automatic cutting, so that the storage space is saved to the greatest extent while the sense of reality of textures is ensured;

(43) Automatically mapping textures; and according to a reverse mapping rule, adopting a ray tracing method to sample point by point, and finally realizing automatic registration of textures and models.

2. The three-dimensional reconstruction method for deep learning matching of multi-modal remote sensing images according to claim 1, wherein the gradient spatial distance, the structural information entropy distance and the local response distance are normalized to [0,1] respectively by using a sigmoid function.