CN111028277B

CN111028277B - SAR and optical remote sensing image registration method based on pseudo-twin convolution neural network

Info

Publication number: CN111028277B
Application number: CN201911256966.0A
Authority: CN
Inventors: 帅通; 董喆; 孙建国; 田左; 关键; 林尤添; 田野; 袁野; 刘加贝; 肖飞扬; 尹晗琦
Original assignee: Harbin Engineering University; CETC 54 Research Institute
Current assignee: Harbin Engineering University; CETC 54 Research Institute
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2023-01-10
Anticipated expiration: 2039-12-10
Also published as: CN111028277A

Abstract

The invention discloses a pseudo-twin convolutional neural network-based SAR and optical remote sensing image registration method in the technical field of remote sensing image registration, which comprises the steps of collecting and matching characteristic image blocks, then removing abnormal points and finally registering, adopting a strategy of maximizing the characteristic distance between a positive sample and a difficult-to-negative sample, defining a new loss function to train the network, and connecting two branches of a pseudo-twin network through convolution operation to obtain a similarity score between two input image blocks; according to the invention, by providing the pseudo-twin convolutional neural network system structure, the left branch and the right branch of the pseudo-twin network can be respectively input into optical and SAR remote sensing images with different sizes, and the task of identifying corresponding image blocks in the optical and SAR remote sensing images under extremely high resolution can be solved.

Description

SAR and optical remote sensing image registration method based on pseudo-twin convolutional neural network

Technical Field

The invention relates to the technical field of remote sensing image registration, in particular to a pseudo twin convolutional neural network-based SAR and optical remote sensing image registration method.

Background

Earth monitoring by Remote Sensing (RS) has been widely used both military and civilian in recent decades. The multimode remote sensing image contains a lot of complementary information, which is beneficial to a lot of remote sensing applications. For this reason, image registration is a common requirement to utilize multimodal images. However, due to differences in imaging mechanisms, multimodal image registration is more challenging than general image registration, especially for optical and Synthetic Aperture Radar (SAR) images. Finding common features in optical and SAR images is very difficult due to differences in imaging mechanisms. And as the spatial resolution increases, the existing gap in geometry and radiometric differences between two different sensors further expands.

There are two key steps to the remote sensing image registration. The first step is to construct a set of Corresponding Points (CPs) that are as evenly distributed as possible across the image. The second step is to estimate a spatial transformation (e.g., affine transformation or projective transformation) based on the corresponding points. Image registration can be divided into two categories-feature-based methods and region-based methods-due to the different ways in which the construction assumes corresponding points.

For the feature-based approach, we assume that the corresponding points of the set match if there is the most similar structure around the corresponding points. First, candidate feature points in which the surrounding structure is prominent are detected from the main image and the slave image. Feature descriptors are generated using local structural information (mainly gradient information). The point pair having the smallest descriptor distance is regarded as the corresponding point pair. Many hand-made features (e.g., SIFT, SURF, ORB) are designed for point matching, but most cannot be applied in multi-modal image registration.

The region-based approach first generates a set of candidate feature points from the main image. For each feature point, it tries to find a correspondence on the slave image within the local search window. We define the metric for locating the corresponding relationship by the similarity between the local intensities. Therefore, the definition of the similarity measure is essential for the region-based approach. Normalized Cross-correlation (NCC) and Mutual Information (MI) are two baseline similarity measures. NCC metrics are mainly used for optical image registration, but always fail in multi-modal image registration. In contrast, MI is more robust to complex radiation variations and is widely used for multi-modality image registration.

The ideal feature of multimodal image registration should be unique and robust to various non-linear metric changes caused by different imaging conditions. Handmade features or MI are not sufficient to describe highly non-linear relationships. It is well known that the advent of convolutional neural networks has revolutionized almost all computer vision problems. In the field of image registration, learning-based features have attracted a great deal of attention. Learned features perform better than hand-written descriptors on some visual tasks, but they have not yet demonstrated overwhelming advantages, and classical local image block detectors and descriptors still provide very competitive registration results. One reason for this is that restating the image registration task as a distinguishable end-to-end process is very challenging. In addition, in the aspect of remote sensing image registration, the local image block data set currently used for training is not large enough and diversified, and does not allow learning of high-quality and widely-applicable descriptors.

Depth features are also used more in the field of remote sensing. However, to date, most deep learning related research has focused on classification and detection related tasks in different remote sensing fields. For remote sensing image registration, deep learning has just succeeded. Currently, almost all remote sensing images are geocoded with latitude and longitude. Theoretically, the correctly geocoded remote sensing images are registered, but inaccurate measurement of the spatial attitude angle often causes geographic positioning errors, and further optimization is needed. Based on the method, the SAR and the optical remote sensing image registration method based on the pseudo-twin convolutional neural network are designed to solve the problems.

Disclosure of Invention

The invention aims to provide a pseudo-twin convolutional neural network-based SAR and optical remote sensing image registration method, so as to solve the problems of poor registration effect and low precision of the SAR and the optical image provided in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a pseudo-twin convolutional neural network-based SAR and optical remote sensing image registration method comprises an SAR and an optical remote sensing image, wherein the SAR and the optical remote sensing image are registered, and the registration comprises the following specific steps:

first step, feature image block acquisition and matching

Step1, carrying out block-based Harris characteristic point detection on the main image, and extracting local candidate image blocks around the Harris characteristic points from the main image;

step2, extracting larger local search image blocks from the images around the same geographical position of the main Harris feature points;

step3, the local candidate image blocks and the corresponding local search image blocks are transmitted as the input of a pseudo-twin convolution network;

second, outlier removal and final registration

Step4, using the similarity score output by the pseudo-twin convolutional neural network as an index of matching confidence, and using the similarity score as a unique index to judge whether an assumed corresponding point is an abnormal value, namely judging whether the CP is an abnormal value, wherein the higher the score of the CP is, the higher the possibility that the CP is a correct value is;

step5, outlier removal is performed by removing all assumed CPs for which the score value is less than the given threshold Tscore.

Preferably, a pseudo-twin convolutional neural network model is constructed, the pseudo-twin convolutional neural network adopts a strategy of maximizing a characteristic distance between a positive sample and a difficult-to-negative sample, and the specific steps are as follows:

firstly, extracting the features of the SAR and the optical image block through the convolution layer, and using a convolution filter with a receptive field of 3 multiplied by 3 in a pseudo-twin convolution neural network, wherein the 3 multiplied by 3 convolution filter is the minimum kernel for capturing patterns in different directions, and then the practical small convolution filter can increase the nonlinearity in the network, so that the network has more discriminative power;

a second step, the pseudo-twin convolutional neural network having two separate but same convolutional streams for processing very different geometric and radiative appearances of SAR and optical images, processing SAR tiles and optical tiles in parallel, and fusing result information only at a later decision stage, the SAR tiles and optical tiles feature-extracting by convolutional layers, the two separate convolutional flow networks comprising 8 convolutional layers and 3 maximum pooling layers, the convolutional layers input space-filling such that spatial resolution is maintained after convolution, using a convolutional filter with a 3 × 3 receptive field, the pseudo-twin convolutional neural network in which the convolution step is fixed to one pixel, all 3 × 3 convolutional layers in the pseudo-twin convolutional neural network are filled to 1 pixel, the space pooling is performed by performing 7 maximum pooling layers, the pooling layers after convolutional layers for reducing the dimension of the feature map, the maximum pooling is performed on a 2 × 2 pixel window with a step size of 2.

Preferably, the fusion phase of the pseudo-twin convolutional neural network comprises two successive convolutional layers followed by two fully connected layers, the convolutional layers consisting of 3 × 3 filters for operating on SAR and optical cascaded feature maps in order to learn the fusion rules that minimize the final loss function, maximum pooling is omitted after the fusion phase after the first convolutional layer, and step size is set to 2 in order to downsample the feature map while preserving spatial information, 3 × 3 filters are used, and there is no maximum pooling after the first convolution, step size is made 2 to reduce the feature size, the last phase of the fusion network consists of two fully connected layers, the first layer contains 512 channels and the second layer performs a one-hot binary classification, containing 2 channels.

Preferably, in order to improve the identifiability of the pseudo-twin convolutional neural network model, a loss function is defined in a manner of maximizing a feature distance between a positive sample and a difficult-to-negative sample, the designed loss function is designed to make a difference between a correctly matched feature and a nearly unmatched feature as large as possible, a score map is obtained by convolution operation between a con-8 output feature pair, in the score map, a response of a correctly matched position is expressed as,

y _hns ＝max _k (y _ns )

the goal of the training is to maximize the distance between the average response values of the positive and difficult negative matches, i.e.

But if it is directly going to

The loss function can lead to unstable training results, the optimization process is very sensitive to learning rate for different training data sets, and logic operation is adopted to obtain smoother loss function

f _logi (y)＝log(1+exp(-y))

And the loss function L is defined as follows

Wherein N is _ps And N _hns The number of positive samples and the number of hard negative samples respectively;

for network training, the ground truth map is defined as

Wherein u = (x, y) is a two-dimensional coordinate of an arbitrary position on the score map, and u = (x, y) is a two-dimensional coordinate of an arbitrary position on the score map ₀ ＝(x ₀ ,y ₀ ) Is the coordinate of the center position, r is the effective L1 distance, if the positions u to u ₀ Is less than r, the position u is assumed to be positive and, therefore, the loss function can be rewritten as

Can also be written as

Preferably, a semi-manual training data set sorting method is established to obtain a data set with higher matching precision, and it is essential to collect co-registered small image blocks for training the pseudo-twin network, and the semi-manual training set sorting method is as follows:

a first step of rough registration of the image pair by manually selecting four corresponding points CP at the four corners of the image;

secondly, detecting candidate characteristic points of the main image by adopting a Harris angular point detector;

thirdly, searching high-confidence corresponding points on the secondary images;

step four, registering the large Geotiff;

and fifthly, acquiring more corresponding points CP from the registered large Geotiff image pair.

Preferably, the local candidate image blocks are M × M pixels, and the local search image blocks are [ M + s ] × M + s ] pixels.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, by providing the pseudo-twin convolutional neural network system structure, the left branch and the right branch of the pseudo-twin network can be respectively input into optical and SAR remote sensing images with different sizes, and the task of identifying corresponding image blocks in the optical and SAR remote sensing images under extremely high resolution can be solved. Unlike the twin structure used in CNNs traditionally used for image registration, with a pseudo-twin structure, there is no interconnection between the two data streams of SAR and optical images, the network is trained using automatically generated training data, and does not resort to any manually made features. Secondly, a new loss function is defined, similarity measurement is defined as convolution calculation in two outputs of the pseudo-twin network, and L2 distance or point generation is not performed, and the performance of the designed network can be obviously improved through a new difficult-to-case mining strategy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of a pseudo-twin convolutional neural network framework of the present invention.

FIG. 2 is a diagram of a pseudo-twin convolutional neural network layer configuration in accordance with the present invention.

FIG. 3 is a loss function construction diagram of the present invention.

Fig. 4 is an overall roadmap of SAR and optical image registration according to the present invention.

FIG. 5 is a diagram of RMSE-i for SAR-optics of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-5, the present invention provides a technical solution: an SAR and optical remote sensing image registration method based on a pseudo-twin convolution neural network,

(1) A pseudo-twin convolutional network model was constructed as shown in figure 1.

In order to deal with the very different geometric and radiation appearances of SAR and optical remote sensing images, a pseudo-twin network architecture with two separate but identical convolution streams is proposed, which can process SAR image blocks and optical image blocks in parallel and only fuse the result information at a later decision phase.

And carrying out feature extraction on the SAR and the optical image block through the convolution layer. In a network, a convolution filter with a 3 x 3 field is used because the 3 x 3 convolution filter is the smallest kernel that can capture different directional patterns, and secondly the use of a small convolution filter can increase the non-linearity inside the network, making the network more discriminative.

The convolution step length in the network is fixed to be one pixel; the spatial filling of the convolutional layer input is such that the spatial resolution is preserved after convolution, i.e. the filling is 1 pixel for all 3 x 3 convolutional layers in the network. Spatial pooling is achieved by performing 7 maximum pooling layers, which are used to reduce the dimensionality of the feature map after the convolutional layers. The maximum pooling is performed over a 2 x 2 window of pixels with a step size of 2. The layer configuration of a particular network model is shown in fig. 2.

The fusion phase of the network consists of two successive convolutional layers followed by two fully-connected layers. The convolutional layer consists of 3 x 3 filters that operate on the SAR and optical cascaded feature maps to learn the fusion rule that minimizes the final loss function. In the fusion stage, maximum pooling is omitted after the first convolution layer, and a step size of 2 is set to downsample the feature map while preserving spatial information. The use of a 3 x 3 filter and no pooling of maxima after the first convolution enables the fusion layer to learn the fusion rule, which has some invariance to spatial mismatch due to differences in imaging modalities. This is because the fusion layer learns the relationship between features using a 3 × 3 convolution while preserving nearby spatial information. The lack of maximum pooling means that these learned spatial relationships are preserved, not only taking into account the maximum response, but also letting step size be 2 to reduce the feature size. The final phase of the converged network consists of two fully connected layers: the first layer contains 512 channels; and the second layer performs one-hot binary classification, containing 2 channels.

In summary, the convolutional layers in the network, except the fusion layer, are usually composed of 3 × 3 filters and follow two rules:

1) Layers with the same feature map size have the same number of filters.

2) The number of feature maps in deeper layers increases, approximately doubling after each maximum pooling layer (except for the last convolution stack in each stream). Except that the last fully connected layer in the network takes Softmax as an activation function, the rest layers take ReLU as the activation function.

(2) A new loss function is proposed for training the neural network to improve the model identifiability.

The designed loss function is intended to show as much as possible the difference between the correctly matched features and the nearly mismatched features, and a graphical representation of the loss function construction is shown in fig. 3. The score map is obtained by a convolution operation between the con-8 output feature pairs. In the score map, the response of the correct match location is represented as:

y _hns ＝max _k (y _ns )

the goal of the training is to maximize the distance between the mean response values of the positive and difficult-to-negative matches, i.e.

But if it is directly going to

f _logi (y)＝log(1+exp(-y))

And the loss function L is defined as follows

for network training, a ground truth map is defined as

Where u = (x, y) is a two-dimensional coordinate of an arbitrary position on the score map, and u is ₀ ＝(x ₀ ,y ₀ ) Is the coordinate of the center position, r is the effective L1 distance, if the positions u to u ₀ Is less than r, the position u is assumed to be positive and, therefore, the loss function can be rewritten as

Can also be written as

During the training process, master and slave image patches are drawn around each pair of corresponding points on the optical SAR image pair. That is, for pairs of characteristic points { p } _m ,p _s At p, respectively _m And p _s Has a center extraction size of N _m *N _m Main image block and size N _s *N _s Is selected from the image blocks. For the corresponding search, I _s Range greater than I _m Mean N _s >N _m . The size of the score map y is (N) by convolving the main image block with the conv-8 output from the search image block _s -N _m +1)*(N _s -N _m +1). Theoretically there is only one correctly matched location, centered in the final score map y, but experiments have found that better results can be obtained when the network tolerates small displacements during training. Thus, as shown in fig. 3, the effective distance r is set to 2, where the red, brown and yellow positions are all considered positive.

During the test, the master block and the larger slave search block are sent to the network to obtain a score graph y. The maximum response position on y may be located as u _max 。u _max And u ₀ An offset u between _max -u ₀ Exactly the offset between the master and slave corresponding feature points. Therefore, for each feature point on the master image, it can be positioned in the slave imageThe corresponding relation of (3).

(3) A semi-manual training data set sorting method is established to obtain a data set with higher matching precision.

The network was trained and tested using Geotiff image pairs, optical images were extracted from Google Earth historical images, and SAR images were from TerraSAR-X.

The method for collecting the co-registered small image blocks is essential for training the pseudo-twin network, and the semi-manual training set finishing method mainly comprises the following five steps:

(1) each pair of large Geotiff images is roughly registered: the image pair is roughly registered by manually selecting four Corresponding Points (CP) at the four corners of the image. The slave image is warped under affine transformation. The coarse registration process is to ensure that each CP on the master and slave images has a limited offset, thereby reducing the time overhead of performing the CP search step.

(2) Detecting candidate feature points of the main image: and detecting candidate characteristic points by adopting a Harris corner detector. Specifically, block-based feature points uniformly distributed on a large geotif main image are extracted. The optical image is considered to be the main image of the optical-SAR image pair. Considering that the texture of the map image is very low, harris candidate feature points are extracted from the map image.

(3) Searching for high confidence correspondence points from the image: the two methods of MI and HOPC are combined to obtain a high confidence match result. First, a local image block of size 101 × 101 pixels is drawn around each candidate dominant feature point, and then a corresponding search block of size 121 × 121 pixels is drawn from the secondary images that are roughly matched around the same position. And respectively carrying out MI and HOPC methods on the main and auxiliary image block pairs to obtain the matching results of the CPMI and the CPHOPC.

For each Corresponding Point (CP), if | | | CP _MI -CP _HOPC || ₂ Below 1.5 pixels, it is considered a high confidence match. They were collected at CPMI = = HOPC. Then, within the set CPMI = = HOPC, the matching points that are actually mismatched are manually eliminated by visual inspection. If the number of high confidence matches is not sufficient for fine image registration, the remaining CPMI and CPH are manually subtracted from the remaining CPMIAnd selecting a new effective matched set in the OPC. Thus, a high-confidence matching feature point set CPconfi is obtained.

(4) Registration of large Geotiff: registering the large geotif image pair with a Piecewise Linear (PL) transform based on the CPconfi obtained in step (3), the transform using triangulation to divide the image into triangular regions and affine transformation to map each triangular region in the image to a corresponding region on the main image. Since the topographical variations of each image pair to be registered are non-linear and unpredictable, PL is more robust than other higher order geometric transformation models. Thus, large Geotiff image pairs are registered with high accuracy.

(5) More Corresponding Points (CP) are acquired from the registered large geotif image pair: the CPconfi data set collected in step (3) is not sufficient for network training. Since the large geotif has been accurately registered in step (4), any pair of feature points extracted from the same positions of the master image and the warped slave image can be considered to be CPs. In this step, more Harris feature points are collected on the master image and their corresponding positions are positioned on the same position of the registered slave image. And finally obtaining a final CPs set. Training patch pairs are drawn around the final CPs, which are not only used for training patch collection, but also as checkpoints for network validation.

Performing SAR and optical image registration:

SAR and optical remote sensing image registration comprise the following two main steps, and a registration flow chart is shown in FIG. 4.

(1) Collecting and matching characteristic image blocks: due to local deformation of the remote sensing image, uniform distribution of the CPs is an important factor for calculating an accurate transformation formula. Thus, similar to the training data set management process, a block-based Harris feature point detector is performed on the master image. Local candidate image blocks (M × M pixels) around the Harris feature point are extracted from the main image. Since the offset between geocoded image pairs is not significant, a larger local search patch [ (M + s) × (M + s) pixels ] is extracted from the image around the same geographic location of the master Harris feature points. The candidate tiles and the corresponding search tiles are taken as inputs to the pseudo-twin network. On the output score map, the offset between the maximum score position and its center position is taken as the offset between the geographic positions of the corresponding master and slave feature points.

(2) Outlier removal and final matching: in the method, a similarity score output by a network is used as an index of matching confidence, and the score value is used as a unique index to determine whether a CP is an abnormal value. Clearly, a higher CP score indicates a greater likelihood that the CP is correct. Outlier removal is performed by removing all hypothetical CPs whose fractional values are less than a given threshold Tscore.

A method for automatically determining the Tscore value is devised. Assume that there are N hypothetical CPs { cp 1.,. Cpi.,. CpN }, sorted in descending order of score. For each i>The first i-estimated CPs are used to estimate the projective transformation between them, and then the Root Mean Square Error (RMSE) of these i-estimated CPs is calculated. Then, a graph of RMSE versus i is obtained. A third order polynomial curve R (I) was fitted to the RMSE-I plot as shown in FIG. 5. Mixing min (abs: (b))

(i) ) is IP, wherein

(i) The first derivative of R (i) is indicated. The score of cpip is assumed to be the threshold Tscore and the first iP CPs is considered to be an internal value. min (abs: (

(i) In) indicates that its impact on RMSE is minimal when nearby CPs are added or removed, which means that the transformation determined by the first iP CPs is relatively stable.

The final registration is performed using the filtered CPs by warping the slave image onto the master image. The choice of transformation model depends on the particular conditions of the image pair to be registered. For low or medium resolution images, affine or projective transformations are usually sufficient. But for high resolution images, the PL transform can achieve better registration results.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A pseudo-twin convolutional neural network-based SAR and optical remote sensing image registration method comprises an SAR and an optical remote sensing image and is characterized in that the SAR and the optical remote sensing image are registered, and the registration comprises the following specific steps:

first step, feature image block acquisition and matching

step2, extracting larger local search image blocks from the image around the same geographical position of the main Harris feature point;

second, outlier removal and final registration

step5, outlier removal is performed by removing all hypothetical CPs whose fractional value is less than a given threshold Tscore.

2. The SAR and optical remote sensing image registration method based on the pseudo-twin convolutional neural network as claimed in claim 1, characterized in that a pseudo-twin convolutional neural network model is constructed, the pseudo-twin convolutional neural network adopts a strategy of maximizing a characteristic distance between a positive sample and a difficult negative sample, and the specific steps are as follows:

and a second step, the pseudo-twin convolutional neural network has two separate but same convolutional streams for processing very different geometric and radiation appearances of SAR and optical images, processing SAR tiles and optical tiles in parallel, and fusing result information only at a later decision stage, the SAR tiles and optical tiles perform feature extraction by convolutional layers, the two separate convolutional stream networks include 8 convolutional layers and 3 maximum pooling layers, space filling of convolutional layer input maintains spatial resolution after convolution, a convolutional filter with a 3 × 3 receptive field is used, a convolution step in the pseudo-twin convolutional neural network is fixed to one pixel, all 3 × 3 convolutional layers in the pseudo-twin convolutional neural network are filled to 1 pixel, space pooling is performed by executing 7 maximum pooling layers, the pooling layers are used for reducing the dimension of the feature map after convolutional layers, and maximum pooling is performed on a 2 × 2 pixel window with a step size of 2.

3. The pseudo-twin convolutional neural network based SAR and optical remote sensing image registration method of claim 2, wherein the fusion phase of the pseudo-twin convolutional neural network comprises two consecutive convolutional layers followed by two fully connected layers, the convolutional layers are composed of 3 x 3 filters for operating on the SAR and optical cascaded feature mapping to learn the fusion rule that minimizes the final loss function, maximum pooling is omitted after the fusion phase after the first convolutional layer, and step size is set to 2 to downsample the feature map while preserving spatial information, 3 x 3 filters are used, and there is no maximum pooling after the first convolution, step size is made to 2 to reduce feature size, the last phase of the fusion network is composed of two fully connected layers, the first layer contains 512 channels, and the second layer performs the unique binary classification, contains 2 channels.

4. The SAR and optical remote sensing image registration method based on the pseudo-twin convolutional neural network as claimed in claim 1, characterized in that: in order to improve the identifiability of the pseudo-twin convolutional neural network model, a loss function is defined in a mode of maximizing the characteristic distance between a positive sample and a difficult-to-negative sample, the designed loss function aims to enable the difference between a correctly matched characteristic and an approximate unmatched characteristic to be as large as possible, a score map is obtained through convolution operation between con-8 output characteristic pairs, in the score map, the response of a correctly matched position is expressed as,

y _hns ＝max _k (y _ns )

But if it is direct will

f _logi (y)＝log(1+exp(-y))

And the loss function L is defined as follows

for network training, the ground truth map is defined as

Can also be written as

5. The SAR and optical remote sensing image registration method based on the pseudo-twin convolutional neural network as claimed in claim 1, characterized in that a semi-manual training data set sorting method is established to obtain a data set with higher matching accuracy, and collecting co-registered small image blocks is essential for training the pseudo-twin network, the semi-manual training set sorting method is as follows:

thirdly, searching high-confidence corresponding points on the slave images;

fourthly, registering the large Geotiff;

6. The SAR and optical remote sensing image registration method based on the pseudo-twin convolutional neural network according to claim 1, characterized in that: the local candidate image blocks are M × M pixels, and the local search image blocks are [ M + s ] × [ M + s ] pixels.