CN113052311A

CN113052311A - Feature extraction network with layer jump structure and method for generating features and descriptors

Info

Publication number: CN113052311A
Application number: CN202110281763.8A
Authority: CN
Inventors: 杨宁; 韩云龙; 郭雷; 方俊; 钟卫军; 徐安林
Original assignee: BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY; Northwestern Polytechnical University; China Xian Satellite Control Center
Current assignee: BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY; Northwestern Polytechnical University; China Xian Satellite Control Center
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-06-29
Anticipated expiration: 2041-03-16
Also published as: CN113052311B

Abstract

The invention relates to a feature extraction network with a layer-skipping structure and a descriptor generation method, wherein the network is an image feature extraction network with the layer-skipping structure, and output feature maps of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detailed information and effectively improve feature point positioning accuracy. The uniqueness index of a feature point refers to the similarity degree between a certain local area of an image and other areas of the image. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The invention achieves superior performance on the HPatches data set of image retrieval, especially on its illumination sequence.

Description

Feature extraction network with layer jump structure and method for generating features and descriptors

Technical Field

The invention belongs to the field of image processing and computer vision, and relates to a feature extraction network with a layer jump structure and a method for detecting image feature points and generating descriptors by using the network.

Background

In many applications, such as visual localization, object detection, pose estimation, three-dimensional reconstruction, etc., it is crucial to extract feature points and descriptors of an image. In these tasks, it is desirable to achieve high feature matching accuracy and high feature point detection accuracy. The high feature matching accuracy is that for two pictures to be matched, mismatching is performed as little as possible when feature point matching is performed; and the high feature point detection accuracy means that for successfully matched feature point pairs, the positions pointed by the feature point pairs in the two pictures are exactly the same position in the scene in the image.

The classical feature point detection method is implemented in two stages, namely, detecting key points first and then calculating a local descriptor for each key point. The first method that relies on machine learning for keypoint detection is FAST. Then, the SIFT method integrates the whole processes of image feature point detection and image local feature description, which is a typical mode of first detection and then description. LIFT proposed by Yi et al first uses a convolutional neural network to completely complete the tasks of image feature point detection and image local feature description and matching, wherein a feature point detection network based on the convolutional neural network, an image local feature direction determination network and an image local feature description network are integrated. The method has the defects that the overall complexity of the network is high, and during training, the LIFT network still uses the screened SIFT feature points as feature point labels and cannot be separated from the limitation of the SIFT method. Le-Net proposed by Ono et al continues the idea of LIFT method as a whole, but greatly simplifies the network structure, and the network is designed as a whole instead of regarding each step in the image feature point detection and image local feature description flow as an independent module. In the aspect of training, the feature point detector is trained by an unsupervised method, and an end-to-end training method is integrally adopted, so that the network performance is improved, and the network complexity is reduced.

The description-before-detection methods appearing in recent years generally exhibit better performance than the previous detection-after-description methods, and use the same network to realize both detection and description tasks, and most of parameters between the detection and description tasks are shared, so that the complexity of the network can be reduced. In recent years, the SuperPoint proposed by Detone et al uses a VGG-style network to extract image features, and detects feature point coordinates by using a similar image super-resolution method after convolution. The labels of SuperPoint are detected by a feature point detector trained by a synthetic data set, and the bias of manual labeling is eliminated. The D2-Net network proposed by Dusmanu et al is described earlier and is more prominent in the post-detection method. The method uses a VGG-16 network as a backbone and is connected with a feature point detector in series behind an output feature diagram of the VGG-16. The characteristic that D2-Net is different from other work is that the characteristic point detector has no learned parameters, and the characteristic points are only detected by a specific algorithm. Despite the simple structure, D2-Net still obtains the effect which is not similar to SuperPoint when the D2-Net comes out, and the feasibility of the idea is proved. The R2D2 network proposed by Revaud et al also uses training data without any artificial errors, and it uses optical flow instead of MegaDepth data set to generate point correspondences, providing a new idea for training data. Meanwhile, the method provides the reliability index of the descriptor to eliminate the mismatching.

However, in many methods of jointly learning local feature points and descriptors, we consider that there are two very large limitations: 1. the feature point positioning accuracy is very low and the camera geometry problem cannot be solved effectively. 2. Much work has been focused on the design of keypoint detectors, which are only repeatable and can cause mismatches for some regions with similar texture.

The accuracy of the location of the keypoints has a large impact on the performance of many computer vision tasks, such as the large projection error of D2-Net in SFM. The low accuracy of keypoint localization is mainly due to the fact that keypoint detection is performed on low-resolution feature maps (e.g., D2-Net is performed on 1/4 of the original image). In order to ensure better feature point accuracy, the SuperPoint performs up-sampling on a low-resolution feature map obtained through a VGG-like network structure to an original resolution, then performs feature point detection through a pixel-level supervision point, and the R2D2 adopts extended convolution instead of a pooling layer to ensure that the resolution of the feature map is unchanged, which increases a large amount of calculation. The ASLFeat is improved on the basis of D2-Net, and the feature point scores obtained by different resolutions are subjected to upsampling fusion so as to obtain all feature points and keep the spatial precision of the feature points. Although the ASLFeat can solve the positioning accuracy of the feature points by using less calculation amount and obtain feature information of different levels, the ASLFeat only fuses score maps of different resolutions and can obtain only a small amount of information of different levels.

In many images, a large number of parts of the texture can be highlighted, such as leaves in nature, windows of skyscrapers or ocean waves, and for the method based on local gradient histograms, although a large number of positions with large gradients can be used as feature points, due to their similarity and instability, matching cannot be performed. At the same time, much of the work based on deep learning has only focused on repeatability in the design of the keypoint detector. On the other hand, methods of metric learning techniques for learning locally robust descriptors are trained on reproducibly provided locations, which are in areas that are reproducible but not likely to be exactly matched, which may compromise performance. The method of the most recent R2D2 deals with unstable texture regions by learning a reliability score for each dense descriptor.

Disclosure of Invention

Technical problem to be solved

The invention provides a feature extraction network with a layer jump structure and a method for generating features and descriptors, aiming at solving two problems existing in the current popular method for jointly learning image feature points and descriptors, and provides an image feature point detection and descriptor generation method with a layer jump structure. And then, carrying out soft and hard feature point detection, and selecting correct feature points and descriptors by using channel scores and uniqueness scores in the feature point detection. And the uniqueness score can effectively eliminate the mismatching. Finally, the characteristic points and the descriptors with high positioning accuracy and high accuracy are obtained.

Technical scheme

A feature extraction network having a layer hopping structure, characterized by: the main body structure is a part from the conv1_1 layer to the conv4_3 layer of VGG16, and a full connecting layer is removed; bilinear interpolation is carried out on the output feature maps of the conv3_3 layer and the conv4_3 layer, the output feature maps are up-sampled to the resolution of the output feature map of the conv2_2, and then tensor splicing is carried out on the conv2_2 layer and the up-sampled feature map, so that the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure are fused; then, tensor splicing is carried out to obtain an eigen map with 896 channels, and the eigen map is subjected to 1 × 1 convolution to be changed into an eigen map F with 512 channels.

A method for detecting image feature points and generating descriptors by adopting the feature extraction network with the skip layer structure is characterized by comprising the following steps:

step 1: selecting a visible light open source data set for labeling, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair together through a homography matrix; taking the labeled data set as a training set, and selecting the data set with the label as a verification set;

step 2: using a feature extraction network F with a layer skipping structure to extract features of images in a training set to obtain a 512-dimensional feature map F, F (I),

and step 3: performing descriptor extraction on a 512-dimensional feature map, regarding each channel vector as dense description of the position of each channel vector, and then performing L2 regularization on the channel vectors to obtain dense descriptors of images

Wherein i is 1, …, h, j is 1, …, w, d_ijF is a 512-dimensional feature map which is a dense description vector of the image;

and 4, step 4: detecting the channel score and the uniqueness score of the feature points by adopting a soft feature point detector, and finally, detecting the channel score c_ijAnd a uniqueness score u_ijMultiplying to obtain the soft feature detector score s at pixel (i, j)_ij＝c_iju_ij；

Wherein:

dense descriptors

The uniqueness of (A) is divided into:

wherein i is 1, …, h, j is 1, …, w,

as dense descriptors of the image, u_ijFor a uniqueness score of a dense descriptor of an image, U is U_ijA set of (a);

descriptor (I)

The channel of (a) is divided into:

where i is 1, …, h, j is 1, …, w, t is 1, …, n,

to describe

The value when the channel is t;

and 5: and (3) carrying out loss calculation on the soft feature point detection component by using a loss function, and then carrying out loss back propagation training on the feature extraction network with the layer jump structure in the step 2:

wherein I₁,I₂RGB image pair input to the network, C image pair I₁,I₂The number of images in between corresponds to,

are respectively an image pair I₁,I₂The total score of the feature points in (a), m (c) is the triple rank penalty;

step 6: and (5) extracting the feature map of the verification set by using the trained feature extraction network with the skip-layer structure in the step 5, and selecting the pixel with the largest channel and more unique than other 75% of pixels as a feature point in the extracted 512-dimensional feature map by using a hard feature detector to obtain the feature point and descriptor of the test image.

Advantageous effects

The network is an image feature extraction network with a layer jump structure, and output feature maps of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detail information and effectively improve feature point positioning accuracy. The uniqueness index of a feature point refers to the similarity degree between a certain local area of an image and other areas of the image. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The invention achieves superior performance on the HPatches data set of image retrieval, especially on its illumination sequence.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

according to the invention, the feature fusion structure is added in the image feature extraction network, so that the feature graph obtained by the network contains semantic information of different levels, wherein the semantic information of the lower level can retain more lower level information of the image, such as edges or corners, and the like, so that high-precision detection of the image features is possible, and the high-level semantic information can provide guarantee for enhancing the accuracy of feature matching and reducing mismatching when local features are finally matched. Meanwhile, the invention effectively solves the problem of mismatching easily generated in the texture region by designing the uniqueness detection of the feature points in the feature point detection stage. Through tests, compared with the effect of D2-Net in image matching, the method can improve the positioning accuracy of the feature points when the projection error threshold is 1 by 2 times, and the method has a very excellent effect when the projection error is large, wherein the average matching accuracy reaches 0.913, and is improved by 0.011 compared with the most excellent ASLFeat at present.

Drawings

Fig. 1 is an overall structural diagram of the present invention, which includes three parts of image feature extraction, feature fusion and feature point detection.

Fig. 2 is a diagram of a feature extraction network architecture of the present invention.

FIG. 3 is a network training flow diagram.

FIG. 4 is a comparison graph of the feature point extraction effect of the present invention on HPatches.

FIG. 5 is a comparison of the matching effect of the present invention in HPatches.

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

a method for detecting image characteristic points with a layer jump structure and generating descriptors comprises the following steps: the design method comprises the following steps:

step 1: the method comprises the steps of selecting a visible light open source data set to label, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair through a homography matrix. And taking the labeled data set as a training set, and selecting the data set with the label as a verification set.

Step 2: using the feature extraction network F having the layer skipping structure as described below to perform feature extraction on the images in the training set in step 1, obtaining a 512-dimensional feature map F ═ F (i),

where h × w is the spatial resolution of the feature map and n is the number of channels.

The feature extraction network with a layer hopping structure is designed as follows: the body structure is part of conv1_1 layer to conv4_3 layer of VGG16, with the full connectivity layer removed. Then, we fuse the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the body structure, so that the spatial positioning accuracy of the feature points can be maintained and the features with different levels can be fused. Firstly, bilinear interpolation is carried out on the output characteristic diagrams of the conv3_3 layer and the conv4_3 layer, the output characteristic diagrams are up-sampled to the resolution of the output characteristic diagram of the conv2_2, and then tensor stitching is carried out on the conv2_2 layer and the up-sampled characteristic diagrams. We obtained an eigenmap with 896 channels after tensor stitching. This profile is then convolved 1 x 1 to become the 512-channel profile F.

And step 3: performing descriptor extraction on the 512-dimensional feature map extracted in the step 2, regarding each channel vector as dense description of the position of each channel vector, and performing L2 regularization on the dense description to obtain a dense descriptor of the image

Wherein i is 1, …, h, j is 1, …, w, d_ijF is a 512-dimensional feature map for dense description vectors of images.

And 4, step 4: and calculating the channel score and the uniqueness score of the feature point by adopting a soft feature point detector, and finally multiplying the detected channel score and the uniqueness score to obtain the score of the soft feature detector at the pixel (i, j).

Dense descriptors

Is scored by the uniqueness of u_ijComprises the following steps:

wherein i is 1, …, h, j is 1, …, w,

as dense descriptors of the image, u_ijFor a uniqueness score of a dense descriptor of an image, U is U_ijA collection of (a).

Descriptor (I)

The channel of (a) is divided into:

where i is 1, …, h, j is 1, …, w, t is 1, …, n,

to describe

The value when the channel is t.

The soft feature detector is divided into at pixel (i, j):

s_ij＝c_iju_ij

wherein c is_ijIs a dense descriptor

Channel score of u_ijIs a dense descriptor

Channel score of, s_ijIs a dense descriptor

The total score of (a). Only when c is_ijAnd u_ijAt least one of which is sufficiently large, s_ijCan be large enough, s_ijWhich reflects how much the spatial position (i, j) can be used as a feature point.

And 5: and (4) performing loss calculation on the soft feature point detection component in the step (4) by adopting the following loss function, and then training the loss back propagation to train the feature extraction network with the layer jump structure in the step (2).

The loss function is designed as follows: for training the network, the channel score and the uniqueness score obtained by the feature point detector are added into a loss function for training. For image pairs (I) of the input network₁,I₂) It has the labeled pixel correspondence c:

A∈I₁,B∈I₂. Wherein A, B are respectively picture I₁,I₂The pixel of (2). We take the form of losses:

are respectively an image pair I₁,I₂M (c) is a triple rank penalty that minimizes the corresponding descriptor

And

to simultaneously maximize the confusion with other descriptors in both images

Or

The distance of (c).

By using the feature point scores as weights in the loss functions, sparsity of the loss functions is guaranteed, and overfitting of the network is effectively prevented. Decreasing m (c), namely increasing the distance of the description vector directions of the matched feature points and increasing the identifiability of the description vectors; or reduce

And reducing the area with high characteristic point score in the picture.

Step 6: and (3) extracting the feature map of the verification set in the step (1) by using the feature extraction network with the skip-layer structure trained in the step (5), and selecting a pixel with the largest channel and more unique than other 75% of pixels in the extracted 512-dimensional feature map as a feature point by using a hard feature detector as follows to obtain the feature point and a descriptor of the test image.

The hard feature detector design is as follows:

the method comprises the steps of analyzing factors influencing feature point matching performance, providing a feature point uniqueness index from the viewpoint of improving the feature point matching performance, and designing a feature point detector based on feature uniqueness by combining analysis on feature point description vectors. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. We improve the matching performance of the network by selecting sufficiently unique feature points in the image.

We define that the uniqueness of a feature is the similarity of some local area of an image to other areas of the image in the same picture. The more dissimilar a local region of an image is to other local regions, the higher its distinctiveness. For a position (i, j) of the dense description, the uniqueness of the description vector

Uniqueness of u_ij

Wherein U is U_ijSet of (2), uniqueness u_ijIndicate dense descriptor vectors

And other dense descriptor vectors

The minimum distance of (c). The larger the minimum distance, the more unique the descriptor vector is among all the descriptor vectors. We sort U in descending order and note U_ijHas an ordering order of p, with p<α|U|。

In our invention, we consider that the feature point position (i, j, k) in dense descriptor d should be represented by a descriptor vector

Uniqueness of u_ijAnd

channel extreme in

Is determined. Wherein the uniqueness u_ijThe spatial position of the feature point is determined, and the channel extremum k indicates which feature response is the largest in the descriptor vector, and we optimize the spatial position (i, j) of the feature point by using its corresponding k-th layer feature map.

For description vector

Extreme value of its channel

Position k of (a) is:

its uniqueness u_ijThe order of the order is p, in the experiment, when alpha is 0.25, the effect is better, and the obtained characteristic points are more unique than other 75 percent of points. In addition, although the coordinates of the feature points are (i, j, k), since

Therefore, at most one point in the description vector at each spatial position (i, j) can be used as a feature point.

The hard characteristic point detection conditions of our invention are as follows:

where, (i, j) is the spatial position in the feature map, i is 1, …, h, j is 1, …, w.

To describe

The value when the channel is t, k is

The maximum time t, U is the uniqueness of the dense descriptor at (i, j), α is 0.75, and U is the set of uniqueness of all dense descriptors.

The specific embodiment is as follows:

referring to fig. 1, the present invention performs the detection and local feature description of the image feature points according to the following steps:

step 1: the train dataset of COCO2014 was chosen for annotation, containing 82783 images. Each image in the data set is processed by using random homography change and color dithering, the processed image and the original image form an image pair, and pixels between the image pair are connected together through a homography matrix. The labeled train dataset of COCO2014 is used as a training set. The test set was trained using a standard HPatches data set.

Step 2: referring to fig. 2, feature extraction is performed on the training set generated in step 1 using a feature extraction network having a layer-skipping structure. The main structure of the feature extraction network is a part from the conv1_1 layer to the conv4_3 layer of the VGG16, and a full connection layer is removed. Meanwhile, in order to maintain the spatial positioning accuracy of the feature points and fuse features with different levels, the output feature maps of a conv2_2 layer, a conv3_3 layer and a conv4_3 layer in the VGG-16 network are fused. First, bilinear interpolation is performed on the conv3_3 layer and the conv4_3 layer, the resolution is up-sampled to conv2_2, and then tensor stitching is performed. After tensor splicing, we obtain an eigen map with 896 channels, namely, three different levels of semantic information. And performing 1 × 1 convolution on the feature map, and fusing different levels of semantic features to obtain a feature map F with 512 channels.

And step 3: for the feature map F in step 2, which contains different levels of detail information and has the resolution of the original map 1/2, the present invention performs descriptor extraction directly thereon.

Dense description vector d of image:

wherein i is 1, …, hJ is 1, …, w. In comparison between images, the descriptor vectors can conveniently use Euclidean distance to establish corresponding relation. As with the previous work, we performed a comparison to the descriptor vector d_ijPerform L2 regularization:

and 4, step 4: referring to the feature point detection section of FIG. 1, we use soft feature detection on the feature map, each dense descriptor

Is characterized by a uniqueness score of u_ijChannel contrast score of c_ijCalculating each dense descriptor

The total score is s_ij。

And 5: we perform a penalty calculation on the soft feature point detection score of step 4 in the form of a penalty, where m (c) is the triple rank penalty that minimizes the corresponding descriptor

And

to simultaneously maximize the confusion with other descriptors in both images

Or

The distance of (c).

And

to simultaneously maximize the confusion with other descriptors in both images

Or

The distance of (c).

Referring to fig. 3, we train the network according to the sequence of

steps

2, 3, 4, and 5. In order to reduce the amount of computation when training the network, we extract the feature map F₁,F₂The average pooling process is performed to change the input resolution 1/2, and then soft feature point detection is performed. In order to obtain a better training effect and save training time, Adam is used for fine tuning training on the basis of weights pre-trained by an ImageNet image classification task. When fine tuning, we unlock the conv2_2, conv3_3 and conv4_3 layers. During our training, we input 8 pairs of images and their labels with center cut of 224 × 224 in a single iteration in bulk, and choose Adam with an initial learning rate of 10^-5A total of 40 batches were trained.

Decreasing m (c), namely increasing the distance of the description vector directions of the matched feature points and increasing the identifiability of the description vectors; or reduce

And reducing the area with high characteristic point score in the picture.

Step 6: and (5) extracting a feature map from the verification set HPatches provided in the step 1 by using the feature extraction network with the skip layer structure trained in the step 5. And selecting a pixel with the largest channel and more than 75% of other pixels in the feature map as a feature point to obtain the feature point and the descriptor of the test image.

The test results are shown in FIG. 4. During testing, we post-processed the keypoints using SIFT-like edge elimination (threshold set to 10) and sub-pixel refinement, and then bi-interpolate the descriptors at the refined positions.

The HPatches dataset is the dataset that Balntas et al constructed in their work for evaluating image feature descriptors. The data set comprises 116 image sequences of scenes in total, wherein 59 scenes are view angle groups and are sequence images shot in the same scene at different view angles, and pictures of the view angle groups are planar scenes; the other 57 scenes are illumination groups, and are image sequences of the same scene at fixed view angles and different illumination conditions. There are 6 images per scene of the Hpatches dataset, the first of which is the reference image. In the experiment, we culled picture sequences with resolutions greater than 1600 × 1200 and tested using the remaining 52 sets of illumination sequences, 56 sets of view sequences. We first extract feature points and descriptors for each sequence of images using different methods, and then use nearest neighbor search to perform feature point matching for each method, accepting only the mutual nearest neighbors. We used the average match accuracy (MMA) as a validation index.

For each image pair, the present invention uses a nearest neighbor search to match the features extracted by each method, accepting only the mutual nearest neighbors. A match is considered correct if the reprojection error using the single-map estimate provided by the dataset is below a given match threshold. In order to show the superiority of our invention, we compare with different methods, wherein the traditional method has HAN + HN + + and RootSIFT, and the method for jointly learning feature points and descriptors has SuperPoint, LF-Net, D2-Net, R2D2 and the latest ASLFeat. We record the MMA values of the different processes at different thresholds, giving the comparison in tables 1, 2.

Refer to fig. 5 and 6. In order to better show our method, we quantitatively show our effect on the HPatches data set, we select three groups of feature points, and it can be obviously found in effect that our method can effectively remove repeated texture regions in a scene, such as leaves, grasslands and paved grounds, and these regions have a large number of feature points, but are very easy to cause mismatching due to self-similarity and instability thereof. In the aspect of feature point matching, one of the illumination group and the viewing angle group is selected for comparison. In the light set, we can obviously find that D2-Net and ASLFeat obtain more matches, but in the presence of unstable texture areas such as sky, leaves and the like, the matches are invalid. Within the set of perspectives, we can clearly see that ASLFeat and D2-Net put more effort on these unstable matches, while our method yields a more representative match.

Table 1 comparison of the verification effect of the present invention on the HPatches verification set.

Table 2 comparison results of feature point positioning accuracy of the overall effect of the present invention.

Claims

1. A feature extraction network having a layer hopping structure, characterized by: the main body structure is a part from the conv1_1 layer to the conv4_3 layer of VGG16, and a full connecting layer is removed; bilinear interpolation is carried out on the output feature maps of the conv3_3 layer and the conv4_3 layer, the output feature maps are up-sampled to the resolution of the output feature map of the conv2_2, and then tensor splicing is carried out on the conv2_2 layer and the up-sampled feature map, so that the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure are fused; then, tensor splicing is carried out to obtain an eigen map with 896 channels, and the eigen map is subjected to 1 × 1 convolution to be changed into an eigen map F with 512 channels.

2. A method for detecting image feature points and generating descriptors by using the feature extraction network with a layer-skipping structure as claimed in claim 1, characterized by the steps of:

d_ij＝F_ij:

Wherein:

dense descriptors

The uniqueness of (A) is divided into:

wherein i is 1, …, h, j is 1, …, w,

descriptor (I)

The channel of (a) is divided into:

where i is 1, …, h, j is 1, …, w, t is 1, …, n,

to describe

The value when the channel is t;

wherein I₁,I₂RGB image pair input to the network, C image pair I₁,I₂Image of betweenThe number of the corresponding parts corresponds to the number of the corresponding parts,