CN113052311A - Feature extraction network with layer jump structure and method for generating features and descriptors - Google Patents

Feature extraction network with layer jump structure and method for generating features and descriptors Download PDF

Info

Publication number
CN113052311A
CN113052311A CN202110281763.8A CN202110281763A CN113052311A CN 113052311 A CN113052311 A CN 113052311A CN 202110281763 A CN202110281763 A CN 202110281763A CN 113052311 A CN113052311 A CN 113052311A
Authority
CN
China
Prior art keywords
feature
image
layer
network
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110281763.8A
Other languages
Chinese (zh)
Other versions
CN113052311B (en
Inventor
杨宁
韩云龙
郭雷
方俊
钟卫军
徐安林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY
Northwestern Polytechnical University
China Xian Satellite Control Center
Original Assignee
BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY
Northwestern Polytechnical University
China Xian Satellite Control Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY, Northwestern Polytechnical University, China Xian Satellite Control Center filed Critical BEIJING INSTITUTE OF TRACKING AND COMMUNICATION TECHNOLOGY
Priority to CN202110281763.8A priority Critical patent/CN113052311B/en
Publication of CN113052311A publication Critical patent/CN113052311A/en
Application granted granted Critical
Publication of CN113052311B publication Critical patent/CN113052311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a feature extraction network with a layer-skipping structure and a descriptor generation method, wherein the network is an image feature extraction network with the layer-skipping structure, and output feature maps of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detailed information and effectively improve feature point positioning accuracy. The uniqueness index of a feature point refers to the similarity degree between a certain local area of an image and other areas of the image. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The invention achieves superior performance on the HPatches data set of image retrieval, especially on its illumination sequence.

Description

Feature extraction network with layer jump structure and method for generating features and descriptors
Technical Field
The invention belongs to the field of image processing and computer vision, and relates to a feature extraction network with a layer jump structure and a method for detecting image feature points and generating descriptors by using the network.
Background
In many applications, such as visual localization, object detection, pose estimation, three-dimensional reconstruction, etc., it is crucial to extract feature points and descriptors of an image. In these tasks, it is desirable to achieve high feature matching accuracy and high feature point detection accuracy. The high feature matching accuracy is that for two pictures to be matched, mismatching is performed as little as possible when feature point matching is performed; and the high feature point detection accuracy means that for successfully matched feature point pairs, the positions pointed by the feature point pairs in the two pictures are exactly the same position in the scene in the image.
The classical feature point detection method is implemented in two stages, namely, detecting key points first and then calculating a local descriptor for each key point. The first method that relies on machine learning for keypoint detection is FAST. Then, the SIFT method integrates the whole processes of image feature point detection and image local feature description, which is a typical mode of first detection and then description. LIFT proposed by Yi et al first uses a convolutional neural network to completely complete the tasks of image feature point detection and image local feature description and matching, wherein a feature point detection network based on the convolutional neural network, an image local feature direction determination network and an image local feature description network are integrated. The method has the defects that the overall complexity of the network is high, and during training, the LIFT network still uses the screened SIFT feature points as feature point labels and cannot be separated from the limitation of the SIFT method. Le-Net proposed by Ono et al continues the idea of LIFT method as a whole, but greatly simplifies the network structure, and the network is designed as a whole instead of regarding each step in the image feature point detection and image local feature description flow as an independent module. In the aspect of training, the feature point detector is trained by an unsupervised method, and an end-to-end training method is integrally adopted, so that the network performance is improved, and the network complexity is reduced.
The description-before-detection methods appearing in recent years generally exhibit better performance than the previous detection-after-description methods, and use the same network to realize both detection and description tasks, and most of parameters between the detection and description tasks are shared, so that the complexity of the network can be reduced. In recent years, the SuperPoint proposed by Detone et al uses a VGG-style network to extract image features, and detects feature point coordinates by using a similar image super-resolution method after convolution. The labels of SuperPoint are detected by a feature point detector trained by a synthetic data set, and the bias of manual labeling is eliminated. The D2-Net network proposed by Dusmanu et al is described earlier and is more prominent in the post-detection method. The method uses a VGG-16 network as a backbone and is connected with a feature point detector in series behind an output feature diagram of the VGG-16. The characteristic that D2-Net is different from other work is that the characteristic point detector has no learned parameters, and the characteristic points are only detected by a specific algorithm. Despite the simple structure, D2-Net still obtains the effect which is not similar to SuperPoint when the D2-Net comes out, and the feasibility of the idea is proved. The R2D2 network proposed by Revaud et al also uses training data without any artificial errors, and it uses optical flow instead of MegaDepth data set to generate point correspondences, providing a new idea for training data. Meanwhile, the method provides the reliability index of the descriptor to eliminate the mismatching.
However, in many methods of jointly learning local feature points and descriptors, we consider that there are two very large limitations: 1. the feature point positioning accuracy is very low and the camera geometry problem cannot be solved effectively. 2. Much work has been focused on the design of keypoint detectors, which are only repeatable and can cause mismatches for some regions with similar texture.
The accuracy of the location of the keypoints has a large impact on the performance of many computer vision tasks, such as the large projection error of D2-Net in SFM. The low accuracy of keypoint localization is mainly due to the fact that keypoint detection is performed on low-resolution feature maps (e.g., D2-Net is performed on 1/4 of the original image). In order to ensure better feature point accuracy, the SuperPoint performs up-sampling on a low-resolution feature map obtained through a VGG-like network structure to an original resolution, then performs feature point detection through a pixel-level supervision point, and the R2D2 adopts extended convolution instead of a pooling layer to ensure that the resolution of the feature map is unchanged, which increases a large amount of calculation. The ASLFeat is improved on the basis of D2-Net, and the feature point scores obtained by different resolutions are subjected to upsampling fusion so as to obtain all feature points and keep the spatial precision of the feature points. Although the ASLFeat can solve the positioning accuracy of the feature points by using less calculation amount and obtain feature information of different levels, the ASLFeat only fuses score maps of different resolutions and can obtain only a small amount of information of different levels.
In many images, a large number of parts of the texture can be highlighted, such as leaves in nature, windows of skyscrapers or ocean waves, and for the method based on local gradient histograms, although a large number of positions with large gradients can be used as feature points, due to their similarity and instability, matching cannot be performed. At the same time, much of the work based on deep learning has only focused on repeatability in the design of the keypoint detector. On the other hand, methods of metric learning techniques for learning locally robust descriptors are trained on reproducibly provided locations, which are in areas that are reproducible but not likely to be exactly matched, which may compromise performance. The method of the most recent R2D2 deals with unstable texture regions by learning a reliability score for each dense descriptor.
Disclosure of Invention
Technical problem to be solved
The invention provides a feature extraction network with a layer jump structure and a method for generating features and descriptors, aiming at solving two problems existing in the current popular method for jointly learning image feature points and descriptors, and provides an image feature point detection and descriptor generation method with a layer jump structure. And then, carrying out soft and hard feature point detection, and selecting correct feature points and descriptors by using channel scores and uniqueness scores in the feature point detection. And the uniqueness score can effectively eliminate the mismatching. Finally, the characteristic points and the descriptors with high positioning accuracy and high accuracy are obtained.
Technical scheme
A feature extraction network having a layer hopping structure, characterized by: the main body structure is a part from the conv1_1 layer to the conv4_3 layer of VGG16, and a full connecting layer is removed; bilinear interpolation is carried out on the output feature maps of the conv3_3 layer and the conv4_3 layer, the output feature maps are up-sampled to the resolution of the output feature map of the conv2_2, and then tensor splicing is carried out on the conv2_2 layer and the up-sampled feature map, so that the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure are fused; then, tensor splicing is carried out to obtain an eigen map with 896 channels, and the eigen map is subjected to 1 × 1 convolution to be changed into an eigen map F with 512 channels.
A method for detecting image feature points and generating descriptors by adopting the feature extraction network with the skip layer structure is characterized by comprising the following steps:
step 1: selecting a visible light open source data set for labeling, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair together through a homography matrix; taking the labeled data set as a training set, and selecting the data set with the label as a verification set;
step 2: using a feature extraction network F with a layer skipping structure to extract features of images in a training set to obtain a 512-dimensional feature map F, F (I),
Figure BDA0002978815040000041
and step 3: performing descriptor extraction on a 512-dimensional feature map, regarding each channel vector as dense description of the position of each channel vector, and then performing L2 regularization on the channel vectors to obtain dense descriptors of images
Figure BDA0002978815040000042
Figure BDA0002978815040000043
Wherein i is 1, …, h, j is 1, …, w, dijF is a 512-dimensional feature map which is a dense description vector of the image;
and 4, step 4: detecting the channel score and the uniqueness score of the feature points by adopting a soft feature point detector, and finally, detecting the channel score cijAnd a uniqueness score uijMultiplying to obtain the soft feature detector score s at pixel (i, j)ij=cijuij
Wherein:
dense descriptors
Figure BDA0002978815040000044
The uniqueness of (A) is divided into:
Figure BDA0002978815040000045
wherein i is 1, …, h, j is 1, …, w,
Figure BDA0002978815040000046
as dense descriptors of the image, uijFor a uniqueness score of a dense descriptor of an image, U is UijA set of (a);
descriptor (I)
Figure BDA0002978815040000051
The channel of (a) is divided into:
Figure BDA0002978815040000052
where i is 1, …, h, j is 1, …, w, t is 1, …, n,
Figure BDA0002978815040000053
to describe
Figure BDA0002978815040000054
The value when the channel is t;
and 5: and (3) carrying out loss calculation on the soft feature point detection component by using a loss function, and then carrying out loss back propagation training on the feature extraction network with the layer jump structure in the step 2:
Figure BDA0002978815040000055
wherein I1,I2RGB image pair input to the network, C image pair I1,I2The number of images in between corresponds to,
Figure BDA0002978815040000056
are respectively an image pair I1,I2The total score of the feature points in (a), m (c) is the triple rank penalty;
step 6: and (5) extracting the feature map of the verification set by using the trained feature extraction network with the skip-layer structure in the step 5, and selecting the pixel with the largest channel and more unique than other 75% of pixels as a feature point in the extracted 512-dimensional feature map by using a hard feature detector to obtain the feature point and descriptor of the test image.
Advantageous effects
The network is an image feature extraction network with a layer jump structure, and output feature maps of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detail information and effectively improve feature point positioning accuracy. The uniqueness index of a feature point refers to the similarity degree between a certain local area of an image and other areas of the image. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The invention achieves superior performance on the HPatches data set of image retrieval, especially on its illumination sequence.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
according to the invention, the feature fusion structure is added in the image feature extraction network, so that the feature graph obtained by the network contains semantic information of different levels, wherein the semantic information of the lower level can retain more lower level information of the image, such as edges or corners, and the like, so that high-precision detection of the image features is possible, and the high-level semantic information can provide guarantee for enhancing the accuracy of feature matching and reducing mismatching when local features are finally matched. Meanwhile, the invention effectively solves the problem of mismatching easily generated in the texture region by designing the uniqueness detection of the feature points in the feature point detection stage. Through tests, compared with the effect of D2-Net in image matching, the method can improve the positioning accuracy of the feature points when the projection error threshold is 1 by 2 times, and the method has a very excellent effect when the projection error is large, wherein the average matching accuracy reaches 0.913, and is improved by 0.011 compared with the most excellent ASLFeat at present.
Drawings
Fig. 1 is an overall structural diagram of the present invention, which includes three parts of image feature extraction, feature fusion and feature point detection.
Fig. 2 is a diagram of a feature extraction network architecture of the present invention.
FIG. 3 is a network training flow diagram.
FIG. 4 is a comparison graph of the feature point extraction effect of the present invention on HPatches.
FIG. 5 is a comparison of the matching effect of the present invention in HPatches.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
a method for detecting image characteristic points with a layer jump structure and generating descriptors comprises the following steps: the design method comprises the following steps:
step 1: the method comprises the steps of selecting a visible light open source data set to label, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair through a homography matrix. And taking the labeled data set as a training set, and selecting the data set with the label as a verification set.
Step 2: using the feature extraction network F having the layer skipping structure as described below to perform feature extraction on the images in the training set in step 1, obtaining a 512-dimensional feature map F ═ F (i),
Figure BDA0002978815040000061
where h × w is the spatial resolution of the feature map and n is the number of channels.
The feature extraction network with a layer hopping structure is designed as follows: the body structure is part of conv1_1 layer to conv4_3 layer of VGG16, with the full connectivity layer removed. Then, we fuse the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the body structure, so that the spatial positioning accuracy of the feature points can be maintained and the features with different levels can be fused. Firstly, bilinear interpolation is carried out on the output characteristic diagrams of the conv3_3 layer and the conv4_3 layer, the output characteristic diagrams are up-sampled to the resolution of the output characteristic diagram of the conv2_2, and then tensor stitching is carried out on the conv2_2 layer and the up-sampled characteristic diagrams. We obtained an eigenmap with 896 channels after tensor stitching. This profile is then convolved 1 x 1 to become the 512-channel profile F.
And step 3: performing descriptor extraction on the 512-dimensional feature map extracted in the step 2, regarding each channel vector as dense description of the position of each channel vector, and performing L2 regularization on the dense description to obtain a dense descriptor of the image
Figure BDA0002978815040000071
Figure BDA0002978815040000072
Wherein i is 1, …, h, j is 1, …, w, dijF is a 512-dimensional feature map for dense description vectors of images.
And 4, step 4: and calculating the channel score and the uniqueness score of the feature point by adopting a soft feature point detector, and finally multiplying the detected channel score and the uniqueness score to obtain the score of the soft feature detector at the pixel (i, j).
Dense descriptors
Figure BDA0002978815040000073
Is scored by the uniqueness of uijComprises the following steps:
Figure BDA0002978815040000074
wherein i is 1, …, h, j is 1, …, w,
Figure BDA0002978815040000075
as dense descriptors of the image, uijFor a uniqueness score of a dense descriptor of an image, U is UijA collection of (a).
Descriptor (I)
Figure BDA0002978815040000076
The channel of (a) is divided into:
Figure BDA0002978815040000077
where i is 1, …, h, j is 1, …, w, t is 1, …, n,
Figure BDA0002978815040000078
to describe
Figure BDA0002978815040000079
The value when the channel is t.
The soft feature detector is divided into at pixel (i, j):
sij=cijuij
wherein c isijIs a dense descriptor
Figure BDA0002978815040000081
Channel score of uijIs a dense descriptor
Figure BDA0002978815040000082
Channel score of, sijIs a dense descriptor
Figure BDA0002978815040000083
The total score of (a). Only when c isijAnd uijAt least one of which is sufficiently large, sijCan be large enough, sijWhich reflects how much the spatial position (i, j) can be used as a feature point.
And 5: and (4) performing loss calculation on the soft feature point detection component in the step (4) by adopting the following loss function, and then training the loss back propagation to train the feature extraction network with the layer jump structure in the step (2).
The loss function is designed as follows: for training the network, the channel score and the uniqueness score obtained by the feature point detector are added into a loss function for training. For image pairs (I) of the input network1,I2) It has the labeled pixel correspondence c:
Figure BDA00029788150400000811
A∈I1,B∈I2. Wherein A, B are respectively picture I1,I2The pixel of (2). We take the form of losses:
Figure BDA0002978815040000084
wherein I1,I2RGB image pair input to the network, C image pair I1,I2The number of images in between corresponds to,
Figure BDA0002978815040000085
are respectively an image pair I1,I2M (c) is a triple rank penalty that minimizes the corresponding descriptor
Figure BDA0002978815040000086
And
Figure BDA0002978815040000087
to simultaneously maximize the confusion with other descriptors in both images
Figure BDA0002978815040000088
Or
Figure BDA0002978815040000089
The distance of (c).
By using the feature point scores as weights in the loss functions, sparsity of the loss functions is guaranteed, and overfitting of the network is effectively prevented. Decreasing m (c), namely increasing the distance of the description vector directions of the matched feature points and increasing the identifiability of the description vectors; or reduce
Figure BDA00029788150400000810
And reducing the area with high characteristic point score in the picture.
Step 6: and (3) extracting the feature map of the verification set in the step (1) by using the feature extraction network with the skip-layer structure trained in the step (5), and selecting a pixel with the largest channel and more unique than other 75% of pixels in the extracted 512-dimensional feature map as a feature point by using a hard feature detector as follows to obtain the feature point and a descriptor of the test image.
The hard feature detector design is as follows:
the method comprises the steps of analyzing factors influencing feature point matching performance, providing a feature point uniqueness index from the viewpoint of improving the feature point matching performance, and designing a feature point detector based on feature uniqueness by combining analysis on feature point description vectors. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. We improve the matching performance of the network by selecting sufficiently unique feature points in the image.
We define that the uniqueness of a feature is the similarity of some local area of an image to other areas of the image in the same picture. The more dissimilar a local region of an image is to other local regions, the higher its distinctiveness. For a position (i, j) of the dense description, the uniqueness of the description vector
Figure BDA0002978815040000091
Uniqueness of uij
Figure BDA0002978815040000092
Wherein U is UijSet of (2), uniqueness uijIndicate dense descriptor vectors
Figure BDA0002978815040000093
And other dense descriptor vectors
Figure BDA0002978815040000094
The minimum distance of (c). The larger the minimum distance, the more unique the descriptor vector is among all the descriptor vectors. We sort U in descending order and note UijHas an ordering order of p, with p<α|U|。
In our invention, we consider that the feature point position (i, j, k) in dense descriptor d should be represented by a descriptor vector
Figure BDA0002978815040000095
Uniqueness of uijAnd
Figure BDA0002978815040000096
channel extreme in
Figure BDA0002978815040000097
Is determined. Wherein the uniqueness uijThe spatial position of the feature point is determined, and the channel extremum k indicates which feature response is the largest in the descriptor vector, and we optimize the spatial position (i, j) of the feature point by using its corresponding k-th layer feature map.
For description vector
Figure BDA0002978815040000098
Extreme value of its channel
Figure BDA0002978815040000099
Position k of (a) is:
Figure BDA00029788150400000910
its uniqueness uijThe order of the order is p, in the experiment, when alpha is 0.25, the effect is better, and the obtained characteristic points are more unique than other 75 percent of points. In addition, although the coordinates of the feature points are (i, j, k), since
Figure BDA00029788150400000911
Therefore, at most one point in the description vector at each spatial position (i, j) can be used as a feature point.
The hard characteristic point detection conditions of our invention are as follows:
Figure BDA00029788150400000912
where, (i, j) is the spatial position in the feature map, i is 1, …, h, j is 1, …, w.
Figure BDA00029788150400000913
To describe
Figure BDA00029788150400000914
The value when the channel is t, k is
Figure BDA00029788150400000915
The maximum time t, U is the uniqueness of the dense descriptor at (i, j), α is 0.75, and U is the set of uniqueness of all dense descriptors.
The specific embodiment is as follows:
referring to fig. 1, the present invention performs the detection and local feature description of the image feature points according to the following steps:
step 1: the train dataset of COCO2014 was chosen for annotation, containing 82783 images. Each image in the data set is processed by using random homography change and color dithering, the processed image and the original image form an image pair, and pixels between the image pair are connected together through a homography matrix. The labeled train dataset of COCO2014 is used as a training set. The test set was trained using a standard HPatches data set.
Step 2: referring to fig. 2, feature extraction is performed on the training set generated in step 1 using a feature extraction network having a layer-skipping structure. The main structure of the feature extraction network is a part from the conv1_1 layer to the conv4_3 layer of the VGG16, and a full connection layer is removed. Meanwhile, in order to maintain the spatial positioning accuracy of the feature points and fuse features with different levels, the output feature maps of a conv2_2 layer, a conv3_3 layer and a conv4_3 layer in the VGG-16 network are fused. First, bilinear interpolation is performed on the conv3_3 layer and the conv4_3 layer, the resolution is up-sampled to conv2_2, and then tensor stitching is performed. After tensor splicing, we obtain an eigen map with 896 channels, namely, three different levels of semantic information. And performing 1 × 1 convolution on the feature map, and fusing different levels of semantic features to obtain a feature map F with 512 channels.
And step 3: for the feature map F in step 2, which contains different levels of detail information and has the resolution of the original map 1/2, the present invention performs descriptor extraction directly thereon.
Dense description vector d of image:
Figure BDA0002978815040000101
wherein i is 1, …, hJ is 1, …, w. In comparison between images, the descriptor vectors can conveniently use Euclidean distance to establish corresponding relation. As with the previous work, we performed a comparison to the descriptor vector dijPerform L2 regularization:
Figure BDA0002978815040000102
and 4, step 4: referring to the feature point detection section of FIG. 1, we use soft feature detection on the feature map, each dense descriptor
Figure BDA0002978815040000103
Is characterized by a uniqueness score of uijChannel contrast score of cijCalculating each dense descriptor
Figure BDA0002978815040000104
The total score is sij
And 5: we perform a penalty calculation on the soft feature point detection score of step 4 in the form of a penalty, where m (c) is the triple rank penalty that minimizes the corresponding descriptor
Figure BDA0002978815040000111
And
Figure BDA0002978815040000112
to simultaneously maximize the confusion with other descriptors in both images
Figure BDA0002978815040000113
Or
Figure BDA0002978815040000114
The distance of (c).
Figure BDA0002978815040000115
Wherein I1,I2RGB image pair input to the network, C image pair I1,I2The number of images in between corresponds to,
Figure BDA0002978815040000116
are respectively an image pair I1,I2M (c) is a triple rank penalty that minimizes the corresponding descriptor
Figure BDA0002978815040000117
And
Figure BDA0002978815040000118
to simultaneously maximize the confusion with other descriptors in both images
Figure BDA0002978815040000119
Or
Figure BDA00029788150400001110
The distance of (c).
Referring to fig. 3, we train the network according to the sequence of steps 2, 3, 4, and 5. In order to reduce the amount of computation when training the network, we extract the feature map F1,F2The average pooling process is performed to change the input resolution 1/2, and then soft feature point detection is performed. In order to obtain a better training effect and save training time, Adam is used for fine tuning training on the basis of weights pre-trained by an ImageNet image classification task. When fine tuning, we unlock the conv2_2, conv3_3 and conv4_3 layers. During our training, we input 8 pairs of images and their labels with center cut of 224 × 224 in a single iteration in bulk, and choose Adam with an initial learning rate of 10-5A total of 40 batches were trained.
Decreasing m (c), namely increasing the distance of the description vector directions of the matched feature points and increasing the identifiability of the description vectors; or reduce
Figure BDA00029788150400001111
And reducing the area with high characteristic point score in the picture.
Step 6: and (5) extracting a feature map from the verification set HPatches provided in the step 1 by using the feature extraction network with the skip layer structure trained in the step 5. And selecting a pixel with the largest channel and more than 75% of other pixels in the feature map as a feature point to obtain the feature point and the descriptor of the test image.
The test results are shown in FIG. 4. During testing, we post-processed the keypoints using SIFT-like edge elimination (threshold set to 10) and sub-pixel refinement, and then bi-interpolate the descriptors at the refined positions.
The HPatches dataset is the dataset that Balntas et al constructed in their work for evaluating image feature descriptors. The data set comprises 116 image sequences of scenes in total, wherein 59 scenes are view angle groups and are sequence images shot in the same scene at different view angles, and pictures of the view angle groups are planar scenes; the other 57 scenes are illumination groups, and are image sequences of the same scene at fixed view angles and different illumination conditions. There are 6 images per scene of the Hpatches dataset, the first of which is the reference image. In the experiment, we culled picture sequences with resolutions greater than 1600 × 1200 and tested using the remaining 52 sets of illumination sequences, 56 sets of view sequences. We first extract feature points and descriptors for each sequence of images using different methods, and then use nearest neighbor search to perform feature point matching for each method, accepting only the mutual nearest neighbors. We used the average match accuracy (MMA) as a validation index.
For each image pair, the present invention uses a nearest neighbor search to match the features extracted by each method, accepting only the mutual nearest neighbors. A match is considered correct if the reprojection error using the single-map estimate provided by the dataset is below a given match threshold. In order to show the superiority of our invention, we compare with different methods, wherein the traditional method has HAN + HN + + and RootSIFT, and the method for jointly learning feature points and descriptors has SuperPoint, LF-Net, D2-Net, R2D2 and the latest ASLFeat. We record the MMA values of the different processes at different thresholds, giving the comparison in tables 1, 2.
Refer to fig. 5 and 6. In order to better show our method, we quantitatively show our effect on the HPatches data set, we select three groups of feature points, and it can be obviously found in effect that our method can effectively remove repeated texture regions in a scene, such as leaves, grasslands and paved grounds, and these regions have a large number of feature points, but are very easy to cause mismatching due to self-similarity and instability thereof. In the aspect of feature point matching, one of the illumination group and the viewing angle group is selected for comparison. In the light set, we can obviously find that D2-Net and ASLFeat obtain more matches, but in the presence of unstable texture areas such as sky, leaves and the like, the matches are invalid. Within the set of perspectives, we can clearly see that ASLFeat and D2-Net put more effort on these unstable matches, while our method yields a more representative match.
Table 1 comparison of the verification effect of the present invention on the HPatches verification set.
Figure BDA0002978815040000131
Table 2 comparison results of feature point positioning accuracy of the overall effect of the present invention.
Figure BDA0002978815040000132

Claims (2)

1. A feature extraction network having a layer hopping structure, characterized by: the main body structure is a part from the conv1_1 layer to the conv4_3 layer of VGG16, and a full connecting layer is removed; bilinear interpolation is carried out on the output feature maps of the conv3_3 layer and the conv4_3 layer, the output feature maps are up-sampled to the resolution of the output feature map of the conv2_2, and then tensor splicing is carried out on the conv2_2 layer and the up-sampled feature map, so that the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure are fused; then, tensor splicing is carried out to obtain an eigen map with 896 channels, and the eigen map is subjected to 1 × 1 convolution to be changed into an eigen map F with 512 channels.
2. A method for detecting image feature points and generating descriptors by using the feature extraction network with a layer-skipping structure as claimed in claim 1, characterized by the steps of:
step 1: selecting a visible light open source data set for labeling, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair together through a homography matrix; taking the labeled data set as a training set, and selecting the data set with the label as a verification set;
step 2: using a feature extraction network F with a layer skipping structure to extract features of images in a training set to obtain a 512-dimensional feature map F, F (I),
Figure FDA0002978815030000011
and step 3: performing descriptor extraction on a 512-dimensional feature map, regarding each channel vector as dense description of the position of each channel vector, and then performing L2 regularization on the channel vectors to obtain dense descriptors of images
Figure FDA0002978815030000012
Figure FDA0002978815030000013
dij=Fij:
Wherein i is 1, …, h, j is 1, …, w, dijF is a 512-dimensional feature map which is a dense description vector of the image;
and 4, step 4: detecting the channel score and the uniqueness score of the feature points by adopting a soft feature point detector, and finally, detecting the channel score cijAnd a uniqueness score uijMultiplying to obtain the soft feature detector score s at pixel (i, j)ij=cijuij
Wherein:
dense descriptors
Figure FDA0002978815030000014
The uniqueness of (A) is divided into:
Figure FDA0002978815030000021
wherein i is 1, …, h, j is 1, …, w,
Figure FDA0002978815030000022
as dense descriptors of the image, uijFor a uniqueness score of a dense descriptor of an image, U is UijA set of (a);
descriptor (I)
Figure FDA0002978815030000023
The channel of (a) is divided into:
Figure FDA0002978815030000024
where i is 1, …, h, j is 1, …, w, t is 1, …, n,
Figure FDA0002978815030000025
to describe
Figure FDA0002978815030000026
The value when the channel is t;
and 5: and (3) carrying out loss calculation on the soft feature point detection component by using a loss function, and then carrying out loss back propagation training on the feature extraction network with the layer jump structure in the step 2:
Figure FDA0002978815030000027
wherein I1,I2RGB image pair input to the network, C image pair I1,I2Image of betweenThe number of the corresponding parts corresponds to the number of the corresponding parts,
Figure FDA0002978815030000028
are respectively an image pair I1,I2The total score of the feature points in (a), m (c) is the triple rank penalty;
step 6: and (5) extracting the feature map of the verification set by using the trained feature extraction network with the skip-layer structure in the step 5, and selecting the pixel with the largest channel and more unique than other 75% of pixels as a feature point in the extracted 512-dimensional feature map by using a hard feature detector to obtain the feature point and descriptor of the test image.
CN202110281763.8A 2021-03-16 2021-03-16 Feature extraction network with layer jump structure and method for generating features and descriptors Active CN113052311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110281763.8A CN113052311B (en) 2021-03-16 2021-03-16 Feature extraction network with layer jump structure and method for generating features and descriptors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110281763.8A CN113052311B (en) 2021-03-16 2021-03-16 Feature extraction network with layer jump structure and method for generating features and descriptors

Publications (2)

Publication Number Publication Date
CN113052311A true CN113052311A (en) 2021-06-29
CN113052311B CN113052311B (en) 2024-01-19

Family

ID=76512664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110281763.8A Active CN113052311B (en) 2021-03-16 2021-03-16 Feature extraction network with layer jump structure and method for generating features and descriptors

Country Status (1)

Country Link
CN (1) CN113052311B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332509A (en) * 2021-12-29 2022-04-12 阿波罗智能技术(北京)有限公司 Image processing method, model training method, electronic device and automatic driving vehicle

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN110188817A (en) * 2019-05-28 2019-08-30 厦门大学 A kind of real-time high-performance street view image semantic segmentation method based on deep learning
CN110781924A (en) * 2019-09-29 2020-02-11 哈尔滨工程大学 Side-scan sonar image feature extraction method based on full convolution neural network
CN110827238A (en) * 2019-09-29 2020-02-21 哈尔滨工程大学 Improved side-scan sonar image feature extraction method of full convolution neural network
CN110929748A (en) * 2019-10-12 2020-03-27 杭州电子科技大学 Motion blur image feature matching method based on deep learning
WO2020215236A1 (en) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) Image semantic segmentation method and system
CN112365501A (en) * 2021-01-13 2021-02-12 南京理工大学 Weldment contour detection algorithm based on convolutional neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
WO2020215236A1 (en) * 2019-04-24 2020-10-29 哈尔滨工业大学(深圳) Image semantic segmentation method and system
CN110188817A (en) * 2019-05-28 2019-08-30 厦门大学 A kind of real-time high-performance street view image semantic segmentation method based on deep learning
CN110781924A (en) * 2019-09-29 2020-02-11 哈尔滨工程大学 Side-scan sonar image feature extraction method based on full convolution neural network
CN110827238A (en) * 2019-09-29 2020-02-21 哈尔滨工程大学 Improved side-scan sonar image feature extraction method of full convolution neural network
CN110929748A (en) * 2019-10-12 2020-03-27 杭州电子科技大学 Motion blur image feature matching method based on deep learning
CN112365501A (en) * 2021-01-13 2021-02-12 南京理工大学 Weldment contour detection algorithm based on convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
廖明哲;吴谨;朱磊;: "基于ResNet和RF-Net的遥感影像匹配", 液晶与显示, no. 09 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332509A (en) * 2021-12-29 2022-04-12 阿波罗智能技术(北京)有限公司 Image processing method, model training method, electronic device and automatic driving vehicle

Also Published As

Publication number Publication date
CN113052311B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
Zhou et al. Bottom-up object detection by grouping extreme and center points
US12020474B2 (en) Image processing apparatus, image processing method, and non-transitory computer-readable storage medium
Zheng et al. Cross-domain object detection through coarse-to-fine feature adaptation
CN110956185B (en) Method for detecting image salient object
CN112101150B (en) Multi-feature fusion pedestrian re-identification method based on orientation constraint
CN108549891B (en) Multi-scale diffusion well-marked target detection method based on background Yu target priori
CN109344701B (en) Kinect-based dynamic gesture recognition method
Von Stumberg et al. Gn-net: The gauss-newton loss for multi-weather relocalization
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
US9633282B2 (en) Cross-trained convolutional neural networks using multimodal images
CN110059586B (en) Iris positioning and segmenting system based on cavity residual error attention structure
Najibi et al. Fa-rpn: Floating region proposals for face detection
CN111126412B (en) Image key point detection method based on characteristic pyramid network
CN113408492A (en) Pedestrian re-identification method based on global-local feature dynamic alignment
Lee et al. Unsupervised video object segmentation via prototype memory network
CN111046789A (en) Pedestrian re-identification method
CN109977834B (en) Method and device for segmenting human hand and interactive object from depth image
CN113159043A (en) Feature point matching method and system based on semantic information
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN111652273A (en) Deep learning-based RGB-D image classification method
CN111709317A (en) Pedestrian re-identification method based on multi-scale features under saliency model
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN113052311B (en) Feature extraction network with layer jump structure and method for generating features and descriptors
CN111582057B (en) Face verification method based on local receptive field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant