CN113052311A - Feature extraction network with layer jump structure and method for generating features and descriptors - Google Patents
Feature extraction network with layer jump structure and method for generating features and descriptors Download PDFInfo
- Publication number
- CN113052311A CN113052311A CN202110281763.8A CN202110281763A CN113052311A CN 113052311 A CN113052311 A CN 113052311A CN 202110281763 A CN202110281763 A CN 202110281763A CN 113052311 A CN113052311 A CN 113052311A
- Authority
- CN
- China
- Prior art keywords
- feature
- image
- layer
- network
- channel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000000605 extraction Methods 0.000 title claims abstract description 34
- 238000001514 detection method Methods 0.000 claims description 32
- 239000013598 vector Substances 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 24
- 238000012795 verification Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 3
- 238000005286 illumination Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 238000013461 design Methods 0.000 description 4
- 230000004927 fusion Effects 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 101150064138 MAP1 gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a feature extraction network with a layer-skipping structure and a descriptor generation method, wherein the network is an image feature extraction network with the layer-skipping structure, and output feature maps of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detailed information and effectively improve feature point positioning accuracy. The uniqueness index of a feature point refers to the similarity degree between a certain local area of an image and other areas of the image. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The invention achieves superior performance on the HPatches data set of image retrieval, especially on its illumination sequence.
Description
Technical Field
The invention belongs to the field of image processing and computer vision, and relates to a feature extraction network with a layer jump structure and a method for detecting image feature points and generating descriptors by using the network.
Background
In many applications, such as visual localization, object detection, pose estimation, three-dimensional reconstruction, etc., it is crucial to extract feature points and descriptors of an image. In these tasks, it is desirable to achieve high feature matching accuracy and high feature point detection accuracy. The high feature matching accuracy is that for two pictures to be matched, mismatching is performed as little as possible when feature point matching is performed; and the high feature point detection accuracy means that for successfully matched feature point pairs, the positions pointed by the feature point pairs in the two pictures are exactly the same position in the scene in the image.
The classical feature point detection method is implemented in two stages, namely, detecting key points first and then calculating a local descriptor for each key point. The first method that relies on machine learning for keypoint detection is FAST. Then, the SIFT method integrates the whole processes of image feature point detection and image local feature description, which is a typical mode of first detection and then description. LIFT proposed by Yi et al first uses a convolutional neural network to completely complete the tasks of image feature point detection and image local feature description and matching, wherein a feature point detection network based on the convolutional neural network, an image local feature direction determination network and an image local feature description network are integrated. The method has the defects that the overall complexity of the network is high, and during training, the LIFT network still uses the screened SIFT feature points as feature point labels and cannot be separated from the limitation of the SIFT method. Le-Net proposed by Ono et al continues the idea of LIFT method as a whole, but greatly simplifies the network structure, and the network is designed as a whole instead of regarding each step in the image feature point detection and image local feature description flow as an independent module. In the aspect of training, the feature point detector is trained by an unsupervised method, and an end-to-end training method is integrally adopted, so that the network performance is improved, and the network complexity is reduced.
The description-before-detection methods appearing in recent years generally exhibit better performance than the previous detection-after-description methods, and use the same network to realize both detection and description tasks, and most of parameters between the detection and description tasks are shared, so that the complexity of the network can be reduced. In recent years, the SuperPoint proposed by Detone et al uses a VGG-style network to extract image features, and detects feature point coordinates by using a similar image super-resolution method after convolution. The labels of SuperPoint are detected by a feature point detector trained by a synthetic data set, and the bias of manual labeling is eliminated. The D2-Net network proposed by Dusmanu et al is described earlier and is more prominent in the post-detection method. The method uses a VGG-16 network as a backbone and is connected with a feature point detector in series behind an output feature diagram of the VGG-16. The characteristic that D2-Net is different from other work is that the characteristic point detector has no learned parameters, and the characteristic points are only detected by a specific algorithm. Despite the simple structure, D2-Net still obtains the effect which is not similar to SuperPoint when the D2-Net comes out, and the feasibility of the idea is proved. The R2D2 network proposed by Revaud et al also uses training data without any artificial errors, and it uses optical flow instead of MegaDepth data set to generate point correspondences, providing a new idea for training data. Meanwhile, the method provides the reliability index of the descriptor to eliminate the mismatching.
However, in many methods of jointly learning local feature points and descriptors, we consider that there are two very large limitations: 1. the feature point positioning accuracy is very low and the camera geometry problem cannot be solved effectively. 2. Much work has been focused on the design of keypoint detectors, which are only repeatable and can cause mismatches for some regions with similar texture.
The accuracy of the location of the keypoints has a large impact on the performance of many computer vision tasks, such as the large projection error of D2-Net in SFM. The low accuracy of keypoint localization is mainly due to the fact that keypoint detection is performed on low-resolution feature maps (e.g., D2-Net is performed on 1/4 of the original image). In order to ensure better feature point accuracy, the SuperPoint performs up-sampling on a low-resolution feature map obtained through a VGG-like network structure to an original resolution, then performs feature point detection through a pixel-level supervision point, and the R2D2 adopts extended convolution instead of a pooling layer to ensure that the resolution of the feature map is unchanged, which increases a large amount of calculation. The ASLFeat is improved on the basis of D2-Net, and the feature point scores obtained by different resolutions are subjected to upsampling fusion so as to obtain all feature points and keep the spatial precision of the feature points. Although the ASLFeat can solve the positioning accuracy of the feature points by using less calculation amount and obtain feature information of different levels, the ASLFeat only fuses score maps of different resolutions and can obtain only a small amount of information of different levels.
In many images, a large number of parts of the texture can be highlighted, such as leaves in nature, windows of skyscrapers or ocean waves, and for the method based on local gradient histograms, although a large number of positions with large gradients can be used as feature points, due to their similarity and instability, matching cannot be performed. At the same time, much of the work based on deep learning has only focused on repeatability in the design of the keypoint detector. On the other hand, methods of metric learning techniques for learning locally robust descriptors are trained on reproducibly provided locations, which are in areas that are reproducible but not likely to be exactly matched, which may compromise performance. The method of the most recent R2D2 deals with unstable texture regions by learning a reliability score for each dense descriptor.
Disclosure of Invention
Technical problem to be solved
The invention provides a feature extraction network with a layer jump structure and a method for generating features and descriptors, aiming at solving two problems existing in the current popular method for jointly learning image feature points and descriptors, and provides an image feature point detection and descriptor generation method with a layer jump structure. And then, carrying out soft and hard feature point detection, and selecting correct feature points and descriptors by using channel scores and uniqueness scores in the feature point detection. And the uniqueness score can effectively eliminate the mismatching. Finally, the characteristic points and the descriptors with high positioning accuracy and high accuracy are obtained.
Technical scheme
A feature extraction network having a layer hopping structure, characterized by: the main body structure is a part from the conv1_1 layer to the conv4_3 layer of VGG16, and a full connecting layer is removed; bilinear interpolation is carried out on the output feature maps of the conv3_3 layer and the conv4_3 layer, the output feature maps are up-sampled to the resolution of the output feature map of the conv2_2, and then tensor splicing is carried out on the conv2_2 layer and the up-sampled feature map, so that the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure are fused; then, tensor splicing is carried out to obtain an eigen map with 896 channels, and the eigen map is subjected to 1 × 1 convolution to be changed into an eigen map F with 512 channels.
A method for detecting image feature points and generating descriptors by adopting the feature extraction network with the skip layer structure is characterized by comprising the following steps:
step 1: selecting a visible light open source data set for labeling, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair together through a homography matrix; taking the labeled data set as a training set, and selecting the data set with the label as a verification set;
step 2: using a feature extraction network F with a layer skipping structure to extract features of images in a training set to obtain a 512-dimensional feature map F, F (I),
and step 3: performing descriptor extraction on a 512-dimensional feature map, regarding each channel vector as dense description of the position of each channel vector, and then performing L2 regularization on the channel vectors to obtain dense descriptors of images
Wherein i is 1, …, h, j is 1, …, w, dijF is a 512-dimensional feature map which is a dense description vector of the image;
and 4, step 4: detecting the channel score and the uniqueness score of the feature points by adopting a soft feature point detector, and finally, detecting the channel score cijAnd a uniqueness score uijMultiplying to obtain the soft feature detector score s at pixel (i, j)ij=cijuij;
Wherein:
wherein i is 1, …, h, j is 1, …, w,as dense descriptors of the image, uijFor a uniqueness score of a dense descriptor of an image, U is UijA set of (a);
and 5: and (3) carrying out loss calculation on the soft feature point detection component by using a loss function, and then carrying out loss back propagation training on the feature extraction network with the layer jump structure in the step 2:
wherein I1,I2RGB image pair input to the network, C image pair I1,I2The number of images in between corresponds to,are respectively an image pair I1,I2The total score of the feature points in (a), m (c) is the triple rank penalty;
step 6: and (5) extracting the feature map of the verification set by using the trained feature extraction network with the skip-layer structure in the step 5, and selecting the pixel with the largest channel and more unique than other 75% of pixels as a feature point in the extracted 512-dimensional feature map by using a hard feature detector to obtain the feature point and descriptor of the test image.
Advantageous effects
The network is an image feature extraction network with a layer jump structure, and output feature maps of a conv2_2 layer, a conv2_3 layer and a conv4_3 layer of VGG16 are fused to obtain all detail information and effectively improve feature point positioning accuracy. The uniqueness index of a feature point refers to the similarity degree between a certain local area of an image and other areas of the image. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. The matching performance of the network is improved by selecting sufficiently unique feature points in the image. The invention achieves superior performance on the HPatches data set of image retrieval, especially on its illumination sequence.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
according to the invention, the feature fusion structure is added in the image feature extraction network, so that the feature graph obtained by the network contains semantic information of different levels, wherein the semantic information of the lower level can retain more lower level information of the image, such as edges or corners, and the like, so that high-precision detection of the image features is possible, and the high-level semantic information can provide guarantee for enhancing the accuracy of feature matching and reducing mismatching when local features are finally matched. Meanwhile, the invention effectively solves the problem of mismatching easily generated in the texture region by designing the uniqueness detection of the feature points in the feature point detection stage. Through tests, compared with the effect of D2-Net in image matching, the method can improve the positioning accuracy of the feature points when the projection error threshold is 1 by 2 times, and the method has a very excellent effect when the projection error is large, wherein the average matching accuracy reaches 0.913, and is improved by 0.011 compared with the most excellent ASLFeat at present.
Drawings
Fig. 1 is an overall structural diagram of the present invention, which includes three parts of image feature extraction, feature fusion and feature point detection.
Fig. 2 is a diagram of a feature extraction network architecture of the present invention.
FIG. 3 is a network training flow diagram.
FIG. 4 is a comparison graph of the feature point extraction effect of the present invention on HPatches.
FIG. 5 is a comparison of the matching effect of the present invention in HPatches.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
a method for detecting image characteristic points with a layer jump structure and generating descriptors comprises the following steps: the design method comprises the following steps:
step 1: the method comprises the steps of selecting a visible light open source data set to label, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair through a homography matrix. And taking the labeled data set as a training set, and selecting the data set with the label as a verification set.
Step 2: using the feature extraction network F having the layer skipping structure as described below to perform feature extraction on the images in the training set in step 1, obtaining a 512-dimensional feature map F ═ F (i),where h × w is the spatial resolution of the feature map and n is the number of channels.
The feature extraction network with a layer hopping structure is designed as follows: the body structure is part of conv1_1 layer to conv4_3 layer of VGG16, with the full connectivity layer removed. Then, we fuse the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the body structure, so that the spatial positioning accuracy of the feature points can be maintained and the features with different levels can be fused. Firstly, bilinear interpolation is carried out on the output characteristic diagrams of the conv3_3 layer and the conv4_3 layer, the output characteristic diagrams are up-sampled to the resolution of the output characteristic diagram of the conv2_2, and then tensor stitching is carried out on the conv2_2 layer and the up-sampled characteristic diagrams. We obtained an eigenmap with 896 channels after tensor stitching. This profile is then convolved 1 x 1 to become the 512-channel profile F.
And step 3: performing descriptor extraction on the 512-dimensional feature map extracted in the step 2, regarding each channel vector as dense description of the position of each channel vector, and performing L2 regularization on the dense description to obtain a dense descriptor of the image
Wherein i is 1, …, h, j is 1, …, w, dijF is a 512-dimensional feature map for dense description vectors of images.
And 4, step 4: and calculating the channel score and the uniqueness score of the feature point by adopting a soft feature point detector, and finally multiplying the detected channel score and the uniqueness score to obtain the score of the soft feature detector at the pixel (i, j).
wherein i is 1, …, h, j is 1, …, w,as dense descriptors of the image, uijFor a uniqueness score of a dense descriptor of an image, U is UijA collection of (a).
The soft feature detector is divided into at pixel (i, j):
sij=cijuij
wherein c isijIs a dense descriptorChannel score of uijIs a dense descriptorChannel score of, sijIs a dense descriptorThe total score of (a). Only when c isijAnd uijAt least one of which is sufficiently large, sijCan be large enough, sijWhich reflects how much the spatial position (i, j) can be used as a feature point.
And 5: and (4) performing loss calculation on the soft feature point detection component in the step (4) by adopting the following loss function, and then training the loss back propagation to train the feature extraction network with the layer jump structure in the step (2).
The loss function is designed as follows: for training the network, the channel score and the uniqueness score obtained by the feature point detector are added into a loss function for training. For image pairs (I) of the input network1,I2) It has the labeled pixel correspondence c:A∈I1,B∈I2. Wherein A, B are respectively picture I1,I2The pixel of (2). We take the form of losses:
wherein I1,I2RGB image pair input to the network, C image pair I1,I2The number of images in between corresponds to,are respectively an image pair I1,I2M (c) is a triple rank penalty that minimizes the corresponding descriptorAndto simultaneously maximize the confusion with other descriptors in both imagesOrThe distance of (c).
By using the feature point scores as weights in the loss functions, sparsity of the loss functions is guaranteed, and overfitting of the network is effectively prevented. Decreasing m (c), namely increasing the distance of the description vector directions of the matched feature points and increasing the identifiability of the description vectors; or reduceAnd reducing the area with high characteristic point score in the picture.
Step 6: and (3) extracting the feature map of the verification set in the step (1) by using the feature extraction network with the skip-layer structure trained in the step (5), and selecting a pixel with the largest channel and more unique than other 75% of pixels in the extracted 512-dimensional feature map as a feature point by using a hard feature detector as follows to obtain the feature point and a descriptor of the test image.
The hard feature detector design is as follows:
the method comprises the steps of analyzing factors influencing feature point matching performance, providing a feature point uniqueness index from the viewpoint of improving the feature point matching performance, and designing a feature point detector based on feature uniqueness by combining analysis on feature point description vectors. The uniqueness of each position in the image, i.e. the similarity of each position in the image to all other positions, is measured by a uniqueness score. We improve the matching performance of the network by selecting sufficiently unique feature points in the image.
We define that the uniqueness of a feature is the similarity of some local area of an image to other areas of the image in the same picture. The more dissimilar a local region of an image is to other local regions, the higher its distinctiveness. For a position (i, j) of the dense description, the uniqueness of the description vectorUniqueness of uij
Wherein U is UijSet of (2), uniqueness uijIndicate dense descriptor vectorsAnd other dense descriptor vectorsThe minimum distance of (c). The larger the minimum distance, the more unique the descriptor vector is among all the descriptor vectors. We sort U in descending order and note UijHas an ordering order of p, with p<α|U|。
In our invention, we consider that the feature point position (i, j, k) in dense descriptor d should be represented by a descriptor vectorUniqueness of uijAndchannel extreme inIs determined. Wherein the uniqueness uijThe spatial position of the feature point is determined, and the channel extremum k indicates which feature response is the largest in the descriptor vector, and we optimize the spatial position (i, j) of the feature point by using its corresponding k-th layer feature map.
For description vectorExtreme value of its channelPosition k of (a) is:its uniqueness uijThe order of the order is p, in the experiment, when alpha is 0.25, the effect is better, and the obtained characteristic points are more unique than other 75 percent of points. In addition, although the coordinates of the feature points are (i, j, k), sinceTherefore, at most one point in the description vector at each spatial position (i, j) can be used as a feature point.
The hard characteristic point detection conditions of our invention are as follows:
where, (i, j) is the spatial position in the feature map, i is 1, …, h, j is 1, …, w.To describeThe value when the channel is t, k isThe maximum time t, U is the uniqueness of the dense descriptor at (i, j), α is 0.75, and U is the set of uniqueness of all dense descriptors.
The specific embodiment is as follows:
referring to fig. 1, the present invention performs the detection and local feature description of the image feature points according to the following steps:
step 1: the train dataset of COCO2014 was chosen for annotation, containing 82783 images. Each image in the data set is processed by using random homography change and color dithering, the processed image and the original image form an image pair, and pixels between the image pair are connected together through a homography matrix. The labeled train dataset of COCO2014 is used as a training set. The test set was trained using a standard HPatches data set.
Step 2: referring to fig. 2, feature extraction is performed on the training set generated in step 1 using a feature extraction network having a layer-skipping structure. The main structure of the feature extraction network is a part from the conv1_1 layer to the conv4_3 layer of the VGG16, and a full connection layer is removed. Meanwhile, in order to maintain the spatial positioning accuracy of the feature points and fuse features with different levels, the output feature maps of a conv2_2 layer, a conv3_3 layer and a conv4_3 layer in the VGG-16 network are fused. First, bilinear interpolation is performed on the conv3_3 layer and the conv4_3 layer, the resolution is up-sampled to conv2_2, and then tensor stitching is performed. After tensor splicing, we obtain an eigen map with 896 channels, namely, three different levels of semantic information. And performing 1 × 1 convolution on the feature map, and fusing different levels of semantic features to obtain a feature map F with 512 channels.
And step 3: for the feature map F in step 2, which contains different levels of detail information and has the resolution of the original map 1/2, the present invention performs descriptor extraction directly thereon.
Dense description vector d of image:
wherein i is 1, …, hJ is 1, …, w. In comparison between images, the descriptor vectors can conveniently use Euclidean distance to establish corresponding relation. As with the previous work, we performed a comparison to the descriptor vector dijPerform L2 regularization:
and 4, step 4: referring to the feature point detection section of FIG. 1, we use soft feature detection on the feature map, each dense descriptorIs characterized by a uniqueness score of uijChannel contrast score of cijCalculating each dense descriptorThe total score is sij。
And 5: we perform a penalty calculation on the soft feature point detection score of step 4 in the form of a penalty, where m (c) is the triple rank penalty that minimizes the corresponding descriptorAndto simultaneously maximize the confusion with other descriptors in both imagesOrThe distance of (c).
Wherein I1,I2RGB image pair input to the network, C image pair I1,I2The number of images in between corresponds to,are respectively an image pair I1,I2M (c) is a triple rank penalty that minimizes the corresponding descriptorAndto simultaneously maximize the confusion with other descriptors in both imagesOrThe distance of (c).
Referring to fig. 3, we train the network according to the sequence of steps 2, 3, 4, and 5. In order to reduce the amount of computation when training the network, we extract the feature map F1,F2The average pooling process is performed to change the input resolution 1/2, and then soft feature point detection is performed. In order to obtain a better training effect and save training time, Adam is used for fine tuning training on the basis of weights pre-trained by an ImageNet image classification task. When fine tuning, we unlock the conv2_2, conv3_3 and conv4_3 layers. During our training, we input 8 pairs of images and their labels with center cut of 224 × 224 in a single iteration in bulk, and choose Adam with an initial learning rate of 10-5A total of 40 batches were trained.
Decreasing m (c), namely increasing the distance of the description vector directions of the matched feature points and increasing the identifiability of the description vectors; or reduceAnd reducing the area with high characteristic point score in the picture.
Step 6: and (5) extracting a feature map from the verification set HPatches provided in the step 1 by using the feature extraction network with the skip layer structure trained in the step 5. And selecting a pixel with the largest channel and more than 75% of other pixels in the feature map as a feature point to obtain the feature point and the descriptor of the test image.
The test results are shown in FIG. 4. During testing, we post-processed the keypoints using SIFT-like edge elimination (threshold set to 10) and sub-pixel refinement, and then bi-interpolate the descriptors at the refined positions.
The HPatches dataset is the dataset that Balntas et al constructed in their work for evaluating image feature descriptors. The data set comprises 116 image sequences of scenes in total, wherein 59 scenes are view angle groups and are sequence images shot in the same scene at different view angles, and pictures of the view angle groups are planar scenes; the other 57 scenes are illumination groups, and are image sequences of the same scene at fixed view angles and different illumination conditions. There are 6 images per scene of the Hpatches dataset, the first of which is the reference image. In the experiment, we culled picture sequences with resolutions greater than 1600 × 1200 and tested using the remaining 52 sets of illumination sequences, 56 sets of view sequences. We first extract feature points and descriptors for each sequence of images using different methods, and then use nearest neighbor search to perform feature point matching for each method, accepting only the mutual nearest neighbors. We used the average match accuracy (MMA) as a validation index.
For each image pair, the present invention uses a nearest neighbor search to match the features extracted by each method, accepting only the mutual nearest neighbors. A match is considered correct if the reprojection error using the single-map estimate provided by the dataset is below a given match threshold. In order to show the superiority of our invention, we compare with different methods, wherein the traditional method has HAN + HN + + and RootSIFT, and the method for jointly learning feature points and descriptors has SuperPoint, LF-Net, D2-Net, R2D2 and the latest ASLFeat. We record the MMA values of the different processes at different thresholds, giving the comparison in tables 1, 2.
Refer to fig. 5 and 6. In order to better show our method, we quantitatively show our effect on the HPatches data set, we select three groups of feature points, and it can be obviously found in effect that our method can effectively remove repeated texture regions in a scene, such as leaves, grasslands and paved grounds, and these regions have a large number of feature points, but are very easy to cause mismatching due to self-similarity and instability thereof. In the aspect of feature point matching, one of the illumination group and the viewing angle group is selected for comparison. In the light set, we can obviously find that D2-Net and ASLFeat obtain more matches, but in the presence of unstable texture areas such as sky, leaves and the like, the matches are invalid. Within the set of perspectives, we can clearly see that ASLFeat and D2-Net put more effort on these unstable matches, while our method yields a more representative match.
Table 1 comparison of the verification effect of the present invention on the HPatches verification set.
Table 2 comparison results of feature point positioning accuracy of the overall effect of the present invention.
Claims (2)
1. A feature extraction network having a layer hopping structure, characterized by: the main body structure is a part from the conv1_1 layer to the conv4_3 layer of VGG16, and a full connecting layer is removed; bilinear interpolation is carried out on the output feature maps of the conv3_3 layer and the conv4_3 layer, the output feature maps are up-sampled to the resolution of the output feature map of the conv2_2, and then tensor splicing is carried out on the conv2_2 layer and the up-sampled feature map, so that the output feature maps of the conv2_2 layer, the conv3_3 layer and the conv4_3 layer of the main body structure are fused; then, tensor splicing is carried out to obtain an eigen map with 896 channels, and the eigen map is subjected to 1 × 1 convolution to be changed into an eigen map F with 512 channels.
2. A method for detecting image feature points and generating descriptors by using the feature extraction network with a layer-skipping structure as claimed in claim 1, characterized by the steps of:
step 1: selecting a visible light open source data set for labeling, processing each image in the data set by using random homography change and color dithering, forming an image pair by the processed image and an original image, and connecting pixels between the image pair together through a homography matrix; taking the labeled data set as a training set, and selecting the data set with the label as a verification set;
step 2: using a feature extraction network F with a layer skipping structure to extract features of images in a training set to obtain a 512-dimensional feature map F, F (I),
and step 3: performing descriptor extraction on a 512-dimensional feature map, regarding each channel vector as dense description of the position of each channel vector, and then performing L2 regularization on the channel vectors to obtain dense descriptors of images
Wherein i is 1, …, h, j is 1, …, w, dijF is a 512-dimensional feature map which is a dense description vector of the image;
and 4, step 4: detecting the channel score and the uniqueness score of the feature points by adopting a soft feature point detector, and finally, detecting the channel score cijAnd a uniqueness score uijMultiplying to obtain the soft feature detector score s at pixel (i, j)ij=cijuij;
Wherein:
wherein i is 1, …, h, j is 1, …, w,as dense descriptors of the image, uijFor a uniqueness score of a dense descriptor of an image, U is UijA set of (a);
and 5: and (3) carrying out loss calculation on the soft feature point detection component by using a loss function, and then carrying out loss back propagation training on the feature extraction network with the layer jump structure in the step 2:
wherein I1,I2RGB image pair input to the network, C image pair I1,I2Image of betweenThe number of the corresponding parts corresponds to the number of the corresponding parts,are respectively an image pair I1,I2The total score of the feature points in (a), m (c) is the triple rank penalty;
step 6: and (5) extracting the feature map of the verification set by using the trained feature extraction network with the skip-layer structure in the step 5, and selecting the pixel with the largest channel and more unique than other 75% of pixels as a feature point in the extracted 512-dimensional feature map by using a hard feature detector to obtain the feature point and descriptor of the test image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110281763.8A CN113052311B (en) | 2021-03-16 | 2021-03-16 | Feature extraction network with layer jump structure and method for generating features and descriptors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110281763.8A CN113052311B (en) | 2021-03-16 | 2021-03-16 | Feature extraction network with layer jump structure and method for generating features and descriptors |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113052311A true CN113052311A (en) | 2021-06-29 |
CN113052311B CN113052311B (en) | 2024-01-19 |
Family
ID=76512664
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110281763.8A Active CN113052311B (en) | 2021-03-16 | 2021-03-16 | Feature extraction network with layer jump structure and method for generating features and descriptors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113052311B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114332509A (en) * | 2021-12-29 | 2022-04-12 | 阿波罗智能技术(北京)有限公司 | Image processing method, model training method, electronic device and automatic driving vehicle |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019148898A1 (en) * | 2018-02-01 | 2019-08-08 | 北京大学深圳研究生院 | Adversarial cross-media retrieving method based on restricted text space |
CN110188817A (en) * | 2019-05-28 | 2019-08-30 | 厦门大学 | A kind of real-time high-performance street view image semantic segmentation method based on deep learning |
CN110781924A (en) * | 2019-09-29 | 2020-02-11 | 哈尔滨工程大学 | Side-scan sonar image feature extraction method based on full convolution neural network |
CN110827238A (en) * | 2019-09-29 | 2020-02-21 | 哈尔滨工程大学 | Improved side-scan sonar image feature extraction method of full convolution neural network |
CN110929748A (en) * | 2019-10-12 | 2020-03-27 | 杭州电子科技大学 | Motion blur image feature matching method based on deep learning |
WO2020215236A1 (en) * | 2019-04-24 | 2020-10-29 | 哈尔滨工业大学(深圳) | Image semantic segmentation method and system |
CN112365501A (en) * | 2021-01-13 | 2021-02-12 | 南京理工大学 | Weldment contour detection algorithm based on convolutional neural network |
-
2021
- 2021-03-16 CN CN202110281763.8A patent/CN113052311B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019148898A1 (en) * | 2018-02-01 | 2019-08-08 | 北京大学深圳研究生院 | Adversarial cross-media retrieving method based on restricted text space |
WO2020215236A1 (en) * | 2019-04-24 | 2020-10-29 | 哈尔滨工业大学(深圳) | Image semantic segmentation method and system |
CN110188817A (en) * | 2019-05-28 | 2019-08-30 | 厦门大学 | A kind of real-time high-performance street view image semantic segmentation method based on deep learning |
CN110781924A (en) * | 2019-09-29 | 2020-02-11 | 哈尔滨工程大学 | Side-scan sonar image feature extraction method based on full convolution neural network |
CN110827238A (en) * | 2019-09-29 | 2020-02-21 | 哈尔滨工程大学 | Improved side-scan sonar image feature extraction method of full convolution neural network |
CN110929748A (en) * | 2019-10-12 | 2020-03-27 | 杭州电子科技大学 | Motion blur image feature matching method based on deep learning |
CN112365501A (en) * | 2021-01-13 | 2021-02-12 | 南京理工大学 | Weldment contour detection algorithm based on convolutional neural network |
Non-Patent Citations (1)
Title |
---|
廖明哲;吴谨;朱磊;: "基于ResNet和RF-Net的遥感影像匹配", 液晶与显示, no. 09 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114332509A (en) * | 2021-12-29 | 2022-04-12 | 阿波罗智能技术(北京)有限公司 | Image processing method, model training method, electronic device and automatic driving vehicle |
Also Published As
Publication number | Publication date |
---|---|
CN113052311B (en) | 2024-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhou et al. | Bottom-up object detection by grouping extreme and center points | |
US12020474B2 (en) | Image processing apparatus, image processing method, and non-transitory computer-readable storage medium | |
Zheng et al. | Cross-domain object detection through coarse-to-fine feature adaptation | |
CN110956185B (en) | Method for detecting image salient object | |
CN112101150B (en) | Multi-feature fusion pedestrian re-identification method based on orientation constraint | |
CN108549891B (en) | Multi-scale diffusion well-marked target detection method based on background Yu target priori | |
CN109344701B (en) | Kinect-based dynamic gesture recognition method | |
Von Stumberg et al. | Gn-net: The gauss-newton loss for multi-weather relocalization | |
CN112150493B (en) | Semantic guidance-based screen area detection method in natural scene | |
US9633282B2 (en) | Cross-trained convolutional neural networks using multimodal images | |
CN110059586B (en) | Iris positioning and segmenting system based on cavity residual error attention structure | |
Najibi et al. | Fa-rpn: Floating region proposals for face detection | |
CN111126412B (en) | Image key point detection method based on characteristic pyramid network | |
CN113408492A (en) | Pedestrian re-identification method based on global-local feature dynamic alignment | |
Lee et al. | Unsupervised video object segmentation via prototype memory network | |
CN111046789A (en) | Pedestrian re-identification method | |
CN109977834B (en) | Method and device for segmenting human hand and interactive object from depth image | |
CN113159043A (en) | Feature point matching method and system based on semantic information | |
CN106407978B (en) | Method for detecting salient object in unconstrained video by combining similarity degree | |
CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
CN111652273A (en) | Deep learning-based RGB-D image classification method | |
CN111709317A (en) | Pedestrian re-identification method based on multi-scale features under saliency model | |
CN114663880A (en) | Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism | |
CN113052311B (en) | Feature extraction network with layer jump structure and method for generating features and descriptors | |
CN111582057B (en) | Face verification method based on local receptive field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |