CN111612075A - Interest point and descriptor extraction method based on joint feature recombination and feature mixing - Google Patents
Interest point and descriptor extraction method based on joint feature recombination and feature mixing Download PDFInfo
- Publication number
- CN111612075A CN111612075A CN202010444152.6A CN202010444152A CN111612075A CN 111612075 A CN111612075 A CN 111612075A CN 202010444152 A CN202010444152 A CN 202010444152A CN 111612075 A CN111612075 A CN 111612075A
- Authority
- CN
- China
- Prior art keywords
- image
- feature
- pixel
- descriptor
- points
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 66
- 230000006798 recombination Effects 0.000 title claims abstract description 22
- 238000005215 recombination Methods 0.000 title claims abstract description 22
- 238000000034 method Methods 0.000 claims abstract description 44
- 239000013598 vector Substances 0.000 claims abstract description 23
- 238000001914 filtration Methods 0.000 claims abstract description 11
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 230000009466 transformation Effects 0.000 claims description 26
- 238000010586 diagram Methods 0.000 claims description 19
- 238000005070 sampling Methods 0.000 claims description 17
- 230000008521 reorganization Effects 0.000 claims description 15
- 238000011426 transformation method Methods 0.000 claims description 8
- 230000005764 inhibitory process Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 150000001875 compounds Chemical class 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 3
- 238000013519 translation Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 15
- 230000001629 suppression Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 28
- 230000008569 process Effects 0.000 description 16
- 238000012549 training Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
Abstract
The invention belongs to the field of computer vision, and particularly relates to an interest point and descriptor extraction method, system and device based on joint feature recombination and feature mixing, aiming at solving the problem of low detection and extraction precision of the existing interest point and descriptor extraction method. The system method comprises the following steps: acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; performing pixel recombination on each characteristic graph, obtaining a score graph through convolution and nonlinear mapping, and obtaining interest points through non-maximum suppression; and acquiring the characteristic vectors of the pixel points in the input image at the corresponding positions of the multi-scale characteristic graph for connection, and filtering and compressing the connected characteristic vectors through a connection layer to obtain the descriptors corresponding to the pixel points. The invention improves the precision of detecting and extracting the interest points and the descriptors.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a method, a system and a device for extracting interest points and descriptors based on joint feature recombination and feature mixing.
Background
It has long been the basis of many vision-based applications, such as visual localization and image retrieval, to use image interest points and their local feature descriptors to find the correct correspondence between images. However, with the rapid development of the industry, these applications need to deal with more complex and difficult scenarios. Since image interest point detection and description are key components of these advanced algorithms, it is important to further improve the accuracy thereof.
Over the past two decades, a number of excellent algorithms have been proposed to solve the above problems. Both the traditional statistical-based and filtering-based methods and the deep learning-based method make a significant breakthrough. Especially, the accuracy of interest point detection and local feature description is greatly improved by deep learning-based algorithms such as SuperPoint, D2-net and R2D 2. However, previous approaches have focused primarily on a better deep learning-based paradigm to solve this problem, and have neglected network architecture design somewhat. The relatively good network architectures proposed in other visual applications, such as classification, object detection, segmentation, etc., are not suitable for image interest point detection or local feature description. Therefore, the invention provides a method, a system and a device for extracting interest points and descriptors based on joint feature recombination and feature mixing.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem of low detection and extraction accuracy of the existing interest point and descriptor extraction method, a first aspect of the present invention provides an interest point and descriptor extraction method based on joint feature recombination and feature mixing, including:
step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
In some preferred embodiments, the residual network does not include a maximum pooling layer, and the extracted multi-scale feature map and the input image have widths and heights:
h=hm×2m
w=wm×2m
where h denotes the height of the input image, w denotes the width of the input image, hmHigh, w of the feature map representing the mth convolutionmThe width of the feature map representing the mth convolution is wide, and m represents the number of convolutions.
In some preferred embodiments, the method for calculating the corresponding position of each pixel point in the input image in the multi-scale feature map includes:
p(m)=p/2m=[x/2m,y/2m]T
wherein p is(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, T represents transposition, and x and y represent coordinates of the pixel point in the input image.
In some preferred embodiments, the preset multiple image transformation methods include translation transformation, scale transformation, in-plane rotation transformation, and symmetric perspective transformation within a preset range.
In some preferred embodiments, the descriptor loss function is:
wherein L istriplet(D, D ', V) represents the total loss of the descriptors corresponding to each pixel of the first image and the second image, D represents the set of the descriptors corresponding to each pixel of the first image, D' represents the set of the descriptors corresponding to each pixel of the second image, V represents the mask set corresponding to each pixel of the second image during conversion,descriptor representing a pixel in a first imageCorresponding loss value, vi、vjRepresenting the ith, j mask in V, n representing the number of pixel points in the second image,the first distance is represented as a function of,representing the second distance.
In some preferred embodiments, the loss function of the extraction network corresponding to the interest point is a weighted cross entropy loss function, and the loss value is calculated by:
wherein S represents a score map, Y represents a marked interest point corresponding to the score map, λ represents a preset ratio, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample imageu,vIndicating the location of the point of interest, Su,vIndicating the position of the pixel point in the score map, Lbce(S, Y) represents a weighted cross-entropy loss value.
The second aspect of the invention provides an interest point and descriptor extraction system based on joint feature recombination and feature mixing, which comprises a feature extraction module, an interest point acquisition module and a descriptor extraction module;
the characteristic extraction module is configured to acquire an image to be extracted as an input image and extract a multi-scale characteristic diagram of the image through a characteristic extraction network; the feature extraction network is constructed based on a residual error network;
the interest point acquisition module is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
the descriptor extraction module is configured to acquire and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being loaded and executed by a processor to implement the above-mentioned interest point and descriptor extraction method based on joint feature reorganization and feature mixing.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described point of interest, descriptor extraction method based on joint feature reorganization and feature mixing.
The invention has the beneficial effects that:
the invention improves the precision of detecting and extracting the interest points and the descriptors. The method utilizes the residual error network to extract the features, realizes the interest point detection through the feature recombination after the features are extracted, and improves the robustness of the interest point detection. And acquiring the characteristic vectors of the pixel points in the input image at the corresponding positions of the multi-scale characteristic graph for connection, and filtering and compressing the connected characteristic vectors through a connection layer to finish the extraction of the characteristic descriptors.
In addition, supervision information of quick and effective training is established based on random uniform sampling in the training process, and loss functions of interest point detection and feature descriptors are constructed. Training is carried out based on the constructed loss function, and the precision of interest point detection and descriptor extraction is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a method for extracting interest points and descriptors based on joint feature reorganization and feature mixing according to an embodiment of the present invention;
FIG. 2 is a block diagram of a point of interest, descriptor extraction system based on joint feature reorganization and feature mixing according to an embodiment of the present invention;
FIG. 3 is a detailed framework diagram of a method for extracting interest points and descriptors based on joint feature reorganization and feature mixing according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating the result of feature matching based on interest points and descriptors according to an embodiment of the present invention;
FIG. 5 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The method for extracting interest points and descriptors based on joint feature recombination and feature mixing, as shown in fig. 1, comprises the following steps:
step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
In order to more clearly explain the method for extracting interest points and descriptors based on joint feature reorganization and feature mixing, the following describes in detail the steps of an embodiment of the method of the present invention with reference to the drawings.
Step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network.
In the invention, the interest point and descriptor extraction method based on combined feature recombination and feature mixing detects and extracts the interest point and the descriptor through an interest point detector and a descriptor extraction network. When the interest point detection and the description point extraction are carried out, firstly, the feature extraction is carried out on the acquired image.
In the present embodiment, the multi-scale features of the input image are extracted through a feature extraction network constructed based on a residual error network. The feature extraction process is a feed-forward calculation of the trunk residual error network, and a feature map with a plurality of scales is generated, wherein the step length of the scale is 2. Convolutional layers generate output maps of the same size, then these layers belong to the same network phase. Only the output of the last element of each network stage is used by the other modules.
In the present invention, the feature extraction network using the residual error network ResNet, which is slightly different from the original residual error network, removes the largest pooling layer from the network because it would make the scale of the top feature map unsuitable for feature reorganization. therefore, it is divided into four stages (i.e., four convolutions) in the network architecture, as shown in FIG. 3, Conv denotes convolutional layers, 7 × 7, and 3 × 3 denotes convolutional layerThe size of the product kernel, considering the width and height of the input image as h and w, the corresponding feature map extracted at each stage can be expressed asCmFeature map representing the m-th convolution acquisition, dmDepth, h, of the feature map representing the mth convolutionmHigh, w of the feature map representing the mth convolutionmThe width of the profile representing the mth convolution, where m ∈ {1,2,3,4} and d ∈ {64,128,256,512mAnd the size of the input image I both satisfy the constraint as in equation (1) (2):
h=hm×2m(1)
w=wm×2m(2)
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; and obtaining interest points through non-maximum inhibition based on the score map.
In this embodiment, the multi-scale Feature map extracted in step S100 is reconstructed, for example, a Feature reconstruction Module (Feature fragment Module) in fig. 3 is specifically as follows:
extracting each feature map from the image by a pixel rebinning operationSwitch over toAnd does not add extra memory resources to the memory,the characteristic diagram after the pixel recombination is shown, and Shuffled in FIG. 3 shows the recombination treatment.
All the converted feature maps are processed as a whole and input into a single 3 × 3 convolutional layer to be added with a Sigmoid activation function to generate a Score map, which is expressed as S ∈ Rh×w。
The whole process can be described in an abstract way as shown in formula (3):
S=FSM(C1,C2,C3,C4) (3)
FSM, Feature Shuffle Module, represents the Feature recombination.
In the inference process, non-maximum suppression is first applied to the predicted score map S. Then, when the response value of a certain pixel in S exceeds a fixed detection threshold α, the current point is marked as a point of interest.
Step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
in this embodiment, in order to fully utilize the multi-layer semantics, the present invention proposes a new Feature fusion, i.e. a Feature Blend generation Module (Feature Blend Module) in fig. 3, which can extract the most discriminating information from the multi-layer Feature vector to construct the descriptor. The method comprises the following specific steps:
giving a pixel point p ═ x, y on the input image]TX, y denote coordinates, T denotes a transpose, which is found in each feature mapThe calculation method of (2) is shown in formula (4):
p(m)=p/2m=[x/2m,y/2m]T(4)
wherein p is(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, and T represents transposition.
Feature vector corresponding to each pixel pointIt can be obtained by bilinear interpolation from multi-scale feature maps. After all the feature vectors are generated to belong to the same point, the feature fusion module (FBM) of the present invention concatenates them into a single feature vector, which is expressed as formula (5):
Ccat(p)=Cat(C1(p(1)),C2(p(2)),C3(p(3)),C4(p(4))) (5)
wherein Cat represents a linkage, CcatRepresents the connected feature vector, P ═ P1,p2,...,pnIndicates a pixel point.
Then filtering C by using two fully-communicated layerscat(p) garbage and compress the remaining valid semantics to generate the final descriptor dp∈RdimWhere dim is 128. The above whole process can be summarized as formula (6):
D=FBM(C1,C2,C3,C4,P) (6)
wherein D ═ { D ═ D1,d2,...,dnDenotes extracted descriptor binding, d1,d2,...,dnAnd a descriptor representing the correspondence of each pixel point.
Based on the obtained interest points and descriptors, feature matching can be performed, and the matching result is shown in fig. 4.
In addition, in order to further improve the precision of interest point detection and descriptor extraction, the invention establishes the supervision information of quick and effective training based on random uniform sampling and defines the loss function. The method comprises the following specific steps:
in particular, for a trained sample image I, a series of simple transformations, including translation, scale, in-plane rotation, and symmetric perspective deformation within a predetermined range, are first sampled3×3. Applying this transformation T to the sample image I, a new image I 'is synthesized and the image pair I, I' with the known transformation T is constructed.
By transforming T between image pairs, an infinite number of correspondences can be found. However, this is resource consuming and does not involveToo many correspondences need to be used to train the model. Instead, a random uniform sampling strategy is employed to obtain a small number of unbiased correspondences. Specifically, n points p ═ p are uniformly sampled on the original image I1,p2,...,pnAnd then generating a corresponding point p ' ═ { p ' based on the transform T on the synthetic image I '1,p’2,...,p’n}. The whole process is shown in the formulas (7) and (8):
P,T=RandomSample(·) (7)
P',V=Transform(P) (8)
where RandomSample denotes random uniform sampling, Transform denotes image space transformation, V ∈ BnIs a representation point p '═ p'1,p’2,...,p’nB denotes a set of mask boundaries, since not all transform points lie on the image boundaries, and points outside these boundaries are invalid.
The sample points P and their corresponding points P0 may be used to construct a descriptor loss function. It is noted that the number of samples n may be chosen to be small enough not to affect the accuracy of the model, to speed up the training process. The present invention sets n to 400 for all experiments.
Given a set of pixel points p ═ { p ] in image I1,p2,...,pnGet the corresponding set of pixel points p '═ { p'1,p’2,...,p’nAnd obtaining descriptors D and D' of the pixel points respectively. Then, for the descriptorThe distance between the descriptor and the corresponding pixel point, which is referred to as the positive sample distance in the present invention for short, is calculated as shown in formula (9):
description of pixels in a representation image ISymbolAnd descriptors of pixels in IThe distance of (c).
WhereinIs thatIs divided by p 'in I'iThe external pixel points are screened, and descriptors corresponding to the screened points and descriptors of the pixel points in the image I are calculatedThe distance (d) is referred to as a negative sample distance in the present invention, as shown in formula (10):
wherein the content of the first and second substances,p'kdenotes l 'except p'iThe other pixels are obtained first, namely descriptors of the pixels in the IAnd descriptors of pixels in IMinimum distance and p'kAnd p'iIs greater than a threshold value and p'kThe point within the boundary that satisfies these requirements is denoted as p'k *。
In the present invention, the threshold θ is 16 for ensuringAnd p'iThe spatial distance between them exceeds a certain value. In addition, the selected negative exampleIt needs to be located within the image boundaries, otherwise it is invalid.
Give aAndthe present invention defines the triplet distance descriptor loss function as shown in equation (11):
wherein the content of the first and second substances,descriptor for representing a pixel in an image ICorresponding loss value.
The total penalty constructed from descriptors D and D' is shown in equation (12):
wherein L istriplet(D, D ', V) represents I, I ' total loss of descriptor corresponding to each pixel point, D represents set of descriptor corresponding to each pixel point in I, D ' represents set of descriptor corresponding to each pixel point in I ', V represents mask set corresponding to each pixel point in I ' during transformation,descriptor for representing a pixel in ICorresponding loss value, vi、vjAnd (3) representing the ith and j masks in the V, and n representing the number of pixel points in the I'.
In combination with the prediction score map S corresponding to the sample image and the interest point label Y corresponding to the sample image, the weighted cross entropy loss of feature point detection is shown in formula (13):
wherein S represents a score map, Y represents a marked interest point corresponding to the score map, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample imageu,vIndicating the location of the point of interest, Su,vIndicating the position of the pixel point in the score map, Lbce(S, Y) represents a weighted cross-entropy loss value. λ represents the ratio between the preset balanced positive and negative samples, since the number of positive samples is much smaller than the number of negative samples. The inventive setting λ is preferably set to 200.
Based on the loss function of the key point detector and the loss function of the descriptor corresponding extraction network, a total loss function is obtained, as shown in formula (14):
Ltotal(S,S',D,D';Y,Y',V)=Lbce(S,Y)+Lbce(S',Y')+Ltriplet(D,D',V) (14)
wherein S and S 'are respectively score maps of the images I and I', and Y 'are real value labels of interest points of the images I and I'. It is noted that the random transformation from (I, Y) to (I ', Y ') and the random correspondence sampling between the image pairs I, I ' are processed in parallel with the training program.
During the training process of the interest point detector and descriptor extraction network, the invention uses the adam optimizer to optimize the optimization parameters β1=0.9β2=0.999lr0.001 and 10 weight decapay-4The training image size is set to 240 × 320 and the training batch is set to 32 the entire training process is typically completed in 15 cycles during the model evaluation process, the non-maximum rejection radius is set to 4 pixels and the detection threshold α is 0.9 to produce reliable points of interest.
A second embodiment of the present invention provides a system for extracting interest points and descriptors based on joint feature reorganization and feature mixing, as shown in fig. 2, including: the system comprises a feature extraction module 100, an interest point acquisition module 200 and a descriptor extraction module 300;
the feature extraction module 100 is configured to acquire an image to be extracted as an input image, and extract a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
the interest point obtaining module 200 is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
the descriptor extraction module 300 is configured to obtain and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that, the interest point and descriptor extraction system based on joint feature reorganization and feature mixing provided in the foregoing embodiment is only illustrated by dividing the foregoing functional modules, and in practical applications, the foregoing function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs, which are adapted to be loaded by a processor and to implement the above-described point of interest and descriptor extraction method based on joint feature reorganization and feature mixing.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described point of interest, descriptor extraction method based on joint feature reorganization and feature mixing.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Referring now to FIG. 5, there is illustrated a block diagram of a computer system suitable for use as a server in implementing embodiments of the method, system, and apparatus of the present application. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the computer system includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for system operation are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An Input/Output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output section 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN (Local area network) card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. A driver 510 is also connected to the I/O interface 405 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (9)
1. A method for extracting interest points and descriptors based on joint feature recombination and feature mixing is characterized by comprising the following steps:
step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
2. The method of claim 1, wherein the residual network does not include a maximum pooling layer, and the width and height of the extracted multi-scale feature map and the input image are:
h=hm×2m
w=wm×2m
where h denotes the height of the input image, w denotes the width of the input image, hmHigh, w of the feature map representing the mth convolutionmThe width of the feature map representing the mth convolution is wide, and m represents the number of convolutions.
3. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing according to claim 2, wherein the method for calculating the corresponding positions of the pixel points in the input image in the multi-scale feature map comprises:
p(m)=p/2m=[x/2m,y/2m]T
wherein p is(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, T represents transposition, and x and y represent coordinates of the pixel point in the input image.
4. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing as claimed in claim 1, wherein the preset multiple image transformation methods include translation transformation, scale transformation, in-plane rotation transformation, and symmetric perspective transformation within a preset range.
5. The joint feature reorganization and feature mixing-based interest point and descriptor extraction method according to claim 3, wherein the descriptor loss function is:
wherein L istriplet(D, D ', V) represents the total loss of the descriptors corresponding to each pixel of the first image and the second image, D represents the set of the descriptors corresponding to each pixel of the first image, D' represents the set of the descriptors corresponding to each pixel of the second image, V represents the mask set corresponding to each pixel of the second image during conversion,descriptor representing a pixel in a first imageCorresponding loss value, vi、vjRepresenting the ith, j mask in V, n representing the number of pixel points in the second image,the first distance is represented as a function of,representing the second distance.
6. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing as claimed in claim 5, wherein the loss function of the extraction network corresponding to the interest points is a weighted cross entropy loss function, and the loss value is calculated by:
wherein S represents a score map, Y represents a marked interest point corresponding to the score map, λ represents a preset ratio, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample imageu,vIndicating the location of the point of interest, Su,vIndicating the position of the pixel point in the score map, Lbce(S, Y) represents a weighted cross-entropy loss value.
7. An interest point and descriptor extraction system based on joint feature reorganization and feature mixing, comprising: the system comprises a feature extraction module, an interest point acquisition module and a descriptor extraction module;
the characteristic extraction module is configured to acquire an image to be extracted as an input image and extract a multi-scale characteristic diagram of the image through a characteristic extraction network; the feature extraction network is constructed based on a residual error network;
the interest point acquisition module is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
the descriptor extraction module is configured to acquire and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
8. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the joint feature reorganization and feature mixing based point of interest, descriptor extraction method of any of claims 1-6.
9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement the point of interest, descriptor extraction method based on joint feature reorganization and feature mixing according to any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010444152.6A CN111612075A (en) | 2020-05-22 | 2020-05-22 | Interest point and descriptor extraction method based on joint feature recombination and feature mixing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010444152.6A CN111612075A (en) | 2020-05-22 | 2020-05-22 | Interest point and descriptor extraction method based on joint feature recombination and feature mixing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111612075A true CN111612075A (en) | 2020-09-01 |
Family
ID=72205282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010444152.6A Withdrawn CN111612075A (en) | 2020-05-22 | 2020-05-22 | Interest point and descriptor extraction method based on joint feature recombination and feature mixing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111612075A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112232361A (en) * | 2020-10-13 | 2021-01-15 | 国网电子商务有限公司 | Image processing method and device, electronic equipment and computer readable storage medium |
CN113656698A (en) * | 2021-08-24 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method and device of interest feature extraction model and electronic equipment |
CN114693940A (en) * | 2022-03-22 | 2022-07-01 | 电子科技大学 | Image description method for enhancing feature mixing resolvability based on deep learning |
-
2020
- 2020-05-22 CN CN202010444152.6A patent/CN111612075A/en not_active Withdrawn
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112232361A (en) * | 2020-10-13 | 2021-01-15 | 国网电子商务有限公司 | Image processing method and device, electronic equipment and computer readable storage medium |
CN112232361B (en) * | 2020-10-13 | 2021-09-21 | 国网电子商务有限公司 | Image processing method and device, electronic equipment and computer readable storage medium |
CN113656698A (en) * | 2021-08-24 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method and device of interest feature extraction model and electronic equipment |
CN113656698B (en) * | 2021-08-24 | 2024-04-09 | 北京百度网讯科技有限公司 | Training method and device for interest feature extraction model and electronic equipment |
CN114693940A (en) * | 2022-03-22 | 2022-07-01 | 电子科技大学 | Image description method for enhancing feature mixing resolvability based on deep learning |
CN114693940B (en) * | 2022-03-22 | 2023-04-28 | 电子科技大学 | Image description method with enhanced feature mixing decomposability based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509915B (en) | Method and device for generating face recognition model | |
EP3716198A1 (en) | Image reconstruction method and device | |
CN111401516B (en) | Searching method for neural network channel parameters and related equipment | |
CN112560876A (en) | Single-stage small sample target detection method for decoupling measurement | |
CN108172213B (en) | Surge audio identification method, surge audio identification device, surge audio identification equipment and computer readable medium | |
CN111612075A (en) | Interest point and descriptor extraction method based on joint feature recombination and feature mixing | |
CN112465828A (en) | Image semantic segmentation method and device, electronic equipment and storage medium | |
CN111612017A (en) | Target detection method based on information enhancement | |
CN112598597A (en) | Training method of noise reduction model and related device | |
CN109977832B (en) | Image processing method, device and storage medium | |
CN113191489B (en) | Training method of binary neural network model, image processing method and device | |
CN111444807B (en) | Target detection method, device, electronic equipment and computer readable medium | |
CN111091524A (en) | Prostate transrectal ultrasound image segmentation method based on deep convolutional neural network | |
CN113065997B (en) | Image processing method, neural network training method and related equipment | |
CN111738174B (en) | Human body example analysis method and system based on depth decoupling | |
CN112580720A (en) | Model training method and device | |
CN114359289A (en) | Image processing method and related device | |
Zhang et al. | A GPU-accelerated real-time single image de-hazing method using pixel-level optimal de-hazing criterion | |
CN111428566A (en) | Deformation target tracking system and method | |
WO2024027347A1 (en) | Content recognition method and apparatus, device, storage medium, and computer program product | |
CN109508582A (en) | The recognition methods of remote sensing image and device | |
CN111738069A (en) | Face detection method and device, electronic equipment and storage medium | |
CN116798041A (en) | Image recognition method and device and electronic equipment | |
CN114332489B (en) | Image salient target detection method and system based on uncertainty perception | |
CN113688928B (en) | Image matching method and device, electronic equipment and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200901 |