CN111612075A - Interest point and descriptor extraction method based on joint feature recombination and feature mixing - Google Patents

Interest point and descriptor extraction method based on joint feature recombination and feature mixing Download PDF

Info

Publication number
CN111612075A
CN111612075A CN202010444152.6A CN202010444152A CN111612075A CN 111612075 A CN111612075 A CN 111612075A CN 202010444152 A CN202010444152 A CN 202010444152A CN 111612075 A CN111612075 A CN 111612075A
Authority
CN
China
Prior art keywords
image
feature
pixel
descriptor
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010444152.6A
Other languages
Chinese (zh)
Inventor
徐士彪
张宇阳
孟维亮
张吉光
张晓鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202010444152.6A priority Critical patent/CN111612075A/en
Publication of CN111612075A publication Critical patent/CN111612075A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Abstract

The invention belongs to the field of computer vision, and particularly relates to an interest point and descriptor extraction method, system and device based on joint feature recombination and feature mixing, aiming at solving the problem of low detection and extraction precision of the existing interest point and descriptor extraction method. The system method comprises the following steps: acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; performing pixel recombination on each characteristic graph, obtaining a score graph through convolution and nonlinear mapping, and obtaining interest points through non-maximum suppression; and acquiring the characteristic vectors of the pixel points in the input image at the corresponding positions of the multi-scale characteristic graph for connection, and filtering and compressing the connected characteristic vectors through a connection layer to obtain the descriptors corresponding to the pixel points. The invention improves the precision of detecting and extracting the interest points and the descriptors.

Description

Interest point and descriptor extraction method based on joint feature recombination and feature mixing
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a method, a system and a device for extracting interest points and descriptors based on joint feature recombination and feature mixing.
Background
It has long been the basis of many vision-based applications, such as visual localization and image retrieval, to use image interest points and their local feature descriptors to find the correct correspondence between images. However, with the rapid development of the industry, these applications need to deal with more complex and difficult scenarios. Since image interest point detection and description are key components of these advanced algorithms, it is important to further improve the accuracy thereof.
Over the past two decades, a number of excellent algorithms have been proposed to solve the above problems. Both the traditional statistical-based and filtering-based methods and the deep learning-based method make a significant breakthrough. Especially, the accuracy of interest point detection and local feature description is greatly improved by deep learning-based algorithms such as SuperPoint, D2-net and R2D 2. However, previous approaches have focused primarily on a better deep learning-based paradigm to solve this problem, and have neglected network architecture design somewhat. The relatively good network architectures proposed in other visual applications, such as classification, object detection, segmentation, etc., are not suitable for image interest point detection or local feature description. Therefore, the invention provides a method, a system and a device for extracting interest points and descriptors based on joint feature recombination and feature mixing.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem of low detection and extraction accuracy of the existing interest point and descriptor extraction method, a first aspect of the present invention provides an interest point and descriptor extraction method based on joint feature recombination and feature mixing, including:
step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
In some preferred embodiments, the residual network does not include a maximum pooling layer, and the extracted multi-scale feature map and the input image have widths and heights:
h=hm×2m
w=wm×2m
where h denotes the height of the input image, w denotes the width of the input image, hmHigh, w of the feature map representing the mth convolutionmThe width of the feature map representing the mth convolution is wide, and m represents the number of convolutions.
In some preferred embodiments, the method for calculating the corresponding position of each pixel point in the input image in the multi-scale feature map includes:
p(m)=p/2m=[x/2m,y/2m]T
wherein p is(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, T represents transposition, and x and y represent coordinates of the pixel point in the input image.
In some preferred embodiments, the preset multiple image transformation methods include translation transformation, scale transformation, in-plane rotation transformation, and symmetric perspective transformation within a preset range.
In some preferred embodiments, the descriptor loss function is:
Figure BDA0002505097940000031
Figure BDA0002505097940000032
wherein L istriplet(D, D ', V) represents the total loss of the descriptors corresponding to each pixel of the first image and the second image, D represents the set of the descriptors corresponding to each pixel of the first image, D' represents the set of the descriptors corresponding to each pixel of the second image, V represents the mask set corresponding to each pixel of the second image during conversion,
Figure BDA0002505097940000033
descriptor representing a pixel in a first image
Figure BDA0002505097940000034
Corresponding loss value, vi、vjRepresenting the ith, j mask in V, n representing the number of pixel points in the second image,
Figure BDA0002505097940000035
the first distance is represented as a function of,
Figure BDA0002505097940000036
representing the second distance.
In some preferred embodiments, the loss function of the extraction network corresponding to the interest point is a weighted cross entropy loss function, and the loss value is calculated by:
Figure BDA0002505097940000041
wherein S represents a score map, Y represents a marked interest point corresponding to the score map, λ represents a preset ratio, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample imageu,vIndicating the location of the point of interest, Su,vIndicating the position of the pixel point in the score map, Lbce(S, Y) represents a weighted cross-entropy loss value.
The second aspect of the invention provides an interest point and descriptor extraction system based on joint feature recombination and feature mixing, which comprises a feature extraction module, an interest point acquisition module and a descriptor extraction module;
the characteristic extraction module is configured to acquire an image to be extracted as an input image and extract a multi-scale characteristic diagram of the image through a characteristic extraction network; the feature extraction network is constructed based on a residual error network;
the interest point acquisition module is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
the descriptor extraction module is configured to acquire and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being loaded and executed by a processor to implement the above-mentioned interest point and descriptor extraction method based on joint feature reorganization and feature mixing.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described point of interest, descriptor extraction method based on joint feature reorganization and feature mixing.
The invention has the beneficial effects that:
the invention improves the precision of detecting and extracting the interest points and the descriptors. The method utilizes the residual error network to extract the features, realizes the interest point detection through the feature recombination after the features are extracted, and improves the robustness of the interest point detection. And acquiring the characteristic vectors of the pixel points in the input image at the corresponding positions of the multi-scale characteristic graph for connection, and filtering and compressing the connected characteristic vectors through a connection layer to finish the extraction of the characteristic descriptors.
In addition, supervision information of quick and effective training is established based on random uniform sampling in the training process, and loss functions of interest point detection and feature descriptors are constructed. Training is carried out based on the constructed loss function, and the precision of interest point detection and descriptor extraction is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a method for extracting interest points and descriptors based on joint feature reorganization and feature mixing according to an embodiment of the present invention;
FIG. 2 is a block diagram of a point of interest, descriptor extraction system based on joint feature reorganization and feature mixing according to an embodiment of the present invention;
FIG. 3 is a detailed framework diagram of a method for extracting interest points and descriptors based on joint feature reorganization and feature mixing according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating the result of feature matching based on interest points and descriptors according to an embodiment of the present invention;
FIG. 5 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The method for extracting interest points and descriptors based on joint feature recombination and feature mixing, as shown in fig. 1, comprises the following steps:
step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
In order to more clearly explain the method for extracting interest points and descriptors based on joint feature reorganization and feature mixing, the following describes in detail the steps of an embodiment of the method of the present invention with reference to the drawings.
Step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network.
In the invention, the interest point and descriptor extraction method based on combined feature recombination and feature mixing detects and extracts the interest point and the descriptor through an interest point detector and a descriptor extraction network. When the interest point detection and the description point extraction are carried out, firstly, the feature extraction is carried out on the acquired image.
In the present embodiment, the multi-scale features of the input image are extracted through a feature extraction network constructed based on a residual error network. The feature extraction process is a feed-forward calculation of the trunk residual error network, and a feature map with a plurality of scales is generated, wherein the step length of the scale is 2. Convolutional layers generate output maps of the same size, then these layers belong to the same network phase. Only the output of the last element of each network stage is used by the other modules.
In the present invention, the feature extraction network using the residual error network ResNet, which is slightly different from the original residual error network, removes the largest pooling layer from the network because it would make the scale of the top feature map unsuitable for feature reorganization. therefore, it is divided into four stages (i.e., four convolutions) in the network architecture, as shown in FIG. 3, Conv denotes convolutional layers, 7 × 7, and 3 × 3 denotes convolutional layerThe size of the product kernel, considering the width and height of the input image as h and w, the corresponding feature map extracted at each stage can be expressed as
Figure BDA0002505097940000081
CmFeature map representing the m-th convolution acquisition, dmDepth, h, of the feature map representing the mth convolutionmHigh, w of the feature map representing the mth convolutionmThe width of the profile representing the mth convolution, where m ∈ {1,2,3,4} and d ∈ {64,128,256,512mAnd the size of the input image I both satisfy the constraint as in equation (1) (2):
h=hm×2m(1)
w=wm×2m(2)
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; and obtaining interest points through non-maximum inhibition based on the score map.
In this embodiment, the multi-scale Feature map extracted in step S100 is reconstructed, for example, a Feature reconstruction Module (Feature fragment Module) in fig. 3 is specifically as follows:
extracting each feature map from the image by a pixel rebinning operation
Figure BDA0002505097940000091
Switch over to
Figure BDA0002505097940000092
And does not add extra memory resources to the memory,
Figure BDA0002505097940000093
the characteristic diagram after the pixel recombination is shown, and Shuffled in FIG. 3 shows the recombination treatment.
All the converted feature maps are processed as a whole and input into a single 3 × 3 convolutional layer to be added with a Sigmoid activation function to generate a Score map, which is expressed as S ∈ Rh×w
The whole process can be described in an abstract way as shown in formula (3):
S=FSM(C1,C2,C3,C4) (3)
FSM, Feature Shuffle Module, represents the Feature recombination.
In the inference process, non-maximum suppression is first applied to the predicted score map S. Then, when the response value of a certain pixel in S exceeds a fixed detection threshold α, the current point is marked as a point of interest.
Step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
in this embodiment, in order to fully utilize the multi-layer semantics, the present invention proposes a new Feature fusion, i.e. a Feature Blend generation Module (Feature Blend Module) in fig. 3, which can extract the most discriminating information from the multi-layer Feature vector to construct the descriptor. The method comprises the following specific steps:
giving a pixel point p ═ x, y on the input image]TX, y denote coordinates, T denotes a transpose, which is found in each feature map
Figure BDA0002505097940000101
The calculation method of (2) is shown in formula (4):
p(m)=p/2m=[x/2m,y/2m]T(4)
wherein p is(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, and T represents transposition.
Feature vector corresponding to each pixel point
Figure BDA0002505097940000102
It can be obtained by bilinear interpolation from multi-scale feature maps. After all the feature vectors are generated to belong to the same point, the feature fusion module (FBM) of the present invention concatenates them into a single feature vector, which is expressed as formula (5):
Ccat(p)=Cat(C1(p(1)),C2(p(2)),C3(p(3)),C4(p(4))) (5)
wherein Cat represents a linkage, CcatRepresents the connected feature vector, P ═ P1,p2,...,pnIndicates a pixel point.
Then filtering C by using two fully-communicated layerscat(p) garbage and compress the remaining valid semantics to generate the final descriptor dp∈RdimWhere dim is 128. The above whole process can be summarized as formula (6):
D=FBM(C1,C2,C3,C4,P) (6)
wherein D ═ { D ═ D1,d2,...,dnDenotes extracted descriptor binding, d1,d2,...,dnAnd a descriptor representing the correspondence of each pixel point.
Based on the obtained interest points and descriptors, feature matching can be performed, and the matching result is shown in fig. 4.
In addition, in order to further improve the precision of interest point detection and descriptor extraction, the invention establishes the supervision information of quick and effective training based on random uniform sampling and defines the loss function. The method comprises the following specific steps:
in particular, for a trained sample image I, a series of simple transformations, including translation, scale, in-plane rotation, and symmetric perspective deformation within a predetermined range, are first sampled3×3. Applying this transformation T to the sample image I, a new image I 'is synthesized and the image pair I, I' with the known transformation T is constructed.
By transforming T between image pairs, an infinite number of correspondences can be found. However, this is resource consuming and does not involveToo many correspondences need to be used to train the model. Instead, a random uniform sampling strategy is employed to obtain a small number of unbiased correspondences. Specifically, n points p ═ p are uniformly sampled on the original image I1,p2,...,pnAnd then generating a corresponding point p ' ═ { p ' based on the transform T on the synthetic image I '1,p’2,...,p’n}. The whole process is shown in the formulas (7) and (8):
P,T=RandomSample(·) (7)
P',V=Transform(P) (8)
where RandomSample denotes random uniform sampling, Transform denotes image space transformation, V ∈ BnIs a representation point p '═ p'1,p’2,...,p’nB denotes a set of mask boundaries, since not all transform points lie on the image boundaries, and points outside these boundaries are invalid.
The sample points P and their corresponding points P0 may be used to construct a descriptor loss function. It is noted that the number of samples n may be chosen to be small enough not to affect the accuracy of the model, to speed up the training process. The present invention sets n to 400 for all experiments.
Given a set of pixel points p ═ { p ] in image I1,p2,...,pnGet the corresponding set of pixel points p '═ { p'1,p’2,...,p’nAnd obtaining descriptors D and D' of the pixel points respectively. Then, for the descriptor
Figure BDA0002505097940000111
The distance between the descriptor and the corresponding pixel point, which is referred to as the positive sample distance in the present invention for short, is calculated as shown in formula (9):
Figure BDA0002505097940000121
Figure BDA0002505097940000122
description of pixels in a representation image ISymbol
Figure BDA00025050979400001219
And descriptors of pixels in I
Figure BDA0002505097940000123
The distance of (c).
Wherein
Figure BDA0002505097940000124
Is that
Figure BDA0002505097940000125
Is divided by p 'in I'iThe external pixel points are screened, and descriptors corresponding to the screened points and descriptors of the pixel points in the image I are calculated
Figure BDA0002505097940000126
The distance (d) is referred to as a negative sample distance in the present invention, as shown in formula (10):
Figure BDA0002505097940000127
wherein the content of the first and second substances,
Figure BDA0002505097940000128
p'kdenotes l 'except p'iThe other pixels are obtained first, namely descriptors of the pixels in the I
Figure BDA0002505097940000129
And descriptors of pixels in I
Figure BDA00025050979400001210
Minimum distance and p'kAnd p'iIs greater than a threshold value and p'kThe point within the boundary that satisfies these requirements is denoted as p'k *
In the present invention, the threshold θ is 16 for ensuring
Figure BDA00025050979400001211
And p'iThe spatial distance between them exceeds a certain value. In addition, the selected negative example
Figure BDA00025050979400001212
It needs to be located within the image boundaries, otherwise it is invalid.
Give a
Figure BDA00025050979400001213
And
Figure BDA00025050979400001214
the present invention defines the triplet distance descriptor loss function as shown in equation (11):
Figure BDA00025050979400001215
wherein the content of the first and second substances,
Figure BDA00025050979400001216
descriptor for representing a pixel in an image I
Figure BDA00025050979400001217
Corresponding loss value.
The total penalty constructed from descriptors D and D' is shown in equation (12):
Figure BDA00025050979400001218
wherein L istriplet(D, D ', V) represents I, I ' total loss of descriptor corresponding to each pixel point, D represents set of descriptor corresponding to each pixel point in I, D ' represents set of descriptor corresponding to each pixel point in I ', V represents mask set corresponding to each pixel point in I ' during transformation,
Figure BDA0002505097940000131
descriptor for representing a pixel in I
Figure BDA0002505097940000132
Corresponding loss value, vi、vjAnd (3) representing the ith and j masks in the V, and n representing the number of pixel points in the I'.
In combination with the prediction score map S corresponding to the sample image and the interest point label Y corresponding to the sample image, the weighted cross entropy loss of feature point detection is shown in formula (13):
Figure BDA0002505097940000133
wherein S represents a score map, Y represents a marked interest point corresponding to the score map, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample imageu,vIndicating the location of the point of interest, Su,vIndicating the position of the pixel point in the score map, Lbce(S, Y) represents a weighted cross-entropy loss value. λ represents the ratio between the preset balanced positive and negative samples, since the number of positive samples is much smaller than the number of negative samples. The inventive setting λ is preferably set to 200.
Based on the loss function of the key point detector and the loss function of the descriptor corresponding extraction network, a total loss function is obtained, as shown in formula (14):
Ltotal(S,S',D,D';Y,Y',V)=Lbce(S,Y)+Lbce(S',Y')+Ltriplet(D,D',V) (14)
wherein S and S 'are respectively score maps of the images I and I', and Y 'are real value labels of interest points of the images I and I'. It is noted that the random transformation from (I, Y) to (I ', Y ') and the random correspondence sampling between the image pairs I, I ' are processed in parallel with the training program.
During the training process of the interest point detector and descriptor extraction network, the invention uses the adam optimizer to optimize the optimization parameters β1=0.9β2=0.999lr0.001 and 10 weight decapay-4The training image size is set to 240 × 320 and the training batch is set to 32 the entire training process is typically completed in 15 cycles during the model evaluation process, the non-maximum rejection radius is set to 4 pixels and the detection threshold α is 0.9 to produce reliable points of interest.
A second embodiment of the present invention provides a system for extracting interest points and descriptors based on joint feature reorganization and feature mixing, as shown in fig. 2, including: the system comprises a feature extraction module 100, an interest point acquisition module 200 and a descriptor extraction module 300;
the feature extraction module 100 is configured to acquire an image to be extracted as an input image, and extract a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
the interest point obtaining module 200 is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
the descriptor extraction module 300 is configured to obtain and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that, the interest point and descriptor extraction system based on joint feature reorganization and feature mixing provided in the foregoing embodiment is only illustrated by dividing the foregoing functional modules, and in practical applications, the foregoing function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs, which are adapted to be loaded by a processor and to implement the above-described point of interest and descriptor extraction method based on joint feature reorganization and feature mixing.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described point of interest, descriptor extraction method based on joint feature reorganization and feature mixing.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Referring now to FIG. 5, there is illustrated a block diagram of a computer system suitable for use as a server in implementing embodiments of the method, system, and apparatus of the present application. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the computer system includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for system operation are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An Input/Output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output section 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN (Local area network) card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. A driver 510 is also connected to the I/O interface 405 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (9)

1. A method for extracting interest points and descriptors based on joint feature recombination and feature mixing is characterized by comprising the following steps:
step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;
step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
2. The method of claim 1, wherein the residual network does not include a maximum pooling layer, and the width and height of the extracted multi-scale feature map and the input image are:
h=hm×2m
w=wm×2m
where h denotes the height of the input image, w denotes the width of the input image, hmHigh, w of the feature map representing the mth convolutionmThe width of the feature map representing the mth convolution is wide, and m represents the number of convolutions.
3. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing according to claim 2, wherein the method for calculating the corresponding positions of the pixel points in the input image in the multi-scale feature map comprises:
p(m)=p/2m=[x/2m,y/2m]T
wherein p is(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, T represents transposition, and x and y represent coordinates of the pixel point in the input image.
4. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing as claimed in claim 1, wherein the preset multiple image transformation methods include translation transformation, scale transformation, in-plane rotation transformation, and symmetric perspective transformation within a preset range.
5. The joint feature reorganization and feature mixing-based interest point and descriptor extraction method according to claim 3, wherein the descriptor loss function is:
Figure FDA0002505097930000021
Figure FDA0002505097930000022
wherein L istriplet(D, D ', V) represents the total loss of the descriptors corresponding to each pixel of the first image and the second image, D represents the set of the descriptors corresponding to each pixel of the first image, D' represents the set of the descriptors corresponding to each pixel of the second image, V represents the mask set corresponding to each pixel of the second image during conversion,
Figure FDA0002505097930000023
descriptor representing a pixel in a first image
Figure FDA0002505097930000024
Corresponding loss value, vi、vjRepresenting the ith, j mask in V, n representing the number of pixel points in the second image,
Figure FDA0002505097930000031
the first distance is represented as a function of,
Figure FDA0002505097930000032
representing the second distance.
6. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing as claimed in claim 5, wherein the loss function of the extraction network corresponding to the interest points is a weighted cross entropy loss function, and the loss value is calculated by:
Figure FDA0002505097930000033
wherein S represents a score map, Y represents a marked interest point corresponding to the score map, λ represents a preset ratio, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample imageu,vIndicating the location of the point of interest, Su,vIndicating the position of the pixel point in the score map, Lbce(S, Y) represents a weighted cross-entropy loss value.
7. An interest point and descriptor extraction system based on joint feature reorganization and feature mixing, comprising: the system comprises a feature extraction module, an interest point acquisition module and a descriptor extraction module;
the characteristic extraction module is configured to acquire an image to be extracted as an input image and extract a multi-scale characteristic diagram of the image through a characteristic extraction network; the feature extraction network is constructed based on a residual error network;
the interest point acquisition module is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;
the descriptor extraction module is configured to acquire and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;
the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:
based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;
uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;
calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;
and combining the first distance and the second distance to construct a descriptor loss function.
8. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the joint feature reorganization and feature mixing based point of interest, descriptor extraction method of any of claims 1-6.
9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement the point of interest, descriptor extraction method based on joint feature reorganization and feature mixing according to any of claims 1-6.
CN202010444152.6A 2020-05-22 2020-05-22 Interest point and descriptor extraction method based on joint feature recombination and feature mixing Withdrawn CN111612075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010444152.6A CN111612075A (en) 2020-05-22 2020-05-22 Interest point and descriptor extraction method based on joint feature recombination and feature mixing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010444152.6A CN111612075A (en) 2020-05-22 2020-05-22 Interest point and descriptor extraction method based on joint feature recombination and feature mixing

Publications (1)

Publication Number Publication Date
CN111612075A true CN111612075A (en) 2020-09-01

Family

ID=72205282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010444152.6A Withdrawn CN111612075A (en) 2020-05-22 2020-05-22 Interest point and descriptor extraction method based on joint feature recombination and feature mixing

Country Status (1)

Country Link
CN (1) CN111612075A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232361A (en) * 2020-10-13 2021-01-15 国网电子商务有限公司 Image processing method and device, electronic equipment and computer readable storage medium
CN113656698A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Training method and device of interest feature extraction model and electronic equipment
CN114693940A (en) * 2022-03-22 2022-07-01 电子科技大学 Image description method for enhancing feature mixing resolvability based on deep learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232361A (en) * 2020-10-13 2021-01-15 国网电子商务有限公司 Image processing method and device, electronic equipment and computer readable storage medium
CN112232361B (en) * 2020-10-13 2021-09-21 国网电子商务有限公司 Image processing method and device, electronic equipment and computer readable storage medium
CN113656698A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Training method and device of interest feature extraction model and electronic equipment
CN113656698B (en) * 2021-08-24 2024-04-09 北京百度网讯科技有限公司 Training method and device for interest feature extraction model and electronic equipment
CN114693940A (en) * 2022-03-22 2022-07-01 电子科技大学 Image description method for enhancing feature mixing resolvability based on deep learning
CN114693940B (en) * 2022-03-22 2023-04-28 电子科技大学 Image description method with enhanced feature mixing decomposability based on deep learning

Similar Documents

Publication Publication Date Title
CN108509915B (en) Method and device for generating face recognition model
EP3716198A1 (en) Image reconstruction method and device
CN111401516B (en) Searching method for neural network channel parameters and related equipment
CN112560876A (en) Single-stage small sample target detection method for decoupling measurement
CN108172213B (en) Surge audio identification method, surge audio identification device, surge audio identification equipment and computer readable medium
CN111612075A (en) Interest point and descriptor extraction method based on joint feature recombination and feature mixing
CN112465828A (en) Image semantic segmentation method and device, electronic equipment and storage medium
CN111612017A (en) Target detection method based on information enhancement
CN112598597A (en) Training method of noise reduction model and related device
CN109977832B (en) Image processing method, device and storage medium
CN113191489B (en) Training method of binary neural network model, image processing method and device
CN111444807B (en) Target detection method, device, electronic equipment and computer readable medium
CN111091524A (en) Prostate transrectal ultrasound image segmentation method based on deep convolutional neural network
CN113065997B (en) Image processing method, neural network training method and related equipment
CN111738174B (en) Human body example analysis method and system based on depth decoupling
CN112580720A (en) Model training method and device
CN114359289A (en) Image processing method and related device
Zhang et al. A GPU-accelerated real-time single image de-hazing method using pixel-level optimal de-hazing criterion
CN111428566A (en) Deformation target tracking system and method
WO2024027347A1 (en) Content recognition method and apparatus, device, storage medium, and computer program product
CN109508582A (en) The recognition methods of remote sensing image and device
CN111738069A (en) Face detection method and device, electronic equipment and storage medium
CN116798041A (en) Image recognition method and device and electronic equipment
CN114332489B (en) Image salient target detection method and system based on uncertainty perception
CN113688928B (en) Image matching method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200901