CN111612075A

CN111612075A - Interest point and descriptor extraction method based on joint feature recombination and feature mixing

Info

Publication number: CN111612075A
Application number: CN202010444152.6A
Authority: CN
Inventors: 徐士彪; 张宇阳; 孟维亮; 张吉光; 张晓鹏
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-09-01

Abstract

The invention belongs to the field of computer vision, and particularly relates to an interest point and descriptor extraction method, system and device based on joint feature recombination and feature mixing, aiming at solving the problem of low detection and extraction precision of the existing interest point and descriptor extraction method. The system method comprises the following steps: acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; performing pixel recombination on each characteristic graph, obtaining a score graph through convolution and nonlinear mapping, and obtaining interest points through non-maximum suppression; and acquiring the characteristic vectors of the pixel points in the input image at the corresponding positions of the multi-scale characteristic graph for connection, and filtering and compressing the connected characteristic vectors through a connection layer to obtain the descriptors corresponding to the pixel points. The invention improves the precision of detecting and extracting the interest points and the descriptors.

Description

Interest point and descriptor extraction method based on joint feature recombination and feature mixing

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a method, a system and a device for extracting interest points and descriptors based on joint feature recombination and feature mixing.

Background

It has long been the basis of many vision-based applications, such as visual localization and image retrieval, to use image interest points and their local feature descriptors to find the correct correspondence between images. However, with the rapid development of the industry, these applications need to deal with more complex and difficult scenarios. Since image interest point detection and description are key components of these advanced algorithms, it is important to further improve the accuracy thereof.

Over the past two decades, a number of excellent algorithms have been proposed to solve the above problems. Both the traditional statistical-based and filtering-based methods and the deep learning-based method make a significant breakthrough. Especially, the accuracy of interest point detection and local feature description is greatly improved by deep learning-based algorithms such as SuperPoint, D2-net and R2D 2. However, previous approaches have focused primarily on a better deep learning-based paradigm to solve this problem, and have neglected network architecture design somewhat. The relatively good network architectures proposed in other visual applications, such as classification, object detection, segmentation, etc., are not suitable for image interest point detection or local feature description. Therefore, the invention provides a method, a system and a device for extracting interest points and descriptors based on joint feature recombination and feature mixing.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem of low detection and extraction accuracy of the existing interest point and descriptor extraction method, a first aspect of the present invention provides an interest point and descriptor extraction method based on joint feature recombination and feature mixing, including:

step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;

step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;

step S300, acquiring and connecting feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;

the construction method of the loss function of the extraction network corresponding to the descriptor comprises the following steps:

based on the acquired sample image, as a first image; performing compound transformation on the first image by multiple preset image transformation methods, and synthesizing a new image serving as a second image after transformation;

uniformly sampling M pixel points in the first image to serve as first pixel points, and sampling pixel points corresponding to the first pixel points in the second image to serve as second pixel points; extracting descriptors of the first pixel points and the second pixel points;

calculating the distance between the descriptor of the first pixel point and the descriptor of the second pixel point as a first distance; calculating the distance between each first pixel point and the descriptor of the third pixel point as a second distance; the third pixel point is the pixel point which is in the second image except the second pixel point, has the minimum descriptor distance with the first pixel point and has the distance with the second pixel point smaller than the set threshold value;

and combining the first distance and the second distance to construct a descriptor loss function.

In some preferred embodiments, the residual network does not include a maximum pooling layer, and the extracted multi-scale feature map and the input image have widths and heights:

h＝h_m×2^m

w＝w_m×2^m

where h denotes the height of the input image, w denotes the width of the input image, h_mHigh, w of the feature map representing the mth convolution_mThe width of the feature map representing the mth convolution is wide, and m represents the number of convolutions.

In some preferred embodiments, the method for calculating the corresponding position of each pixel point in the input image in the multi-scale feature map includes:

p^(m)＝p/2^m＝[x/2^m,y/2^m]^T

wherein p is^(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, T represents transposition, and x and y represent coordinates of the pixel point in the input image.

In some preferred embodiments, the preset multiple image transformation methods include translation transformation, scale transformation, in-plane rotation transformation, and symmetric perspective transformation within a preset range.

In some preferred embodiments, the descriptor loss function is:

wherein L is_triplet(D, D ', V) represents the total loss of the descriptors corresponding to each pixel of the first image and the second image, D represents the set of the descriptors corresponding to each pixel of the first image, D' represents the set of the descriptors corresponding to each pixel of the second image, V represents the mask set corresponding to each pixel of the second image during conversion,

descriptor representing a pixel in a first image

Corresponding loss value, v_i、v_jRepresenting the ith, j mask in V, n representing the number of pixel points in the second image,

the first distance is represented as a function of,

representing the second distance.

In some preferred embodiments, the loss function of the extraction network corresponding to the interest point is a weighted cross entropy loss function, and the loss value is calculated by:

wherein S represents a score map, Y represents a marked interest point corresponding to the score map, λ represents a preset ratio, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample image_u,vIndicating the location of the point of interest, S_u,vIndicating the position of the pixel point in the score map, L_bce(S, Y) represents a weighted cross-entropy loss value.

The second aspect of the invention provides an interest point and descriptor extraction system based on joint feature recombination and feature mixing, which comprises a feature extraction module, an interest point acquisition module and a descriptor extraction module;

the characteristic extraction module is configured to acquire an image to be extracted as an input image and extract a multi-scale characteristic diagram of the image through a characteristic extraction network; the feature extraction network is constructed based on a residual error network;

the interest point acquisition module is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;

the descriptor extraction module is configured to acquire and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being loaded and executed by a processor to implement the above-mentioned interest point and descriptor extraction method based on joint feature reorganization and feature mixing.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described point of interest, descriptor extraction method based on joint feature reorganization and feature mixing.

The invention has the beneficial effects that:

the invention improves the precision of detecting and extracting the interest points and the descriptors. The method utilizes the residual error network to extract the features, realizes the interest point detection through the feature recombination after the features are extracted, and improves the robustness of the interest point detection. And acquiring the characteristic vectors of the pixel points in the input image at the corresponding positions of the multi-scale characteristic graph for connection, and filtering and compressing the connected characteristic vectors through a connection layer to finish the extraction of the characteristic descriptors.

In addition, supervision information of quick and effective training is established based on random uniform sampling in the training process, and loss functions of interest point detection and feature descriptors are constructed. Training is carried out based on the constructed loss function, and the precision of interest point detection and descriptor extraction is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of a method for extracting interest points and descriptors based on joint feature reorganization and feature mixing according to an embodiment of the present invention;

FIG. 2 is a block diagram of a point of interest, descriptor extraction system based on joint feature reorganization and feature mixing according to an embodiment of the present invention;

FIG. 3 is a detailed framework diagram of a method for extracting interest points and descriptors based on joint feature reorganization and feature mixing according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the result of feature matching based on interest points and descriptors according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The method for extracting interest points and descriptors based on joint feature recombination and feature mixing, as shown in fig. 1, comprises the following steps:

In order to more clearly explain the method for extracting interest points and descriptors based on joint feature reorganization and feature mixing, the following describes in detail the steps of an embodiment of the method of the present invention with reference to the drawings.

Step S100, acquiring an image to be extracted as an input image, and extracting a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network.

In the invention, the interest point and descriptor extraction method based on combined feature recombination and feature mixing detects and extracts the interest point and the descriptor through an interest point detector and a descriptor extraction network. When the interest point detection and the description point extraction are carried out, firstly, the feature extraction is carried out on the acquired image.

In the present embodiment, the multi-scale features of the input image are extracted through a feature extraction network constructed based on a residual error network. The feature extraction process is a feed-forward calculation of the trunk residual error network, and a feature map with a plurality of scales is generated, wherein the step length of the scale is 2. Convolutional layers generate output maps of the same size, then these layers belong to the same network phase. Only the output of the last element of each network stage is used by the other modules.

In the present invention, the feature extraction network using the residual error network ResNet, which is slightly different from the original residual error network, removes the largest pooling layer from the network because it would make the scale of the top feature map unsuitable for feature reorganization. therefore, it is divided into four stages (i.e., four convolutions) in the network architecture, as shown in FIG. 3, Conv denotes convolutional layers, 7 × 7, and 3 × 3 denotes convolutional layerThe size of the product kernel, considering the width and height of the input image as h and w, the corresponding feature map extracted at each stage can be expressed as

C_mFeature map representing the m-th convolution acquisition, d_mDepth, h, of the feature map representing the mth convolution_mHigh, w of the feature map representing the mth convolution_mThe width of the profile representing the mth convolution, where m ∈ {1,2,3,4} and d ∈ {64,128,256,512_mAnd the size of the input image I both satisfy the constraint as in equation (1) (2):

h＝h_m×2^m(1)

w＝w_m×2^m(2)

step S200, carrying out pixel recombination on each characteristic diagram, and obtaining a score map Scoremap through convolution and nonlinear mapping; and obtaining interest points through non-maximum inhibition based on the score map.

In this embodiment, the multi-scale Feature map extracted in step S100 is reconstructed, for example, a Feature reconstruction Module (Feature fragment Module) in fig. 3 is specifically as follows:

extracting each feature map from the image by a pixel rebinning operation

Switch over to

And does not add extra memory resources to the memory,

the characteristic diagram after the pixel recombination is shown, and Shuffled in FIG. 3 shows the recombination treatment.

All the converted feature maps are processed as a whole and input into a single 3 × 3 convolutional layer to be added with a Sigmoid activation function to generate a Score map, which is expressed as S ∈ R^h×w。

The whole process can be described in an abstract way as shown in formula (3):

S＝FSM(C₁,C₂,C₃,C₄) (3)

FSM, Feature Shuffle Module, represents the Feature recombination.

In the inference process, non-maximum suppression is first applied to the predicted score map S. Then, when the response value of a certain pixel in S exceeds a fixed detection threshold α, the current point is marked as a point of interest.

in this embodiment, in order to fully utilize the multi-layer semantics, the present invention proposes a new Feature fusion, i.e. a Feature Blend generation Module (Feature Blend Module) in fig. 3, which can extract the most discriminating information from the multi-layer Feature vector to construct the descriptor. The method comprises the following specific steps:

giving a pixel point p ═ x, y on the input image]^TX, y denote coordinates, T denotes a transpose, which is found in each feature map

The calculation method of (2) is shown in formula (4):

p^(m)＝p/2^m＝[x/2^m,y/2^m]^T(4)

wherein p is^(m)And the position of a pixel point in the characteristic diagram of the mth convolution is represented, and T represents transposition.

Feature vector corresponding to each pixel point

It can be obtained by bilinear interpolation from multi-scale feature maps. After all the feature vectors are generated to belong to the same point, the feature fusion module (FBM) of the present invention concatenates them into a single feature vector, which is expressed as formula (5):

C_cat(p)＝Cat(C₁(p⁽¹⁾),C₂(p⁽²⁾),C₃(p⁽³⁾),C₄(p⁽⁴⁾)) (5)

wherein Cat represents a linkage, C_catRepresents the connected feature vector, P ═ P₁,p₂,...,p_nIndicates a pixel point.

Then filtering C by using two fully-communicated layers_cat(p) garbage and compress the remaining valid semantics to generate the final descriptor d_p∈R^dimWhere dim is 128. The above whole process can be summarized as formula (6):

D＝FBM(C₁,C₂,C₃,C₄,P) (6)

wherein D ═ { D ═ D₁,d₂,...,d_nDenotes extracted descriptor binding, d₁,d₂,...,d_nAnd a descriptor representing the correspondence of each pixel point.

Based on the obtained interest points and descriptors, feature matching can be performed, and the matching result is shown in fig. 4.

In addition, in order to further improve the precision of interest point detection and descriptor extraction, the invention establishes the supervision information of quick and effective training based on random uniform sampling and defines the loss function. The method comprises the following specific steps:

in particular, for a trained sample image I, a series of simple transformations, including translation, scale, in-plane rotation, and symmetric perspective deformation within a predetermined range, are first sampled^3×3. Applying this transformation T to the sample image I, a new image I 'is synthesized and the image pair I, I' with the known transformation T is constructed.

By transforming T between image pairs, an infinite number of correspondences can be found. However, this is resource consuming and does not involveToo many correspondences need to be used to train the model. Instead, a random uniform sampling strategy is employed to obtain a small number of unbiased correspondences. Specifically, n points p ═ p are uniformly sampled on the original image I₁,p₂,...,p_nAnd then generating a corresponding point p ' ═ { p ' based on the transform T on the synthetic image I '₁,p’₂,...,p’_n}. The whole process is shown in the formulas (7) and (8):

P,T＝RandomSample(·) (7)

P',V＝Transform(P) (8)

where RandomSample denotes random uniform sampling, Transform denotes image space transformation, V ∈ BⁿIs a representation point p '═ p'₁,p’₂,...,p’_nB denotes a set of mask boundaries, since not all transform points lie on the image boundaries, and points outside these boundaries are invalid.

The sample points P and their corresponding points P0 may be used to construct a descriptor loss function. It is noted that the number of samples n may be chosen to be small enough not to affect the accuracy of the model, to speed up the training process. The present invention sets n to 400 for all experiments.

Given a set of pixel points p ═ { p ] in image I₁,p₂,...,p_nGet the corresponding set of pixel points p '═ { p'₁,p’₂,...,p’_nAnd obtaining descriptors D and D' of the pixel points respectively. Then, for the descriptor

The distance between the descriptor and the corresponding pixel point, which is referred to as the positive sample distance in the present invention for short, is calculated as shown in formula (9):

description of pixels in a representation image ISymbol

And descriptors of pixels in I

The distance of (c).

Wherein

Is that

Is divided by p 'in I'_iThe external pixel points are screened, and descriptors corresponding to the screened points and descriptors of the pixel points in the image I are calculated

The distance (d) is referred to as a negative sample distance in the present invention, as shown in formula (10):

wherein the content of the first and second substances,

p'_kdenotes l 'except p'_iThe other pixels are obtained first, namely descriptors of the pixels in the I

And descriptors of pixels in I

Minimum distance and p'_kAnd p'_iIs greater than a threshold value and p'_kThe point within the boundary that satisfies these requirements is denoted as p'_k ^*。

In the present invention, the threshold θ is 16 for ensuring

And p'_iThe spatial distance between them exceeds a certain value. In addition, the selected negative example

It needs to be located within the image boundaries, otherwise it is invalid.

Give a

And

the present invention defines the triplet distance descriptor loss function as shown in equation (11):

wherein the content of the first and second substances,

descriptor for representing a pixel in an image I

Corresponding loss value.

The total penalty constructed from descriptors D and D' is shown in equation (12):

wherein L is_triplet(D, D ', V) represents I, I ' total loss of descriptor corresponding to each pixel point, D represents set of descriptor corresponding to each pixel point in I, D ' represents set of descriptor corresponding to each pixel point in I ', V represents mask set corresponding to each pixel point in I ' during transformation,

descriptor for representing a pixel in I

Corresponding loss value, v_i、v_jAnd (3) representing the ith and j masks in the V, and n representing the number of pixel points in the I'.

In combination with the prediction score map S corresponding to the sample image and the interest point label Y corresponding to the sample image, the weighted cross entropy loss of feature point detection is shown in formula (13):

wherein S represents a score map, Y represents a marked interest point corresponding to the score map, u and v represent positions of pixel points in the sample image, and Y represents the position of the pixel point in the sample image_u,vIndicating the location of the point of interest, S_u,vIndicating the position of the pixel point in the score map, L_bce(S, Y) represents a weighted cross-entropy loss value. λ represents the ratio between the preset balanced positive and negative samples, since the number of positive samples is much smaller than the number of negative samples. The inventive setting λ is preferably set to 200.

Based on the loss function of the key point detector and the loss function of the descriptor corresponding extraction network, a total loss function is obtained, as shown in formula (14):

L_total(S,S',D,D'；Y,Y',V)＝L_bce(S,Y)+L_bce(S',Y')+L_triplet(D,D',V) (14)

wherein S and S 'are respectively score maps of the images I and I', and Y 'are real value labels of interest points of the images I and I'. It is noted that the random transformation from (I, Y) to (I ', Y ') and the random correspondence sampling between the image pairs I, I ' are processed in parallel with the training program.

During the training process of the interest point detector and descriptor extraction network, the invention uses the adam optimizer to optimize the optimization parameters β₁＝0.9β₂＝0.999l_r0.001 and 10 weight decapay^-4The training image size is set to 240 × 320 and the training batch is set to 32 the entire training process is typically completed in 15 cycles during the model evaluation process, the non-maximum rejection radius is set to 4 pixels and the detection threshold α is 0.9 to produce reliable points of interest.

A second embodiment of the present invention provides a system for extracting interest points and descriptors based on joint feature reorganization and feature mixing, as shown in fig. 2, including: the system comprises a feature extraction module 100, an interest point acquisition module 200 and a descriptor extraction module 300;

the feature extraction module 100 is configured to acquire an image to be extracted as an input image, and extract a multi-scale feature map of the image through a feature extraction network; the feature extraction network is constructed based on a residual error network;

the interest point obtaining module 200 is configured to perform pixel recombination on each feature map, and obtain a Score map through convolution and nonlinear mapping; obtaining interest points through non-maximum inhibition based on the score map;

the descriptor extraction module 300 is configured to obtain and connect feature vectors of the pixel points in the input image at corresponding positions of the multi-scale feature map; filtering and compressing the connected feature vectors through N connecting layers to obtain descriptors corresponding to all pixel points; wherein N is a positive integer;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that, the interest point and descriptor extraction system based on joint feature reorganization and feature mixing provided in the foregoing embodiment is only illustrated by dividing the foregoing functional modules, and in practical applications, the foregoing function allocation may be completed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores therein a plurality of programs, which are adapted to be loaded by a processor and to implement the above-described point of interest and descriptor extraction method based on joint feature reorganization and feature mixing.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described point of interest, descriptor extraction method based on joint feature reorganization and feature mixing.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

Referring now to FIG. 5, there is illustrated a block diagram of a computer system suitable for use as a server in implementing embodiments of the method, system, and apparatus of the present application. The server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the computer system includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for system operation are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An Input/Output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output section 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN (Local area network) card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. A driver 510 is also connected to the I/O interface 405 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for extracting interest points and descriptors based on joint feature recombination and feature mixing is characterized by comprising the following steps:

2. The method of claim 1, wherein the residual network does not include a maximum pooling layer, and the width and height of the extracted multi-scale feature map and the input image are:

h＝h_m×2^m

w＝w_m×2^m

3. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing according to claim 2, wherein the method for calculating the corresponding positions of the pixel points in the input image in the multi-scale feature map comprises:

p^(m)＝p/2^m＝[x/2^m,y/2^m]^T

4. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing as claimed in claim 1, wherein the preset multiple image transformation methods include translation transformation, scale transformation, in-plane rotation transformation, and symmetric perspective transformation within a preset range.

5. The joint feature reorganization and feature mixing-based interest point and descriptor extraction method according to claim 3, wherein the descriptor loss function is:

descriptor representing a pixel in a first image

the first distance is represented as a function of,

representing the second distance.

6. The method for extracting interest points and descriptors based on joint feature reconstruction and feature mixing as claimed in claim 5, wherein the loss function of the extraction network corresponding to the interest points is a weighted cross entropy loss function, and the loss value is calculated by:

7. An interest point and descriptor extraction system based on joint feature reorganization and feature mixing, comprising: the system comprises a feature extraction module, an interest point acquisition module and a descriptor extraction module;

8. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the joint feature reorganization and feature mixing based point of interest, descriptor extraction method of any of claims 1-6.

9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement the point of interest, descriptor extraction method based on joint feature reorganization and feature mixing according to any of claims 1-6.