CN114140623A

CN114140623A - Image feature point extraction method and system

Info

Publication number: CN114140623A
Application number: CN202111503824.7A
Authority: CN
Inventors: 汪志涛; 胡健萌; 卢勇; 唐德军; 唐崇伟; 谢勇
Original assignee: Shanghai Intelligent Transportation Co ltd
Current assignee: Shanghai Intelligent Transportation Co ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-04

Abstract

The invention relates to an image feature point extraction method and system. And training the initial characteristic point extraction model by using a training data set to obtain a trained characteristic point extraction model. And inputting the image to be processed into the trained feature point extraction model, so as to extract the feature points of the image to be processed. The multi-scale convolution layer and the deformable convolution layer are introduced into the feature point extraction model, so that multi-scale image feature information can be fused, local features of the image can be better described, the problems that fusion of the multi-scale information is lacked and the integrity of local textures can be damaged are solved, accurate and robust feature point extraction is achieved, and important theoretical and practical values are achieved for research and practical application of image feature extraction.

Description

Image feature point extraction method and system

Technical Field

The invention relates to the technical field of image processing, in particular to an image feature point extraction method and system based on multi-scale convolution and deformable convolution.

Background

With the rapid development of computer hardware and the coming of big data era, the computer vision field has gained vigorous development. The image feature point detection task is a bottom-layer task in the field of computer vision, plays a vital role in supporting a plurality of higher-layer tasks, is accurate and robust in feature point detection, and can improve the accuracy of tasks such as three-dimensional reconstruction, visual SLAM, image registration and the like. Therefore, how to accurately and robustly detect the feature points in the image in a complex environment is a difficult problem to be solved in practical application of the computer vision technology.

With the rise of deep learning, the convolutional neural network starts to be widely applied in the field of computer vision, and how to detect, describe and match feature points by using the deep learning also becomes a research hotspot. Based on the role that deep learning plays in these studies, it can be divided into three categories: the method for training by using the feature points detected by the traditional algorithm as the supervision information adopts a method of self-supervision feature description network and manual screening of the feature points and a method of simultaneously training the feature point detection and description network by adopting a self-supervision mode. Because the upper limit of the feature point detection network is limited by a manually designed detector used for supervision due to the fact that features extracted by SIFT and other traditional strategies are adopted for supervision, a better effect cannot be achieved, and therefore the method for training the network to complete feature extraction by adopting an automatic supervision mode becomes a mainstream method in a task of extracting feature points by utilizing deep learning. In the extraction process of the feature points, the scale change of each layer of the convolutional neural network is large, and the input feature graph can be convolved only by a convolution kernel with a fixed size, so that the fusion of multi-scale information is lacked; meanwhile, the local texture shape of the image is irregular, so that irrelevant information may be introduced during extraction of the convolution layer, and the integrity of the local texture is damaged. Therefore, how to fuse multi-scale image feature information and better describe local features of the image becomes a key for improving the performance of the feature point extraction network.

Based on this, a method and a system for extracting feature points with accuracy and robustness are needed.

Disclosure of Invention

The invention aims to provide an image feature point extraction method and system, which can fuse multi-scale image feature information and better describe image local features by introducing a multi-scale convolution layer and a deformable convolution layer into a feature point extraction model, thereby realizing accurate and robust feature point extraction.

In order to achieve the purpose, the invention provides the following scheme:

an image feature point extraction method, comprising:

constructing an initial characteristic point extraction model; the initial feature point extraction model comprises a feature extraction module; the feature extraction module comprises a multi-scale convolutional layer and a deformable convolutional layer;

training the initial characteristic point extraction model by using a training data set to obtain a trained characteristic point extraction model; the training data set comprises a plurality of images for training;

and inputting the image to be processed into the trained feature point extraction model, and extracting the feature points of the image to be processed.

An image feature point extraction system, the extraction system comprising:

the model construction module is used for constructing an initial characteristic point extraction model; the initial feature point extraction model comprises a feature extraction module; the feature extraction module comprises a multi-scale convolutional layer and a deformable convolutional layer;

the training module is used for training the initial characteristic point extraction model by utilizing a training data set to obtain a trained characteristic point extraction model; the training data set comprises a plurality of images for training;

and the extraction module is used for inputting the image to be processed into the trained feature point extraction model and extracting the feature points of the image to be processed.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides an image feature point extraction method and system. And training the initial characteristic point extraction model by using a training data set to obtain a trained characteristic point extraction model. And inputting the image to be processed into the trained feature point extraction model, so as to extract the feature points of the image to be processed. The multi-scale convolution layer and the deformable convolution layer are introduced into the feature point extraction model, so that multi-scale image feature information can be fused, local features of the image can be better described, the problems that fusion of the multi-scale information is lacked and the integrity of local textures can be damaged are solved, accurate and robust feature point extraction is achieved, and important theoretical and practical values are achieved for research and practical application of image feature extraction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of the extraction method provided in example 1 of the present invention;

FIG. 2 is a schematic diagram of an extraction method provided in example 1 of the present invention;

fig. 3 is a network structure diagram of a feature point extraction model provided in embodiment 1 of the present invention;

fig. 4 is a network structure diagram of a convolution attention sub-module provided in embodiment 1 of the present invention;

fig. 5 is a network structure diagram of a coordinate attention sub-module provided in embodiment 1 of the present invention;

fig. 6 is a system block diagram of an extraction system provided in embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1:

the embodiment is used for providing an image feature point extraction method, which is a feature point extraction method adopting an automatic supervision mode for training, and can realize multi-scale and robust feature point extraction. As shown in fig. 1 and fig. 2, the extraction method of the present embodiment includes:

s1: constructing an initial characteristic point extraction model; the initial feature point extraction model comprises a feature extraction module; the feature extraction module comprises a multi-scale convolutional layer and a deformable convolutional layer;

s2: training the initial characteristic point extraction model by using a training data set to obtain a trained characteristic point extraction model; the training data set comprises a plurality of images for training;

s3: and inputting the image to be processed into the trained feature point extraction model, and extracting the feature points of the image to be processed.

The image to be processed in this embodiment may be a natural image obtained by taking a picture of the same natural scene at different angles, and specifically may include a first natural image obtained by taking a picture of the natural scene with a camera at a first angle and a second natural image obtained by taking a picture of the natural scene with a camera at a second angle, where the first natural image and the second natural image are partially overlapped. In S3, the two images to be processed are both input to the trained feature point extraction model, and a first feature point of the first natural image and a second feature point of the second natural image are obtained respectively. And matching the first characteristic points and the second characteristic points, calculating a transformation matrix between the first natural image and the second natural image by using the characteristic point pairs successfully matched, and splicing the first natural image and the second natural image according to the transformation matrix to obtain the wide-angle image of the natural scene.

In S1, the idea of multi-scale convolution is: and in the same convolution layer, performing feature extraction and fusion on different channels by adopting convolution kernels with different step lengths or void ratios. The idea of deformable convolution is: and additionally predicting the offset of convolution sampling points in the convolution layer so as to realize the feature extraction of the irregular region. By introducing the multi-scale convolution layer and the deformable convolution layer into the feature point extraction model, multi-scale image feature information can be fused, local features of an image can be better described, the feature extraction capability of the model on the multi-scale features and the irregular region of the image is further improved, and accurate and robust feature point extraction is realized.

The initial feature point extraction model of this embodiment further includes a feature point detection module and a descriptor extraction module, both of which are connected to the feature extraction module. The feature extraction module is used for extracting features of the input image. The feature point detection module is used for determining the probability that each pixel point in the input image is the feature point based on the features of the input image. And sequencing the pixel points according to the sequence of the probability from large to small, and selecting the first N pixel points as the feature points, so that the coordinates of the feature points of the input image can be determined. The descriptor extraction module is used for determining a feature descriptor of each pixel point in the input image based on the features of the input image. The embodiment comprises a feature point detection module and a descriptor extraction module, and can simultaneously output the coordinates of the feature points and the feature descriptors of the feature points, and the coordinates and the feature descriptors complement each other, so that the model can finally obtain a more accurate and robust result.

Specifically, the feature point detection module comprises a feature point repetition sub-module and a feature point confidence sub-module, and the feature point repetition sub-module and the feature point confidence sub-module are both connected with the feature extraction module. The feature point repetition degree submodule is used for calculating the repetition degree of a local feature region corresponding to each pixel point in the input image in the image range based on the features of the input image. The feature point confidence coefficient submodule is used for calculating the confidence coefficient of each pixel point in the input image as a feature point based on the features of the input image. The probability that each pixel point is a feature point is the product of the repetition degree of the local feature region corresponding to the pixel point in the image range and the confidence degree that the pixel point is the feature point.

More specifically, as shown in fig. 3, the feature extraction module of the present embodiment further includes a first convolution layer (3 × 3Conv1), a second convolution layer (3 × 3Conv2), and a third convolution layer (3 × 3Conv3), where the first convolution layer, the second convolution layer, the third convolution layer, the multi-scale convolution layer, and the deformable convolution layer are sequentially connected. The feature point repetition sub-module includes a fourth convolution layer (1 × 1Conv) and a first L2 regularization layer connected in series. The feature point confidence submodule includes a fifth convolution layer (1 × 1Conv) and a second L2 regularization layer connected in series. The descriptor extraction module includes a third L2 regularization layer.

As an optional implementation manner, the feature extraction module of this embodiment further includes a convolution attention submodule and a coordinate attention submodule. The convolution attention submodule is connected between the multi-scale convolution layer and the deformable convolution layer, the multi-scale convolution layer is connected with the convolution attention submodule through the first connecting channel and the second connecting channel respectively, and the convolution attention submodule is further connected with the deformable convolution layer. The coordinate attention submodule is connected between the deformable convolution layer and the output end of the feature extraction module, the deformable convolution layer is connected with the coordinate attention submodule through a third connecting channel and a fourth connecting channel respectively, and the coordinate attention submodule is further connected with the output end of the feature extraction module. A convolution attention submodule and a coordinate attention submodule are additionally arranged in the feature point extraction model, so that the position accuracy and the descriptor distinguishability of feature point extraction can be further improved.

As shown in fig. 4, which is a network structure diagram of the convolution attention sub-module. The convolution attention submodule comprises a global average pooling layer, a global maximum pooling layer, a full connection layer, a RELU activation function, a 1 x 1 convolution layer, a Sigmoid activation function and a first splicing layer which are connected in sequence

Channel domain average pooling layer, 7 x 7 convolutional layer and second splice layer

And the input characteristic is also directly input into the first splicing layer, the input characteristic is the output characteristic of the multi-scale convolutional layer, the connecting channel between the input characteristic and the global average pooling layer is the first connecting channel, and the connecting channel between the input characteristic and the first splicing layer is the second connecting channel. The first splicing layer splices the characteristics input to the first splicing layer, the output of the first splicing layer is directly input to the second splicing layer, the second splicing layer is used for splicing the characteristics input to the second splicing layer, and the output of the second splicing layer is the output characteristics of the convolution attention submodule.

As shown in fig. 5, which is a network architecture diagram of the coordinate attention submodule. The coordinate attention submodule comprises a pooling layer, a splicing layer, a convolution layer (Conv2d), a normalization layer (BatchNorm), a RELU activation function, a channel separation layer, a convolution activation layer and a third splicing layer which are connected in sequence

The pooling layer comprises parallel X-direction averagesThe X-direction average pooling layer and the Y-direction average pooling layer are both connected with the splicing layer. The convolutional activation layer comprises two branches consisting of a convolutional layer (Conv2d) and a Sigmoid activation function, both Sigmoid activation functions being connected to a third concatenation layer. The input features are also directly connected with a third splicing layer, the input features are output features of the deformable convolution layer, a connecting channel between the input features and the pooling layer is a third connecting channel, a connecting channel between the input features and the third splicing layer is a fourth connecting channel, the third splicing layer splices the features input to the third splicing layer, and the output of the third splicing layer is output features of the coordinate attention submodule, namely features of the input image extracted by the feature extraction module.

After an input image enters a feature point extraction model, the input image sequentially passes through 3-by-3 convolutional layers of 32 channels, 64 channels and 128 channels to obtain a 128-dimensional feature map, and then passes through a multi-scale convolutional layer (namely a multi-scale convolutional layer and a convolutional attention submodule) combined with a channel attention mechanism and a deformable convolutional layer (namely a deformable convolutional layer and a coordinate attention submodule) combined with a coordinate attention mechanism to obtain the features of the input image. The method comprises the steps that the characteristics of an input image pass through a 1 x 1 convolution layer and an L2 regularization layer, the output of each pixel point is limited to 0-1, and the confidence coefficient of the pixel points as characteristic points is obtained; the method comprises the steps that the characteristics of an input image pass through another 1-x 1 convolution layer and another L2 regularization layer, the output of each pixel point is limited to be 0-1, the repetition degree of a local characteristic region corresponding to the pixel point in the image range is obtained, and finally the probability that each pixel point is the characteristic point is obtained by multiplying the confidence coefficient that each pixel point is the characteristic point and the repetition degree of the local characteristic region corresponding to the pixel point in the image range; the characteristics of the input image directly pass through an L2 regularization layer, and a 128-dimensional characteristic descriptor of each pixel point can be obtained.

As an optional implementation manner, the deformable convolution layer of this embodiment is a deformable convolution layer constrained by a homography matrix, by adding a homography matrix constraint to the deformable convolution layer, the homography matrix can be determined only by predicting the offsets of four sampling points of the layer by using the offset layer, and the homography matrix is applied to the deformable convolution layer, so that the offsets of other points can be rapidly predicted by using the homography matrix, a feature map of an irregular area is obtained, the calculation amount can be saved, and the calculation process of the deformable convolution layer can be simplified.

The method comprises the following steps of adding normalization constraint in the process of calculating a homography matrix corresponding to a deformable convolution layer, wherein the process of solving the homography matrix comprises the following steps:

(1) carrying out scale standardization transformation on a plurality of sampling points for calculating the homography matrix to obtain transformed parameters;

the offset of the sampling points is predicted by using the offset layer, and in this embodiment, only 4 sampling points need to be selected, and the homography matrix can be calculated according to the offsets of the 4 sampling points. Carrying out scale standardization transformation on a plurality of sampling points, and obtaining transformed parameters specifically comprises: coordinate p before adding offset to a plurality of sampling points_i＝(x_i,y_i,w_i)^TApplying a similar transformation to enable the average distance from each sampling point to the origin of the coordinates to be 1.414, obtaining the coordinates of each sampling point after the similar transformation, adding the offset corresponding to the sampling point to the coordinates of each sampling point after the similar transformation to obtain the coordinates after the offset corresponding to each sampling point, and performing the inverse transformation of the similar transformation on the coordinates after the offset to obtain the coordinates p 'of each sampling point after the offset in the original coordinate system'_i＝(x'_i,y'_i,w'_i)^T. The transformed parameters include the coordinates before the offset and the coordinates after the offset of each sample point.

(2) And calculating the parameters of the homography matrix by utilizing a DLT algorithm based on the transformed parameters to obtain the homography matrix corresponding to the deformable convolution layer.

The present embodiment solves the homography matrix by a direct linear transformation algorithm (DLT). The translation of the fixed homography matrix is 0 and the last element h ₃₃1, homography matrix H ∈ R^3×3Flattening to vector h ∈ R^6×1Then, there are:

the homography matrix can be obtained by calculation based on the transformed parameters by using the formula.

In the embodiment, normalization constraint is added in the process of calculating the homography matrix corresponding to the deformable convolution layer, so that the calculation stability of the homography matrix can be improved, and the homography matrix obtained through calculation is more robust and has higher precision.

Before S2, the extracting method of this embodiment further includes performing data enhancement on the training data set to obtain an enhanced data set, and using the enhanced data set as a new training data set.

Wherein, performing data enhancement on the training data set, and obtaining the enhanced data set may include: and for each training image in the training data set, transforming the training image by adopting a plurality of transformation modes to obtain a plurality of transformed images corresponding to the training image, and forming an enhanced data set by all the transformed images. The transformation modes comprise picture scaling, horizontal turning, random translation, random rotation, color channel scaling, noise addition, random homography matrix disturbance, color dithering, random cropping and shading adjustment. The images for training provided by the training data set are cut randomly, a large amount of training data can be obtained, the brightness adjustment is that the brightness of the images is changed randomly, the illumination change in a real scene can be effectively coped with, a large amount of virtual images of the same scene under different visual angles can be obtained by adding random homography matrix disturbance, and model training is facilitated. The magnitude of the homography perturbation may be within 10%. Through the above data enhancement operation, a large amount of training data can be obtained.

Aiming at complex scenes under the actual condition, the embodiment adopts a more diversified data enhancement mode than the prior art, realizes coverage of various types of scenes which may appear, basically covers various conditions in reality, greatly increases the data volume and effectively improves the generalization capability of the network.

Using the enhanced data set as a training data set, S2 may include:

(1) randomly extracting a plurality of images for training from a training data set to form a first data set;

(2) inputting the first data set into an initial characteristic point extraction model to obtain a first detection result of each image for training in the first data set; the first detection result comprises the probability that each pixel point in the training image is a feature point and a feature descriptor of each pixel point;

(3) applying a random homographic transformation to the images for training to obtain transformed images corresponding to each image for training, wherein the transformed images and the images for training are images in the same scene, and all the transformed images form a second data set;

(4) inputting the second data set into the initial characteristic point extraction model to obtain a second detection result of each transformed image in the second data set; the second detection result comprises the probability that each pixel point in the transformed image is a feature point and a feature descriptor of each pixel point;

(5) calculating a loss value based on a loss function with all the first detection results and all the second detection results as inputs;

specifically, the loss function used in this embodiment is:

L_total＝λ_detL_det+λ_descL_desc；

wherein L is_totalIs a loss function; l is_detLoss of detector for the feature point detection module; l is_descA descriptor loss for the descriptor extraction module; lambda [ alpha ]_detAnd λ_descThe weights corresponding to the detected sub-losses and the descriptor losses, respectively.

Loss of detector L_detLoss of L from cosine similarity_cosimAnd the loss of degree of difference L_peakyAnd (4) forming.

Wherein P is a set formed by neighborhood points of the pixel point P; | P | is the number of neighborhood points of the pixel point P; p' is a neighborhood point of the pixel point p; s [ p ] is the probability that the pixel point p is a characteristic point; s [ p '] is the probability that the neighborhood point p' is a characteristic point; cosim stands for cosine angle.

To ensure that the selected feature points have significant size difference in the local region, the embodiment uses the difference degree loss L_peaky。

Wherein, I is an image for training.

Wherein, I' is a transformed image corresponding to the training image; p₁Is a pixel point p₁Is a set of neighborhood points, pixel point p₁The pixel point in I' corresponding to the pixel point p in I; i P₁I is pixel point p₁The number of neighborhood points of (2); p'₁Is a pixel point p₁A neighborhood point of (d); s [ p'₁]Is neighborhood point p'₁Is the probability of a feature point.

Loss of degree of difference L_peaky＝0.5(L_peaky(I)+L_peaky(I′))。

Loss of detector L_det＝L_cosim+0.5(L_peaky(I)+L_peaky(I′))。

Descriptor loss L_descThe method is used for measuring whether the feature descriptors generated by the same feature point in the two images are similar. And sequencing all the pixel points according to the probability that the pixel points are the feature points, and selecting the first n pixel points as the feature points to obtain n feature points corresponding to the training image I and n feature points corresponding to the transformed image I'. Let (x)₁，x₂，...，x_n) Arranging a certain characteristic point in the training image I and n characteristic points in the transformed image I' in the sequence from small to large according to the descriptor distance, x_iIs the descriptor distance between the feature descriptor of the feature point in the training image I and the feature descriptor of the ith feature point in the transformed image I'. And calculating a homography matrix between the training image and the transformed image, wherein pixel points in the training image can be projected into the transformed image through the homography matrix. Will (x)₁，x₂，...，x_n) Into correct matching sets S_q ⁺And set of error matches S_q ^-Specifically, feature points in the training image I are projected into the transformed image according to the homography matrix, the positions of the projection points are determined, if the distance between the ith feature point in the transformed image I' and the projection point in the transformed image is within a preset distance threshold, the descriptor distance corresponding to the ith feature point belongs to a correct matching set, and if not, the descriptor distance corresponding to the ith feature point belongs to an incorrect matching set.

Wherein PrecK is a proportionality coefficient, and the meaning of PrecK is (x)₁，x₂，...，x_n) The number of correctly matched sets in the first K descriptor distances; k1, 2,. n; ap (q) is an average precision loss, which represents an average value of PrecK in the case that the kth descriptor distance belongs to the correct matching set, and the larger the value is, the better the value is, and the physical meaning is that, if the kth descriptor distance belongs to the correct matching set, it is expected to be smaller than the descriptor distance thereof, and the larger the number of descriptor distances ranked further forward belongs to the correct matching set, the better the value is. I S_q ⁺And | represents the number of elements in the correct matching set.

Where 1 [. cndot. ] is a function whose expression is:

descriptor loss L_desc＝1-[AP(q)R[q]+K(1-R[q])]。

Wherein, R [ q ] is the repetition degree of the local characteristic region corresponding to the characteristic point q in the image range; k is a hyperparameter.

And inputting each corresponding training image and each transformed image into the loss function to obtain a loss value, and averaging all the loss values to obtain the loss value of the iteration.

(6) Optimizing and adjusting network parameters of the initial feature point extraction model based on the loss value to obtain an adjusted feature point extraction model;

(7) judging whether a preset iteration termination condition is reached; the preset iteration termination condition can be that the iteration times reach the maximum iteration times or the loss value is smaller than the loss value threshold value; if so, taking the adjusted feature point extraction model as a trained feature point extraction model, and ending the iteration; if not, the adjusted feature point extraction model is used as an initial feature point extraction model in the next iteration, the step of randomly extracting a plurality of images for training from the training data set to form a first data set is returned, and the iteration is continued.

In this embodiment, through the loss function, each network parameter of the whole feature point extraction model is continuously iteratively optimized, so that the whole model is trained end to end, the optimization method adopts an Adam (Adaptive motion Estimation) optimizer, the initial learning rate is 0.001, the batch size is 4, the learning rate is attenuated to 90% every five epochs, and a final feature point extraction model for training is obtained after 25 epochs are trained.

S3 may include:

(1) inputting the image to be processed into the trained feature point extraction model to obtain a third detection result of the image to be processed; the third detection result comprises the probability that each pixel point in the image to be processed is a characteristic point and a characteristic descriptor of each pixel point;

(2) and sequencing all the pixel points according to the sequence of the probability values from large to small, and selecting the first N pixel points as the feature points of the image to be processed. The coordinates of the feature points of the image to be processed and the feature descriptors can be obtained simultaneously.

Example 2:

this embodiment is used to provide an image feature point extraction system, as shown in fig. 6, the extraction system includes:

the model construction module M1 is used for constructing an initial characteristic point extraction model; the initial feature point extraction model comprises a feature extraction module; the feature extraction module comprises a multi-scale convolutional layer and a deformable convolutional layer;

a training module M2, configured to train the initial feature point extraction model by using a training data set, to obtain a trained feature point extraction model; the training data set comprises a plurality of images for training;

and the extraction module M3 is used for inputting the image to be processed into the trained feature point extraction model and extracting the feature points of the image to be processed.

The emphasis of each embodiment in the present specification is on the difference from the other embodiments, and the same and similar parts among the various embodiments may be referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An image feature point extraction method, characterized by comprising:

2. The extraction method according to claim 1, wherein the initial feature point extraction model further comprises a feature point detection module and a descriptor extraction module; the feature point detection module and the descriptor extraction module are both connected with the feature extraction module;

the characteristic extraction module is used for extracting the characteristics of the input image;

the characteristic point detection module is used for determining the probability that each pixel point in the input image is a characteristic point based on the characteristics;

the descriptor extraction module is configured to determine a feature descriptor for each pixel point in the input image based on the features.

3. The extraction method according to claim 2, wherein the feature point detection module includes a feature point repetition degree sub-module and a feature point confidence degree sub-module;

the feature point repetition degree submodule is used for calculating the repetition degree of a local feature region corresponding to each pixel point in the input image in an image range based on the features;

the feature point confidence coefficient submodule is used for calculating the confidence coefficient of each pixel point in the input image as a feature point based on the features; the probability that each pixel point is a feature point is the product of the repetition degree of the local feature region corresponding to the pixel point in the image range and the confidence degree that the pixel point is the feature point.

4. The extraction method according to claim 3, wherein the feature extraction module further comprises a first convolutional layer, a second convolutional layer, and a third convolutional layer; the first convolutional layer, the second convolutional layer, the third convolutional layer, the multi-scale convolutional layer, and the deformable convolutional layer are sequentially connected;

the characteristic point repeatability submodule comprises a fourth convolution layer and a first L2 regularization layer which are sequentially connected;

the feature point confidence coefficient submodule comprises a fifth convolution layer and a second L2 regularization layer which are sequentially connected;

the descriptor extraction module includes a third L2 regularization layer.

5. The extraction method according to claim 4, wherein the feature extraction module further comprises a convolution attention submodule and a coordinate attention submodule;

the convolution attention submodule is connected between the multi-scale convolution layer and the deformable convolution layer; the multi-scale convolutional layer is connected with the convolution attention submodule through a first connecting channel and a second connecting channel respectively; the convolution attention sub-module is also connected with the deformable convolution layer;

the coordinate attention sub-module is connected between the deformable convolution layer and the output end of the feature extraction module; the deformable convolution layer is connected with the coordinate attention submodule through a third connecting channel and a fourth connecting channel respectively; the coordinate attention submodule is also connected with the output end of the feature extraction module.

6. The extraction method according to claim 1, wherein the deformable convolutional layer is a deformable convolutional layer constrained by a homography matrix;

wherein calculating the homography matrix corresponding to the deformable convolution layer comprises:

carrying out scale standardization transformation on a plurality of sampling points for calculating the homography matrix to obtain transformed parameters;

and calculating the parameters of the homography matrix by utilizing a DLT algorithm based on the transformed parameters to obtain the homography matrix corresponding to the deformable convolution layer.

7. The extraction method according to claim 1, wherein before the initial feature point extraction model is trained by using a training data set to obtain a trained feature point extraction model, the extraction method further comprises performing data enhancement on the training data set to obtain an enhanced data set, and using the enhanced data set as a new training data set;

wherein, the data enhancement of the training data set to obtain the enhanced data set specifically includes:

for each training image in the training data set, transforming the training image by adopting a plurality of transformation modes to obtain a plurality of transformed images corresponding to the training image; all the transformed images constitute the enhanced data set; the transformation modes comprise picture scaling, horizontal turning, random translation, random rotation, color channel scaling, noise addition, random homography matrix disturbance, color dithering, random cropping and shading adjustment.

8. The extraction method according to claim 1, wherein the training of the initial feature point extraction model by using the training data set to obtain the trained feature point extraction model specifically comprises:

randomly extracting a plurality of images for training from the training data set to form a first data set;

inputting the first data set into the initial characteristic point extraction model to obtain a first detection result of each training image in the first data set; the first detection result comprises the probability that each pixel point in the training image is a feature point and a feature descriptor of each pixel point;

applying homographic transformation to the images for training to obtain transformed images corresponding to each image for training; all the transformed images form a second data set;

inputting the second data set into the initial characteristic point extraction model to obtain a second detection result of each transformed image in the second data set; the second detection result comprises the probability that each pixel point in the transformed image is a feature point and a feature descriptor of each pixel point;

calculating a loss value based on a loss function with all of the first detection results and all of the second detection results as inputs;

optimizing and adjusting network parameters of the initial feature point extraction model based on the loss value to obtain an adjusted feature point extraction model;

judging whether a preset iteration termination condition is reached;

if so, taking the adjusted feature point extraction model as a trained feature point extraction model, and ending iteration;

if not, the adjusted feature point extraction model is used as an initial feature point extraction model in the next iteration, the step of randomly extracting a plurality of images for training from the training data set to form a first data set is returned, and the iteration is continued.

9. The extraction method according to claim 1, wherein the inputting of the image to be processed into the trained feature point extraction model, and the extracting of the feature points of the image to be processed specifically comprises:

inputting the image to be processed into the trained feature point extraction model to obtain a third detection result of the image to be processed; the third detection result comprises the probability that each pixel point in the image to be processed is a characteristic point and a characteristic descriptor of each pixel point;

and sequencing all the pixel points according to the sequence of the probability values from large to small, and selecting the first N pixel points as the feature points of the image to be processed.

10. An image feature point extraction system, characterized in that the extraction system comprises: