CN116701695B

CN116701695B - Image retrieval method and system for cascading corner features and twin network

Info

Publication number: CN116701695B
Application number: CN202310640768.4A
Authority: CN
Inventors: 陈程立诏; 李潞铭; 卢博; 宋佳; 宋梦柯; 胡诗语; 赵一汎; 王子铭; 张明月; 杨龙燕; 崔爽锌; 薛子玥; 刘新宇; 梁少峰; 朱晓东; 尹涵冰; 张钰; 袁千禧; 刘伊凡; 崔奇
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2024-01-30
Anticipated expiration: 2043-06-01
Also published as: CN116701695A

Abstract

The invention discloses an image retrieval method and system of cascade corner features and a twin network, wherein the method comprises the following steps: extracting key point characteristics of the image to be searched after noise reduction and each image in the search data set through angular point detection, and screening out similar images according to the key point characteristics; and (3) cutting out the region of interest of the image to be retrieved and the similar images thereof, inputting the obtained local images into a trained twin network model which is constructed based on a depth residual error network and has a deformable attention mechanism, obtaining two groups of depth features, scoring the similarity of the depth features, and taking the similar images with the scores exceeding a threshold value as retrieval results. The image retrieval method is particularly suitable for LOGO image retrieval, and can accurately and robustly retrieve the target image containing the corresponding LOGO from the data set under the condition that the number of the image types is uncertain.

Description

Image retrieval method and system for cascading corner features and twin network

Technical Field

The invention belongs to the technical field of digital image processing methods, and particularly relates to the technical field of content-based image retrieval methods.

Background

How to conveniently, rapidly and accurately search images required or interested by a user in an image library is a hot spot for research in the field of current multimedia information retrieval. More studied image retrieval methods include two categories, text-based image retrieval (TBIR, text Based Image Retrieval) and content-based image retrieval (CBIR, content BasedImage Retrieval).

The text-based image retrieval method is characterized in that text labeling is needed for content in an image, so that after a user provides retrieval keywords, the image of interest of the user is retrieved according to the correspondence between the labeled text and the keywords; the image retrieval method based on the content needs to analyze the image by using a computer, and establishes an image feature library according to the image vector features, when a user inputs an inquiry image, the features of the inquiry image are extracted as well, the similarity comparison is carried out between the features in the feature library, and the images are output according to the sequence of the similarity.

Content-based image retrieval has the following drawbacks, including: the retrieval is easy to be interfered by different image scale changes and complex backgrounds, and similar images with certain image changes under the complex backgrounds, such as LOGO images with rotation changes, scale scaling and image quality differences, are difficult to accurately retrieve. However, accurate search of the LOGO image is a technical problem to be solved in the prior art, and thus, the above drawbacks are overcome by a new technology.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to break through the limitations of image scale difference change and image similarity complex background area interference in the existing content-based image retrieval method, and provides a novel image retrieval method and system which can be used for accurately retrieving LOGO images.

The technical scheme of the invention is as follows:

an image retrieval method of cascading corner features and a twin network comprises the following steps:

s1, carrying out noise reduction treatment on each image in an image to be searched and a search data set to obtain a noise-reduced image to be searched and a noise-reduced search data set image, wherein the search data set is an image data set for carrying out image search according to the image to be searched;

s2, respectively carrying out global feature extraction on the image to be searched after noise reduction and the image of the search dataset after noise reduction through a SIFT angular point detection algorithm, carrying out similarity comparison on the extracted key point vectors, screening out similar images which are similar to the image to be searched in the search dataset, and forming a matching pair image by the image to be searched and the similar images;

s3, in an image coordinate system, based on the key points with similarity in the matching pair images, namely the matching points, respectively carrying out region-of-interest clipping on the obtained matching pair images to obtain local images of the images to be retrieved and local images of the similar images;

s4, inputting the local images of the images to be searched and the local images of the similar images into a trained twin network model, obtaining the depth characteristics of the two local images and the similarity scores between the two obtained sets of depth characteristics, screening out the local images with scores exceeding a similarity threshold, and taking the corresponding similar images as search results;

the twin network model is constructed based on a depth residual network and provided with a deformable attention mechanism.

According to some preferred embodiments of the invention, the noise reduction is achieved by gaussian filtering.

According to some preferred embodiments of the present invention, the similarity comparison is evaluated by euclidean distance between the key point vectors, that is, when the euclidean distance between the key point vectors extracted from the image to be retrieved after noise reduction and the key point vectors extracted from the image of the retrieved dataset after noise reduction is less than or equal to a priori threshold, the two are considered to be similar.

According to some preferred embodiments of the invention, the similarity score is obtained by linear cross-correlation.

According to some preferred embodiments of the invention, the a priori threshold is set to 0.6.

According to some preferred embodiments of the invention, the similarity threshold is set to 0.9.

According to some preferred embodiments of the invention, step S2 further comprises:

s21, extracting key point vectors of the noise-reduced image to be retrieved and the noise-reduced retrieval data set image through SIFT angular point detection;

s22, calculating Euclidean distance between key point vectors of the image to be retrieved after noise reduction and the image of the retrieval data set after noise reduction, and screening out the similar images;

s23, screening out images with the width and the height respectively larger than 256 pixels in the similar images to obtain secondary screening images;

s24, carrying out two-dimensional gridding treatment on the noise-reduced image to be searched and the secondary screening image to obtain the image to be searched and the secondary screening image with a two-dimensional coordinate system;

and S25, matching the key points with the highest similarity in the image to be searched and the secondary screening image with the two-dimensional coordinate system to serve as the most similar key points of the image to be searched and the secondary screening image thereof.

According to some preferred embodiments of the invention, step S3 further comprises:

cutting the image to be searched and the secondary screening image with the two-dimensional coordinate system according to the most similar key points to obtain the local image; wherein the cropping comprises:

when the most similar key point is positioned in the central area of the image to be searched or the secondary screening image with the two-dimensional coordinate system, namely, in the two-dimensional coordinate system, the distances between the most similar key point and the four sides of the image are more than or equal to 256, the most similar key point is taken as the central point, a rectangular image with the size of 128 multiplied by 128 is cut out from the image to be searched with the two-dimensional coordinate system and is taken as a local image, and a rectangular image with the size of 256 multiplied by 256 is cut out from the secondary screening image with the two-dimensional coordinate system and is taken as the local image;

when the most similar key point is located in the edge area of the image to be retrieved or the secondary screening image with the two-dimensional coordinate system, that is, in the two-dimensional coordinate system, the distance between the most similar key point and the four sides of the image is smaller than 256, the most similar key point is taken as a corner point in a cutting area of a rectangle, the most similar key point extends from the corner point to the central area of the secondary screening image, and cutting is performed after a cutting area of a target size is obtained, so that a local image is obtained, wherein the target size comprises: and cutting out a rectangular image with the size of 128 multiplied by 128 from the image to be retrieved with the two-dimensional coordinate system, and cutting out a rectangular image with the size of 256 multiplied by 256 from the secondary screening image with the two-dimensional coordinate system.

According to some preferred embodiments of the invention, the twin network model is constructed by a modified Resnet50 network, the modified Resnet50 network having the following structure:

the third convolution layer is changed into a deformable convolution network layer on the basis of the Resnet50 network, and a channel attention module is added after the fourth convolution layer and the five convolution layers respectively.

According to some preferred embodiments of the invention, training the twin network model comprises:

forming a training set by the template images of the same type as the images to be searched and the cut search data set images;

five anchor blocks are arranged pixel by pixel on an input training set image in a coordinate mapping mode, and the arrangement mode is as follows: the area of each anchor point frame is 1/64 of the original image, and the length-width ratio is 0.33, 0.5, 1, 2 and 3 respectively;

training a twin network model based on a triplet loss function by using the trained improved ResNet50 network as an initialization weight, wherein the template image is set to be a positive label in training, and a search dataset image with the Euclidean distance between a key point and the key point of the template image being more than 0.6 is set to be a negative label;

and obtaining the trained twin network model until the triplet loss is stable, namely the network converges.

According to some preferred embodiments of the invention, the obtaining of the search result includes:

s51, extracting feature vectors of local images of the images to be retrieved and local images of similar images of the images to be retrieved through the trained improved ResNet50 network;

s52, taking the extracted depth features of the images to be retrieved as convolution kernels, and carrying out convolution processing on the depth features of the images to be retrieved and the depth features of the similar images in a linear cross-correlation mode to obtain a similarity score graph;

s53, classifying the similarity score graph through a softmax function, wherein samples with the maximum value and the minimum value of the similarity score graph being larger than 0.5 are positive samples, and samples with the similarity score being larger than a preset similarity threshold value of 0.9 are retrieval results in the positive samples.

According to some preferred embodiments of the invention, the image to be retrieved is a LOGO-like image.

The invention further provides a retrieval system for realizing the image retrieval method, which comprises the following steps: the database module stores the images to be searched and the search data set, the noise reduction processing module carries out noise reduction processing on the images in the images to be searched and the search data set, the SIFT angular point detection and similarity matching module carries out all feature extraction and similar image screening, the image cutting module carries out the region of interest cutting, the model processing module carries out the twin network model construction and training, and the secondary search module can carry out secondary search on search results.

The image retrieval method is particularly suitable for LOGO image retrieval, and can be used for firstly carrying out noise reduction and sampling pretreatment operation on the LOGO image by adopting Gaussian filtering, then extracting SIFT corner features of the LOGO image and a retrieval dataset image, carrying out cutting of an interested search area on the image in the dataset to be retrieved meeting a matching condition by key point matching, and generating a corresponding candidate search area image.

The invention can solve the problem of poor practicability of LOGO image retrieval under the complex scene in the existing retrieval method, and realize LOGO image retrieval with high accuracy and good robustness in the complex scene.

The invention can effectively improve the accuracy and the speed of LOGO image retrieval under the complex application condition, and particularly, under the condition that the number of LOGO image types is uncertain, the image expression containing the corresponding LOGO is most obvious when the image data set is retrieved.

Drawings

Fig. 1 is a flow chart of the search method of the present invention.

Fig. 2 is a schematic structural diagram of a specific deep learning network according to the present invention.

FIG. 3 is a schematic diagram of a particular attention module employed in the present invention.

Fig. 4 is a schematic diagram of the components of the retrieval system in the embodiment.

Detailed Description

The present invention will be described in detail with reference to the following examples and drawings, but it should be understood that the examples and drawings are only for illustrative purposes and are not intended to limit the scope of the present invention in any way. All reasonable variations and combinations that are included within the scope of the inventive concept fall within the scope of the present invention.

Referring to fig. 1, a specific embodiment of the image retrieval method of cascading corner features and a twin network model provided by the invention comprises the following steps:

s1, carrying out noise reduction processing on the LOGO image to be searched and each image in the search data set through Gaussian filtering, and obtaining the LOGO image to be searched after noise reduction and each image in the search data set after noise reduction.

In more specific embodiments, the gaussian filter is a two-dimensional gaussian filter.

S2, global feature extraction is respectively carried out on the LOGO image to be searched after noise reduction and each image in the search data set, and images which are similar to the LOGO image to be searched in the search data set, namely similar images, are screened out through similarity comparison of global features obtained through extraction, and then the LOGO image to be searched and the similar images in the search data set are matched images; and the global features use SIFT key point features, and SIFT key points with similarity in the matching pair images are the matching points.

More specific embodiments thereof are as follows: and extracting SIFT key point feature vectors of the LOGO image to be searched and each image in the search data set through a SIFT key point algorithm, calculating the similarity of the SIFT key point feature vectors by using Euclidean distance, and primarily screening out images with similarity Euclidean distance less than or equal to a priori threshold value, wherein the images are used as matching pair images with similarity.

Preferably, the a priori threshold is set to 0.6.

More specific embodiments thereof include:

s21, extracting key point vectors of LOGO images to be searched and each image in the search data set through a SIFT angular point detection algorithm;

s22, calculating Euclidean distance between the LOGO image to be searched and the key point vectors of all the images in the search data set, and screening out images with Euclidean distance smaller than 0.6 between the key point vectors of the LOGO image to be searched in the search data set, namely similar images of the LOGO image to be searched in the search data set;

s23, carrying out secondary screening on similar images in the search data set according to the pixel size of the images, and screening out images with the width and the height respectively larger than 256 pixels, namely secondary screening images;

s24, taking a point at the upper left corner of the secondary screening image as an origin, taking left and right as an X-axis extending direction and Y-axis extending directions respectively, and generating grid point two-dimensional coordinates of the searching and the image to be searched by using a merhgrid function in a python numpy library;

s25, matching the key point pair with the highest similarity between the LOGO image to be searched and the secondary screening image in the searching data set, and taking the key point pair as the most similar key point between the LOGO image to be searched and the secondary screening image.

S3, based on the obtained matching points in an image coordinate system, respectively cutting the interested areas of the images by the obtained matching points, and cutting to obtain local images of the LOGO images to be searched and local images of similar images in the search data set.

Specific embodiments thereof are as follows: and according to the obtained matching points, mapping the matching pair images into an image coordinate system respectively, further, taking any key point coordinate in the matching pair images as the central coordinate of the region of interest, and respectively cutting the region of interest of the matching pair images to generate a local image of the LOGO image to be searched and a local image of a similar image in the search data set.

In more specific embodiments, the partial image of the LOGO image to be searched is set to be an image of a rectangular area with a size of 128×128 in the LOGO image to be searched, and the partial image of the similar image is an image of a rectangular area with a size of 256×256 in the similar image.

Other more specific embodiments thereof include:

s31, obtaining a LOGO image to be searched and a secondary screening image of the LOGO image in a search data set and the most similar key points through the steps S21-S25;

s32, in the LOGO image to be searched and the secondary screening image, local image cutting is carried out by taking the most similar key points as cutting basis, and the local image is obtained.

Preferably, the local image cropping includes:

when the most similar key points are positioned in the central area of the LOGO image to be searched or the secondary screening image, namely in a two-dimensional coordinate system, the distances between the most similar key points and the four sides of the image are more than or equal to 256, the most similar key points are taken as the central points, rectangular images with the size of 128 multiplied by 128 are cut out from the LOGO image to be searched and taken as local images of the LOGO image to be searched, and rectangular images with the size of 256 multiplied by 256 are cut out from the secondary screening image and taken as local images of the similar images;

when the most similar key points are positioned in the edge area of the LOGO image to be searched or the secondary screening image, namely in a two-dimensional coordinate system, the distances between the most similar key points and the four sides of the image are smaller than 256, the most similar key points are used as corner points in a cutting area of a rectangle, the most similar key points extend from the corner points to the central area of the secondary screening image, cutting is carried out after the cutting area with the target size is obtained, and the local images of the LOGO image to be searched and the secondary screening image are obtained; the target size includes: rectangular images with the size of 128 multiplied by 128 are cut out of LOGO images to be searched, and rectangular images with the size of 256 multiplied by 256 are cut out of secondary screening images.

S4, inputting the local images of the LOGO images to be searched and the local images of the similar images in the search data set into a trained twin network model, obtaining the depth characteristics of the two local images and the similarity scores of the two obtained sets of depth characteristics, screening out the local images with scores exceeding a similarity threshold, and taking the corresponding original similar images as search results.

In more specific embodiments, the similarity score is obtained by linear cross-correlation analysis.

In more specific embodiments, the similarity threshold is set to 0.9.

Considering the twin network shared network weight in the twin network model, the similarity of two inputs can be accurately measured, but the common full convolution twin network is difficult to accurately and robustly extract the image characteristics with unknown deformation and unknown sources of complex background characteristics, so in the step S4, the invention further builds the twin network model with a deformable attention mechanism, and the retrieval is more accurate and robustly.

Further, wherein the twin network model is a twin network model with deformable attention (Deformable Attention) mechanism constructed based on a depth residual network (Deep Residual Network, res net).

Further, referring to fig. 2 and 3, a more specific implementation manner of the twin network model is as follows:

with the Resnet50 depth full convolutional network as the basic network structure, referring to FIG. 2, the third convolutional layer of Resnet50 is changed into a deformable convolutional network layer, and a channel attention module is added after the fourth and fifth convolutional layers.

Further, referring to fig. 3, a more specific embodiment of the channel attention module is as follows:

the channel attention modules take the outputs of the fourth and fifth convolution layers as inputs, assuming the inputs of the respective channel attention modulesi=3, 4,5. The input image is first downsampled using a convolution layer with a convolution kernel size of 1 x 1, reducing the number of channels input to one quarter, i.e., C' =1/4C. Based on the resulting downsampled features, the reshape function in the torch library of python is used to deform it into a two-dimensional vector +.>N=h×w, and V is transposed to its transposed matrix V ^T Matrix multiplication is performed, and after the matrix multiplication, a channel attention map is obtained by using a softmax function in a python numpy library>That is, a=softmax (VV ^T )。

Then, weighted addition is performed on the channel attention map a and the two-bit vector V, that is, V' =γva+v, where γ is a weight preset to 0 trained with the network. Finally, V' is deformed again to a vector of resolution equal to input X using the reshape function in the torch library of python.

In the above embodiment, the modified res net50 network is used as a backbone network of a twin network model, in which the modification replaces the convolution in the third convolution layer of the res net50 network with a deformable convolution, and an offset variable is added to the position of each sampling point in the convolution kernel through the deformable convolution, and by using the variables, the convolution kernel can randomly sample near the current position, and is not limited to the previous regular lattice points. After the offset is learned, the size and the position of the deformable convolution kernel can be dynamically adjusted according to the image content which is required to be identified at present, and the visual effect is that the convolution kernel sampling point positions at different positions can be adaptively changed according to the image content, so that the geometric deformation such as the shape, the size and the like of different objects can be adapted to the complex and changeable data set to be searched. After the fourth and fifth layers, a channel attention mechanism is used, i.e. a bypass branch is split after the normal convolution operation, a reshape operation is performed first, and the space dimension is compressed, i.e. each two-dimensional feature map becomes a real number, which is equivalent to a pooling operation with a global receptive field, and the number of feature channels is unchanged. The softmax operation then generates weights for each feature channel by parameters that are learned to explicitly model the correlation between feature channels. After the weight of each characteristic channel is obtained, the weight is applied to each original characteristic channel, and the importance of different channels can be learned. After that, the final LOGO characteristic tensor is used as a convolution kernel, convolution operation is carried out on the image to be retrieved, and the similarity is calculated through channel-by-channel multiplication.

Further, training of the twin network model may include:

and taking the LOGO template image and the image area cut in the retrieval data set as the input of the twin network model, namely a training set, and carrying out network parameter training on the twin network model by adopting a triplet loss function.

More specific embodiments thereof are as follows:

five anchor point frames are arranged pixel by pixel on an input training set image in a coordinate mapping mode, wherein the arrangement mode is that; the area of each anchor point frame is 1/64 of the original image, and the length-width ratio is 0.33, 0.5, 1, 2 and 3 respectively;

training a twin network model by using a triple loss function by using a Resnet50 network weight trained on 1400 ten thousand images of an ImageNet as an initialization weight of a backbone network in the twin network model, using a LOGO template image as a positive label, using an image with a Euclidean distance between a key point of the LOGO template image and a key point of the LOGO template image being larger than 0.6 as a negative label;

Further, in a specific embodiment of the present invention, the process of obtaining the search result through the trained twin network model includes:

s51, extracting feature vectors of local images of LOGO images to be searched and local images of similar images of the LOGO images to be searched through an improved ResNet50 model which is completed through training;

s52, taking the extracted depth image features of the LOGO image to be retrieved as convolution kernels, and carrying out convolution processing on the extracted depth image features of the LOGO image and the depth image features of the similar image in a linear cross-correlation mode to obtain a similarity score graph with the channel number of 2 and the pixel size of 17 multiplied by 17;

s53, classifying the similarity score graph through a softmax function, calculating the maximum value and the minimum value of the similarity score graph, if the maximum value and the minimum value of the similarity score graph are both larger than 0.5, considering the classification sample as positive samples, and if the maximum value and the minimum value of the classification score are both larger than 0.5, the classification sample is an image required by retrieval if the maximum score in the similarity score graph is larger than a preset similarity threshold value of 0.9.

Example 1

According to the above embodiments, the present invention further provides the following examples:

the method was implemented in Python using pythorch and trained on 2 RTX 2080Ti cards. During training, the batch size was set to 16 and 20 epochs were performed using random gradient descent (SGD). We set the initial learning rate to 0.001, using a warm-up learning rate of 0.001 to 0.005 in the first 5 cycles, and decaying exponentially from 0.005 to 0.00005 in the last 15 cycles. The weight decay and momentum were set to 0.0001 and 0.9, respectively. The retrieval image and the image to be retrieved are screen capture images of manual random screen capture in the national 242 information security plan small emergency special project, and have the characteristics of large image scale change, complex background interference, rotation change, poor image quality and the like.

The search system is shown in fig. 4, and the search result statistics are as follows:

TABLE 1LOGO image search results summary

It can be seen that the retrieval method of the invention has excellent accuracy and robustness.

The above examples are only preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples. All technical schemes belonging to the concept of the invention belong to the protection scope of the invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The image retrieval method of the cascade corner features and the twin network is characterized by comprising the following steps of:

s2, carrying out global feature extraction on the image to be searched after noise reduction and the image of the search dataset after noise reduction through SIFT angular point detection, carrying out similarity comparison on the extracted key point vectors, screening out similar images with similarity to the image to be searched in the search dataset, and forming a matching pair image by the image to be searched and the similar images;

wherein the twin network model is a twin network model with deformable attention mechanism constructed based on depth residual network, which is composed by a modified Resnet50 network, the modified Resnet50 network has the following structure:

on the basis of a Resnet50 network, a third convolution layer is changed into a deformable convolution network layer, and a channel attention module is added after a fourth convolution layer and five convolution layers respectively;

step S2 further comprises:

s25, matching the key points with the highest similarity in the images to be searched and the secondary screening images with the two-dimensional coordinate system to serve as the most similar key points of the images to be searched and the secondary screening images;

step S3 further comprises:

2. The image retrieval method according to claim 1, wherein the noise reduction processing is realized by gaussian filtering.

3. The image retrieval method of claim 1, wherein the similarity score is obtained by linear cross-correlation.

4. The image retrieval method according to claim 1, wherein the similarity comparison is evaluated by euclidean distance between key point vectors, namely, when the euclidean distance between the key point vectors extracted from the image to be retrieved after noise reduction and the key point vectors extracted from the image of the retrieval dataset after noise reduction is less than or equal to a priori threshold, the two are considered to be similar.

5. The image retrieval method as recited in claim 4, wherein the a priori threshold is set to 0.6; and/or, the similarity threshold is set to 0.9.

6. The image retrieval method of claim 1, wherein training the twin network model comprises:

training a twin network model based on a triplet loss function by using the trained weight of the improved Resnet50 network as an initialization weight, wherein the template image is set to be a positive label in training, and the cut search dataset image with the Euclidean distance between a key point and the key point of the template image being more than 0.6 is set to be a negative label;

and obtaining the trained twin network model after the triplet loss function is stable, namely the network is converged.

7. The image retrieval method according to claim 1, wherein the obtaining of the retrieval result includes:

s52, taking the extracted depth features of the images to be retrieved as convolution kernels, and carrying out convolution processing on the depth features of the images to be retrieved and the depth features of similar images in a linear cross-correlation mode to obtain a similarity score graph with the channel number of 2 and the pixel size of 17 multiplied by 17;

s53 classifies the similarity score map by a softmax function, including: and calculating the maximum value and the minimum value of the similarity score map, and if the maximum value and the minimum value are both larger than 0.5, considering the classification sample as a positive sample, wherein in the positive sample, a sample with the similarity score larger than a preset similarity threshold value of 0.9 is taken as an image required for retrieval.

8. The image retrieval method according to claim 1, wherein the image to be retrieved is a LOGO-like image.

9. A retrieval system implementing the image retrieval method of any one of claims 1 to 8, comprising: the database module stores the images to be searched and the search data set, the noise reduction processing module carries out noise reduction processing on the images in the images to be searched and the search data set, the SIFT angular point detection and similarity matching module carries out global feature extraction and similar image screening, the image cutting module carries out region-of-interest cutting, the model processing module carries out twin network model construction and training, and the secondary search module can carry out secondary search on search results.