CN114140657A

CN114140657A - Image retrieval method based on multi-feature fusion

Info

Publication number: CN114140657A
Application number: CN202111017516.3A
Authority: CN
Inventors: 张华熊; 江宁远
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2022-03-04

Abstract

The invention discloses an image retrieval method based on multi-feature fusion, which extracts the content features of an image by fusing and using two shallow visual features and deep learning features which have different levels and complementarity, can accurately describe the image features, and improves the reliability of image retrieval and the robustness of retrieval. The designed fusion features are superior to the traditional features and the single features by combining the geometric invariance of the image shallow visual features and the high-level semantic characteristics of the deep learning features; according to the invention, PCA dimension reduction processing is carried out on the fusion features, the obtained feature dimension is low, and the method has great advantages in feature comparison speed and feature storage space; the multi-feature fusion mode designed by the invention is simple, the retrieval process is efficient, and the retrieval accuracy is high.

Description

Image retrieval method based on multi-feature fusion

Technical Field

The invention belongs to the technical field of image retrieval, and particularly relates to an image retrieval method based on multi-feature fusion.

Background

Image retrieval is one of the research key points in the fields of information retrieval and machine vision, and the image retrieval refers to the action of a retrieval system user for searching images which are required by the user in an image database in a certain range. Image retrieval techniques can be classified into two categories according to different description methods of images: one type is text-based image retrieval and the other type is content-based image retrieval. The text-based image retrieval technology realizes the description of image contents by depending on an artificial text annotation method, so that the image retrieval is realized by searching keywords, and the defects of huge workload of artificial annotation, strong subjectivity, incapability of completely covering the contents of the image by text annotation and the like exist; and the content-based image retrieval starts from the content of the image, so that the ambiguity problem existing in the text annotation process is effectively overcome.

The content features of the current image can be divided into shallow visual features and deep learning features; the shallow layer visual features mainly refer to visual content features expressed by an image, and generally comprise global features such as color, texture and shape, local features such as SIFT, and the like, wherein the SIFT local features have invariance to image rotation, scale scaling, brightness and other transformations, so that the shallow layer visual features are widely applied to the field of computer vision. The deep learning features are image features extracted from a deep neural network, complex feature representation of the image can be independently learned through a data training mode, high-level semantic information of the image can be extracted, errors caused by 'semantic gap' can be effectively reduced compared with shallow visual features, and a better retrieval effect is achieved.

In the document [ Babenko A, Slesarev A, Chigorin A, et al. neural codes for image retrieval [ C ]// European conference on computer vision. Springer, Cham,2014:584-599], it is proposed that image features are extracted from a fully connected layer of a CNN model pre-trained on ImageNet and used in an image retrieval scene, so that good effects are obtained, but the fully connected layer features lack certain geometric invariance and still have certain problems in image retrieval. The above shortcomings of the retrieval method based on single feature extraction promote the research of the image retrieval method based on multi-feature fusion. Coupled multi-dimensional indexes are proposed in documents [ Zheng L, Wang S, Liu Z, et al.packing and packing: Coupled multi-index for acquisition image retrieval [ C ]// Proceedings of the IEEE conference on computer vision and pattern retrieval.2014: 1939-.

Therefore, the image retrieval method based on multi-feature fusion at present has the defects that the fusion method flow is complex, the feature extraction and fusion time is increased, the use and fusion of various features cause the increase of image feature dimensions, and the retrieval time is greatly increased, and the like.

Disclosure of Invention

In view of the above, the present invention provides an image retrieval method based on multi-feature fusion, which realizes content feature extraction for an image by fusing and using two shallow visual features and deep learning features with different levels and complementarity, can accurately describe image features, and improves reliability of image retrieval and robustness of retrieval.

An image retrieval method based on multi-feature fusion comprises the following steps:

(1) SIFT feature extraction is carried out on the target image, and the SIFT features are coded by utilizing a pre-trained visual dictionary and serve as the shallow visual features of the target image;

(2) inputting the preprocessed target image into a pre-trained Resnet50 neural network to extract convolutional layer characteristics as deep learning characteristics of the target image;

(3) respectively carrying out L2 norm normalization on the shallow visual feature and the deep learning feature of the target image, then carrying out weighted concatenation on the normalized features and combining PCA (principal Component analysis) dimension reduction processing, thereby obtaining the fusion feature of the target image;

(4) and comparing the fusion characteristics of the target image with all image characteristic vectors in the characteristic library, and finally obtaining a retrieval result by adopting a query expansion mode.

Further, the pre-training process of the visual dictionary in the step (1) is as follows: the SIFT feature vectors of each image in the data set are extracted by utilizing the image data set, the set of the feature vectors is clustered by using a K-means clustering algorithm, the feature vector set is finally divided into a plurality of clusters, and the clustering center of each cluster can be regarded as a visual word in a visual dictionary.

Further, in the step (1), the SIFT features of the target image are encoded by using a local feature encoding algorithm, the encoding algorithm adopts multi-neighbor soft distribution to aggregate SIFT local features, the membership degree of the SIFT feature vector and n neighbor visual words is calculated by a distance ratio, and the calculation formula of the membership degree is as follows:

wherein: x is the number of_iIs SIFT feature vector of target image, and n is represented as SIFT feature vector x_iNumber of assigned neighbor visual words, b_jFor the j-th adjacent visual word assigned, u_ijIs SIFT feature vector x_iIn the neighborhood of the visual word b_jAnd beta is the rate of change of the smoothing factor control function.

Further, the specific implementation manner of the step (2) is as follows: firstly, scaling the size of a target image to 224 multiplied by 224 pixel size, and carrying out averaging processing; and then inputting the processed target image into a pre-trained Resnet50 neural network, extracting a feature map output by the 5 th convolutional layer of the neural network, and aggregating the feature map into a one-dimensional feature vector as the deep learning feature of the target image.

Further, the pretraining process of the Resnet50 neural network is as follows: the method comprises the steps of initializing a Renset50 neural network by using weight parameters trained on an ImageNet data set, then performing migration training on an image data set, namely training a Resnet50 neural network as a softmax classifier, and performing migration training on the neural network in batches by adopting a cross entropy loss function and a mini-batch optimizer in a forward propagation mode and a backward propagation mode in the training process.

Further, the feature maps are aggregated into one-dimensional feature vectors by using a rmac (regional Maximum activities of constraints) coding mode, which is specifically realized as follows: for any two-dimensional feature map of any layer in the feature map, firstly, uniformly sampling on the feature map by using a multi-scale sliding window strategy, wherein the side length of a square corresponding to a sliding window of the ith scale is 2 x min (W, H)/(l +1), W and H are the width and height of the feature map, the square windows slide on the feature map, and the overlapping area between adjacent windows is not less than 40%; then summing the characteristic response maximum values of all local areas extracted by each scale sliding window to obtain an RMAC characteristic value of the characteristic diagram; and finally, combining the RMAC characteristic values of all the characteristic graphs into a one-dimensional vector form, namely, taking the vector form as the deep learning characteristic of the target image.

Further, the specific implementation manner of the step (4) is as follows: firstly, the fusion characteristics of the target image are taken as an initial query vector F₀Similarity calculation is carried out with all image feature vectors in the feature library, and k image feature vectors { F with the closest similarity are found out₁,F₂,…,F_k}; f is then calculated by the following formula₀And { F₁,F₂,…,F_kMean value F_avgTaking the query vector as a new query vector, finally carrying out similarity calculation on the query vector and all image feature vectors in a feature library, and finding out k image feature vectors with the closest similarity as a final retrieval result;

wherein: k is a self-setting positive integer.

Based on the technical scheme, the invention has the following beneficial technical effects:

1. the fusion feature designed in the invention is superior to the traditional feature and the single feature by combining the geometric invariance of the image shallow visual feature and the high-level semantic characteristic of the deep learning feature.

2. According to the invention, PCA dimension reduction processing is carried out on the fusion features, so that the obtained feature dimension is low, and the method has great advantages in feature comparison speed and feature storage space.

3. The multi-feature fusion mode designed in the invention is simple, the retrieval process is efficient, and the retrieval accuracy is high.

Drawings

FIG. 1 is a flowchart illustrating steps of an image retrieval method according to the present invention.

Fig. 2 is a schematic diagram of RMAC signature encoding in accordance with the present invention.

Fig. 3 is an example of a result of a similar image retrieved by the method of the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the image retrieval method based on multi-feature fusion of the present invention includes the following steps:

(1) and (5) feature extraction.

The invention needs to extract shallow visual features and deep learning features from the retrieved images.

For superficial visual features: firstly, SIFT feature vectors are extracted from an image set, a K-means clustering algorithm is used for clustering feature vector sets, finally the feature vector sets are divided into a plurality of clusters, each clustering center can be regarded as a visual word, and a visual dictionary is obtained.

Then, aggregating SIFT local features of the retrieval image by adopting a multi-neighbor soft distribution method, and calculating the membership degrees of SIFT feature vectors and n neighbor visual words through a distance ratio, wherein the membership degree is a calculation formula:

wherein: u. of_ijRepresenting local features x_iIn the visual sense of the word b_jDegree of membership in, β is the rate of change of the smoothing factor control function, | | x_i-b_j||₂Representing local features x_iAnd visual word b_jIs a Euclidean distance, n tableWill x_iTo n neighboring visual words.

For deep learning features: renset50 is first initialized with weight parameters trained on the ImageNet dataset; performing migration training on a target image set, and training by using a Resnet50 network as a softmax classifier; in the training process of the neural network, a cross entropy loss function and a mini-batch optimizer are adopted to perform migration training on the network in batches in a forward propagation and backward propagation mode.

And then, the size of the retrieval image is scaled to 224 multiplied by 224 pixel size, the retrieval image is subjected to averaging, the retrieval image is input into a Resnet50 network of transfer learning, a feature map of a Conv5 layer is extracted, and three-dimensional feature maps are aggregated into a one-dimensional feature vector.

The polymerization method adopts RMAC characteristic coding, firstly, a sliding window strategy is utilized to carry out uniform sampling on a characteristic diagram omega, the programming length of a square corresponding to a 1 st scale region is min (W, H), the side length of a square corresponding to a 2 nd scale region is 2 x min (W, H)/3, the side length of a square corresponding to an L-th scale region is 2 x min (W, H)/(L +1), the square regions slide on the characteristic diagram omega, and each region has an overlapping area which is not less than 40%. As shown in fig. 2, the region selection process in the scales of 1 to 3 is respectively shown, 2 regions are extracted from the feature map when L is 1, 6 regions are extracted when L is 2, and 12 regions are extracted when L is 3, so that a total of 20 local regions are extracted from a single feature map.

The global and 20 area maxima are then summed to obtain the RMAC features for a single feature map, as follows:

wherein R represents the number of local regions on the feature map, f_irRepresenting the characteristic response maximum for the i-th channel r region.

(2) And (5) feature fusion.

The invention respectively normalizes the shallow vision of the retrieval image by using the L2 normFeatures and deep learning features were normalized to the L2 norm. Suppose there is a set of feature vectors

First calculate the L2 norm of the X vector:

each dimension of vector X is then divided by | X |₂To obtain a new normalized vector X' of L2 norm, i.e.:

and then, carrying out weighted concatenation on the shallow visual features and the deep learning features after normalization of the L2 norm, wherein the fusion mode is as follows:

F＝[γ₁F_B,γ₂F_R]

wherein: f_BSIFT-BOW shallow visual feature, F, for multi-neighbor soft allocation_RResnet50 deep convolution feature, γ, for transfer learning₁、γ₂Are weight parameters of both, respectively, and γ₁+γ₂＝1。

And finally, carrying out PCA (principal component analysis) dimension reduction processing on the fused features, removing redundant information and obtaining final fused features. The PCA dimensionality reduction is to map the original n-dimensional features onto k dimensions, which are completely new orthogonal features also called principal components, and the main transformation process is as follows:

first, the original data X (m, n) is de-averaged, i.e. the average value is subtracted from each feature dimension.

Secondly, calculating a covariance matrix C (m, n), and calculating an eigenvalue and a vector of the covariance matrix.

And thirdly, selecting eigenvectors corresponding to the largest k eigenvalues in the covariance matrix C to form a dimension reduction matrix T (n, k).

Fourthly, by formula X_newX × T, obtained X_newNamely the main component data of the original data X reduced to k dimension.

(3) And (5) feature retrieval.

Comparing the fusion features of the retrieved images in a vector library by using Euclidean distance, retrieving and reordering by using an average query expansion strategy to improve retrieval accuracy, and obtaining a final retrieval result, wherein the execution process of the average query expansion strategy is as follows:

extracting a content feature vector F of a query image₀Using F₀Similarity comparison is carried out with the feature vectors in the feature library, and the first k feature vector sets F { F } with the nearest distance are found out₁,F₂,…,F_k}。

② calculating F₀And a set of feature vectors F { F }₁,F₂,…,F_kMean value of F_avgThe calculation method is as follows:

wherein: f₀Is the initial query vector, F_mIs the feature vector of the mth result of the initial query.

Finally, the average value F_avgAnd as a new query characteristic example, querying to obtain a final retrieval result. FIG. 3 is an exemplary image search result of the present invention, wherein the first boxed image is labeled as the target search image, and the rest are the search result of the similar image.

The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. An image retrieval method based on multi-feature fusion comprises the following steps:

(3) respectively carrying out L2 norm normalization on the shallow visual feature and the deep learning feature of the target image, then carrying out weighted series connection on the normalized features and combining PCA (principal component analysis) dimension reduction processing, thereby obtaining the fusion features of the target image;

2. The image retrieval method according to claim 1, characterized in that: the pre-training process of the visual dictionary in the step (1) comprises the following steps: the SIFT feature vectors of each image in the data set are extracted by utilizing the image data set, the set of the feature vectors is clustered by using a K-means clustering algorithm, the feature vector set is finally divided into a plurality of clusters, and the clustering center of each cluster can be regarded as a visual word in a visual dictionary.

3. The image retrieval method according to claim 1, characterized in that: in the step (1), SIFT features of the target image are coded by using a local feature coding algorithm, the coding algorithm adopts multi-neighbor soft distribution to aggregate SIFT local features, the membership degree of SIFT feature vectors and n neighbor visual words is calculated according to a distance ratio, and a calculation formula of the membership degree is as follows:

wherein: x is the number of_iBeing target imagesSIFT feature vector, n is represented as SIFT feature vector x_iNumber of assigned neighbor visual words, b_jFor the j-th adjacent visual word assigned, u_ijIs SIFT feature vector x_iIn the neighborhood of the visual word b_jAnd beta is the rate of change of the smoothing factor control function.

4. The image retrieval method according to claim 1, characterized in that: the specific implementation manner of the step (2) is as follows: firstly, scaling the size of a target image to 224 multiplied by 224 pixel size, and carrying out averaging processing; and then inputting the processed target image into a pre-trained Resnet50 neural network, extracting a feature map output by the 5 th convolutional layer of the neural network, and aggregating the feature map into a one-dimensional feature vector as the deep learning feature of the target image.

5. The image retrieval method according to claim 1, characterized in that: the pretraining process of the Resnet50 neural network comprises the following steps: the method comprises the steps of initializing a Renset50 neural network by using weight parameters trained on an ImageNet data set, then performing migration training on an image data set, namely training a Resnet50 neural network as a softmax classifier, and performing migration training on the neural network in batches by adopting a cross entropy loss function and a mini-batch optimizer in a forward propagation mode and a backward propagation mode in the training process.

6. The image retrieval method according to claim 4, characterized in that: the feature maps are aggregated into a one-dimensional feature vector by adopting an RMAC coding mode, and the method is specifically realized as follows: for any two-dimensional feature map of any layer in the feature map, firstly, uniformly sampling on the feature map by using a multi-scale sliding window strategy, wherein the side length of a square corresponding to a sliding window of the ith scale is 2 x min (W, H)/(l +1), W and H are the width and height of the feature map, the square windows slide on the feature map, and the overlapping area between adjacent windows is not less than 40%; then summing the characteristic response maximum values of all local areas extracted by each scale sliding window to obtain an RMAC characteristic value of the characteristic diagram; and finally, combining the RMAC characteristic values of all the characteristic graphs into a one-dimensional vector form, namely, taking the vector form as the deep learning characteristic of the target image.

7. The image retrieval method according to claim 1, characterized in that: the specific implementation manner of the step (4) is as follows: firstly, the fusion characteristics of the target image are taken as an initial query vector F₀Similarity calculation is carried out with all image feature vectors in the feature library, and k image feature vectors { F with the closest similarity are found out₁,F₂,…,F_k}; f is then calculated by the following formula₀And { F₁,F₂,…,F_kMean value F_avgTaking the query vector as a new query vector, finally carrying out similarity calculation on the query vector and all image feature vectors in a feature library, and finding out k image feature vectors with the closest similarity as a final retrieval result;

wherein: k is a self-setting positive integer.