CN114461827A

CN114461827A - Method and device for searching picture by picture

Info

Publication number: CN114461827A
Application number: CN202210115546.6A
Authority: CN
Inventors: 朱利霞; 伊文超; 李明明; 潘心冰; 何彬彬
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2022-02-07
Filing date: 2022-02-07
Publication date: 2022-05-10

Abstract

The invention relates to the technical field of image processing, and particularly provides a method for searching images by using images, which comprises the following steps: s1, extracting image features and attention weight information; s2, fusing the obtained characteristics of the three different layers and the attention weight information; s3, carrying out feature clustering and constructing a dictionary tree; s4, performing reverse indexing according to the dictionary tree to obtain a dictionary vector of the image; and S5, calculating the similarity according to the dictionary vector, sequencing and outputting the result of searching the images. Compared with the prior art, the method compresses the features based on the dictionary tree, generates the vector with fixed dimensionality according to the constructed dictionary tree for the image features, then stores and calculates the similarity, can reduce the space required by storage and improve the speed of similarity calculation, and further accurately and quickly finishes the task of searching the image by the image.

Description

Method and device for searching picture by picture

Technical Field

The invention relates to the technical field of image processing, and particularly provides a method and a device for searching images by using images.

Background

With the rapid development of social networks and e-commerce networks, image data is growing at an alarming rate every day, and a huge image database is formed. The image database contains rich information, and the images required by the user are retrieved from the mass image database, so that the method is a research direction with great potential for application requirements and prospect development in the field of image processing at present. At present, some famous search engines, such as google and Baidu, are provided with image searching services, so that users can conveniently acquire more image data according to requirements.

The image searching technology is a branch of the image retrieval field. Conventional image retrieval techniques fall into two main categories: the image retrieval field based on text and the image retrieval technology based on image content semantics. The text-based image retrieval field mainly carries out image retrieval according to text description, text annotation paraphrasing and the like are required to be carried out on image data, and tasks which can not be completed in a mass image database are almost possible; the image retrieval technology based on image content semantics mainly utilizes information such as color and texture of an image, object types contained in the image and the like to compare image information.

In the actual image retrieval process, complex background noise directly affects the final search performance, so many algorithms use a region candidate network (RPN) to extract a region of interest, and then perform feature extraction and further perform similarity comparison. However, the generation and setting of the anchor frame on one hand needs a lot of work by using the RPN network to extract the region of interest, and on the other hand, the candidate region selected by the frame is large and may lose image foreground information.

In the image retrieval technology based on deep learning, the similarity calculation is generally performed by using the features of the last layer of a deep neural network, so that the retrieval task is completed. But high-level features lose much detail. Therefore, the fusion of the high-level features and the low-level features can efficiently and reasonably utilize the image feature information. The image retrieval technology based on deep learning generally comprises the following steps: firstly, extracting image features by using a convolutional neural network to obtain the features of a characteristic image, performing distance calculation on the image features by using a metric learning method such as Euclidean distance, and then sequencing according to the distances to obtain similar image information. However, the feature dimensionality obtained based on the neural network is high, a large amount of storage space is occupied when feature storage is carried out, and time is consumed when similarity is calculated.

Searching images by using images is a technology for searching similar images based on images in the field of image searching, and the current algorithm is the derivation improvement of the image searching technology based on image content semantics. The most key part of the image searching technology is image feature extraction and expression, and the problems encountered at present are that global features comprise a lot of background noise, local features lose key information, and the retrieval accuracy is low and the retrieval speed is slow. The use of both global and local features will greatly increase the feature dimensionality, thereby increasing the cost of subsequent computations.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for searching a picture by using a picture, which has strong practicability.

The invention further aims to provide a device which is reasonable in design, safe and applicable and searches the picture by using the picture.

The technical scheme adopted by the invention for solving the technical problem is as follows:

a method for searching a picture by a picture comprises the following steps:

s1, extracting image features and attention weight information;

s2, fusing the obtained characteristics of the three different layers and the attention weight information;

s3, carrying out feature clustering and constructing a dictionary tree;

s4, performing reverse indexing according to the dictionary tree to obtain a dictionary vector of the image;

and S5, calculating the similarity according to the dictionary vector, sequencing and outputting the result of searching the images.

Further, in step S1, the image I1 is fixed to 224 × 3, and is input to the VGG16 convolution model, and convolution feature maps of three different layers are extracted, where the convolution feature map of the first layer is extracted at layers 3 to 5 of the VGG16 convolution model, the convolution feature map of the second layer is extracted at layers 7 to 9 of the VGG16 convolution model, and the convolution feature map of the third layer is extracted at layers 10 to 13 of the VGG16 convolution model.

Further, in step S1, the method further includes:

s11, extracting a 4 th layer convolution feature map of VGG16, wherein the size of the feature map is 112 × 128, and the feature map is used

Represents; extracting 7 th layer convolution characteristic diagram of VGG16, wherein the size of the characteristic diagram is 56 × 256, and using

Represents; extracting 13 th layer convolution characteristic diagram of VGG16, wherein the size of the characteristic diagram is 14 × 512, and using

Represents;

s12, extracting attention weight information:

by the same token can obtain

Attention weight information.

Further, in step S2, the method further includes:

s21, merging the attention weight information into the corresponding features to obtain features containing attention weight:

s22, fusing the three characteristics with attention weight in different scales;

will be characterized by

Features of size 56 x 512 are obtained by upsampling and are compared with

Features of size 56 × 512 obtained by convolution block of 1 × 1 are spliced:

wherein the content of the first and second substances,

characteristic of expression pair

The up-sampling is carried out and,

characteristic of expression pair

A convolution operation with a convolution kernel size of 1 x 1 is performed,

representation feature

And features of

The result of splicing;

s23 splicingThe latter features are then upsampled to obtain features of size 112 x 512, and the upsampled features are compared with the corresponding features

Splicing features with the size of 112 × 512 obtained by the features through a convolution block of 1 × 1 to finally obtain fused features

Wherein the content of the first and second substances,

representation feature

And features of

As a result of the splicing, the result,

characteristic of expression pair

The up-sampling is carried out and,

characteristic of expression pair

Performing convolution operation with convolution kernel size of 1 x 1;

the feature information of all the images in the image database can be extracted through step S2.

Further, in step S3, the method further includes:

s31, carrying out self-adaptive K-Means hierarchical clustering on the obtained characteristics,constructing a dictionary tree, setting the clustering node of each layer in the dictionary tree to be n at most and the dictionary tree to be m at most, firstly clustering the characteristics to obtain p clustering centers which are respectively { mu [ ]₁，…，μ_p}; then clustering is carried out under each clustering center until the whole clustering process reaches the preset number of layers, clustering is finished, and the construction of the dictionary tree is completed until the clustering centers are the nodes of the dictionary tree;

s32, calculating the node weight according to the number of the covering features of each node in the dictionary number, wherein the node weight calculation formula is as follows:

W_T＝log(N/N_T)

where N represents the total number of images in the image library, N_TRepresenting the number of images for the features covered in the dictionary tree node T, and then taking the logarithm of the ratio of the two to obtain the weight of the node T.

Further, in step S4, a dictionary vector of the image in the image library is calculated according to the dictionary tree information and stored, the image is processed through step S1, step S2 to obtain the original features, and a dictionary tree is obtained according to step S3, the dictionary vector of the database image is calculated, and the feature compression expression is completed.

Further, for image I₁The frequency of occurrence of the characteristic in the dictionary tree node is counted in step S3

The dictionary vector is calculated according to:

wherein W_TThe weights of the nodes in the dictionary tree are represented,

representing an image I₁According to the dictionary vector obtained by the dictionary tree, the dictionary vectors of all images in the database can be calculated according to the method, the image features are characterized by the features, and the features are stored in the image feature library.

Further, in step S5, a dictionary vector, denoted as d, of the image to be retrieved is calculated_queryCalculating a dictionary vector d of the band search image according to the following formula_querySimilarity to image dictionary vectors stored in the database:

wherein p represents a dictionary vector

Sum vector d_queryThe dimension (c) of (a) is,

a dictionary vector representing the jth image in the image library,

then, the images are sorted according to the similarity, and the top N images with high similarity are returned as results.

An apparatus for searching a picture with a picture, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine readable program to perform a method for searching a graph.

Compared with the prior art, the method and the device for searching the picture by the picture have the following outstanding advantages that:

the invention uses the convolution network to extract the features of different layers and different scales, gradually fuses the features and fuses the attention weight information, so that the network not only focuses on the global feature information but also can increase the expression capability of the significant features in the training process, and the subsequent calculation cost is reduced.

The features are compressed by adopting a dictionary tree-based mode, the image features generate vectors with fixed dimensionality according to the constructed dictionary tree, then the vectors are stored and the similarity is calculated, the space required by storage can be reduced, the similarity calculation speed is improved, and further the task of searching the images by the images can be accurately and quickly completed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for searching pictures.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A preferred embodiment is given below:

as shown in fig. 1, in the method for searching a graph in this embodiment, first, the image is subjected to a convolution operation of a VGG16 network, and image features of different layers and self-attention weight information of the image features are extracted. Secondly, the obtained features of the three different layers and the attention weight information are fused, the resolution of the features of the lower layer is higher, more position and detail information is contained, the semantic property is lower, and the noise is more. The high-level features have stronger semantic information, but have lower resolution and poorer detail perception capability. The three different levels of characteristics are fused to enrich and strengthen the expression capability of the characteristics.

Clustering the features to construct a dictionary tree, and calculating the node weight according to the dictionary tree; and performing reverse indexing to calculate a dictionary vector of the image in the database and storing the dictionary vector.

And finally, calculating the similarity and sequencing, and further outputting an image retrieval result. And extracting the features of the image to be inquired, performing reverse indexing according to the constructed dictionary tree to obtain an image vector, calculating the similarity between the image vector to be searched and the image vector in the image database, and sequencing to obtain a search result.

The specific operation orientations are as follows:

s1, extracting image features and attention weight information:

image I₁And fixing the feature map to 224 × 3, inputting the feature map into a VGG16 convolution model, and extracting convolution feature maps of three different layers, wherein the convolution feature map of the first layer is extracted at layers 3 to 5 of the VGG16 convolution model, the convolution feature map of the second layer is extracted at layers 7 to 9 of the VGG16 convolution model, and the convolution feature map of the third layer is extracted at layers 10 to 13 of the VGG16 convolution model.

Further comprising the following steps:

s11, image I₁The size is fixed to 224 × 3 and input into a VGG16 convolution model, and 4 th layer convolution feature maps of VGG16 are respectively extracted, wherein the feature maps have the size of 112 × 128 and are used for extracting

Represents; extracting 7 th layer convolution characteristic map of VGG16, wherein the size of the characteristic map is 56 × 256, and the characteristic map is used for

And (4) showing.

And S12, extracting attention weight information of the features. In the convolutional layer, output characteristics are obtained through linear combination of convolutional kernels and original characteristics, the convolutional kernels are local, the receptive field is limited, and the attention mechanism obtains larger receptive field and context information by capturing global information. Extracting attention weight information:

by the same token can obtain

Attention weight information.

further comprising the following steps:

and S21, integrating the attention weight information into the corresponding features to obtain the features containing the attention weight:

and S22, fusing the features of the three different scales. Will be characterized by

Features of size 56 x 512 are obtained by upsampling and are compared with

Features of size 56 × 512 obtained by convolution block of 1 × 1 are spliced:

wherein

Characteristic of expression pair

The up-sampling is carried out and,

characteristic of expression pair

A convolution operation with a convolution kernel size of 1 x 1 is performed,

representation feature

And features of

And (5) splicing results.

S23, the spliced features are up-sampled to obtain features 112 x 512, and the features are combined with the features

Wherein the content of the first and second substances,

representation feature

And characteristics of

As a result of the splicing, the result,

characteristic of expression pair

The up-sampling is carried out and,

characteristic of expression pair

A convolution operation with a convolution kernel size of 1 x 1 is performed. Through the two steps, the characteristic information of all images in the image database can be extracted, and the next step is carried out.

S3, carrying out feature clustering and constructing a dictionary tree;

the constructed dictionary tree is adaptively constructed according to a specific image database,

further comprising:

and S31, carrying out self-adaptive K-Means hierarchical clustering on the features acquired in the steps and constructing a dictionary tree. Here, the clustering node of each layer in the dictionary tree is set to be at most n, and the dictionary tree is set to be at most m layers. Firstly, clustering the features to obtain p clustering centers, wherein the p clustering centers are respectively { mu₁，...，μ_p}. And then clustering is carried out under each clustering center until the whole clustering process reaches the preset number of layers, clustering is finished, and the construction of the dictionary tree is completed until the clustering centers are the nodes of the dictionary tree.

And S32, calculating the node weight according to the number of the covering features of each node in the dictionary number. The node weight calculation formula is as follows:

W_T＝log(N/N_T) (7)

and calculating and storing dictionary vectors of the images in the image library according to the dictionary tree information. The image is processed by the steps S1 and S2 to obtain original features, a dictionary tree is obtained according to the step S3, a dictionary vector of the database image is calculated, and feature compression expression is completed. For image I₁Counting the frequency of the appearance of the characteristics in the nodes of the dictionary tree

The dictionary vector is calculated according to:

wherein W_TThe weights of the nodes in the dictionary tree are represented,

representing an image I₁And obtaining a dictionary vector according to the dictionary tree. According to the method, dictionary vectors of all images in the database can be calculated, the features are characterized and stored in the image feature library.

S5, calculating the similarity according to the dictionary vector, sequencing and further outputting the result of searching the images by the images:

calculating dictionary vector of image to be retrieved, and recording as d_queryCalculating a dictionary vector d of the band search image according to the following formula_querySimilarity to image dictionary vectors stored in the database:

wherein p represents a dictionary vector

Sum vector d_queryThe dimension (c) of (a) is,

a dictionary vector representing the jth image in the image library.

In this embodiment, an apparatus for searching a picture with a picture includes: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

The above embodiments are only specific examples, and the scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of the method and apparatus for searching drawings and that are made by one of ordinary skill in the art should fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for searching a picture by a picture is characterized by comprising the following steps:

s1, extracting image features and attention weight information;

s3, carrying out feature clustering and constructing a dictionary tree;

2. A method of searching a graph according to claim 1Method, characterized in that in step S1, image I is processed₁And fixing the feature map to 224 × 3, inputting the feature map into a VGG16 convolution model, and extracting convolution feature maps of three different layers, wherein the convolution feature map of the first layer is extracted at layers 3 to 5 of the VGG16 convolution model, the convolution feature map of the second layer is extracted at layers 7 to 9 of the VGG16 convolution model, and the convolution feature map of the third layer is extracted at layers 10 to 13 of the VGG16 convolution model.

3. A method for searching a graph according to claim 2, further comprising, in step S1:

Represents;

s12, extracting attention weight information:

by the same token, can obtain

Attention weight information.

4. A method for searching a graph according to claim 3, wherein in step S2, the method further comprises:

will be characterized by

The feature size 56 x 512 is obtained by upsampling and is compared with

Features of size 56 × 512 obtained by convolution block of 1 × 1 are spliced:

wherein the content of the first and second substances,

characteristic of expression pair

The up-sampling is carried out and,

characteristic of expression pair

A convolution operation with a convolution kernel size of 1 x 1 is performed,

representation of features

And features of

The result of splicing;

s23, splicing the features, upsampling to obtain 112 × 512 features, and mixing with the above two

Wherein the content of the first and second substances,

representation feature

And features of

As a result of the splicing, the result,

characteristic of expression pair

The up-sampling is carried out and,

characteristic of a representation pair

Performing convolution operation with convolution kernel size of 1 x 1;

5. The method for searching with a graph according to claim 4, wherein in step S3, further comprising:

s31, conducting self-adaptive K-Means hierarchical clustering on the obtained features, constructing a dictionary tree, setting the clustering node of each layer in the dictionary tree to be n at most and the dictionary tree to be m at most, firstly clustering the features to obtain p clustering centers respectively being { mu [ ]₁，...，μ_p}; then clustering is carried out under each clustering center until the whole clustering process reaches the preset number of layers, clustering is finished, and the construction of the dictionary tree is completed until the clustering centers are the nodes of the dictionary tree;

W_T＝log(N/N_T)

6. The method of claim 5, wherein in step S4, dictionary vectors of images in the image library are calculated according to the dictionary tree information and stored, the images are processed through step S1, step S2 to obtain original features, and the dictionary trees are obtained according to step S3 to calculate dictionary vectors of database images, thereby completing feature compression expression.

7. A method as claimed in claim 6, characterized in that for image I₁The frequency of occurrence of the characteristic in the dictionary tree node is counted in step S3

The dictionary vector is calculated according to the following formula:

wherein W_TThe weights of the nodes in the dictionary tree are represented,

8. The method as claimed in claim 6, wherein in step S5, a dictionary vector of the image to be retrieved is calculated and recorded as d_queryCalculating a dictionary vector d of the retrieved image according to the following formula_querySimilarity to image dictionary vectors stored in the database:

wherein p represents a dictionary vector

Sum vector d_queryThe dimension (c) of (a) is,

a lexicon vector representing the jth image in the image library,

9. An apparatus for searching a graph with a graph, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program, to perform the method of any of claims 1 to 8.