CN113297410A

CN113297410A - Image retrieval method and device, computer equipment and storage medium

Info

Publication number: CN113297410A
Application number: CN202110841488.0A
Authority: CN
Inventors: 丁冬睿; 姚丽; 杨光远; 逯天斌; 房体品
Original assignee: Guangdong Zhongju Artificial Intelligence Technology Co ltd
Current assignee: Guangdong Zhongju Artificial Intelligence Technology Co ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-08-24

Abstract

The invention discloses an image retrieval method, an image retrieval device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image and a text to be retrieved; extracting image features by using a VGGNet network model; extracting Word2vec characteristics and TF-IDF characteristics of the text and performing deep concatenation to obtain text characteristics; fusing the image features and the text features, constructing residual error features and gate features, and linearly combining according to the weight to obtain fused features; learning the weight by a metric learning method to obtain a final weight; and taking the final fusion features of the images to be retrieved as the features to be retrieved, calculating the similarity between the final fusion features and the retrieval features of the images in the retrieval database, and returning the images meeting the retrieval requirements. The method realizes information fusion of data in different modes based on data in two modes of images and texts, and completes retrieval tasks by using the fused information, thereby improving the retrieval performance.

Description

Image retrieval method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of image retrieval, in particular to an image retrieval method, an image retrieval device, computer equipment and a storage medium.

Background

In the network era, with the rise of various social networks, different types of information such as texts, pictures, audio and video are also increased on a large scale, and data in different modalities can explain the same object or event from different angles, so that people can understand the same object or event more and more perfectly. How to utilize data of different modalities to accomplish specific tasks in a particular scenario has also become a research hotspot. With the increase of multi-modal data, it is becoming more and more complicated for a general user to more accurately and efficiently retrieve information required by the user. The multimodal data in image retrieval includes textual descriptions and image representations of the images.

The image retrieval technology is mainly divided into two types: Text-Based Image Retrieval (TBIR) and Content-Based Image Retrieval (CBIR). TBIR relies mainly on annotation information of images for retrieval, but in the face of tens of thousands of image data sets, manual image annotation is too expensive, so that the retrieval scheme can not meet the requirements of practical applications. CBIR mainly utilizes feature extraction and high-dimensional indexing techniques for image retrieval, but because visual information of a computer-acquired image may not be consistent with semantic information understood by a user for the image, a distance is generated between low-level and high-level retrieval requirements, i.e., a "semantic gap" is caused. In CBIR, images with similar features are likely to be semantically irrelevant due to the existence of semantic gaps, which makes it difficult for content-based image retrieval results to meet the information needs of users in many cases.

Disclosure of Invention

The invention provides an image retrieval method, an image retrieval device, computer equipment and a storage medium, which are used for solving the problems in the prior art.

In a first aspect, an embodiment of the present invention provides an image retrieval method. The method comprises the following steps:

s10: acquiring an image to be retrieved and a text corresponding to the image to be retrieved;

s20: extracting image features of the image to be retrieved by using a VGGNet network model;

s30: extracting Word Vector (Word to Vector, abbreviated as Word2 vec) features and Term Frequency-Inverse text Frequency (TF-IDF) features of the text, and performing deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;

s40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;

s50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;

s60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.

In an embodiment, the parameter configuration of the VGGNet network model comprises the following steps:

s11: pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters;

s12: adjusting all image sizes in a target data set of the VGGNet network model to 256 × 256, and randomly selecting an image content mirror image with the size of 227 × 227 as an input of the VGGNet network model;

s13: modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set;

s14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.

In an embodiment, in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes:

s31: characterize the Word2vec as

Wherein, in the step (A),

are all real numbers, and are all real numbers,Na dimension representing the Word2vec feature; characterize the TF-IDF as

Wherein, in the step (A),

are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;

s32: will be provided with

And

splicing is carried out to obtain spliced characteristics

；

S33: will be provided with

Inputting a deep neural network through which to learn

And

obtaining the text characteristic of the image to be retrieved by the high-order fusion characteristic

Wherein, in the step (A),

is less than

Of (c) is calculated.

In one embodiment, S40 includes:

s41: by a convolution filter according to equation (1)

Characterizing the text

Transforming so that the transformed text features

And the image characteristics

The dimensions of (a) are the same:

（1）

wherein, denotes a standard normalized convolution calculation mode;

s42: constructing the residual features according to equation (2)

：

（2）

Wherein the content of the first and second substances,

representing a ReLU activation function;

s43: constructing the door signature according to equation (3)

：

（3）

Wherein the content of the first and second substances,

in order to be a sigmoid function,

and

two convolution filters are shown as being present in the convolution filter,

representing a calculation method of corresponding multiplication of the parity elements;

s44: according to the formula (4), to

And

carrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved

：

（4）

Wherein the content of the first and second substances,

and

representing learnable weight values for balancing

And

in that

Specific gravity of (1).

In one embodiment, in S50, the learning, by the metric learning method, weights of the residual feature and the gate feature in the fused feature by using the fused feature of the training images and the respective search target feature to obtain a final weight includes:

s51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training image

Initial fusion feature of

And

corresponding search target feature

(ii) a Will be provided with

Is marked as

Wherein, in the step (A),

to represent

The corresponding text is then displayed on the display screen,

representation acquisition

As a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided with

Is marked as

Wherein, in the step (A),

to represent

A corresponding one of the retrieval-target images,

a function representing image characteristics of any one of the acquired images;

s52: for each training image

Repeating the constructionMEach size isKSet of (2)

To obtain theMAn

Set of (2)

Wherein each one

Including one selected from said minimatchKA sample, theKOne sample includes a positive example

And (a)K-1) negative examples, said one positive example being said retrieval target feature

The above-mentioned (A) toK-1) negative examples

，MIs less than or equal toBAnd is andMis less than or equal toK；

S53: constructing a Softmax cross-entropy loss function by using the formula (5)

：

（5）

Wherein the content of the first and second substances,

a similar kernel function is represented as a function of the kernel,

representing two vectors of data points

And

the distance between them;

and

respectively represent

Sample of (1)

The corresponding initial fusion features and the retrieval target features,

is shown in

Under the condition of (1) calculating

；

Representing a softmax function for characterizing the percentage of the converted results to the sum of all converted results;

s54: by using

And learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.

In an embodiment, in S60, the taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image in the plurality of images that meets the retrieval requirement includes:

s61: the image to be retrieved is processed

Is finally fused and characterized as

Wherein, in the step (A),tto represent

The corresponding text is then displayed on the display screen,

representation acquisition

A function of the final fused features of (a); will be provided with

As the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)

Search feature of

The distance between

：

（6）

Wherein the number of images in the search database is recorded asR，r=1,2,…,R；

S62: getD ₁，D ₂，…，D _RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are

(ii) a Will be provided with

And returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.

In one embodiment, the VGGNet network model is a VGGNe-16 network model; or the Word2vec characteristics are obtained through a Skip-Gram model; or the TF-IDF characteristics are obtained through a skleran library in Python.

In a second aspect, an embodiment of the present invention further provides an image retrieval apparatus. The device includes:

the data acquisition module is used for acquiring an image to be retrieved and a text corresponding to the image to be retrieved;

the image feature extraction module is used for extracting the image features of the image to be retrieved by utilizing a VGGNet network model;

the text feature extraction module is used for extracting Word2vec features and TF-IDF features of the text, and performing deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;

the feature fusion module is used for fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;

the weight learning module is used for acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;

and the image retrieval module is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting retrieval requirements in the plurality of images.

In a third aspect, an embodiment of the present invention further provides a computer device. The device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the image retrieval method provided by any embodiment of the invention is realized.

In a fourth aspect, the embodiment of the present invention further provides a storage medium, on which a computer-readable program is stored, where the program, when executed, implements the image retrieval method provided by any embodiment of the present invention.

The invention can realize the following beneficial effects:

1. the embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is also improved;

2. the embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused;

3. the embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved;

4. the embodiment of the invention adopts the VGGNe-16 network model as a processing unit of the image data, and fine-tunes the parameters in the pre-training according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and the efficiency of image feature extraction are improved;

5. the embodiment of the invention combines the image characteristic and the text characteristic by utilizing the corresponding multiplication mode of the homothetic elements, and combines the element information contained in the image characteristic to the maximum extent on the basis of completing the structure matching of the text characteristic and the image characteristic.

Drawings

Fig. 1 is a flowchart of an image retrieval method according to an embodiment of the present invention.

Fig. 2 is a flowchart of another image retrieval method according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples. It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

Fig. 1 is a flowchart of an image retrieval method according to an embodiment of the present invention. The method realizes information fusion of data in different modes based on data in two modes of images and texts, and completes retrieval tasks by using the fused information, thereby improving the retrieval performance. The method includes steps S10-S60.

S10: and acquiring an image to be retrieved and a text corresponding to the image to be retrieved.

S20: and extracting the image features of the image to be retrieved by using a VGGNet network model.

S30: and extracting the Word2vec feature and the TF-IDF feature of the text, and performing depth series connection on the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved.

S40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; and linearly combining the residual error characteristics and the gate characteristics according to the weight to obtain the fusion characteristics of the image to be retrieved.

S50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; and learning the weights of the residual error feature and the gate feature in the fusion feature by using the fusion feature of the training images and the respective retrieval target feature through a metric learning method to obtain a final weight.

Fig. 2 is a flowchart of another image retrieval method according to an embodiment of the present invention, which shows a basic framework of the image retrieval method according to the present invention in a more concise manner. Firstly, an image to be retrieved and a text are obtained, and features of data in different modes are extracted by using different independent networks, for example, the image features of the image mode data are extracted by using a deep convolutional neural network, and the text features of the text mode data are extracted by using a pre-training language network model. And then, fusing the text features and the image features, and training by using a metric learning technology to balance the proportion of the text features and the image features in the fused features, thereby realizing the organic combination of different modal data. And obtaining the final fusion characteristics through learning training. And finally, using the final fusion features as features to be retrieved, completing similarity measurement calculation through a plurality of features existing in the database, returning a set of similar features meeting retrieval requirements in the database, further returning a retrieval result, and completing a retrieval task.

In one embodiment, in S20, the extraction of the image features may be implemented by a deep convolutional neural network. As an end-to-end feature extraction method, even though the training of the deep convolutional neural network needs large-scale labeled data, the advantage of the deep convolutional neural network on image feature extraction is still very outstanding. In the related research direction of image vision, the universality and the expandability of the deep convolutional neural network are very strong. The traditional feature extraction method has the defects of manual design, large calculated amount, low speed, poor real-time performance, not friendly small-scale data and the like, and the deep convolutional neural network automatically extracts features through deep learning and efficiently overcomes the defects. In this embodiment, the VGGNet network model is selected as the processing unit of the image data, so as to obtain the corresponding image features.

In one embodiment, the parameter configuration of the VGGNet network model includes steps S11-S14.

S11: and pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters.

S12: all images in the target data set of the VGGNet network model are resized to a size of 256 × 256, and an image content image with a size of 227 × 227 is randomly selected as an input to the VGGNet network model.

S13: and modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set.

In one embodiment, the VGGNet network model is a VGGNe-16 network model. In the parameter configuration of the VGGNe-16 network model, pre-training is firstly carried out to obtain relatively stable network parameters, and then the network parameters are finely adjusted to make the network parameters more accord with the requirements of a target data set. The process of fine tuning includes three steps.

1) All images in the target data set are resized to 256 × 256, and in the fine-tuning operation, an image content mirror image with size 227 × 227 is randomly selected as the network input.

Through the step 1), on one hand, the operation amount can be reduced by using a smaller convolution kernel, and on the other hand, the model complexity can not be too high due to smaller input, so that the overfitting risk is reduced. Alternatively, both sets of parameters 256 × 256 and 227 × 227 may be modified according to the model. Typically 256, i.e. 8 powers of 2, are used. Furthermore, the method is simple. The "image content mirroring" is understood to mean a data expansion means for expanding a data set by mirroring an original image. On the other hand, in the convolution operation of an image, since the convolution kernels are 3 × 3 in left-right order, it is necessary to mirror the original image. On the other hand, if the trained model image is independent of left and right, the new data after mirroring is equivalent to the same category, and the robustness of the network can be increased.

2) And modifying the number of the neurons of the last full-connection layer in the network model from the original 1000 to c, wherein c is the specific number of the image categories in the target data set. Wherein 1000 refers to 1000 categories in the ImageNet dataset.

Through the step 2), the number of the neurons of the last full connection layer in the network model can be modified, so that the modified number is more suitable for the target data set.

3) And performing Softmax operation with the dimension c on the output of the last layer to obtain the probability distribution result of the picture content in the c types. A euclidean loss function is employed.

The specific settings in the fine-tuning are further described below.

In the whole fine adjustment process, firstly, the VGGNe-16 is pre-trained by using the ImageNet data set, and the parameters obtained by pre-training are adopted to complete the parameter assignment of the front 7 layers in the VGGNe-16. And for the final fully-connected layer, finishing parameter assignment by fine adjustment. In this embodiment, for the final fully-connected layer, a gaussian distribution function is used

And realizing the random assignment of the parameters. μ represents a location parameter of the gaussian distribution describing a central tendency location of the gaussian distribution. σ describes the degree of dispersion of the data distribution of the gaussian distribution: the larger σ is, the more dispersed the data distribution is, and the smaller σ is, the more concentrated the data distribution is. Both the mu and sigma parameters can be flexibly set according to requirements.

In the fine tuning process, different learning rates are set for the front and rear levels in VGGNe-16. For the previous convolutional layer, the main function is to extract the bottom layer feature representation of the image data, which is more uniform with the parameter setting in the pre-trained model obtained by pre-training with the ImageNet data set, so the learning rate is set to be lower 0.001. Aiming at the last three full connection layers of VGGNe-16, in order to ensure that the network model converges on the target data set as soon as possible to reach the corresponding optimal solution, the learning rates of the first two full connection layers are preset to be 0.002, and the learning rate of the last full connection layer is preset to be 0.01, which are relatively high learning rates. Due to the difference of learning rate, the update rate of the front and back layers is also correspondingly different. By adopting the fine tuning operation, the network model can be matched on the target data set as soon as possible, the optimization efficiency and effect are improved, and relatively stable parameters obtained in the pre-training stage cannot be damaged. And after the fine adjustment operation is finished, extracting the image characteristics corresponding to the image data by using a network model with complete parameters.

In an embodiment, in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes: steps S31-S33.

S31: characterize the Word2vec as

Wherein, in the step (A),

Wherein, in the step (A),

s32: will be provided with

And

splicing to obtain spliced productIs characterized by

；

S33: will be provided with

Inputting a deep neural network through which to learn

And

Wherein, in the step (A),

is less than

Of (c) is calculated.

In one embodiment, the Word2vec feature may be obtained by the Skip-Gram model, and the TF-IDF feature may be obtained by the sklern library in Python.

The extraction of the text features is to perform corresponding processing on the original text data to obtain vector representations which can be utilized subsequently, namely the text features. In the embodiment, the text features are obtained by extracting the Word2vec features and the TF-IDF features of the text and performing deep concatenation on the Word2vec features and the TF-IDF features. It should be noted that, in this embodiment, instead of simply and directly connecting the Word2vec feature and the TF-IDF feature in series to obtain the text feature, the two features are combined with the classified cross entropy loss to perform feature fusion by using the deep neural network, so as to learn the high-level fusion feature, so that the learned semantics in the text feature are more accurate.

The Word2vec feature and the TF-IDF feature each have advantages. Word2vec features represent semantic information of words in a vector mode (i.e., Word vectors) through the learning of a large corpus, and words similar in semantics are close to each other in an embedding space. Word2vec characteristic considers context information, and compared with the previous Embedding method, the effect is better; meanwhile, the dimension of the Word2vec feature is less, so the processing speed is higher. The TF-IDF characteristic is a characteristic weight algorithm based on word frequency widely used in text mining, and the main idea is that if a certain word or phrase appears in an article with high probability and rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification. The TF-IDF has simple characteristics and quick processing.

Alternatively, the Skip-Gram model is used to obtain Word2 vec. First, a vector representation of all words in the pre-processed text data is obtained using the Skip-Gram model. Then, averaging all Word vectors belonging to the same text, and taking the average value as a Word2vec feature vector of the text. The term "same text" is understood here to mean "same sentence", for example a sentence describing an image. A Word2vec feature vector for a text can be expressed as:

wherein, in the step (A),

is a real number and 150 indicates that the word vector is 150 dimensions long. The dimensionality of the word vector can be set according to requirements, the larger corpus dimensionality can be obtained, and the smaller corpus dimensionality can be obtained in a specific field.

Alternatively, using the sklern library in Python, the feature vectors of the TF-IDF of the text can be extracted. In the sklern library, the countvectorzer class only considers the frequency of occurrence of each vocabulary, while the tfidvectorzer class simultaneously calculates the inverse of the number of pieces of text that contain this vocabulary, in addition to the frequency of occurrence of each vocabulary. Therefore, in this embodiment, the tfidvectorer class in the sklern library is used to convert each text into a TF-IDF feature vector.A text can be represented as a 500-dimensional TF-IDF feature vector:

. The dimensionality of the TF-IDF characteristic vector can be set according to requirements, the larger corpus dimensionality can be selected, and the smaller corpus dimensionality can be selected in a specific field.

And splicing the feature vectors, and splicing the TF-IDF feature behind the Word2Vec feature. The spliced feature vector can be represented as

. Inputting the spliced feature vector into a deep neural network, learning high-order fusion features, and finally obtaining the feature vector with 256 dimensions

。

In one embodiment, S40 includes: steps S41-S44. The text features and the image features obtained are spatially unified by step S40.

S41: by a convolution filter according to equation (1)

Characterizing the text

Transforming so that the transformed text features

And the image characteristics

The dimensions of (a) are the same:

（1）

where denotes the standard normalized convolution calculation.

Optionally, a convolution filter of size 3 x 3 is provided to extend the text feature along the height and width dimensions of the underlying image feature to a size that matches the size of the underlying image feature.

The structural transformation of the text feature can be expressed as:

(1) wherein, in the step (A),

the features of the image that are extracted are represented,

the extracted features of the text are represented,

the convolution filter size is 3 x 3, which represents the standard normalized convolution calculation. The extension process can be completed through the structural transformation in the formula (1),

i.e. the text features after the extension.

S42: constructing the residual features according to equation (2)

：

（2）

Wherein the content of the first and second substances,

indicating the ReLU activation function.

In the formula (1), the first and second groups,

a size matching the underlying image features is achieved. In the formula (2), the first and second groups,

obtaining final text characteristics through ReLU activation function

。

More characteristic elements in the text data are combined, and effective conversion of the bottom text characteristics is achieved.

S43: constructing the door signature according to equation (3)

：

（3）

Wherein the content of the first and second substances,

in order to be a sigmoid function,

and

two convolution filters are shown as being present in the convolution filter,

indicating the calculation method of the corresponding multiplication of the parity elements.

Alternatively, in the formula (3),

and

indicating two sizes are largeA convolution filter as small as 3 x 3; and the combination of the bottom layer image characteristic and the bottom layer text characteristic is realized by utilizing a corresponding multiplication mode of the same-position elements. Obtained

On the basis of completing the structure matching of the bottom text features and the bottom image features, the element information contained in the image features is combined to the maximum extent.

S44: according to the formula (4), to

And

：

（4）

Wherein the content of the first and second substances,

and

representing learnable weight values for balancing

And

in that

Specific gravity of (1).

After the construction of the residual error feature and the gate feature is completed, in order to better balance the influence of a certain single mode feature on attribute embedding, the residual error feature and the gate feature are respectively matched with corresponding weight proportions and are linearly combined to complete feature combination.

In addition, the learnable weight value

And

are all one numerical value. Optionally, a learnable weight value

And convolution filter

Can learn the weight values by weight correlation at a plurality of convolution positions

And convolution filter

Is determined by the weight correlation at a plurality of convolution locations. Therefore, it can also be understood as a combination of

And

is learned to realize the pair

And

and (4) learning.

In one embodiment, in S50, the learning, by the metric learning method, weights of the residual feature and the gate feature in the fused feature by using the fused feature of the training images and the respective search target feature to obtain a final weight includes: steps S51-S54.

Initial fusion feature of

And

corresponding search target feature

(ii) a Will be provided with

Is marked as

Wherein, in the step (A),

to represent

The corresponding text is then displayed on the display screen,

representation acquisition

Is marked as

Wherein, in the step (A),

to represent

A corresponding one of the retrieval-target images,

a function representing the image characteristics of any one of the images acquired.

The goal of learning training is to make the distance between the fused features and the target features closer, and the distance between the fused features and the irrelevant features farther, so a supervised classification loss training method is adopted. During the training process, each forward propagation will result in a loss value of the output value and the true value. The smaller the loss value, the better the model. Optionally, a gradient descent algorithm is used to find the minimum loss value, so that the corresponding learning parameter can be deduced reversely, and the effect of optimizing the model is achieved. Assuming that the set minipatch size is B in the gradient descent process, and the initial fusion characteristic of certain data to be retrieved is B

Wherein, in the step (A),

representing the image to be retrieved and,

text representing the image to be retrieved,

a function representing the initial fused features of the acquired image, i=1,2,…,B. Initial weight

And

are all 0.5, and the corresponding retrieval target image is characterized by

Wherein, in the step (A),

which represents the image of the retrieval target,

representing the function of the image characteristics acquired by VGGNe-16.

S52: for each training image

Repeating the constructionMEach size isKSet of (2)

To obtain theMAn

Set of (2)

Wherein each one

The above-mentioned (A) toK-1) negative examples

，MIs less than or equal toBAnd is andMis less thanOr equal toK。

Optionally, for each training image, a certain sample is selected from the set minimatch, and a set with the size of K is constructed

The set has a positive example andK-1) negative examples, wherein a positive example is

Negative example is the negative example in the minimatch

. To evaluate the

Repeating the construction of the set M times to obtain a set

Set of (2)

WhereinMIs not greater than the size B of minipatch and the size of the constructed collectionK。

：

（5）

Wherein the content of the first and second substances,

a similar kernel function is represented as a function of the kernel,

represents the direction of two data pointsMeasurement of

And

the distance between them;

and

respectively represent

Sample of (1)

The corresponding initial fusion features and the retrieval target features,

is shown in

Under the condition of (1) calculating

；

The softmax function is expressed to characterize the percentage of post-conversion results over the sum of all post-conversion results.

Optionally, during the calculation process,

is set to mean negative

Distance.

Distance is Euclidean distance when

And

when a certain two data point vector is represented, and n represents the dimension of the vector, it is defined as:

（5-1）

s54: by using

By utilizing the related technology of metric learning, the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features are completed, and the specific optimization of fusion features is perfected. After the text features and the image features are fused, the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the spatial structure and are similar to the features of the retrieval target image in semantic expression.

It should be noted that, in this embodiment, the spatial structure consistency may be understood as spatial dimension consistency, and the semantic expression similarity may be understood as high-level semantic expression similarity. Wherein the "high level semantics" of an image are a concept as opposed to the "low level features" of an image. The low-level features of the image refer to: contour, edge, color, texture, and shape features. Semantic information of low-level features is less, but the target position is accurate. The high-level semantic features of the image are worth looking at, for example, extracting low-level features of a face can extract continuous outlines, noses, eyes and the like, and the high-level features of the image are displayed as the face. The feature semantic information of the high layer is rich, but the target position is rough. Deeper features include higher levels of semantic meaning and higher resolution. We refer to the visual features of an image as a visual space (visual space), and the semantic information of a category as a semantic space (semantic space).

In an embodiment, in S60, the taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image in the plurality of images that meets the retrieval requirement includes: steps S61 and S62.

S61: the image to be retrieved is processed

Is finally fused and characterized as

Wherein, in the step (A),tto represent

The corresponding text is then displayed on the display screen,

representation acquisition

A function of the final fused features of (a); will be provided with

Search feature of

The distance between

：

（6）

Wherein the number of images in the search database is recorded asR，r=1,2,…,R。

In the image retrieval process, the sorting output of the similarity result is the last very important step. Fusion features after optimization of training

Then, the feature vector is used as the basis of similarity check and the search feature of the existing picture in the database

And (3) calculating the distance:

（6-1）

wherein, the selection of the distance function is consistent with the selection of the similar kernel function in the metric learning, and the distance can be expressed as:

(6-2)

the smaller the interval is, the higher the probability that the two feature vectors belong to the same class is, i.e. the similarity between the two vectors is higher.

(ii) a Will be provided with

Sorting the output results of S61, and taking the front with the minimum valuekEach vector is used for obtaining a similarity list of the examination results:

（7）

the set Sim represents the set of similar features of the fusion feature to be retrieved after similarity query, and each feature vector in the set

The corresponding original image is the final retrieval result.

The image retrieval method provided by the embodiment of the invention can realize the following beneficial effects.

1. The embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is improved.

2. The embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused.

3. The embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved.

4. According to the embodiment of the invention, the VGGNe-16 network model is used as a processing unit of the image data, and parameters in pre-training are finely adjusted according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and efficiency of image feature extraction are improved.

Example two

Fig. 3 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present invention. The device is used for implementing the image retrieval method provided by the first embodiment, and includes a data acquisition module 310, an image feature extraction module 320, a text feature extraction module 330, a feature fusion module 340, a weight learning module 350, and an image retrieval module 360.

The data obtaining module 310 is configured to obtain an image to be retrieved and a text corresponding to the image to be retrieved.

The image feature extraction module 320 is configured to extract image features of the image to be retrieved using the VGGNet network model.

The text feature extraction module 330 is configured to extract Word2vec features and TF-IDF features of the text, and perform deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved.

The feature fusion module 340 is configured to fuse the image feature and the text feature, and construct a residual feature and a gate feature of the image to be retrieved, where the residual feature and the gate feature have a consistent spatial structure; and linearly combining the residual error characteristics and the gate characteristics according to the weight to obtain the fusion characteristics of the image to be retrieved.

The weight learning module 350 is configured to obtain a training data set, wherein the training data set includes a plurality of training images and respective corresponding texts; and learning the weights of the residual error feature and the gate feature in the fusion feature by using the fusion feature of the training images and the respective retrieval target feature through a metric learning method to obtain a final weight.

The image retrieval module 360 is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting the retrieval requirements in the plurality of images.

In an embodiment, the text feature extraction module 330 is configured to perform deep concatenation on the Word2vec feature and the TF-IDF feature to obtain a text feature of the image to be retrieved, in the following manner:

s31: characterize the Word2vec as

Wherein, in the step (A),

are all real numbers, and are all real numbers,Nto representThe dimension of the Word2vec feature; characterize the TF-IDF as

Wherein, in the step (A),

s32: will be provided with

And

splicing is carried out to obtain spliced characteristics

；

S33: will be provided with

Inputting a deep neural network through which to learn

And

Wherein, in the step (A),

is less than

Of (c) is calculated.

In one embodiment, the feature fusion module 340 includes: a size transformation unit 341, a residual feature construction unit 342, a gate feature construction unit 343, and a feature fusion unit 344.

The size conversion unit 341 is arranged to pass the convolution filter according to equation (1)

Characterizing the text

Transforming so that the transformed text features

And the image characteristics

The dimensions of (a) are the same:

（1）

where denotes the standard normalized convolution calculation.

The residual feature construction unit 342 is arranged to construct the residual feature according to equation (2)

：

（2）

Wherein the content of the first and second substances,

indicating the ReLU activation function.

The door feature construction unit 343 is arranged to construct said door feature according to equation (3)

：

（3）

Wherein the content of the first and second substances,

in order to be a sigmoid function,

and

two convolution filters are shown as being present in the convolution filter,

The feature fusion unit 344 is arranged to pair according to equation (4)

And

：

（4）

Wherein the content of the first and second substances,

and

representing learnable weight values for balancing

And

in that

Specific gravity of (1).

In one embodiment, the weight learning module 350 is configured to learn the weights of the residual features and the gate features in the fused features by metric learning method according to the following manner, and using the fused features of the training images and the respective search target features to obtain final weights:

Initial fusion feature of

And

corresponding search target feature

(ii) a Will be provided with

Is marked as

Wherein, in the step (A),

to represent

The corresponding text is then displayed on the display screen,

representation acquisition

Is marked as

Wherein, in the step (A),

to represent

A corresponding one of the retrieval-target images,

s52: for each training image

Repeating the constructionMEach size isKSet of (2)

To obtain theMAn

Set of (2)

Wherein each one

The above-mentioned (A) toK-1) negative examples

，MIs less than or equal toBAnd is andMis less than or equal toK；

：

（5）

Wherein the content of the first and second substances,

a similar kernel function is represented as a function of the kernel,

representing two vectors of data points

And

the distance between them;

and

respectively represent

Sample of (1)

The corresponding initial fusion features and the retrieval target features,

is shown in

Under the condition of (1) calculating

；

s54: by using

In an embodiment, the image retrieval module 360 is configured to use the final fusion feature as a feature to be retrieved, calculate similarity between the feature to be retrieved and retrieval features of a plurality of images in the retrieval database, and return an image in the plurality of images that meets the retrieval requirement by:

s61: the image to be retrieved is processed

Is finally fused and characterized as

Wherein, in the step (A),tto represent

The corresponding text is then displayed on the display screen,

representation acquisition

A function of the final fused features of (a); will be provided with

Search feature of

The distance between

：

（6）

(ii) a Will be provided with

The image retrieval device provided by the embodiment of the invention can realize the following beneficial effects.

The image retrieval device of the embodiment of the invention has the same technical principle and beneficial effects as the image retrieval method of the first embodiment. Please refer to the image retrieval method in the first embodiment without detailed technical details in this embodiment.

It should be noted that, in the embodiment of the apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes a processor 410 and a memory 420. The number of the processors 410 may be one or more, and one processor 410 is taken as an example in fig. 4.

The memory 420, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules of the image retrieval method in embodiments of the present invention. The processor 410 implements the image retrieval method described above by running software programs, instructions, and modules stored in the memory 420.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 420 may further include memory located remotely from the processor 410, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Example four

The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be configured to store a program for executing the steps of:

s30: extracting Word2vec characteristics and TF-IDF characteristics of the text, and performing depth series connection on the Word2vec characteristics and the TF-IDF characteristics to obtain text characteristics of the image to be retrieved;

Of course, the storage medium provided in the embodiments of the present invention stores the computer-readable program, which is not limited to the method operations described above, and may also perform related operations in the image retrieval method provided in any embodiment of the present invention.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An image retrieval method, comprising:

s30: extracting Word vector Word2vec characteristics and Word frequency-inverse text frequency TF-IDF characteristics of the text, and performing deep concatenation on the Word2vec characteristics and the TF-IDF characteristics to obtain text characteristics of the image to be retrieved;

2. The image retrieval method of claim 1, wherein the parameter configuration of the VGGNet network model comprises the steps of:

3. The image retrieval method of claim 1, wherein in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes:

s31: characterize the Word2vec as

Wherein, in the step (A),

Wherein, in the step (A),

s32: will be provided with

And

splicing is carried out to obtain spliced characteristics

；

S33: will be provided with

Inputting a deep neural network through which to learn

And

Wherein, in the step (A),

is less than

Of (c) is calculated.

4. The image retrieval method according to claim 1, wherein S40 includes:

s41: by a convolution filter according to equation (1)

Characterizing the text

Transforming so that the transformed text features

And the image characteristics

The dimensions of (a) are the same:

（1）

wherein, denotes a standard normalized convolution calculation mode;

s42: constructing the residual features according to equation (2)

：

（2）

Wherein the content of the first and second substances,

representing a ReLU activation function;

s43: is constructed according to the formula (3)Characteristic of the door

：

（3）

Wherein the content of the first and second substances,

in order to be a sigmoid function,

and

two convolution filters are shown as being present in the convolution filter,

s44: according to the formula (4), to

And

：

（4）

Wherein the content of the first and second substances,

and

representing learnable weight values for balancing

And

in that

Specific gravity of (1).

5. The image retrieval method of claim 1, wherein in S50, the learning, by the metric learning method, weights of the residual features and the gate features in the fused features by using the fused features of the plurality of training images and the respective retrieval target features to obtain final weights comprises:

Initial fusion feature of

And

corresponding search target feature

(ii) a Will be provided with

Is marked as

Wherein, in the step (A),

to represent

The corresponding text is then displayed on the display screen,

representation acquisition

Is marked as

Wherein, in the step (A),

to represent

A corresponding one of the retrieval-target images,

s52: for each training image

Repeating the constructionMEach size isKSet of (2)

To obtain theMAn

Set of (2)

Wherein each one

The above-mentioned (A) toK-1) negative examples

，MIs less than or equal toBAnd is andMis less than or equal toK；

：

（5）

Wherein the content of the first and second substances,

a similar kernel function is represented as a function of the kernel,

representing two vectors of data points

And

the distance between them;

and

respectively represent

Sample of (1)

The corresponding initial fusion features and the retrieval target features,

is shown in

Under the condition of (1) calculating

；

s54: by using

6. The image retrieval method according to claim 5, wherein, in S60, the step of taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image meeting retrieval requirements in the plurality of images comprises:

s61: the image to be retrieved is processed

Is finally fused and characterized as

Wherein, in the step (A),tto represent

The corresponding text is then displayed on the display screen,

representation acquisition

A function of the final fused features of (a); will be provided with

Search feature of

The distance between

：

（6）

(ii) a Will be provided with

7. The image retrieval method according to claim 1,

the VGGNet network model is a VGGNe-16 network model; or

The Word2vec feature is obtained through a Skip-Gram model; or

The TF-IDF characteristics are obtained through the skleran library in Python.

8. An image retrieval apparatus, comprising:

the text feature extraction module is used for extracting Word vector Word2vec features and Word frequency-inverse text frequency TF-IDF features of the text, and performing deep series connection on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image retrieval method according to any one of claims 1 to 7 when executing the program.

10. A storage medium on which a computer-readable program is stored, characterized in that the program, when executed, implements an image retrieval method as recited in any one of claims 1 to 7.