CN113297410A - Image retrieval method and device, computer equipment and storage medium - Google Patents
Image retrieval method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN113297410A CN113297410A CN202110841488.0A CN202110841488A CN113297410A CN 113297410 A CN113297410 A CN 113297410A CN 202110841488 A CN202110841488 A CN 202110841488A CN 113297410 A CN113297410 A CN 113297410A
- Authority
- CN
- China
- Prior art keywords
- image
- features
- retrieved
- retrieval
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an image retrieval method, an image retrieval device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image and a text to be retrieved; extracting image features by using a VGGNet network model; extracting Word2vec characteristics and TF-IDF characteristics of the text and performing deep concatenation to obtain text characteristics; fusing the image features and the text features, constructing residual error features and gate features, and linearly combining according to the weight to obtain fused features; learning the weight by a metric learning method to obtain a final weight; and taking the final fusion features of the images to be retrieved as the features to be retrieved, calculating the similarity between the final fusion features and the retrieval features of the images in the retrieval database, and returning the images meeting the retrieval requirements. The method realizes information fusion of data in different modes based on data in two modes of images and texts, and completes retrieval tasks by using the fused information, thereby improving the retrieval performance.
Description
Technical Field
The embodiment of the invention relates to the technical field of image retrieval, in particular to an image retrieval method, an image retrieval device, computer equipment and a storage medium.
Background
In the network era, with the rise of various social networks, different types of information such as texts, pictures, audio and video are also increased on a large scale, and data in different modalities can explain the same object or event from different angles, so that people can understand the same object or event more and more perfectly. How to utilize data of different modalities to accomplish specific tasks in a particular scenario has also become a research hotspot. With the increase of multi-modal data, it is becoming more and more complicated for a general user to more accurately and efficiently retrieve information required by the user. The multimodal data in image retrieval includes textual descriptions and image representations of the images.
The image retrieval technology is mainly divided into two types: Text-Based Image Retrieval (TBIR) and Content-Based Image Retrieval (CBIR). TBIR relies mainly on annotation information of images for retrieval, but in the face of tens of thousands of image data sets, manual image annotation is too expensive, so that the retrieval scheme can not meet the requirements of practical applications. CBIR mainly utilizes feature extraction and high-dimensional indexing techniques for image retrieval, but because visual information of a computer-acquired image may not be consistent with semantic information understood by a user for the image, a distance is generated between low-level and high-level retrieval requirements, i.e., a "semantic gap" is caused. In CBIR, images with similar features are likely to be semantically irrelevant due to the existence of semantic gaps, which makes it difficult for content-based image retrieval results to meet the information needs of users in many cases.
Disclosure of Invention
The invention provides an image retrieval method, an image retrieval device, computer equipment and a storage medium, which are used for solving the problems in the prior art.
In a first aspect, an embodiment of the present invention provides an image retrieval method. The method comprises the following steps:
s10: acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
s20: extracting image features of the image to be retrieved by using a VGGNet network model;
s30: extracting Word Vector (Word to Vector, abbreviated as Word2 vec) features and Term Frequency-Inverse text Frequency (TF-IDF) features of the text, and performing deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;
s40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
s50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
s60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
In an embodiment, the parameter configuration of the VGGNet network model comprises the following steps:
s11: pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters;
s12: adjusting all image sizes in a target data set of the VGGNet network model to 256 × 256, and randomly selecting an image content mirror image with the size of 227 × 227 as an input of the VGGNet network model;
s13: modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set;
s14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
In an embodiment, in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes:
s31: characterize the Word2vec asWherein, in the step (A),are all real numbers, and are all real numbers,Na dimension representing the Word2vec feature; characterize the TF-IDF asWherein, in the step (A),are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
S33: will be provided withInputting a deep neural network through which to learnAndobtaining the text characteristic of the image to be retrieved by the high-order fusion characteristicWherein, in the step (A),is less thanOf (c) is calculated.
In one embodiment, S40 includes:
s41: by a convolution filter according to equation (1)Characterizing the textTransforming so that the transformed text featuresAnd the image characteristicsThe dimensions of (a) are the same:
wherein, denotes a standard normalized convolution calculation mode;
Wherein the content of the first and second substances,in order to be a sigmoid function,andtwo convolution filters are shown as being present in the convolution filter,representing a calculation method of corresponding multiplication of the parity elements;
s44: according to the formula (4), toAndcarrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved:
Wherein the content of the first and second substances,andrepresenting learnable weight values for balancingAndin thatSpecific gravity of (1).
In one embodiment, in S50, the learning, by the metric learning method, weights of the residual feature and the gate feature in the fused feature by using the fused feature of the training images and the respective search target feature to obtain a final weight includes:
s51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training imageInitial fusion feature ofAndcorresponding search target feature(ii) a Will be provided withIs marked asWherein, in the step (A),to representThe corresponding text is then displayed on the display screen,representation acquisitionAs a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided withIs marked asWherein, in the step (A),to representA corresponding one of the retrieval-target images,a function representing image characteristics of any one of the acquired images;
s52: for each training imageRepeating the constructionMEach size isKSet of (2)To obtain theMAnSet of (2)Wherein each oneIncluding one selected from said minimatchKA sample, theKOne sample includes a positive exampleAnd (a)K-1) negative examples, said one positive example being said retrieval target featureThe above-mentioned (A) toK-1) negative examples,MIs less than or equal toBAnd is andMis less than or equal toK;
Wherein the content of the first and second substances,a similar kernel function is represented as a function of the kernel,representing two vectors of data pointsAndthe distance between them;andrespectively representSample of (1)The corresponding initial fusion features and the retrieval target features,is shown inUnder the condition of (1) calculating;Representing a softmax function for characterizing the percentage of the converted results to the sum of all converted results;
s54: by usingAnd learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.
In an embodiment, in S60, the taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image in the plurality of images that meets the retrieval requirement includes:
s61: the image to be retrieved is processedIs finally fused and characterized asWherein, in the step (A),tto representThe corresponding text is then displayed on the display screen,representation acquisitionA function of the final fused features of (a); will be provided withAs the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)Search feature ofThe distance between:
Wherein the number of images in the search database is recorded asR,r=1,2,…,R;
S62: getD 1,D 2,…,D RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are(ii) a Will be provided withAnd returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.
In one embodiment, the VGGNet network model is a VGGNe-16 network model; or the Word2vec characteristics are obtained through a Skip-Gram model; or the TF-IDF characteristics are obtained through a skleran library in Python.
In a second aspect, an embodiment of the present invention further provides an image retrieval apparatus. The device includes:
the data acquisition module is used for acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
the image feature extraction module is used for extracting the image features of the image to be retrieved by utilizing a VGGNet network model;
the text feature extraction module is used for extracting Word2vec features and TF-IDF features of the text, and performing deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;
the feature fusion module is used for fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
the weight learning module is used for acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
and the image retrieval module is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting retrieval requirements in the plurality of images.
In a third aspect, an embodiment of the present invention further provides a computer device. The device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the image retrieval method provided by any embodiment of the invention is realized.
In a fourth aspect, the embodiment of the present invention further provides a storage medium, on which a computer-readable program is stored, where the program, when executed, implements the image retrieval method provided by any embodiment of the present invention.
The invention can realize the following beneficial effects:
1. the embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is also improved;
2. the embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused;
3. the embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved;
4. the embodiment of the invention adopts the VGGNe-16 network model as a processing unit of the image data, and fine-tunes the parameters in the pre-training according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and the efficiency of image feature extraction are improved;
5. the embodiment of the invention combines the image characteristic and the text characteristic by utilizing the corresponding multiplication mode of the homothetic elements, and combines the element information contained in the image characteristic to the maximum extent on the basis of completing the structure matching of the text characteristic and the image characteristic.
Drawings
Fig. 1 is a flowchart of an image retrieval method according to an embodiment of the present invention.
Fig. 2 is a flowchart of another image retrieval method according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples. It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
Fig. 1 is a flowchart of an image retrieval method according to an embodiment of the present invention. The method realizes information fusion of data in different modes based on data in two modes of images and texts, and completes retrieval tasks by using the fused information, thereby improving the retrieval performance. The method includes steps S10-S60.
S10: and acquiring an image to be retrieved and a text corresponding to the image to be retrieved.
S20: and extracting the image features of the image to be retrieved by using a VGGNet network model.
S30: and extracting the Word2vec feature and the TF-IDF feature of the text, and performing depth series connection on the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved.
S40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; and linearly combining the residual error characteristics and the gate characteristics according to the weight to obtain the fusion characteristics of the image to be retrieved.
S50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; and learning the weights of the residual error feature and the gate feature in the fusion feature by using the fusion feature of the training images and the respective retrieval target feature through a metric learning method to obtain a final weight.
S60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
Fig. 2 is a flowchart of another image retrieval method according to an embodiment of the present invention, which shows a basic framework of the image retrieval method according to the present invention in a more concise manner. Firstly, an image to be retrieved and a text are obtained, and features of data in different modes are extracted by using different independent networks, for example, the image features of the image mode data are extracted by using a deep convolutional neural network, and the text features of the text mode data are extracted by using a pre-training language network model. And then, fusing the text features and the image features, and training by using a metric learning technology to balance the proportion of the text features and the image features in the fused features, thereby realizing the organic combination of different modal data. And obtaining the final fusion characteristics through learning training. And finally, using the final fusion features as features to be retrieved, completing similarity measurement calculation through a plurality of features existing in the database, returning a set of similar features meeting retrieval requirements in the database, further returning a retrieval result, and completing a retrieval task.
In one embodiment, in S20, the extraction of the image features may be implemented by a deep convolutional neural network. As an end-to-end feature extraction method, even though the training of the deep convolutional neural network needs large-scale labeled data, the advantage of the deep convolutional neural network on image feature extraction is still very outstanding. In the related research direction of image vision, the universality and the expandability of the deep convolutional neural network are very strong. The traditional feature extraction method has the defects of manual design, large calculated amount, low speed, poor real-time performance, not friendly small-scale data and the like, and the deep convolutional neural network automatically extracts features through deep learning and efficiently overcomes the defects. In this embodiment, the VGGNet network model is selected as the processing unit of the image data, so as to obtain the corresponding image features.
In one embodiment, the parameter configuration of the VGGNet network model includes steps S11-S14.
S11: and pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters.
S12: all images in the target data set of the VGGNet network model are resized to a size of 256 × 256, and an image content image with a size of 227 × 227 is randomly selected as an input to the VGGNet network model.
S13: and modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set.
S14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
In one embodiment, the VGGNet network model is a VGGNe-16 network model. In the parameter configuration of the VGGNe-16 network model, pre-training is firstly carried out to obtain relatively stable network parameters, and then the network parameters are finely adjusted to make the network parameters more accord with the requirements of a target data set. The process of fine tuning includes three steps.
1) All images in the target data set are resized to 256 × 256, and in the fine-tuning operation, an image content mirror image with size 227 × 227 is randomly selected as the network input.
Through the step 1), on one hand, the operation amount can be reduced by using a smaller convolution kernel, and on the other hand, the model complexity can not be too high due to smaller input, so that the overfitting risk is reduced. Alternatively, both sets of parameters 256 × 256 and 227 × 227 may be modified according to the model. Typically 256, i.e. 8 powers of 2, are used. Furthermore, the method is simple. The "image content mirroring" is understood to mean a data expansion means for expanding a data set by mirroring an original image. On the other hand, in the convolution operation of an image, since the convolution kernels are 3 × 3 in left-right order, it is necessary to mirror the original image. On the other hand, if the trained model image is independent of left and right, the new data after mirroring is equivalent to the same category, and the robustness of the network can be increased.
2) And modifying the number of the neurons of the last full-connection layer in the network model from the original 1000 to c, wherein c is the specific number of the image categories in the target data set. Wherein 1000 refers to 1000 categories in the ImageNet dataset.
Through the step 2), the number of the neurons of the last full connection layer in the network model can be modified, so that the modified number is more suitable for the target data set.
3) And performing Softmax operation with the dimension c on the output of the last layer to obtain the probability distribution result of the picture content in the c types. A euclidean loss function is employed.
The specific settings in the fine-tuning are further described below.
In the whole fine adjustment process, firstly, the VGGNe-16 is pre-trained by using the ImageNet data set, and the parameters obtained by pre-training are adopted to complete the parameter assignment of the front 7 layers in the VGGNe-16. And for the final fully-connected layer, finishing parameter assignment by fine adjustment. In this embodiment, for the final fully-connected layer, a gaussian distribution function is usedAnd realizing the random assignment of the parameters. μ represents a location parameter of the gaussian distribution describing a central tendency location of the gaussian distribution. σ describes the degree of dispersion of the data distribution of the gaussian distribution: the larger σ is, the more dispersed the data distribution is, and the smaller σ is, the more concentrated the data distribution is. Both the mu and sigma parameters can be flexibly set according to requirements.
In the fine tuning process, different learning rates are set for the front and rear levels in VGGNe-16. For the previous convolutional layer, the main function is to extract the bottom layer feature representation of the image data, which is more uniform with the parameter setting in the pre-trained model obtained by pre-training with the ImageNet data set, so the learning rate is set to be lower 0.001. Aiming at the last three full connection layers of VGGNe-16, in order to ensure that the network model converges on the target data set as soon as possible to reach the corresponding optimal solution, the learning rates of the first two full connection layers are preset to be 0.002, and the learning rate of the last full connection layer is preset to be 0.01, which are relatively high learning rates. Due to the difference of learning rate, the update rate of the front and back layers is also correspondingly different. By adopting the fine tuning operation, the network model can be matched on the target data set as soon as possible, the optimization efficiency and effect are improved, and relatively stable parameters obtained in the pre-training stage cannot be damaged. And after the fine adjustment operation is finished, extracting the image characteristics corresponding to the image data by using a network model with complete parameters.
In an embodiment, in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes: steps S31-S33.
S31: characterize the Word2vec asWherein, in the step (A),are all real numbers, and are all real numbers,Na dimension representing the Word2vec feature; characterize the TF-IDF asWherein, in the step (A),are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
S33: will be provided withInputting a deep neural network through which to learnAndobtaining the text characteristic of the image to be retrieved by the high-order fusion characteristicWherein, in the step (A),is less thanOf (c) is calculated.
In one embodiment, the Word2vec feature may be obtained by the Skip-Gram model, and the TF-IDF feature may be obtained by the sklern library in Python.
The extraction of the text features is to perform corresponding processing on the original text data to obtain vector representations which can be utilized subsequently, namely the text features. In the embodiment, the text features are obtained by extracting the Word2vec features and the TF-IDF features of the text and performing deep concatenation on the Word2vec features and the TF-IDF features. It should be noted that, in this embodiment, instead of simply and directly connecting the Word2vec feature and the TF-IDF feature in series to obtain the text feature, the two features are combined with the classified cross entropy loss to perform feature fusion by using the deep neural network, so as to learn the high-level fusion feature, so that the learned semantics in the text feature are more accurate.
The Word2vec feature and the TF-IDF feature each have advantages. Word2vec features represent semantic information of words in a vector mode (i.e., Word vectors) through the learning of a large corpus, and words similar in semantics are close to each other in an embedding space. Word2vec characteristic considers context information, and compared with the previous Embedding method, the effect is better; meanwhile, the dimension of the Word2vec feature is less, so the processing speed is higher. The TF-IDF characteristic is a characteristic weight algorithm based on word frequency widely used in text mining, and the main idea is that if a certain word or phrase appears in an article with high probability and rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification. The TF-IDF has simple characteristics and quick processing.
Alternatively, the Skip-Gram model is used to obtain Word2 vec. First, a vector representation of all words in the pre-processed text data is obtained using the Skip-Gram model. Then, averaging all Word vectors belonging to the same text, and taking the average value as a Word2vec feature vector of the text. The term "same text" is understood here to mean "same sentence", for example a sentence describing an image. A Word2vec feature vector for a text can be expressed as:wherein, in the step (A),is a real number and 150 indicates that the word vector is 150 dimensions long. The dimensionality of the word vector can be set according to requirements, the larger corpus dimensionality can be obtained, and the smaller corpus dimensionality can be obtained in a specific field.
Alternatively, using the sklern library in Python, the feature vectors of the TF-IDF of the text can be extracted. In the sklern library, the countvectorzer class only considers the frequency of occurrence of each vocabulary, while the tfidvectorzer class simultaneously calculates the inverse of the number of pieces of text that contain this vocabulary, in addition to the frequency of occurrence of each vocabulary. Therefore, in this embodiment, the tfidvectorer class in the sklern library is used to convert each text into a TF-IDF feature vector.A text can be represented as a 500-dimensional TF-IDF feature vector:. The dimensionality of the TF-IDF characteristic vector can be set according to requirements, the larger corpus dimensionality can be selected, and the smaller corpus dimensionality can be selected in a specific field.
And splicing the feature vectors, and splicing the TF-IDF feature behind the Word2Vec feature. The spliced feature vector can be represented as. Inputting the spliced feature vector into a deep neural network, learning high-order fusion features, and finally obtaining the feature vector with 256 dimensions。
In one embodiment, S40 includes: steps S41-S44. The text features and the image features obtained are spatially unified by step S40.
S41: by a convolution filter according to equation (1)Characterizing the textTransforming so that the transformed text featuresAnd the image characteristicsThe dimensions of (a) are the same:
where denotes the standard normalized convolution calculation.
Optionally, a convolution filter of size 3 x 3 is provided to extend the text feature along the height and width dimensions of the underlying image feature to a size that matches the size of the underlying image feature.
The structural transformation of the text feature can be expressed as:(1) wherein, in the step (A),the features of the image that are extracted are represented,the extracted features of the text are represented,the convolution filter size is 3 x 3, which represents the standard normalized convolution calculation. The extension process can be completed through the structural transformation in the formula (1),i.e. the text features after the extension.
In the formula (1), the first and second groups,a size matching the underlying image features is achieved. In the formula (2), the first and second groups,obtaining final text characteristics through ReLU activation function。More characteristic elements in the text data are combined, and effective conversion of the bottom text characteristics is achieved.
Wherein the content of the first and second substances,in order to be a sigmoid function,andtwo convolution filters are shown as being present in the convolution filter,indicating the calculation method of the corresponding multiplication of the parity elements.
Alternatively, in the formula (3),andindicating two sizes are largeA convolution filter as small as 3 x 3; and the combination of the bottom layer image characteristic and the bottom layer text characteristic is realized by utilizing a corresponding multiplication mode of the same-position elements. ObtainedOn the basis of completing the structure matching of the bottom text features and the bottom image features, the element information contained in the image features is combined to the maximum extent.
S44: according to the formula (4), toAndcarrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved:
Wherein the content of the first and second substances,andrepresenting learnable weight values for balancingAndin thatSpecific gravity of (1).
After the construction of the residual error feature and the gate feature is completed, in order to better balance the influence of a certain single mode feature on attribute embedding, the residual error feature and the gate feature are respectively matched with corresponding weight proportions and are linearly combined to complete feature combination.
In addition, the learnable weight valueAndare all one numerical value. Optionally, a learnable weight valueAnd convolution filterCan learn the weight values by weight correlation at a plurality of convolution positionsAnd convolution filterIs determined by the weight correlation at a plurality of convolution locations. Therefore, it can also be understood as a combination ofAndis learned to realize the pairAndand (4) learning.
In one embodiment, in S50, the learning, by the metric learning method, weights of the residual feature and the gate feature in the fused feature by using the fused feature of the training images and the respective search target feature to obtain a final weight includes: steps S51-S54.
S51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training imageInitial fusion feature ofAndcorresponding search target feature(ii) a Will be provided withIs marked asWherein, in the step (A),to representThe corresponding text is then displayed on the display screen,representation acquisitionAs a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided withIs marked asWherein, in the step (A),to representA corresponding one of the retrieval-target images,a function representing the image characteristics of any one of the images acquired.
The goal of learning training is to make the distance between the fused features and the target features closer, and the distance between the fused features and the irrelevant features farther, so a supervised classification loss training method is adopted. During the training process, each forward propagation will result in a loss value of the output value and the true value. The smaller the loss value, the better the model. Optionally, a gradient descent algorithm is used to find the minimum loss value, so that the corresponding learning parameter can be deduced reversely, and the effect of optimizing the model is achieved. Assuming that the set minipatch size is B in the gradient descent process, and the initial fusion characteristic of certain data to be retrieved is BWherein, in the step (A),representing the image to be retrieved and,text representing the image to be retrieved,a function representing the initial fused features of the acquired image, i=1,2,…,B. Initial weightAndare all 0.5, and the corresponding retrieval target image is characterized byWherein, in the step (A),which represents the image of the retrieval target,representing the function of the image characteristics acquired by VGGNe-16.
S52: for each training imageRepeating the constructionMEach size isKSet of (2)To obtain theMAnSet of (2)Wherein each oneIncluding one selected from said minimatchKA sample, theKOne sample includes a positive exampleAnd (a)K-1) negative examples, said one positive example being said retrieval target featureThe above-mentioned (A) toK-1) negative examples,MIs less than or equal toBAnd is andMis less thanOr equal toK。
Optionally, for each training image, a certain sample is selected from the set minimatch, and a set with the size of K is constructedThe set has a positive example andK-1) negative examples, wherein a positive example isNegative example is the negative example in the minimatch. To evaluate theRepeating the construction of the set M times to obtain a setSet of (2)WhereinMIs not greater than the size B of minipatch and the size of the constructed collectionK。
Wherein the content of the first and second substances,a similar kernel function is represented as a function of the kernel,represents the direction of two data pointsMeasurement ofAndthe distance between them;andrespectively representSample of (1)The corresponding initial fusion features and the retrieval target features,is shown inUnder the condition of (1) calculating;The softmax function is expressed to characterize the percentage of post-conversion results over the sum of all post-conversion results.
Optionally, during the calculation process,is set to mean negativeDistance.Distance is Euclidean distance whenAndwhen a certain two data point vector is represented, and n represents the dimension of the vector, it is defined as:
s54: by usingAnd learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.
By utilizing the related technology of metric learning, the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features are completed, and the specific optimization of fusion features is perfected. After the text features and the image features are fused, the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the spatial structure and are similar to the features of the retrieval target image in semantic expression.
It should be noted that, in this embodiment, the spatial structure consistency may be understood as spatial dimension consistency, and the semantic expression similarity may be understood as high-level semantic expression similarity. Wherein the "high level semantics" of an image are a concept as opposed to the "low level features" of an image. The low-level features of the image refer to: contour, edge, color, texture, and shape features. Semantic information of low-level features is less, but the target position is accurate. The high-level semantic features of the image are worth looking at, for example, extracting low-level features of a face can extract continuous outlines, noses, eyes and the like, and the high-level features of the image are displayed as the face. The feature semantic information of the high layer is rich, but the target position is rough. Deeper features include higher levels of semantic meaning and higher resolution. We refer to the visual features of an image as a visual space (visual space), and the semantic information of a category as a semantic space (semantic space).
In an embodiment, in S60, the taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image in the plurality of images that meets the retrieval requirement includes: steps S61 and S62.
S61: the image to be retrieved is processedIs finally fused and characterized asWherein, in the step (A),tto representThe corresponding text is then displayed on the display screen,representation acquisitionA function of the final fused features of (a); will be provided withAs the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)Search feature ofThe distance between:
Wherein the number of images in the search database is recorded asR,r=1,2,…,R。
In the image retrieval process, the sorting output of the similarity result is the last very important step. Fusion features after optimization of trainingThen, the feature vector is used as the basis of similarity check and the search feature of the existing picture in the databaseAnd (3) calculating the distance:
wherein, the selection of the distance function is consistent with the selection of the similar kernel function in the metric learning, and the distance can be expressed as:
the smaller the interval is, the higher the probability that the two feature vectors belong to the same class is, i.e. the similarity between the two vectors is higher.
S62: getD 1,D 2,…,D RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are(ii) a Will be provided withAnd returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.
Sorting the output results of S61, and taking the front with the minimum valuekEach vector is used for obtaining a similarity list of the examination results:
the set Sim represents the set of similar features of the fusion feature to be retrieved after similarity query, and each feature vector in the setThe corresponding original image is the final retrieval result.
The image retrieval method provided by the embodiment of the invention can realize the following beneficial effects.
1. The embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is improved.
2. The embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused.
3. The embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved.
4. According to the embodiment of the invention, the VGGNe-16 network model is used as a processing unit of the image data, and parameters in pre-training are finely adjusted according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and efficiency of image feature extraction are improved.
5. The embodiment of the invention combines the image characteristic and the text characteristic by utilizing the corresponding multiplication mode of the homothetic elements, and combines the element information contained in the image characteristic to the maximum extent on the basis of completing the structure matching of the text characteristic and the image characteristic.
Example two
Fig. 3 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present invention. The device is used for implementing the image retrieval method provided by the first embodiment, and includes a data acquisition module 310, an image feature extraction module 320, a text feature extraction module 330, a feature fusion module 340, a weight learning module 350, and an image retrieval module 360.
The data obtaining module 310 is configured to obtain an image to be retrieved and a text corresponding to the image to be retrieved.
The image feature extraction module 320 is configured to extract image features of the image to be retrieved using the VGGNet network model.
The text feature extraction module 330 is configured to extract Word2vec features and TF-IDF features of the text, and perform deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved.
The feature fusion module 340 is configured to fuse the image feature and the text feature, and construct a residual feature and a gate feature of the image to be retrieved, where the residual feature and the gate feature have a consistent spatial structure; and linearly combining the residual error characteristics and the gate characteristics according to the weight to obtain the fusion characteristics of the image to be retrieved.
The weight learning module 350 is configured to obtain a training data set, wherein the training data set includes a plurality of training images and respective corresponding texts; and learning the weights of the residual error feature and the gate feature in the fusion feature by using the fusion feature of the training images and the respective retrieval target feature through a metric learning method to obtain a final weight.
The image retrieval module 360 is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting the retrieval requirements in the plurality of images.
In an embodiment, the parameter configuration of the VGGNet network model comprises the following steps:
s11: pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters;
s12: adjusting all image sizes in a target data set of the VGGNet network model to 256 × 256, and randomly selecting an image content mirror image with the size of 227 × 227 as an input of the VGGNet network model;
s13: modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set;
s14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
In an embodiment, the text feature extraction module 330 is configured to perform deep concatenation on the Word2vec feature and the TF-IDF feature to obtain a text feature of the image to be retrieved, in the following manner:
s31: characterize the Word2vec asWherein, in the step (A),are all real numbers, and are all real numbers,Nto representThe dimension of the Word2vec feature; characterize the TF-IDF asWherein, in the step (A),are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
S33: will be provided withInputting a deep neural network through which to learnAndobtaining the text characteristic of the image to be retrieved by the high-order fusion characteristicWherein, in the step (A),is less thanOf (c) is calculated.
In one embodiment, the feature fusion module 340 includes: a size transformation unit 341, a residual feature construction unit 342, a gate feature construction unit 343, and a feature fusion unit 344.
The size conversion unit 341 is arranged to pass the convolution filter according to equation (1)Characterizing the textTransforming so that the transformed text featuresAnd the image characteristicsThe dimensions of (a) are the same:
where denotes the standard normalized convolution calculation.
The residual feature construction unit 342 is arranged to construct the residual feature according to equation (2):
The door feature construction unit 343 is arranged to construct said door feature according to equation (3):
Wherein the content of the first and second substances,in order to be a sigmoid function,andtwo convolution filters are shown as being present in the convolution filter,indicating the calculation method of the corresponding multiplication of the parity elements.
The feature fusion unit 344 is arranged to pair according to equation (4)Andcarrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved:
Wherein the content of the first and second substances,andrepresenting learnable weight values for balancingAndin thatSpecific gravity of (1).
In one embodiment, the weight learning module 350 is configured to learn the weights of the residual features and the gate features in the fused features by metric learning method according to the following manner, and using the fused features of the training images and the respective search target features to obtain final weights:
s51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training imageInitial fusion feature ofAndcorresponding search target feature(ii) a Will be provided withIs marked asWherein, in the step (A),to representThe corresponding text is then displayed on the display screen,representation acquisitionAs a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided withIs marked asWherein, in the step (A),to representA corresponding one of the retrieval-target images,a function representing image characteristics of any one of the acquired images;
s52: for each training imageRepeating the constructionMEach size isKSet of (2)To obtain theMAnSet of (2)Wherein each oneIncluding one selected from said minimatchKA sample, theKOne sample includes a positive exampleAnd (a)K-1) negative examples, said one positive example being said retrieval target featureThe above-mentioned (A) toK-1) negative examples,MIs less than or equal toBAnd is andMis less than or equal toK;
Wherein the content of the first and second substances,a similar kernel function is represented as a function of the kernel,representing two vectors of data pointsAndthe distance between them;andrespectively representSample of (1)The corresponding initial fusion features and the retrieval target features,is shown inUnder the condition of (1) calculating;Representing a softmax function for characterizing the percentage of the converted results to the sum of all converted results;
s54: by usingAnd learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.
In an embodiment, the image retrieval module 360 is configured to use the final fusion feature as a feature to be retrieved, calculate similarity between the feature to be retrieved and retrieval features of a plurality of images in the retrieval database, and return an image in the plurality of images that meets the retrieval requirement by:
s61: the image to be retrieved is processedIs finally fused and characterized asWherein, in the step (A),tto representThe corresponding text is then displayed on the display screen,representation acquisitionA function of the final fused features of (a); will be provided withAs the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)Search feature ofThe distance between:
Wherein the number of images in the search database is recorded asR,r=1,2,…,R;
S62: getD 1,D 2,…,D RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are(ii) a Will be provided withAnd returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.
In one embodiment, the VGGNet network model is a VGGNe-16 network model; or the Word2vec characteristics are obtained through a Skip-Gram model; or the TF-IDF characteristics are obtained through a skleran library in Python.
The image retrieval device provided by the embodiment of the invention can realize the following beneficial effects.
1. The embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is improved.
2. The embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused.
3. The embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved.
4. According to the embodiment of the invention, the VGGNe-16 network model is used as a processing unit of the image data, and parameters in pre-training are finely adjusted according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and efficiency of image feature extraction are improved.
5. The embodiment of the invention combines the image characteristic and the text characteristic by utilizing the corresponding multiplication mode of the homothetic elements, and combines the element information contained in the image characteristic to the maximum extent on the basis of completing the structure matching of the text characteristic and the image characteristic.
The image retrieval device of the embodiment of the invention has the same technical principle and beneficial effects as the image retrieval method of the first embodiment. Please refer to the image retrieval method in the first embodiment without detailed technical details in this embodiment.
It should be noted that, in the embodiment of the apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
EXAMPLE III
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes a processor 410 and a memory 420. The number of the processors 410 may be one or more, and one processor 410 is taken as an example in fig. 4.
The memory 420, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules of the image retrieval method in embodiments of the present invention. The processor 410 implements the image retrieval method described above by running software programs, instructions, and modules stored in the memory 420.
The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 420 may further include memory located remotely from the processor 410, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Example four
The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be configured to store a program for executing the steps of:
s10: acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
s20: extracting image features of the image to be retrieved by using a VGGNet network model;
s30: extracting Word2vec characteristics and TF-IDF characteristics of the text, and performing depth series connection on the Word2vec characteristics and the TF-IDF characteristics to obtain text characteristics of the image to be retrieved;
s40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
s50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
s60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
Of course, the storage medium provided in the embodiments of the present invention stores the computer-readable program, which is not limited to the method operations described above, and may also perform related operations in the image retrieval method provided in any embodiment of the present invention.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. An image retrieval method, comprising:
s10: acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
s20: extracting image features of the image to be retrieved by using a VGGNet network model;
s30: extracting Word vector Word2vec characteristics and Word frequency-inverse text frequency TF-IDF characteristics of the text, and performing deep concatenation on the Word2vec characteristics and the TF-IDF characteristics to obtain text characteristics of the image to be retrieved;
s40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
s50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
s60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
2. The image retrieval method of claim 1, wherein the parameter configuration of the VGGNet network model comprises the steps of:
s11: pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters;
s12: adjusting all image sizes in a target data set of the VGGNet network model to 256 × 256, and randomly selecting an image content mirror image with the size of 227 × 227 as an input of the VGGNet network model;
s13: modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set;
s14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
3. The image retrieval method of claim 1, wherein in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes:
s31: characterize the Word2vec asWherein, in the step (A),are all real numbers, and are all real numbers,Na dimension representing the Word2vec feature; characterize the TF-IDF asWherein, in the step (A),are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
4. The image retrieval method according to claim 1, wherein S40 includes:
s41: by a convolution filter according to equation (1)Characterizing the textTransforming so that the transformed text featuresAnd the image characteristicsThe dimensions of (a) are the same:
wherein, denotes a standard normalized convolution calculation mode;
Wherein the content of the first and second substances,in order to be a sigmoid function,andtwo convolution filters are shown as being present in the convolution filter,representing a calculation method of corresponding multiplication of the parity elements;
s44: according to the formula (4), toAndcarrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved:
5. The image retrieval method of claim 1, wherein in S50, the learning, by the metric learning method, weights of the residual features and the gate features in the fused features by using the fused features of the plurality of training images and the respective retrieval target features to obtain final weights comprises:
s51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training imageInitial fusion feature ofAndcorresponding search target feature(ii) a Will be provided withIs marked asWherein, in the step (A),to representThe corresponding text is then displayed on the display screen,representation acquisitionAs a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided withIs marked asWherein, in the step (A),to representA corresponding one of the retrieval-target images,a function representing image characteristics of any one of the acquired images;
s52: for each training imageRepeating the constructionMEach size isKSet of (2)To obtain theMAnSet of (2)Wherein each oneIncluding one selected from said minimatchKA sample, theKOne sample includes a positive exampleAnd (a)K-1) negative examples, said one positive example being said retrieval target featureThe above-mentioned (A) toK-1) negative examples,MIs less than or equal toBAnd is andMis less than or equal toK;
Wherein the content of the first and second substances,a similar kernel function is represented as a function of the kernel,representing two vectors of data pointsAndthe distance between them;andrespectively representSample of (1)The corresponding initial fusion features and the retrieval target features,is shown inUnder the condition of (1) calculating;Representing a softmax function for characterizing the percentage of the converted results to the sum of all converted results;
6. The image retrieval method according to claim 5, wherein, in S60, the step of taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image meeting retrieval requirements in the plurality of images comprises:
s61: the image to be retrieved is processedIs finally fused and characterized asWherein, in the step (A),tto representThe corresponding text is then displayed on the display screen,representation acquisitionA function of the final fused features of (a); will be provided withAs the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)Search feature ofThe distance between:
Wherein the number of images in the search database is recorded asR,r=1,2,…,R;
7. The image retrieval method according to claim 1,
the VGGNet network model is a VGGNe-16 network model; or
The Word2vec feature is obtained through a Skip-Gram model; or
The TF-IDF characteristics are obtained through the skleran library in Python.
8. An image retrieval apparatus, comprising:
the data acquisition module is used for acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
the image feature extraction module is used for extracting the image features of the image to be retrieved by utilizing a VGGNet network model;
the text feature extraction module is used for extracting Word vector Word2vec features and Word frequency-inverse text frequency TF-IDF features of the text, and performing deep series connection on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;
the feature fusion module is used for fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
the weight learning module is used for acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
and the image retrieval module is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting retrieval requirements in the plurality of images.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image retrieval method according to any one of claims 1 to 7 when executing the program.
10. A storage medium on which a computer-readable program is stored, characterized in that the program, when executed, implements an image retrieval method as recited in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110841488.0A CN113297410A (en) | 2021-07-26 | 2021-07-26 | Image retrieval method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110841488.0A CN113297410A (en) | 2021-07-26 | 2021-07-26 | Image retrieval method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113297410A true CN113297410A (en) | 2021-08-24 |
Family
ID=77330973
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110841488.0A Pending CN113297410A (en) | 2021-07-26 | 2021-07-26 | Image retrieval method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113297410A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113901177A (en) * | 2021-10-27 | 2022-01-07 | 电子科技大学 | Code searching method based on multi-mode attribute decision |
CN113989792A (en) * | 2021-10-29 | 2022-01-28 | 天津大学 | Cultural relic recommendation algorithm based on fusion features |
CN114880517A (en) * | 2022-05-27 | 2022-08-09 | 支付宝(杭州)信息技术有限公司 | Method and device for video retrieval |
CN115269882A (en) * | 2022-09-28 | 2022-11-01 | 山东鼹鼠人才知果数据科技有限公司 | Intellectual property retrieval system and method based on semantic understanding |
CN115905608A (en) * | 2022-11-15 | 2023-04-04 | 腾讯科技(深圳)有限公司 | Image feature acquisition method and device, computer equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095964A (en) * | 2015-08-17 | 2015-11-25 | 杭州朗和科技有限公司 | Data processing method and device |
CN109766465A (en) * | 2018-12-26 | 2019-05-17 | 中国矿业大学 | A kind of picture and text fusion book recommendation method based on machine learning |
US20200285811A1 (en) * | 2018-02-05 | 2020-09-10 | Alibaba Group Holding Limited | Methods, apparatuses, and devices for generating word vectors |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
CN112231442A (en) * | 2020-10-15 | 2021-01-15 | 北京临近空间飞行器系统工程研究所 | Sensitive word filtering method and device |
CN112364204A (en) * | 2020-11-12 | 2021-02-12 | 北京达佳互联信息技术有限公司 | Video searching method and device, computer equipment and storage medium |
-
2021
- 2021-07-26 CN CN202110841488.0A patent/CN113297410A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095964A (en) * | 2015-08-17 | 2015-11-25 | 杭州朗和科技有限公司 | Data processing method and device |
US20200285811A1 (en) * | 2018-02-05 | 2020-09-10 | Alibaba Group Holding Limited | Methods, apparatuses, and devices for generating word vectors |
CN109766465A (en) * | 2018-12-26 | 2019-05-17 | 中国矿业大学 | A kind of picture and text fusion book recommendation method based on machine learning |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
CN112231442A (en) * | 2020-10-15 | 2021-01-15 | 北京临近空间飞行器系统工程研究所 | Sensitive word filtering method and device |
CN112364204A (en) * | 2020-11-12 | 2021-02-12 | 北京达佳互联信息技术有限公司 | Video searching method and device, computer equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
原旭 等: "多模态特征融合的裁判文书推荐方法", 《微电子学与计算机》 * |
李超越: "基于特征融合的跨模态检索方法研究与应用", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113901177A (en) * | 2021-10-27 | 2022-01-07 | 电子科技大学 | Code searching method based on multi-mode attribute decision |
CN113901177B (en) * | 2021-10-27 | 2023-08-08 | 电子科技大学 | Code searching method based on multi-mode attribute decision |
CN113989792A (en) * | 2021-10-29 | 2022-01-28 | 天津大学 | Cultural relic recommendation algorithm based on fusion features |
CN114880517A (en) * | 2022-05-27 | 2022-08-09 | 支付宝(杭州)信息技术有限公司 | Method and device for video retrieval |
CN115269882A (en) * | 2022-09-28 | 2022-11-01 | 山东鼹鼠人才知果数据科技有限公司 | Intellectual property retrieval system and method based on semantic understanding |
CN115269882B (en) * | 2022-09-28 | 2022-12-30 | 山东鼹鼠人才知果数据科技有限公司 | Intellectual property retrieval system and method based on semantic understanding |
CN115905608A (en) * | 2022-11-15 | 2023-04-04 | 腾讯科技(深圳)有限公司 | Image feature acquisition method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11562039B2 (en) | System and method for performing cross-modal information retrieval using a neural network using learned rank images | |
CN111753060B (en) | Information retrieval method, apparatus, device and computer readable storage medium | |
JP7282940B2 (en) | System and method for contextual retrieval of electronic records | |
US11295090B2 (en) | Multi-scale model for semantic matching | |
GB2547068B (en) | Semantic natural language vector space | |
CN106649561B (en) | Intelligent question-answering system for tax consultation service | |
AU2016256753B2 (en) | Image captioning using weak supervision and semantic natural language vector space | |
CN112163165B (en) | Information recommendation method, device, equipment and computer readable storage medium | |
US9792534B2 (en) | Semantic natural language vector space | |
KR102354716B1 (en) | Context-sensitive search using a deep learning model | |
CN113297410A (en) | Image retrieval method and device, computer equipment and storage medium | |
US11550871B1 (en) | Processing structured documents using convolutional neural networks | |
CN108288067A (en) | Training method, bidirectional research method and the relevant apparatus of image text Matching Model | |
EP3180742A1 (en) | Generating and using a knowledge-enhanced model | |
US10482146B2 (en) | Systems and methods for automatic customization of content filtering | |
Wu et al. | Learning of multimodal representations with random walks on the click graph | |
CN111159367B (en) | Information processing method and related equipment | |
CN113761153B (en) | Picture-based question-answering processing method and device, readable medium and electronic equipment | |
CN108475256B (en) | Generating feature embedding from co-occurrence matrices | |
WO2022140900A1 (en) | Method and apparatus for constructing personal knowledge graph, and related device | |
CN112148831B (en) | Image-text mixed retrieval method and device, storage medium and computer equipment | |
CN114298122B (en) | Data classification method, apparatus, device, storage medium and computer program product | |
CN113094534B (en) | Multi-mode image-text recommendation method and device based on deep learning | |
CN111813993A (en) | Video content expanding method and device, terminal equipment and storage medium | |
CN117332112A (en) | Multimodal retrieval model training, multimodal retrieval method, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210824 |