CN113297410A - Image retrieval method and device, computer equipment and storage medium - Google Patents

Image retrieval method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113297410A
CN113297410A CN202110841488.0A CN202110841488A CN113297410A CN 113297410 A CN113297410 A CN 113297410A CN 202110841488 A CN202110841488 A CN 202110841488A CN 113297410 A CN113297410 A CN 113297410A
Authority
CN
China
Prior art keywords
image
features
retrieved
retrieval
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110841488.0A
Other languages
Chinese (zh)
Inventor
丁冬睿
姚丽
杨光远
逯天斌
房体品
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Zhongju Artificial Intelligence Technology Co ltd
Original Assignee
Guangdong Zhongju Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Zhongju Artificial Intelligence Technology Co ltd filed Critical Guangdong Zhongju Artificial Intelligence Technology Co ltd
Priority to CN202110841488.0A priority Critical patent/CN113297410A/en
Publication of CN113297410A publication Critical patent/CN113297410A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an image retrieval method, an image retrieval device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image and a text to be retrieved; extracting image features by using a VGGNet network model; extracting Word2vec characteristics and TF-IDF characteristics of the text and performing deep concatenation to obtain text characteristics; fusing the image features and the text features, constructing residual error features and gate features, and linearly combining according to the weight to obtain fused features; learning the weight by a metric learning method to obtain a final weight; and taking the final fusion features of the images to be retrieved as the features to be retrieved, calculating the similarity between the final fusion features and the retrieval features of the images in the retrieval database, and returning the images meeting the retrieval requirements. The method realizes information fusion of data in different modes based on data in two modes of images and texts, and completes retrieval tasks by using the fused information, thereby improving the retrieval performance.

Description

Image retrieval method and device, computer equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of image retrieval, in particular to an image retrieval method, an image retrieval device, computer equipment and a storage medium.
Background
In the network era, with the rise of various social networks, different types of information such as texts, pictures, audio and video are also increased on a large scale, and data in different modalities can explain the same object or event from different angles, so that people can understand the same object or event more and more perfectly. How to utilize data of different modalities to accomplish specific tasks in a particular scenario has also become a research hotspot. With the increase of multi-modal data, it is becoming more and more complicated for a general user to more accurately and efficiently retrieve information required by the user. The multimodal data in image retrieval includes textual descriptions and image representations of the images.
The image retrieval technology is mainly divided into two types: Text-Based Image Retrieval (TBIR) and Content-Based Image Retrieval (CBIR). TBIR relies mainly on annotation information of images for retrieval, but in the face of tens of thousands of image data sets, manual image annotation is too expensive, so that the retrieval scheme can not meet the requirements of practical applications. CBIR mainly utilizes feature extraction and high-dimensional indexing techniques for image retrieval, but because visual information of a computer-acquired image may not be consistent with semantic information understood by a user for the image, a distance is generated between low-level and high-level retrieval requirements, i.e., a "semantic gap" is caused. In CBIR, images with similar features are likely to be semantically irrelevant due to the existence of semantic gaps, which makes it difficult for content-based image retrieval results to meet the information needs of users in many cases.
Disclosure of Invention
The invention provides an image retrieval method, an image retrieval device, computer equipment and a storage medium, which are used for solving the problems in the prior art.
In a first aspect, an embodiment of the present invention provides an image retrieval method. The method comprises the following steps:
s10: acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
s20: extracting image features of the image to be retrieved by using a VGGNet network model;
s30: extracting Word Vector (Word to Vector, abbreviated as Word2 vec) features and Term Frequency-Inverse text Frequency (TF-IDF) features of the text, and performing deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;
s40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
s50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
s60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
In an embodiment, the parameter configuration of the VGGNet network model comprises the following steps:
s11: pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters;
s12: adjusting all image sizes in a target data set of the VGGNet network model to 256 × 256, and randomly selecting an image content mirror image with the size of 227 × 227 as an input of the VGGNet network model;
s13: modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set;
s14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
In an embodiment, in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes:
s31: characterize the Word2vec as
Figure 426708DEST_PATH_IMAGE001
Wherein, in the step (A),
Figure 468483DEST_PATH_IMAGE002
are all real numbers, and are all real numbers,Na dimension representing the Word2vec feature; characterize the TF-IDF as
Figure 310537DEST_PATH_IMAGE003
Wherein, in the step (A),
Figure 830380DEST_PATH_IMAGE004
are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
s32: will be provided with
Figure 707069DEST_PATH_IMAGE005
And
Figure 529531DEST_PATH_IMAGE006
splicing is carried out to obtain spliced characteristics
Figure 858882DEST_PATH_IMAGE007
S33: will be provided with
Figure 57782DEST_PATH_IMAGE008
Inputting a deep neural network through which to learn
Figure 726660DEST_PATH_IMAGE005
And
Figure 782341DEST_PATH_IMAGE006
obtaining the text characteristic of the image to be retrieved by the high-order fusion characteristic
Figure 536671DEST_PATH_IMAGE009
Wherein, in the step (A),
Figure 273682DEST_PATH_IMAGE009
is less than
Figure 859384DEST_PATH_IMAGE008
Of (c) is calculated.
In one embodiment, S40 includes:
s41: by a convolution filter according to equation (1)
Figure 289229DEST_PATH_IMAGE010
Characterizing the text
Figure 530854DEST_PATH_IMAGE011
Transforming so that the transformed text features
Figure 930612DEST_PATH_IMAGE012
And the image characteristics
Figure 636399DEST_PATH_IMAGE013
The dimensions of (a) are the same:
Figure 971566DEST_PATH_IMAGE014
(1)
wherein, denotes a standard normalized convolution calculation mode;
s42: constructing the residual features according to equation (2)
Figure 762804DEST_PATH_IMAGE015
Figure 841619DEST_PATH_IMAGE016
(2)
Wherein the content of the first and second substances,
Figure 339596DEST_PATH_IMAGE017
representing a ReLU activation function;
s43: constructing the door signature according to equation (3)
Figure 298194DEST_PATH_IMAGE018
Figure 311149DEST_PATH_IMAGE019
(3)
Wherein the content of the first and second substances,
Figure 255971DEST_PATH_IMAGE020
in order to be a sigmoid function,
Figure 670772DEST_PATH_IMAGE021
and
Figure 613320DEST_PATH_IMAGE022
two convolution filters are shown as being present in the convolution filter,
Figure 51255DEST_PATH_IMAGE023
representing a calculation method of corresponding multiplication of the parity elements;
s44: according to the formula (4), to
Figure 534189DEST_PATH_IMAGE015
And
Figure 6759DEST_PATH_IMAGE018
carrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved
Figure 182525DEST_PATH_IMAGE024
Figure 107756DEST_PATH_IMAGE025
(4)
Wherein the content of the first and second substances,
Figure 332064DEST_PATH_IMAGE026
and
Figure 455877DEST_PATH_IMAGE027
representing learnable weight values for balancing
Figure 740228DEST_PATH_IMAGE015
And
Figure 152755DEST_PATH_IMAGE018
in that
Figure 180754DEST_PATH_IMAGE024
Specific gravity of (1).
In one embodiment, in S50, the learning, by the metric learning method, weights of the residual feature and the gate feature in the fused feature by using the fused feature of the training images and the respective search target feature to obtain a final weight includes:
s51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training image
Figure 549287DEST_PATH_IMAGE028
Initial fusion feature of
Figure 738960DEST_PATH_IMAGE029
And
Figure 638783DEST_PATH_IMAGE030
corresponding search target feature
Figure 267210DEST_PATH_IMAGE031
(ii) a Will be provided with
Figure 568879DEST_PATH_IMAGE032
Is marked as
Figure 991770DEST_PATH_IMAGE033
Wherein, in the step (A),
Figure 378889DEST_PATH_IMAGE034
to represent
Figure 748690DEST_PATH_IMAGE035
The corresponding text is then displayed on the display screen,
Figure 701603DEST_PATH_IMAGE036
representation acquisition
Figure 560974DEST_PATH_IMAGE037
As a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided with
Figure 169810DEST_PATH_IMAGE031
Is marked as
Figure 343303DEST_PATH_IMAGE038
Wherein, in the step (A),
Figure 416301DEST_PATH_IMAGE039
to represent
Figure 384257DEST_PATH_IMAGE040
A corresponding one of the retrieval-target images,
Figure 542706DEST_PATH_IMAGE041
a function representing image characteristics of any one of the acquired images;
s52: for each training image
Figure 254310DEST_PATH_IMAGE035
Repeating the constructionMEach size isKSet of (2)
Figure 119498DEST_PATH_IMAGE042
To obtain theMAn
Figure 320672DEST_PATH_IMAGE043
Set of (2)
Figure 28734DEST_PATH_IMAGE044
Wherein each one
Figure 278450DEST_PATH_IMAGE045
Including one selected from said minimatchKA sample, theKOne sample includes a positive example
Figure 349477DEST_PATH_IMAGE031
And (a)K-1) negative examples, said one positive example being said retrieval target feature
Figure 659236DEST_PATH_IMAGE031
The above-mentioned (A) toK-1) negative examples
Figure 792277DEST_PATH_IMAGE046
MIs less than or equal toBAnd is andMis less than or equal toK
S53: constructing a Softmax cross-entropy loss function by using the formula (5)
Figure 32634DEST_PATH_IMAGE047
Figure 872415DEST_PATH_IMAGE048
(5)
Wherein the content of the first and second substances,
Figure 353074DEST_PATH_IMAGE049
a similar kernel function is represented as a function of the kernel,
Figure 707832DEST_PATH_IMAGE050
representing two vectors of data points
Figure 564930DEST_PATH_IMAGE051
And
Figure 259216DEST_PATH_IMAGE052
the distance between them;
Figure 973095DEST_PATH_IMAGE053
and
Figure 18411DEST_PATH_IMAGE031
respectively represent
Figure 413620DEST_PATH_IMAGE054
Sample of (1)
Figure 149364DEST_PATH_IMAGE055
The corresponding initial fusion features and the retrieval target features,
Figure 299723DEST_PATH_IMAGE056
is shown in
Figure 753707DEST_PATH_IMAGE057
Under the condition of (1) calculating
Figure 952607DEST_PATH_IMAGE058
Figure 355906DEST_PATH_IMAGE059
Representing a softmax function for characterizing the percentage of the converted results to the sum of all converted results;
s54: by using
Figure 677166DEST_PATH_IMAGE060
And learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.
In an embodiment, in S60, the taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image in the plurality of images that meets the retrieval requirement includes:
s61: the image to be retrieved is processed
Figure 431495DEST_PATH_IMAGE061
Is finally fused and characterized as
Figure 168507DEST_PATH_IMAGE062
Wherein, in the step (A),tto represent
Figure 754209DEST_PATH_IMAGE061
The corresponding text is then displayed on the display screen,
Figure 184054DEST_PATH_IMAGE063
representation acquisition
Figure 222417DEST_PATH_IMAGE061
A function of the final fused features of (a); will be provided with
Figure 763120DEST_PATH_IMAGE064
As the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)
Figure 406591DEST_PATH_IMAGE065
Search feature of
Figure 804074DEST_PATH_IMAGE066
The distance between
Figure 532996DEST_PATH_IMAGE067
Figure 611810DEST_PATH_IMAGE068
(6)
Wherein the number of images in the search database is recorded asRr=1,2,…,R
S62: getD 1D 2,…,D RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are
Figure 172104DEST_PATH_IMAGE069
(ii) a Will be provided with
Figure 678172DEST_PATH_IMAGE069
And returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.
In one embodiment, the VGGNet network model is a VGGNe-16 network model; or the Word2vec characteristics are obtained through a Skip-Gram model; or the TF-IDF characteristics are obtained through a skleran library in Python.
In a second aspect, an embodiment of the present invention further provides an image retrieval apparatus. The device includes:
the data acquisition module is used for acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
the image feature extraction module is used for extracting the image features of the image to be retrieved by utilizing a VGGNet network model;
the text feature extraction module is used for extracting Word2vec features and TF-IDF features of the text, and performing deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;
the feature fusion module is used for fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
the weight learning module is used for acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
and the image retrieval module is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting retrieval requirements in the plurality of images.
In a third aspect, an embodiment of the present invention further provides a computer device. The device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the image retrieval method provided by any embodiment of the invention is realized.
In a fourth aspect, the embodiment of the present invention further provides a storage medium, on which a computer-readable program is stored, where the program, when executed, implements the image retrieval method provided by any embodiment of the present invention.
The invention can realize the following beneficial effects:
1. the embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is also improved;
2. the embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused;
3. the embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved;
4. the embodiment of the invention adopts the VGGNe-16 network model as a processing unit of the image data, and fine-tunes the parameters in the pre-training according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and the efficiency of image feature extraction are improved;
5. the embodiment of the invention combines the image characteristic and the text characteristic by utilizing the corresponding multiplication mode of the homothetic elements, and combines the element information contained in the image characteristic to the maximum extent on the basis of completing the structure matching of the text characteristic and the image characteristic.
Drawings
Fig. 1 is a flowchart of an image retrieval method according to an embodiment of the present invention.
Fig. 2 is a flowchart of another image retrieval method according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples. It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
Fig. 1 is a flowchart of an image retrieval method according to an embodiment of the present invention. The method realizes information fusion of data in different modes based on data in two modes of images and texts, and completes retrieval tasks by using the fused information, thereby improving the retrieval performance. The method includes steps S10-S60.
S10: and acquiring an image to be retrieved and a text corresponding to the image to be retrieved.
S20: and extracting the image features of the image to be retrieved by using a VGGNet network model.
S30: and extracting the Word2vec feature and the TF-IDF feature of the text, and performing depth series connection on the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved.
S40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; and linearly combining the residual error characteristics and the gate characteristics according to the weight to obtain the fusion characteristics of the image to be retrieved.
S50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; and learning the weights of the residual error feature and the gate feature in the fusion feature by using the fusion feature of the training images and the respective retrieval target feature through a metric learning method to obtain a final weight.
S60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
Fig. 2 is a flowchart of another image retrieval method according to an embodiment of the present invention, which shows a basic framework of the image retrieval method according to the present invention in a more concise manner. Firstly, an image to be retrieved and a text are obtained, and features of data in different modes are extracted by using different independent networks, for example, the image features of the image mode data are extracted by using a deep convolutional neural network, and the text features of the text mode data are extracted by using a pre-training language network model. And then, fusing the text features and the image features, and training by using a metric learning technology to balance the proportion of the text features and the image features in the fused features, thereby realizing the organic combination of different modal data. And obtaining the final fusion characteristics through learning training. And finally, using the final fusion features as features to be retrieved, completing similarity measurement calculation through a plurality of features existing in the database, returning a set of similar features meeting retrieval requirements in the database, further returning a retrieval result, and completing a retrieval task.
In one embodiment, in S20, the extraction of the image features may be implemented by a deep convolutional neural network. As an end-to-end feature extraction method, even though the training of the deep convolutional neural network needs large-scale labeled data, the advantage of the deep convolutional neural network on image feature extraction is still very outstanding. In the related research direction of image vision, the universality and the expandability of the deep convolutional neural network are very strong. The traditional feature extraction method has the defects of manual design, large calculated amount, low speed, poor real-time performance, not friendly small-scale data and the like, and the deep convolutional neural network automatically extracts features through deep learning and efficiently overcomes the defects. In this embodiment, the VGGNet network model is selected as the processing unit of the image data, so as to obtain the corresponding image features.
In one embodiment, the parameter configuration of the VGGNet network model includes steps S11-S14.
S11: and pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters.
S12: all images in the target data set of the VGGNet network model are resized to a size of 256 × 256, and an image content image with a size of 227 × 227 is randomly selected as an input to the VGGNet network model.
S13: and modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set.
S14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
In one embodiment, the VGGNet network model is a VGGNe-16 network model. In the parameter configuration of the VGGNe-16 network model, pre-training is firstly carried out to obtain relatively stable network parameters, and then the network parameters are finely adjusted to make the network parameters more accord with the requirements of a target data set. The process of fine tuning includes three steps.
1) All images in the target data set are resized to 256 × 256, and in the fine-tuning operation, an image content mirror image with size 227 × 227 is randomly selected as the network input.
Through the step 1), on one hand, the operation amount can be reduced by using a smaller convolution kernel, and on the other hand, the model complexity can not be too high due to smaller input, so that the overfitting risk is reduced. Alternatively, both sets of parameters 256 × 256 and 227 × 227 may be modified according to the model. Typically 256, i.e. 8 powers of 2, are used. Furthermore, the method is simple. The "image content mirroring" is understood to mean a data expansion means for expanding a data set by mirroring an original image. On the other hand, in the convolution operation of an image, since the convolution kernels are 3 × 3 in left-right order, it is necessary to mirror the original image. On the other hand, if the trained model image is independent of left and right, the new data after mirroring is equivalent to the same category, and the robustness of the network can be increased.
2) And modifying the number of the neurons of the last full-connection layer in the network model from the original 1000 to c, wherein c is the specific number of the image categories in the target data set. Wherein 1000 refers to 1000 categories in the ImageNet dataset.
Through the step 2), the number of the neurons of the last full connection layer in the network model can be modified, so that the modified number is more suitable for the target data set.
3) And performing Softmax operation with the dimension c on the output of the last layer to obtain the probability distribution result of the picture content in the c types. A euclidean loss function is employed.
The specific settings in the fine-tuning are further described below.
In the whole fine adjustment process, firstly, the VGGNe-16 is pre-trained by using the ImageNet data set, and the parameters obtained by pre-training are adopted to complete the parameter assignment of the front 7 layers in the VGGNe-16. And for the final fully-connected layer, finishing parameter assignment by fine adjustment. In this embodiment, for the final fully-connected layer, a gaussian distribution function is used
Figure 956707DEST_PATH_IMAGE070
And realizing the random assignment of the parameters. μ represents a location parameter of the gaussian distribution describing a central tendency location of the gaussian distribution. σ describes the degree of dispersion of the data distribution of the gaussian distribution: the larger σ is, the more dispersed the data distribution is, and the smaller σ is, the more concentrated the data distribution is. Both the mu and sigma parameters can be flexibly set according to requirements.
In the fine tuning process, different learning rates are set for the front and rear levels in VGGNe-16. For the previous convolutional layer, the main function is to extract the bottom layer feature representation of the image data, which is more uniform with the parameter setting in the pre-trained model obtained by pre-training with the ImageNet data set, so the learning rate is set to be lower 0.001. Aiming at the last three full connection layers of VGGNe-16, in order to ensure that the network model converges on the target data set as soon as possible to reach the corresponding optimal solution, the learning rates of the first two full connection layers are preset to be 0.002, and the learning rate of the last full connection layer is preset to be 0.01, which are relatively high learning rates. Due to the difference of learning rate, the update rate of the front and back layers is also correspondingly different. By adopting the fine tuning operation, the network model can be matched on the target data set as soon as possible, the optimization efficiency and effect are improved, and relatively stable parameters obtained in the pre-training stage cannot be damaged. And after the fine adjustment operation is finished, extracting the image characteristics corresponding to the image data by using a network model with complete parameters.
In an embodiment, in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes: steps S31-S33.
S31: characterize the Word2vec as
Figure 839212DEST_PATH_IMAGE001
Wherein, in the step (A),
Figure 191696DEST_PATH_IMAGE002
are all real numbers, and are all real numbers,Na dimension representing the Word2vec feature; characterize the TF-IDF as
Figure 196561DEST_PATH_IMAGE003
Wherein, in the step (A),
Figure 634496DEST_PATH_IMAGE004
are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
s32: will be provided with
Figure 55113DEST_PATH_IMAGE005
And
Figure 527682DEST_PATH_IMAGE006
splicing to obtain spliced productIs characterized by
Figure 437869DEST_PATH_IMAGE071
S33: will be provided with
Figure 363100DEST_PATH_IMAGE008
Inputting a deep neural network through which to learn
Figure 587408DEST_PATH_IMAGE005
And
Figure 976801DEST_PATH_IMAGE006
obtaining the text characteristic of the image to be retrieved by the high-order fusion characteristic
Figure 261152DEST_PATH_IMAGE009
Wherein, in the step (A),
Figure 673679DEST_PATH_IMAGE009
is less than
Figure 498415DEST_PATH_IMAGE008
Of (c) is calculated.
In one embodiment, the Word2vec feature may be obtained by the Skip-Gram model, and the TF-IDF feature may be obtained by the sklern library in Python.
The extraction of the text features is to perform corresponding processing on the original text data to obtain vector representations which can be utilized subsequently, namely the text features. In the embodiment, the text features are obtained by extracting the Word2vec features and the TF-IDF features of the text and performing deep concatenation on the Word2vec features and the TF-IDF features. It should be noted that, in this embodiment, instead of simply and directly connecting the Word2vec feature and the TF-IDF feature in series to obtain the text feature, the two features are combined with the classified cross entropy loss to perform feature fusion by using the deep neural network, so as to learn the high-level fusion feature, so that the learned semantics in the text feature are more accurate.
The Word2vec feature and the TF-IDF feature each have advantages. Word2vec features represent semantic information of words in a vector mode (i.e., Word vectors) through the learning of a large corpus, and words similar in semantics are close to each other in an embedding space. Word2vec characteristic considers context information, and compared with the previous Embedding method, the effect is better; meanwhile, the dimension of the Word2vec feature is less, so the processing speed is higher. The TF-IDF characteristic is a characteristic weight algorithm based on word frequency widely used in text mining, and the main idea is that if a certain word or phrase appears in an article with high probability and rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification. The TF-IDF has simple characteristics and quick processing.
Alternatively, the Skip-Gram model is used to obtain Word2 vec. First, a vector representation of all words in the pre-processed text data is obtained using the Skip-Gram model. Then, averaging all Word vectors belonging to the same text, and taking the average value as a Word2vec feature vector of the text. The term "same text" is understood here to mean "same sentence", for example a sentence describing an image. A Word2vec feature vector for a text can be expressed as:
Figure 679998DEST_PATH_IMAGE072
wherein, in the step (A),
Figure 135250DEST_PATH_IMAGE073
is a real number and 150 indicates that the word vector is 150 dimensions long. The dimensionality of the word vector can be set according to requirements, the larger corpus dimensionality can be obtained, and the smaller corpus dimensionality can be obtained in a specific field.
Alternatively, using the sklern library in Python, the feature vectors of the TF-IDF of the text can be extracted. In the sklern library, the countvectorzer class only considers the frequency of occurrence of each vocabulary, while the tfidvectorzer class simultaneously calculates the inverse of the number of pieces of text that contain this vocabulary, in addition to the frequency of occurrence of each vocabulary. Therefore, in this embodiment, the tfidvectorer class in the sklern library is used to convert each text into a TF-IDF feature vector.A text can be represented as a 500-dimensional TF-IDF feature vector:
Figure 97390DEST_PATH_IMAGE074
. The dimensionality of the TF-IDF characteristic vector can be set according to requirements, the larger corpus dimensionality can be selected, and the smaller corpus dimensionality can be selected in a specific field.
And splicing the feature vectors, and splicing the TF-IDF feature behind the Word2Vec feature. The spliced feature vector can be represented as
Figure 663500DEST_PATH_IMAGE075
. Inputting the spliced feature vector into a deep neural network, learning high-order fusion features, and finally obtaining the feature vector with 256 dimensions
Figure 699589DEST_PATH_IMAGE076
In one embodiment, S40 includes: steps S41-S44. The text features and the image features obtained are spatially unified by step S40.
S41: by a convolution filter according to equation (1)
Figure 388060DEST_PATH_IMAGE010
Characterizing the text
Figure 775179DEST_PATH_IMAGE011
Transforming so that the transformed text features
Figure 879401DEST_PATH_IMAGE012
And the image characteristics
Figure 97893DEST_PATH_IMAGE013
The dimensions of (a) are the same:
Figure 894947DEST_PATH_IMAGE077
(1)
where denotes the standard normalized convolution calculation.
Optionally, a convolution filter of size 3 x 3 is provided to extend the text feature along the height and width dimensions of the underlying image feature to a size that matches the size of the underlying image feature.
The structural transformation of the text feature can be expressed as:
Figure 503783DEST_PATH_IMAGE078
(1) wherein, in the step (A),
Figure 739593DEST_PATH_IMAGE079
the features of the image that are extracted are represented,
Figure 750274DEST_PATH_IMAGE011
the extracted features of the text are represented,
Figure 780547DEST_PATH_IMAGE010
the convolution filter size is 3 x 3, which represents the standard normalized convolution calculation. The extension process can be completed through the structural transformation in the formula (1),
Figure 876679DEST_PATH_IMAGE080
i.e. the text features after the extension.
S42: constructing the residual features according to equation (2)
Figure 322704DEST_PATH_IMAGE015
Figure 515788DEST_PATH_IMAGE081
(2)
Wherein the content of the first and second substances,
Figure 654645DEST_PATH_IMAGE082
indicating the ReLU activation function.
In the formula (1), the first and second groups,
Figure 238073DEST_PATH_IMAGE083
a size matching the underlying image features is achieved. In the formula (2), the first and second groups,
Figure 550106DEST_PATH_IMAGE084
obtaining final text characteristics through ReLU activation function
Figure 535379DEST_PATH_IMAGE085
Figure 845138DEST_PATH_IMAGE086
More characteristic elements in the text data are combined, and effective conversion of the bottom text characteristics is achieved.
S43: constructing the door signature according to equation (3)
Figure 978179DEST_PATH_IMAGE018
Figure 31585DEST_PATH_IMAGE087
(3)
Wherein the content of the first and second substances,
Figure 871365DEST_PATH_IMAGE088
in order to be a sigmoid function,
Figure 414342DEST_PATH_IMAGE089
and
Figure 706783DEST_PATH_IMAGE022
two convolution filters are shown as being present in the convolution filter,
Figure 563881DEST_PATH_IMAGE090
indicating the calculation method of the corresponding multiplication of the parity elements.
Alternatively, in the formula (3),
Figure 320484DEST_PATH_IMAGE021
and
Figure 972046DEST_PATH_IMAGE022
indicating two sizes are largeA convolution filter as small as 3 x 3; and the combination of the bottom layer image characteristic and the bottom layer text characteristic is realized by utilizing a corresponding multiplication mode of the same-position elements. Obtained
Figure 17362DEST_PATH_IMAGE018
On the basis of completing the structure matching of the bottom text features and the bottom image features, the element information contained in the image features is combined to the maximum extent.
S44: according to the formula (4), to
Figure 474888DEST_PATH_IMAGE091
And
Figure 23681DEST_PATH_IMAGE092
carrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved
Figure 111723DEST_PATH_IMAGE024
Figure 441073DEST_PATH_IMAGE093
(4)
Wherein the content of the first and second substances,
Figure 374394DEST_PATH_IMAGE094
and
Figure 43273DEST_PATH_IMAGE095
representing learnable weight values for balancing
Figure 364533DEST_PATH_IMAGE015
And
Figure 853283DEST_PATH_IMAGE018
in that
Figure 855874DEST_PATH_IMAGE096
Specific gravity of (1).
After the construction of the residual error feature and the gate feature is completed, in order to better balance the influence of a certain single mode feature on attribute embedding, the residual error feature and the gate feature are respectively matched with corresponding weight proportions and are linearly combined to complete feature combination.
In addition, the learnable weight value
Figure 441576DEST_PATH_IMAGE026
And
Figure 605841DEST_PATH_IMAGE027
are all one numerical value. Optionally, a learnable weight value
Figure 847467DEST_PATH_IMAGE026
And convolution filter
Figure 450486DEST_PATH_IMAGE021
Can learn the weight values by weight correlation at a plurality of convolution positions
Figure 828378DEST_PATH_IMAGE027
And convolution filter
Figure 429124DEST_PATH_IMAGE022
Is determined by the weight correlation at a plurality of convolution locations. Therefore, it can also be understood as a combination of
Figure 892466DEST_PATH_IMAGE021
And
Figure 299176DEST_PATH_IMAGE022
is learned to realize the pair
Figure 797154DEST_PATH_IMAGE026
And
Figure 303222DEST_PATH_IMAGE027
and (4) learning.
In one embodiment, in S50, the learning, by the metric learning method, weights of the residual feature and the gate feature in the fused feature by using the fused feature of the training images and the respective search target feature to obtain a final weight includes: steps S51-S54.
S51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training image
Figure 581756DEST_PATH_IMAGE097
Initial fusion feature of
Figure 464262DEST_PATH_IMAGE032
And
Figure 879062DEST_PATH_IMAGE098
corresponding search target feature
Figure 556031DEST_PATH_IMAGE031
(ii) a Will be provided with
Figure 993966DEST_PATH_IMAGE032
Is marked as
Figure 742479DEST_PATH_IMAGE099
Wherein, in the step (A),
Figure 949470DEST_PATH_IMAGE100
to represent
Figure 62919DEST_PATH_IMAGE101
The corresponding text is then displayed on the display screen,
Figure 50467DEST_PATH_IMAGE102
representation acquisition
Figure 337091DEST_PATH_IMAGE103
As a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided with
Figure 664168DEST_PATH_IMAGE031
Is marked as
Figure 948518DEST_PATH_IMAGE104
Wherein, in the step (A),
Figure 95466DEST_PATH_IMAGE105
to represent
Figure 185782DEST_PATH_IMAGE106
A corresponding one of the retrieval-target images,
Figure 429681DEST_PATH_IMAGE107
a function representing the image characteristics of any one of the images acquired.
The goal of learning training is to make the distance between the fused features and the target features closer, and the distance between the fused features and the irrelevant features farther, so a supervised classification loss training method is adopted. During the training process, each forward propagation will result in a loss value of the output value and the true value. The smaller the loss value, the better the model. Optionally, a gradient descent algorithm is used to find the minimum loss value, so that the corresponding learning parameter can be deduced reversely, and the effect of optimizing the model is achieved. Assuming that the set minipatch size is B in the gradient descent process, and the initial fusion characteristic of certain data to be retrieved is B
Figure 947250DEST_PATH_IMAGE108
Wherein, in the step (A),
Figure 847073DEST_PATH_IMAGE109
representing the image to be retrieved and,
Figure 413184DEST_PATH_IMAGE100
text representing the image to be retrieved,
Figure 511590DEST_PATH_IMAGE110
a function representing the initial fused features of the acquired image, i=1,2,…,B. Initial weight
Figure 200060DEST_PATH_IMAGE111
And
Figure 321600DEST_PATH_IMAGE027
are all 0.5, and the corresponding retrieval target image is characterized by
Figure 753718DEST_PATH_IMAGE112
Wherein, in the step (A),
Figure 972210DEST_PATH_IMAGE113
which represents the image of the retrieval target,
Figure 769265DEST_PATH_IMAGE114
representing the function of the image characteristics acquired by VGGNe-16.
S52: for each training image
Figure 440417DEST_PATH_IMAGE115
Repeating the constructionMEach size isKSet of (2)
Figure 348330DEST_PATH_IMAGE116
To obtain theMAn
Figure 483646DEST_PATH_IMAGE117
Set of (2)
Figure 513919DEST_PATH_IMAGE118
Wherein each one
Figure 610050DEST_PATH_IMAGE119
Including one selected from said minimatchKA sample, theKOne sample includes a positive example
Figure 321655DEST_PATH_IMAGE031
And (a)K-1) negative examples, said one positive example being said retrieval target feature
Figure 514739DEST_PATH_IMAGE031
The above-mentioned (A) toK-1) negative examples
Figure 653596DEST_PATH_IMAGE120
MIs less than or equal toBAnd is andMis less thanOr equal toK
Optionally, for each training image, a certain sample is selected from the set minimatch, and a set with the size of K is constructed
Figure 971445DEST_PATH_IMAGE121
The set has a positive example andK-1) negative examples, wherein a positive example is
Figure 549057DEST_PATH_IMAGE031
Negative example is the negative example in the minimatch
Figure 534330DEST_PATH_IMAGE122
. To evaluate the
Figure 844089DEST_PATH_IMAGE042
Repeating the construction of the set M times to obtain a set
Figure 711551DEST_PATH_IMAGE042
Set of (2)
Figure 951908DEST_PATH_IMAGE123
WhereinMIs not greater than the size B of minipatch and the size of the constructed collectionK
S53: constructing a Softmax cross-entropy loss function by using the formula (5)
Figure 916322DEST_PATH_IMAGE047
Figure 396982DEST_PATH_IMAGE124
(5)
Wherein the content of the first and second substances,
Figure 955002DEST_PATH_IMAGE125
a similar kernel function is represented as a function of the kernel,
Figure 632275DEST_PATH_IMAGE126
represents the direction of two data pointsMeasurement of
Figure 592140DEST_PATH_IMAGE051
And
Figure 243702DEST_PATH_IMAGE052
the distance between them;
Figure 85756DEST_PATH_IMAGE053
and
Figure 480965DEST_PATH_IMAGE031
respectively represent
Figure 295337DEST_PATH_IMAGE127
Sample of (1)
Figure 180117DEST_PATH_IMAGE055
The corresponding initial fusion features and the retrieval target features,
Figure 447150DEST_PATH_IMAGE128
is shown in
Figure 646050DEST_PATH_IMAGE057
Under the condition of (1) calculating
Figure 314929DEST_PATH_IMAGE129
Figure 370609DEST_PATH_IMAGE130
The softmax function is expressed to characterize the percentage of post-conversion results over the sum of all post-conversion results.
Optionally, during the calculation process,
Figure 311889DEST_PATH_IMAGE131
is set to mean negative
Figure 314481DEST_PATH_IMAGE132
Distance.
Figure 900183DEST_PATH_IMAGE133
Distance is Euclidean distance when
Figure 64448DEST_PATH_IMAGE051
And
Figure 306073DEST_PATH_IMAGE052
when a certain two data point vector is represented, and n represents the dimension of the vector, it is defined as:
Figure 909093DEST_PATH_IMAGE134
(5-1)
s54: by using
Figure 286985DEST_PATH_IMAGE047
And learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.
By utilizing the related technology of metric learning, the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features are completed, and the specific optimization of fusion features is perfected. After the text features and the image features are fused, the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the spatial structure and are similar to the features of the retrieval target image in semantic expression.
It should be noted that, in this embodiment, the spatial structure consistency may be understood as spatial dimension consistency, and the semantic expression similarity may be understood as high-level semantic expression similarity. Wherein the "high level semantics" of an image are a concept as opposed to the "low level features" of an image. The low-level features of the image refer to: contour, edge, color, texture, and shape features. Semantic information of low-level features is less, but the target position is accurate. The high-level semantic features of the image are worth looking at, for example, extracting low-level features of a face can extract continuous outlines, noses, eyes and the like, and the high-level features of the image are displayed as the face. The feature semantic information of the high layer is rich, but the target position is rough. Deeper features include higher levels of semantic meaning and higher resolution. We refer to the visual features of an image as a visual space (visual space), and the semantic information of a category as a semantic space (semantic space).
In an embodiment, in S60, the taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image in the plurality of images that meets the retrieval requirement includes: steps S61 and S62.
S61: the image to be retrieved is processed
Figure 887730DEST_PATH_IMAGE061
Is finally fused and characterized as
Figure 413390DEST_PATH_IMAGE135
Wherein, in the step (A),tto represent
Figure 757783DEST_PATH_IMAGE061
The corresponding text is then displayed on the display screen,
Figure 52498DEST_PATH_IMAGE136
representation acquisition
Figure 824145DEST_PATH_IMAGE061
A function of the final fused features of (a); will be provided with
Figure 837101DEST_PATH_IMAGE064
As the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)
Figure 719606DEST_PATH_IMAGE065
Search feature of
Figure 134407DEST_PATH_IMAGE066
The distance between
Figure 76955DEST_PATH_IMAGE137
Figure 514890DEST_PATH_IMAGE138
(6)
Wherein the number of images in the search database is recorded asRr=1,2,…,R
In the image retrieval process, the sorting output of the similarity result is the last very important step. Fusion features after optimization of training
Figure 263403DEST_PATH_IMAGE139
Then, the feature vector is used as the basis of similarity check and the search feature of the existing picture in the database
Figure 470393DEST_PATH_IMAGE066
And (3) calculating the distance:
Figure 583843DEST_PATH_IMAGE140
(6-1)
wherein, the selection of the distance function is consistent with the selection of the similar kernel function in the metric learning, and the distance can be expressed as:
Figure 571390DEST_PATH_IMAGE141
(6-2)
the smaller the interval is, the higher the probability that the two feature vectors belong to the same class is, i.e. the similarity between the two vectors is higher.
S62: getD 1D 2,…,D RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are
Figure 795698DEST_PATH_IMAGE069
(ii) a Will be provided with
Figure 122774DEST_PATH_IMAGE069
And returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.
Sorting the output results of S61, and taking the front with the minimum valuekEach vector is used for obtaining a similarity list of the examination results:
Figure 141546DEST_PATH_IMAGE142
(7)
the set Sim represents the set of similar features of the fusion feature to be retrieved after similarity query, and each feature vector in the set
Figure 616390DEST_PATH_IMAGE143
The corresponding original image is the final retrieval result.
The image retrieval method provided by the embodiment of the invention can realize the following beneficial effects.
1. The embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is improved.
2. The embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused.
3. The embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved.
4. According to the embodiment of the invention, the VGGNe-16 network model is used as a processing unit of the image data, and parameters in pre-training are finely adjusted according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and efficiency of image feature extraction are improved.
5. The embodiment of the invention combines the image characteristic and the text characteristic by utilizing the corresponding multiplication mode of the homothetic elements, and combines the element information contained in the image characteristic to the maximum extent on the basis of completing the structure matching of the text characteristic and the image characteristic.
Example two
Fig. 3 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present invention. The device is used for implementing the image retrieval method provided by the first embodiment, and includes a data acquisition module 310, an image feature extraction module 320, a text feature extraction module 330, a feature fusion module 340, a weight learning module 350, and an image retrieval module 360.
The data obtaining module 310 is configured to obtain an image to be retrieved and a text corresponding to the image to be retrieved.
The image feature extraction module 320 is configured to extract image features of the image to be retrieved using the VGGNet network model.
The text feature extraction module 330 is configured to extract Word2vec features and TF-IDF features of the text, and perform deep concatenation on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved.
The feature fusion module 340 is configured to fuse the image feature and the text feature, and construct a residual feature and a gate feature of the image to be retrieved, where the residual feature and the gate feature have a consistent spatial structure; and linearly combining the residual error characteristics and the gate characteristics according to the weight to obtain the fusion characteristics of the image to be retrieved.
The weight learning module 350 is configured to obtain a training data set, wherein the training data set includes a plurality of training images and respective corresponding texts; and learning the weights of the residual error feature and the gate feature in the fusion feature by using the fusion feature of the training images and the respective retrieval target feature through a metric learning method to obtain a final weight.
The image retrieval module 360 is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting the retrieval requirements in the plurality of images.
In an embodiment, the parameter configuration of the VGGNet network model comprises the following steps:
s11: pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters;
s12: adjusting all image sizes in a target data set of the VGGNet network model to 256 × 256, and randomly selecting an image content mirror image with the size of 227 × 227 as an input of the VGGNet network model;
s13: modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set;
s14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
In an embodiment, the text feature extraction module 330 is configured to perform deep concatenation on the Word2vec feature and the TF-IDF feature to obtain a text feature of the image to be retrieved, in the following manner:
s31: characterize the Word2vec as
Figure 644388DEST_PATH_IMAGE001
Wherein, in the step (A),
Figure 825971DEST_PATH_IMAGE002
are all real numbers, and are all real numbers,Nto representThe dimension of the Word2vec feature; characterize the TF-IDF as
Figure 343540DEST_PATH_IMAGE003
Wherein, in the step (A),
Figure 977784DEST_PATH_IMAGE004
are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
s32: will be provided with
Figure 543894DEST_PATH_IMAGE005
And
Figure 907880DEST_PATH_IMAGE006
splicing is carried out to obtain spliced characteristics
Figure 596350DEST_PATH_IMAGE144
S33: will be provided with
Figure 717890DEST_PATH_IMAGE008
Inputting a deep neural network through which to learn
Figure 150008DEST_PATH_IMAGE005
And
Figure 306183DEST_PATH_IMAGE006
obtaining the text characteristic of the image to be retrieved by the high-order fusion characteristic
Figure 103238DEST_PATH_IMAGE009
Wherein, in the step (A),
Figure 774390DEST_PATH_IMAGE009
is less than
Figure 682303DEST_PATH_IMAGE008
Of (c) is calculated.
In one embodiment, the feature fusion module 340 includes: a size transformation unit 341, a residual feature construction unit 342, a gate feature construction unit 343, and a feature fusion unit 344.
The size conversion unit 341 is arranged to pass the convolution filter according to equation (1)
Figure 692985DEST_PATH_IMAGE010
Characterizing the text
Figure 723258DEST_PATH_IMAGE011
Transforming so that the transformed text features
Figure 819390DEST_PATH_IMAGE080
And the image characteristics
Figure 530994DEST_PATH_IMAGE013
The dimensions of (a) are the same:
Figure 458498DEST_PATH_IMAGE145
(1)
where denotes the standard normalized convolution calculation.
The residual feature construction unit 342 is arranged to construct the residual feature according to equation (2)
Figure 597356DEST_PATH_IMAGE015
Figure 180784DEST_PATH_IMAGE146
(2)
Wherein the content of the first and second substances,
Figure 758396DEST_PATH_IMAGE147
indicating the ReLU activation function.
The door feature construction unit 343 is arranged to construct said door feature according to equation (3)
Figure 478090DEST_PATH_IMAGE018
Figure 787849DEST_PATH_IMAGE148
(3)
Wherein the content of the first and second substances,
Figure 920890DEST_PATH_IMAGE149
in order to be a sigmoid function,
Figure 974296DEST_PATH_IMAGE150
and
Figure 814076DEST_PATH_IMAGE151
two convolution filters are shown as being present in the convolution filter,
Figure 357053DEST_PATH_IMAGE090
indicating the calculation method of the corresponding multiplication of the parity elements.
The feature fusion unit 344 is arranged to pair according to equation (4)
Figure 915073DEST_PATH_IMAGE015
And
Figure 506592DEST_PATH_IMAGE018
carrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved
Figure 263195DEST_PATH_IMAGE024
Figure 180336DEST_PATH_IMAGE152
(4)
Wherein the content of the first and second substances,
Figure 22390DEST_PATH_IMAGE026
and
Figure 417599DEST_PATH_IMAGE027
representing learnable weight values for balancing
Figure 231971DEST_PATH_IMAGE015
And
Figure 116751DEST_PATH_IMAGE018
in that
Figure 383784DEST_PATH_IMAGE024
Specific gravity of (1).
In one embodiment, the weight learning module 350 is configured to learn the weights of the residual features and the gate features in the fused features by metric learning method according to the following manner, and using the fused features of the training images and the respective search target features to obtain final weights:
s51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training image
Figure 582684DEST_PATH_IMAGE153
Initial fusion feature of
Figure 313880DEST_PATH_IMAGE154
And
Figure 307243DEST_PATH_IMAGE155
corresponding search target feature
Figure 61573DEST_PATH_IMAGE031
(ii) a Will be provided with
Figure 860902DEST_PATH_IMAGE032
Is marked as
Figure 384287DEST_PATH_IMAGE156
Wherein, in the step (A),
Figure 814131DEST_PATH_IMAGE157
to represent
Figure 55757DEST_PATH_IMAGE158
The corresponding text is then displayed on the display screen,
Figure 393197DEST_PATH_IMAGE159
representation acquisition
Figure 36668DEST_PATH_IMAGE160
As a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided with
Figure 371834DEST_PATH_IMAGE161
Is marked as
Figure 163073DEST_PATH_IMAGE162
Wherein, in the step (A),
Figure 241887DEST_PATH_IMAGE163
to represent
Figure 802182DEST_PATH_IMAGE040
A corresponding one of the retrieval-target images,
Figure 573829DEST_PATH_IMAGE107
a function representing image characteristics of any one of the acquired images;
s52: for each training image
Figure 524467DEST_PATH_IMAGE040
Repeating the constructionMEach size isKSet of (2)
Figure 469289DEST_PATH_IMAGE164
To obtain theMAn
Figure 821773DEST_PATH_IMAGE043
Set of (2)
Figure 764321DEST_PATH_IMAGE123
Wherein each one
Figure 264573DEST_PATH_IMAGE045
Including one selected from said minimatchKA sample, theKOne sample includes a positive example
Figure 685190DEST_PATH_IMAGE031
And (a)K-1) negative examples, said one positive example being said retrieval target feature
Figure 157760DEST_PATH_IMAGE031
The above-mentioned (A) toK-1) negative examples
Figure 333526DEST_PATH_IMAGE165
MIs less than or equal toBAnd is andMis less than or equal toK
S53: constructing a Softmax cross-entropy loss function by using the formula (5)
Figure 258757DEST_PATH_IMAGE047
Figure 483065DEST_PATH_IMAGE166
(5)
Wherein the content of the first and second substances,
Figure 606878DEST_PATH_IMAGE167
a similar kernel function is represented as a function of the kernel,
Figure 891229DEST_PATH_IMAGE168
representing two vectors of data points
Figure 366073DEST_PATH_IMAGE051
And
Figure 394072DEST_PATH_IMAGE052
the distance between them;
Figure 575654DEST_PATH_IMAGE053
and
Figure 765327DEST_PATH_IMAGE031
respectively represent
Figure 727467DEST_PATH_IMAGE044
Sample of (1)
Figure 293578DEST_PATH_IMAGE055
The corresponding initial fusion features and the retrieval target features,
Figure 595246DEST_PATH_IMAGE169
is shown in
Figure 18137DEST_PATH_IMAGE170
Under the condition of (1) calculating
Figure 467573DEST_PATH_IMAGE171
Figure 837374DEST_PATH_IMAGE172
Representing a softmax function for characterizing the percentage of the converted results to the sum of all converted results;
s54: by using
Figure 790287DEST_PATH_IMAGE060
And learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.
In an embodiment, the image retrieval module 360 is configured to use the final fusion feature as a feature to be retrieved, calculate similarity between the feature to be retrieved and retrieval features of a plurality of images in the retrieval database, and return an image in the plurality of images that meets the retrieval requirement by:
s61: the image to be retrieved is processed
Figure 587342DEST_PATH_IMAGE061
Is finally fused and characterized as
Figure 258494DEST_PATH_IMAGE173
Wherein, in the step (A),tto represent
Figure 431987DEST_PATH_IMAGE061
The corresponding text is then displayed on the display screen,
Figure 442668DEST_PATH_IMAGE174
representation acquisition
Figure 472941DEST_PATH_IMAGE061
A function of the final fused features of (a); will be provided with
Figure 569073DEST_PATH_IMAGE064
As the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)
Figure 280677DEST_PATH_IMAGE065
Search feature of
Figure 208182DEST_PATH_IMAGE175
The distance between
Figure 347039DEST_PATH_IMAGE176
Figure 930467DEST_PATH_IMAGE177
(6)
Wherein the number of images in the search database is recorded asRr=1,2,…,R
S62: getD 1D 2,…,D RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are
Figure 242500DEST_PATH_IMAGE069
(ii) a Will be provided with
Figure 227773DEST_PATH_IMAGE069
And returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.
In one embodiment, the VGGNet network model is a VGGNe-16 network model; or the Word2vec characteristics are obtained through a Skip-Gram model; or the TF-IDF characteristics are obtained through a skleran library in Python.
The image retrieval device provided by the embodiment of the invention can realize the following beneficial effects.
1. The embodiment of the invention takes the two different modal data of the text modal data and the image modal data as the entry points, constructs a multi-modal fusion feature with more comprehensive information through the residual error feature and the gate feature, and fully explores the relevance between the bottom layer feature and the high-level semantic meaning of the text data and the image data; the retrieval task is completed through multi-mode fusion features, the recall ratio and the precision ratio of image retrieval are improved, and the retrieval efficiency is improved.
2. The embodiment of the invention completes the construction of data samples, the training of loss functions and the learning of the weight values of image features and text features by using the related technology of metric learning, perfects the specific optimization of fusion features, and ensures that the features constructed by the data to be retrieved are consistent with the features of the retrieval target image in the space structure and similar in semantic expression after the text features and the image features are fused.
3. The embodiment of the invention provides a deep fusion model when text features are obtained, and the series splicing of the Word2vec feature vector and the TF-IDF feature vector is used as input to learn the high-order fusion features of the text features and is used as the final features of text data, so that the problem of dimension disaster caused by the fact that the feature dimensions are greatly increased because the features after the series splicing are directly used as the text features is solved.
4. According to the embodiment of the invention, the VGGNe-16 network model is used as a processing unit of the image data, and parameters in pre-training are finely adjusted according to the characteristics of the target data set, so that the matching degree of the VGGNe-16 network model and the target data set is higher, and the accuracy and efficiency of image feature extraction are improved.
5. The embodiment of the invention combines the image characteristic and the text characteristic by utilizing the corresponding multiplication mode of the homothetic elements, and combines the element information contained in the image characteristic to the maximum extent on the basis of completing the structure matching of the text characteristic and the image characteristic.
The image retrieval device of the embodiment of the invention has the same technical principle and beneficial effects as the image retrieval method of the first embodiment. Please refer to the image retrieval method in the first embodiment without detailed technical details in this embodiment.
It should be noted that, in the embodiment of the apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
EXAMPLE III
Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes a processor 410 and a memory 420. The number of the processors 410 may be one or more, and one processor 410 is taken as an example in fig. 4.
The memory 420, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules of the image retrieval method in embodiments of the present invention. The processor 410 implements the image retrieval method described above by running software programs, instructions, and modules stored in the memory 420.
The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 420 may further include memory located remotely from the processor 410, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Example four
The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be configured to store a program for executing the steps of:
s10: acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
s20: extracting image features of the image to be retrieved by using a VGGNet network model;
s30: extracting Word2vec characteristics and TF-IDF characteristics of the text, and performing depth series connection on the Word2vec characteristics and the TF-IDF characteristics to obtain text characteristics of the image to be retrieved;
s40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
s50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
s60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
Of course, the storage medium provided in the embodiments of the present invention stores the computer-readable program, which is not limited to the method operations described above, and may also perform related operations in the image retrieval method provided in any embodiment of the present invention.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An image retrieval method, comprising:
s10: acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
s20: extracting image features of the image to be retrieved by using a VGGNet network model;
s30: extracting Word vector Word2vec characteristics and Word frequency-inverse text frequency TF-IDF characteristics of the text, and performing deep concatenation on the Word2vec characteristics and the TF-IDF characteristics to obtain text characteristics of the image to be retrieved;
s40: fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
s50: acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
s60: and linearly combining the residual error characteristics and the gate characteristics of the image to be retrieved according to the final weight to obtain final fusion characteristics of the image to be retrieved, taking the final fusion characteristics as the characteristics to be retrieved, calculating the similarity between the characteristics to be retrieved and the retrieval characteristics of the plurality of images in the retrieval database, and returning the images which meet the retrieval requirements in the plurality of images.
2. The image retrieval method of claim 1, wherein the parameter configuration of the VGGNet network model comprises the steps of:
s11: pre-training the VGGNet network model by using an ImageNet data set to obtain pre-training network parameters;
s12: adjusting all image sizes in a target data set of the VGGNet network model to 256 × 256, and randomly selecting an image content mirror image with the size of 227 × 227 as an input of the VGGNet network model;
s13: modifying the number of the neurons of the last fully-connected layer of the VGGNet network model from the number of the image categories in the ImageNet data set to the number c of the image categories in the numbered data set;
s14: and performing Softmax operation with the dimensionality of c on the output of the last full connecting layer to obtain probability distribution results of the image to be retrieved in the c image categories.
3. The image retrieval method of claim 1, wherein in S30, the depth concatenation of the Word2vec feature and the TF-IDF feature to obtain the text feature of the image to be retrieved includes:
s31: characterize the Word2vec as
Figure 686092DEST_PATH_IMAGE001
Wherein, in the step (A),
Figure 819133DEST_PATH_IMAGE002
are all real numbers, and are all real numbers,Na dimension representing the Word2vec feature; characterize the TF-IDF as
Figure 872539DEST_PATH_IMAGE003
Wherein, in the step (A),
Figure 836953DEST_PATH_IMAGE004
are all real numbers, and are all real numbers,Ta dimension representing the TF-IDF feature;
s32: will be provided with
Figure 379930DEST_PATH_IMAGE005
And
Figure 734688DEST_PATH_IMAGE006
splicing is carried out to obtain spliced characteristics
Figure 591785DEST_PATH_IMAGE007
S33: will be provided with
Figure 286072DEST_PATH_IMAGE008
Inputting a deep neural network through which to learn
Figure 999950DEST_PATH_IMAGE005
And
Figure 107583DEST_PATH_IMAGE006
obtaining the text characteristic of the image to be retrieved by the high-order fusion characteristic
Figure 502793DEST_PATH_IMAGE009
Wherein, in the step (A),
Figure 51586DEST_PATH_IMAGE009
is less than
Figure 201944DEST_PATH_IMAGE008
Of (c) is calculated.
4. The image retrieval method according to claim 1, wherein S40 includes:
s41: by a convolution filter according to equation (1)
Figure 468978DEST_PATH_IMAGE010
Characterizing the text
Figure 667878DEST_PATH_IMAGE011
Transforming so that the transformed text features
Figure 133494DEST_PATH_IMAGE012
And the image characteristics
Figure 454754DEST_PATH_IMAGE013
The dimensions of (a) are the same:
Figure 209083DEST_PATH_IMAGE014
(1)
wherein, denotes a standard normalized convolution calculation mode;
s42: constructing the residual features according to equation (2)
Figure 8412DEST_PATH_IMAGE015
Figure 531797DEST_PATH_IMAGE016
(2)
Wherein the content of the first and second substances,
Figure 23959DEST_PATH_IMAGE017
representing a ReLU activation function;
s43: is constructed according to the formula (3)Characteristic of the door
Figure 62322DEST_PATH_IMAGE018
Figure 603024DEST_PATH_IMAGE019
(3)
Wherein the content of the first and second substances,
Figure 308812DEST_PATH_IMAGE020
in order to be a sigmoid function,
Figure 706296DEST_PATH_IMAGE021
and
Figure 435217DEST_PATH_IMAGE022
two convolution filters are shown as being present in the convolution filter,
Figure 514032DEST_PATH_IMAGE023
representing a calculation method of corresponding multiplication of the parity elements;
s44: according to the formula (4), to
Figure 74326DEST_PATH_IMAGE024
And
Figure 580394DEST_PATH_IMAGE025
carrying out linear combination according to respective weight to obtain the fusion characteristics of the image to be retrieved
Figure 858928DEST_PATH_IMAGE026
Figure 741434DEST_PATH_IMAGE027
(4)
Wherein the content of the first and second substances,
Figure 156234DEST_PATH_IMAGE028
and
Figure 161100DEST_PATH_IMAGE029
representing learnable weight values for balancing
Figure 599034DEST_PATH_IMAGE015
And
Figure 81968DEST_PATH_IMAGE018
in that
Figure 554538DEST_PATH_IMAGE030
Specific gravity of (1).
5. The image retrieval method of claim 1, wherein in S50, the learning, by the metric learning method, weights of the residual features and the gate features in the fused features by using the fused features of the plurality of training images and the respective retrieval target features to obtain final weights comprises:
s51: setting the size of minipatch when the gradient descent algorithm is adopted to search the minimum loss value in the training process asBWherein the minimatch includes each training image
Figure 464725DEST_PATH_IMAGE031
Initial fusion feature of
Figure 389956DEST_PATH_IMAGE032
And
Figure 614264DEST_PATH_IMAGE033
corresponding search target feature
Figure 3657DEST_PATH_IMAGE034
(ii) a Will be provided with
Figure 350324DEST_PATH_IMAGE035
Is marked as
Figure 762851DEST_PATH_IMAGE036
Wherein, in the step (A),
Figure 587588DEST_PATH_IMAGE037
to represent
Figure 769170DEST_PATH_IMAGE038
The corresponding text is then displayed on the display screen,
Figure 286739DEST_PATH_IMAGE039
representation acquisition
Figure 186562DEST_PATH_IMAGE040
As a function of the initial fused features of (a),i=1,2,…,B(ii) a Will be provided with
Figure 814990DEST_PATH_IMAGE034
Is marked as
Figure 851079DEST_PATH_IMAGE041
Wherein, in the step (A),
Figure 539549DEST_PATH_IMAGE042
to represent
Figure 926668DEST_PATH_IMAGE043
A corresponding one of the retrieval-target images,
Figure 93207DEST_PATH_IMAGE044
a function representing image characteristics of any one of the acquired images;
s52: for each training image
Figure 249382DEST_PATH_IMAGE043
Repeating the constructionMEach size isKSet of (2)
Figure 108754DEST_PATH_IMAGE045
To obtain theMAn
Figure 779906DEST_PATH_IMAGE046
Set of (2)
Figure 953399DEST_PATH_IMAGE047
Wherein each one
Figure 26397DEST_PATH_IMAGE046
Including one selected from said minimatchKA sample, theKOne sample includes a positive example
Figure 994353DEST_PATH_IMAGE034
And (a)K-1) negative examples, said one positive example being said retrieval target feature
Figure 90485DEST_PATH_IMAGE034
The above-mentioned (A) toK-1) negative examples
Figure 598827DEST_PATH_IMAGE048
MIs less than or equal toBAnd is andMis less than or equal toK
S53: constructing a Softmax cross-entropy loss function by using the formula (5)
Figure 729594DEST_PATH_IMAGE049
Figure 930768DEST_PATH_IMAGE050
(5)
Wherein the content of the first and second substances,
Figure 514196DEST_PATH_IMAGE051
a similar kernel function is represented as a function of the kernel,
Figure 826229DEST_PATH_IMAGE052
representing two vectors of data points
Figure 811502DEST_PATH_IMAGE053
And
Figure 183578DEST_PATH_IMAGE054
the distance between them;
Figure 254302DEST_PATH_IMAGE055
and
Figure 370025DEST_PATH_IMAGE034
respectively represent
Figure 209805DEST_PATH_IMAGE056
Sample of (1)
Figure 752782DEST_PATH_IMAGE057
The corresponding initial fusion features and the retrieval target features,
Figure 45223DEST_PATH_IMAGE058
is shown in
Figure 902321DEST_PATH_IMAGE059
Under the condition of (1) calculating
Figure 658924DEST_PATH_IMAGE060
Figure 372802DEST_PATH_IMAGE061
Representing a softmax function for characterizing the percentage of the converted results to the sum of all converted results;
s54: by using
Figure 418119DEST_PATH_IMAGE062
And learning the weights of the residual error feature and the gate feature in the fusion feature to obtain the final weight.
6. The image retrieval method according to claim 5, wherein, in S60, the step of taking the final fused feature as a feature to be retrieved, calculating similarity between the feature to be retrieved and retrieval features of a plurality of images in a retrieval database, and returning an image meeting retrieval requirements in the plurality of images comprises:
s61: the image to be retrieved is processed
Figure 813328DEST_PATH_IMAGE063
Is finally fused and characterized as
Figure 447875DEST_PATH_IMAGE064
Wherein, in the step (A),tto represent
Figure 535917DEST_PATH_IMAGE063
The corresponding text is then displayed on the display screen,
Figure 865267DEST_PATH_IMAGE065
representation acquisition
Figure 798588DEST_PATH_IMAGE063
A function of the final fused features of (a); will be provided with
Figure 467467DEST_PATH_IMAGE066
As the feature to be retrieved, each image in the retrieval database of the feature to be retrieved is calculated according to formula (6)
Figure 788727DEST_PATH_IMAGE067
Search feature of
Figure 277477DEST_PATH_IMAGE068
The distance between
Figure 342385DEST_PATH_IMAGE069
Figure 865770DEST_PATH_IMAGE070
(6)
Wherein the number of images in the search database is recorded asRr=1,2,…,R
S62: getD 1D 2,…,D RFront of lowest numerical valuekA distance, wherein, the frontkThe picture retrieval characteristics corresponding to the distances are
Figure 30035DEST_PATH_IMAGE071
(ii) a Will be provided with
Figure 333978DEST_PATH_IMAGE071
And returning the corresponding image in the retrieval database as the image meeting the retrieval requirement.
7. The image retrieval method according to claim 1,
the VGGNet network model is a VGGNe-16 network model; or
The Word2vec feature is obtained through a Skip-Gram model; or
The TF-IDF characteristics are obtained through the skleran library in Python.
8. An image retrieval apparatus, comprising:
the data acquisition module is used for acquiring an image to be retrieved and a text corresponding to the image to be retrieved;
the image feature extraction module is used for extracting the image features of the image to be retrieved by utilizing a VGGNet network model;
the text feature extraction module is used for extracting Word vector Word2vec features and Word frequency-inverse text frequency TF-IDF features of the text, and performing deep series connection on the Word2vec features and the TF-IDF features to obtain text features of the image to be retrieved;
the feature fusion module is used for fusing the image features and the text features to construct residual features and gate features of the image to be retrieved, wherein the residual features and the gate features have consistent spatial structures; linearly combining the residual error characteristics and the gate characteristics according to weight to obtain fusion characteristics of the image to be retrieved;
the weight learning module is used for acquiring a training data set, wherein the training data set comprises a plurality of training images and texts corresponding to the training images; learning the weights of the residual error features and the gate features in the fusion features by using the fusion features of the training images and the respective retrieval target features through a metric learning method to obtain final weights;
and the image retrieval module is configured to linearly combine the residual features and the gate features of the image to be retrieved according to the final weight to obtain final fusion features of the image to be retrieved, use the final fusion features as the features to be retrieved, calculate similarity between the features to be retrieved and retrieval features of a plurality of images in a retrieval database, and return the images meeting retrieval requirements in the plurality of images.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image retrieval method according to any one of claims 1 to 7 when executing the program.
10. A storage medium on which a computer-readable program is stored, characterized in that the program, when executed, implements an image retrieval method as recited in any one of claims 1 to 7.
CN202110841488.0A 2021-07-26 2021-07-26 Image retrieval method and device, computer equipment and storage medium Pending CN113297410A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110841488.0A CN113297410A (en) 2021-07-26 2021-07-26 Image retrieval method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110841488.0A CN113297410A (en) 2021-07-26 2021-07-26 Image retrieval method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113297410A true CN113297410A (en) 2021-08-24

Family

ID=77330973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110841488.0A Pending CN113297410A (en) 2021-07-26 2021-07-26 Image retrieval method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113297410A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901177A (en) * 2021-10-27 2022-01-07 电子科技大学 Code searching method based on multi-mode attribute decision
CN113989792A (en) * 2021-10-29 2022-01-28 天津大学 Cultural relic recommendation algorithm based on fusion features
CN114880517A (en) * 2022-05-27 2022-08-09 支付宝(杭州)信息技术有限公司 Method and device for video retrieval
CN115269882A (en) * 2022-09-28 2022-11-01 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
CN115905608A (en) * 2022-11-15 2023-04-04 腾讯科技(深圳)有限公司 Image feature acquisition method and device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095964A (en) * 2015-08-17 2015-11-25 杭州朗和科技有限公司 Data processing method and device
CN109766465A (en) * 2018-12-26 2019-05-17 中国矿业大学 A kind of picture and text fusion book recommendation method based on machine learning
US20200285811A1 (en) * 2018-02-05 2020-09-10 Alibaba Group Holding Limited Methods, apparatuses, and devices for generating word vectors
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN112231442A (en) * 2020-10-15 2021-01-15 北京临近空间飞行器系统工程研究所 Sensitive word filtering method and device
CN112364204A (en) * 2020-11-12 2021-02-12 北京达佳互联信息技术有限公司 Video searching method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095964A (en) * 2015-08-17 2015-11-25 杭州朗和科技有限公司 Data processing method and device
US20200285811A1 (en) * 2018-02-05 2020-09-10 Alibaba Group Holding Limited Methods, apparatuses, and devices for generating word vectors
CN109766465A (en) * 2018-12-26 2019-05-17 中国矿业大学 A kind of picture and text fusion book recommendation method based on machine learning
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN112231442A (en) * 2020-10-15 2021-01-15 北京临近空间飞行器系统工程研究所 Sensitive word filtering method and device
CN112364204A (en) * 2020-11-12 2021-02-12 北京达佳互联信息技术有限公司 Video searching method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
原旭 等: "多模态特征融合的裁判文书推荐方法", 《微电子学与计算机》 *
李超越: "基于特征融合的跨模态检索方法研究与应用", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901177A (en) * 2021-10-27 2022-01-07 电子科技大学 Code searching method based on multi-mode attribute decision
CN113901177B (en) * 2021-10-27 2023-08-08 电子科技大学 Code searching method based on multi-mode attribute decision
CN113989792A (en) * 2021-10-29 2022-01-28 天津大学 Cultural relic recommendation algorithm based on fusion features
CN114880517A (en) * 2022-05-27 2022-08-09 支付宝(杭州)信息技术有限公司 Method and device for video retrieval
CN115269882A (en) * 2022-09-28 2022-11-01 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
CN115269882B (en) * 2022-09-28 2022-12-30 山东鼹鼠人才知果数据科技有限公司 Intellectual property retrieval system and method based on semantic understanding
CN115905608A (en) * 2022-11-15 2023-04-04 腾讯科技(深圳)有限公司 Image feature acquisition method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US11562039B2 (en) System and method for performing cross-modal information retrieval using a neural network using learned rank images
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
JP7282940B2 (en) System and method for contextual retrieval of electronic records
US11295090B2 (en) Multi-scale model for semantic matching
GB2547068B (en) Semantic natural language vector space
CN106649561B (en) Intelligent question-answering system for tax consultation service
AU2016256753B2 (en) Image captioning using weak supervision and semantic natural language vector space
CN112163165B (en) Information recommendation method, device, equipment and computer readable storage medium
US9792534B2 (en) Semantic natural language vector space
KR102354716B1 (en) Context-sensitive search using a deep learning model
CN113297410A (en) Image retrieval method and device, computer equipment and storage medium
US11550871B1 (en) Processing structured documents using convolutional neural networks
CN108288067A (en) Training method, bidirectional research method and the relevant apparatus of image text Matching Model
EP3180742A1 (en) Generating and using a knowledge-enhanced model
US10482146B2 (en) Systems and methods for automatic customization of content filtering
Wu et al. Learning of multimodal representations with random walks on the click graph
CN111159367B (en) Information processing method and related equipment
CN113761153B (en) Picture-based question-answering processing method and device, readable medium and electronic equipment
CN108475256B (en) Generating feature embedding from co-occurrence matrices
WO2022140900A1 (en) Method and apparatus for constructing personal knowledge graph, and related device
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
CN113094534B (en) Multi-mode image-text recommendation method and device based on deep learning
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN117332112A (en) Multimodal retrieval model training, multimodal retrieval method, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210824