CN108509521B

CN108509521B - Image retrieval method for automatically generating text index

Info

Publication number: CN108509521B
Application number: CN201810198490.9A
Authority: CN
Inventors: 吴良超; 苏锦钿
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2020-02-18
Anticipated expiration: 2038-03-12
Also published as: CN108509521A

Abstract

The invention discloses an image retrieval method for automatically generating text indexes, which comprises the following steps: (1) training an automatic labeling model, firstly extracting image characteristics through a CNN part of the model, taking the characteristics and descriptors of the image as the input of the RNN part of the model, and performing back propagation by taking a cross entropy loss function as a target function; (2) generating a text index for the image, training to obtain an automatic labeling model and a dictionary, generating a description word sequence and confidence degrees corresponding to each word for the image which is not labeled through the automatic labeling model, normalizing the confidence degrees, and using the two words as the text index of the image to construct an image retrieval index; (3) when the query keyword is not in the dictionary, searching the word bank through the similar meaning word to find the similar meaning word of the keyword in the dictionary; (4) and finding out corresponding images in the image retrieval index according to the keywords or the similar meaning words thereof, and returning from high to low in sequence according to the confidence coefficient.

Description

Image retrieval method for automatically generating text index

Technical Field

The invention belongs to the technical field of information retrieval, particularly relates to image retrieval based on texts, and particularly relates to an image retrieval method for automatically generating text indexes for images.

Background

With the explosive growth of image data on the internet, how to screen out required data from these massive data becomes an urgent problem to be solved, and therefore image retrieval is receiving attention from more and more researchers.

The mainstream Image search can be classified into two categories, one is Content-Based Image search (CBIR) and the other is text-Based Image search (TBIR), depending on how the Image Content is described. The text-based image retrieval method describes the contents in the images in a text labeling mode, so that keywords for describing the contents of the images are formed for each image, such as objects, scenes and the like in the images; when searching, the user can provide the query key words according to the interest of the user, the searching system finds out the pictures marked with the corresponding query key words according to the query key words provided by the user, and finally, the searched result is returned to the user.

The image retrieval mode based on the text is intuitive, the interpretability of the result is strong, and the precision ratio is relatively high. The drawbacks of this approach are however also very significant: firstly, the method needs manual intervention in the labeling process, and obviously consumes a great deal of manpower and financial resources for completing text labeling for the images along with the rapid increase of the number of the images on the internet; secondly, the result obtained by manual labeling is often articles appearing in some images, namely, some nouns representing the articles, the information such as the number, the action, the state and the like of the articles is ignored, and each word in the result is not distinguished, namely, which word covers more image information cannot be distinguished; finally, the method can only carry out accurate retrieval, namely the query keyword of the user must appear in the label to return a corresponding result, but the same meaning can be expressed by a plurality of different terms generally, and the label data can not cover all terms, which causes that the content meeting the requirements in the database can not be retrieved.

Disclosure of Invention

The invention aims to solve the problems that the efficiency is low, the labeling result cannot cover all the contents of an image and words which do not appear in the label cannot be retrieved due to the fact that manual labeling is needed in the current text-based image retrieval, and provides an image retrieval method capable of automatically generating text indexes.

The purpose of the invention can be achieved by adopting the following technical scheme:

an image retrieval method for automatically generating text indexes comprises the following steps:

s1, learning the automatic labeling model M, and the process is as follows:

s101, acquiring a labeled training data set and an unlabeled image data set, wherein the training data set comprises training images and text descriptions corresponding to the training images, and the image data set only comprises images and does not have text descriptions corresponding to the images;

s102, segmenting all text descriptions of the training data set to construct a dictionary D;

s103, extracting the characteristics of each image in the training data set through CNN, wherein the characteristics are one-dimensional vectors;

s104, for a certain image i in the training data set, performing word segmentation on the corresponding text description to obtain w_i1,w_i2,…w_iLTotal L words, of said image i to be extracted from CNN at the same timeCharacteristic f_iAs initial input of hidden unit of RNN, and sequentially inputting word w in each step of recurrent neural network cycle_i1,w_i2,…w_iLObtaining the probability value of each word output in the dictionary after the output result of each step passes through the softmax layer, and recording the word input in the t step as w_itThe probability distribution of the output is P_itThen the step outputs the word w_itHas a probability of P_it(w_it) From the maximum likelihood estimation, it is desirable to maximize the probability of equation (1),

s105, aiming at all images in the training data set, the probability of the formula (2) needs to be maximized, and the formula is taken as a target function to carry out back propagation to update the parameters of the model so as to obtain an automatic labeling model M, wherein the model consists of the CNN and the RNN;

s2, generating text indexes for all images through the automatic annotation model M;

for any image i in the image data set, firstly, the image feature f is extracted from the CNN part of the automatic labeling model M_iAs an initial input to the RNN portion of the auto-labeling model M, then the words are generated in turn, generating the word w'_itIs dependent on w 'already generated'_i1,w′_i2…w′_i(t-1)Selecting the word with the maximum output probability value as a generated word at each step, and taking the probability value as the confidence coefficient of the generated word in the image and marking as z;

when the final words generated in the above steps or the length of the generated words reaches a preset threshold value, stopping continuously generating the words; for any image i described above, a sequence of descriptors w 'can be generated'_i1,w′_i2…w′_ilAnd said descriptor is in the imageZ of (1)_i1,z_i2…z_ilNormalizing the confidence level by formula (3)

W 'mentioned above'_i1,w′_i2…w′_ilAnd z'_i1,z′_i2…z′_ilTogether constituting a text index of said image i;

s3, constructing an image retrieval index by the text index of each image, and aiming at any word w in the dictionary D_uFind all images i described by the word₁,i₂…i_oAnd the confidence level z 'that the term corresponds in the image'_u1,z′_u2…z′_uoIf the images are ranked according to the confidence degree from high to low, a candidate image set ranked according to the confidence degree is generated for any word in the dictionary D in the mode;

s4, establishing a near-meaning word query word bank, obtaining text data which does not need to be labeled from a network text data set, training the text data through a word2vec algorithm, constructing a word bank DB, belonging to the DB, wherein each word in the word bank has a corresponding word vector, and calculating the meaning similarity of any two words in the word bank DB; when inquiring about the keyword w_uNot present in dictionary D described above, and found with word w via lexicon DB_uWord w having the closest meaning and appearing in dictionary D_vAnd by the word w_vRetrieving the relevant images;

s5, receiving query keywords to perform image retrieval, generating a group of candidate image sets sorted according to confidence degrees for any word in the word bank DB according to the steps S3 and S4, combining the candidate image sets generated by each word when a plurality of query keywords exist, and superposing all the confidence degrees of the candidate images i in different query keywords as the final confidence degree of the candidate images i for the candidate images i with the occurrence times larger than 1, removing redundant candidate images i and enabling the candidate images i to appear only once; and sorting the candidate images from high to low according to the confidence coefficient after superposition, and selecting the first plurality of candidate images as a returned result.

Further, in step S2, when the automatic labeling model M generates descriptors for an image, a confidence is generated for each descriptor at the same time, which indicates the accuracy of the descriptor in describing the image; and by sequencing the confidence level, images with higher relevance to the keywords are accurately retrieved.

Further, in the step S4 and the step S5, when the query keyword does not appear in the dictionary D, a word vector is constructed for each word by constructing a lexicon DB, and for any two word vectors v_e,v_uCalculating the similarity of the two words by formula (4), finding out the word which appears in the dictionary D and has the closest meaning to the query keyword, and further retrieving the corresponding image,

wherein v is_e·v_uRepresents the inner product of two vectors, | v_e|×|v_uI represents the product of the lengths of two vectors, and the larger the value, the closer the two meanings are.

Further, the CNN adopts ResNet, and the RNN adopts LSTM.

Compared with the prior art, the invention has the following advantages and effects:

1. according to the method, the automatic labeling model is learned, a plurality of description words can be automatically generated for the image, and when massive image data are faced, manual intervention can be effectively reduced, and required manpower and financial resources are greatly reduced.

2. The description words generated for the image in the invention can contain quantifier for describing the number of the objects, adjective for describing the state of the objects, verb for describing the action of the objects and the like besides the nouns for representing the objects, thereby more comprehensively covering the content of the image, and more accurately retrieving the image compared with the traditional processing mode of marking out nouns only by manual marking.

3. The method can process the query keywords which do not appear in the text of the image training set, train the similar meaning word query word bank by an unsupervised method, find the words with the similar meaning to the query keywords, and avoid the problem of low matching success rate of an accurate matching mode.

Drawings

FIG. 1 is a diagram of an automatic tagging model of the present invention;

FIG. 2 is a schematic diagram of text indexing of images generated by an automatic annotation model;

FIG. 3 is a schematic diagram of a process for finding a synonym for a query keyword from a synonym library;

fig. 4 is a flowchart of image retrieval by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

The image retrieval method for automatically generating the text index is mainly applied to retrieval of Internet images, such as search engines of Google, must and Baidu. The following are the implementation steps of the application of the invention:

step S1, learning the automatic labeling model M, which comprises the following steps:

step S101, acquiring a labeled training data set and an unlabeled image data set from the network image data, first selecting a specific language, here taking chinese as an example, acquiring the labeled training data set as an image chinese description data set of AI challenge, where the data set includes 300000 labeled images, and each image has sentences described by 5 sentences. The obtained unlabelled image data set is pictures provided by ImageNet, the number of the images is 14197122, and meanwhile, more pictures can be captured by means of a crawler and the like.

Step S102, segmenting the 1500000 samples in the training data set through the bus segmentation, counting different words appearing in the 1500000 samples, and constructing a dictionary D, wherein the size of the constructed dictionary is 8233.

Step S103, extracting features of the images in the training data set through a convolutional neural network (hereinafter, abbreviated as CNN), where the specific CNN used here is ResNet, and each image extracts a feature vector with a dimension of 2048.

Step S104, for a certain image i in the training data set, performing word segmentation on the corresponding text description to obtain w_i1,w_i2,…w_iLThe feature vector of the image i extracted from the CNN is used as the initial input of a hidden unit of a recurrent neural network (hereinafter abbreviated as RNN), where the specific RNN used here is LSTM, and the specific step number of the input is determined according to the above L.

Before inputting, it also needs to add label words at the head and tail of L words obtained by word segmentation, where the head label word is marked as w_sThe tail marker word is marked as w_eThe two tokens are the same for all samples in the training dataset. Therefore, the words w need to be input sequentially in each step of the RNN cycle_s,w_i1,w_i2,…w_iL,w_eThe result output at each step passes through the softmax layer to obtain a probability distribution, which in this example is a 8233-long vector corresponding to 8233 words in the dictionary, and the value of one dimension in the vector represents the probability value of the corresponding word of the dimension. The word input in the t step is recorded as w_itThe probability distribution of the output is P_itThen the step outputs the word w_itHas a probability of P_it(w_it). In order to make the words output from the image i fit the words input to it as closely as possible, the probability of equation (1) needs to be maximized according to the maximum likelihood estimation.

Step S105, according to step S104, may fit the output of one image to its input, where in order to fit the output term of each image in the entire training set to the input term as much as possible, the probability of equation (2) needs to be maximized. However, in the training process, a negative sign is often added before the formula (2) and then used as a loss function of the model, so that the target is changed into a loss function minimization, then the parameters of the model are updated through back propagation, during the training, the training set and the verification set need to be divided, and during the training process, whether the model is converged or not is judged by observing the effect of the model on the verification set, and the automatic labeling model M can be obtained after the model is converged, wherein the specific structure of the model is shown in fig. 1.

Step S2, generating text indexes for all images through the automatic annotation model M described in step S1.

For any image i, firstly, the image feature f is extracted from the CNN part of the automatic labeling model M_iImage feature f_iAnd a head marker word w in step S104_sTogether as an initial input to the RNN portion of the automatic annotation model M, the first word w 'describing the image is generated'_i1Then the first word w'_i1Generating a second descriptive term w 'as input to RNN'_i2. Analogize in turn to generate the t < th > word w'_itDependent on w 'generated ahead of it'_i1,w′_i2…w′_i(t-1)。

And selecting the word with the maximum RNN output part probability value when generating each word, and taking the probability value as the confidence coefficient of the word in the image, and marking the confidence coefficient as z. When the tail markup word w in step S104 is generated_eOr stopping continuously generating the words when the length of the generated words reaches a preset threshold value, and recording that the words generated at last are w'_il. Then for any image i described above, a sequence can be generatedDescriptor w'_i1,w′_i2…w′_ilAnd the confidence z of these words in the image_i1,z_i2…z_ilNormalizing the confidence level by the formula (3) to obtain a normalized confidence level z'_i1,z′_i2…z′_il

W 'produced above'_i1,w′_i2…w′_ilAnd z'_i1,z′_i2…z′_ilTogether forming a text index for the image, fig. 2 shows an example of generating a text index for an image by an automatic annotation model.

Step S3, establishing index of all images for the text index generated by step S2, wherein the specific step is that any one of 8233 words in the dictionary is marked as w_uFind all images i in which the word appears₁,i₂…i_oAnd the confidence level z 'of the corresponding word in the image'_u1,z′_u2…z′_uoThe images are ordered according to confidence from high to low. A candidate set ordered by confidence may be generated in this manner for any of 8233 words.

Step S4, establishing a thesaurus for solving the situation that the query keyword does not appear in the 8233 terms mentioned above. The method specifically comprises the steps of acquiring a large amount of text data which do not need to be labeled from a network text data set, wherein a corpus of Chinese Wikipedia is acquired. The Chinese Wikipedia corpus comprises the title and the text part of each entry in the Chinese Wikipedia, the number of the entries is 984451, and 984451 texts are obtained after the preprocessing steps of punctuation removal, complex and simple conversion, word segmentation and the like. 984451 texts of Wikipedia and 150000 texts in a training set are trained through word2vec algorithm, wherein all words form a word stock DB, the size of the word stock is 408787, and then each word in the DB can be generatedWord vectors, for any two word vectors v_e,v_uThe degree of similarity, v, can be calculated by formula (4)_e·v_uRepresents the inner product of two vectors, | v_e|×|v_uI represents the product of the lengths of two vectors, and the larger the value, the closer the two meanings are

When inquiring about the keyword w_uNot present in dictionary D described above, and found with word w via lexicon DB_uWord w having the closest meaning and appearing in dictionary D_vAnd by the word w_vRetrieving related images, FIG. 3 shows an example of finding a synonym for a query keyword from a synonym query thesaurus.

And step S5, receiving the query key words to perform image retrieval. A set of candidate image sets ordered according to confidence may be generated for any word in the thesaurus DB through steps S3, S4. When there are a plurality of query keywords w₁,w₂,…w_nAt first, each keyword retrieves a group of candidate image sets<i₁,z₁>,<i₂,z₂>….<i_o,z_o>And for the candidate image i with the occurrence frequency larger than 1, superposing all confidence degrees z of the candidate image i as the final confidence degree of the candidate image i, and removing redundant i to ensure that the candidate image i only appears once. And sorting the candidate images from high to low according to the confidence coefficient after superposition, and selecting the first plurality of candidate images as a returned result.

FIG. 4 shows a general flow chart of image retrieval according to the present invention, which integrates the contents of the above steps, first obtaining a text data set, training to obtain a query thesaurus of near-sense words; acquiring an image data set, training an automatic labeling model and generating a dictionary, generating a text index for an image through the automatic labeling model, and constructing an image retrieval index through the text index; when the query keyword of the user is not in the dictionary of the image data set, the word bank is queried through the similar meaning word to find the word which is closest to the meaning of the query keyword in the dictionary, and the image retrieval is carried out by replacing the query keyword.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An image retrieval method for automatically generating a text index is characterized by comprising the following steps:

s1, learning the automatic labeling model M, and the process is as follows:

s104, for a certain image i in the training data set, performing word segmentation on the corresponding text description to obtain w_i1,w_i2,…w_iLL words in total, and the feature f of the image i extracted from the CNN_iAs initial input of hidden unit of RNN, and sequentially inputting word w in each step of recurrent neural network cycle_i1,w_i2,…w_iLObtaining the probability value of each word output in the dictionary after the output result of each step passes through the softmax layer, and recording the word input in the t step as w_itThe probability distribution of the output is P_itThen the step outputs the word w_itHas a probability of P_it(w_it) From the maximum likelihood estimation, it is desirable to maximize the probability of equation (1),

when the final words generated in the above steps or the length of the generated words reaches a preset threshold value, stopping continuously generating the words; for any image i described above, a sequence of descriptors w 'can be generated'_i1,w′_i2…w′_ilAnd the confidence z of said descriptor in the image_i1,z_i2…z_ilNormalizing the confidence level by formula (3)

s3, constructing images through the text index of each imageIndex of retrieval, for any word w in dictionary D as described above_uFind all images i described by the word₁,i₂…i_oAnd the confidence level z 'that the term corresponds in the image'_u1,z′_u2…z′_uoIf the images are ranked according to the confidence degree from high to low, a candidate image set ranked according to the confidence degree is generated for any word in the dictionary D in the mode;

2. The image retrieval method of claim 1, wherein in step S2, when the automatic annotation model M generates descriptors for an image, a confidence level is generated for each descriptor at the same time, which indicates the accuracy of the descriptor in describing the image; and by sequencing the confidence level, images with higher relevance to the keywords are accurately retrieved.

3. The image retrieval method of claim 1, wherein in the steps S4 and S5, when the query keyword does not appear in the dictionary D, a word vector is constructed for each word by constructing a lexicon DB, and v is a vector for any two words_e,v_uCalculating the similarity of the two words by formula (4), finding out the word which appears in the dictionary D and has the closest meaning to the query keyword, and further retrieving the corresponding image,

4. The image retrieval method of claim 1, wherein the CNN is ResNet and the RNN is LSTM.