CN110851629A

CN110851629A - Image retrieval method

Info

Publication number: CN110851629A
Application number: CN201910971299.8A
Authority: CN
Inventors: 赵喜玲; 周瑞乾; 牛炳麟; 马原; 张伯琨; 黄蓉; 徐莉; 郑铭海
Original assignee: Xinyang Agriculture and Forestry University
Current assignee: Xinyang Agriculture and Forestry University
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2020-02-28

Abstract

The invention discloses an image retrieval method, which comprises the steps of selecting a metadata set to construct a database, then classifying and extracting the metadata set, and dividing the metadata set into visual information and text description information; extracting text features in the text description information and category features of visual features in the visual information, then constructing a source field based on the visual features, and constructing an auxiliary field based on the text features; abstracting text features and visual features into points serving as sample points, firstly performing dimensionality reduction on high-dimensional sample points, then performing enhancement processing on all the sample points, and finally performing feature fusion on the enhanced sample points; and finally, performing similarity matching by adopting a cosine similarity method to realize the retrieval of similar images. The invention can learn the visual information of the image and the text information of the image, thereby improving the accuracy of the image retrieval.

Description

Image retrieval method

Technical Field

The invention relates to the field of retrieval, in particular to an image retrieval method.

Background

At present, the number of stored pictures on the network is increased explosively, and meanwhile, the number of users of different types of social networks and media is also continuously increased. Under the condition, the multimedia data types uploaded to the network by the users are changed, the content shared by the users is not single image visual information any more, but is added with additional information of other data types such as descriptive sentences, self-defined labels, shooting time, scene places and the like by taking images as main. Therefore, the images in the social network not only contain visual information carried by the images, but also contain text, time and other modal information. In such a multimodal data environment, when a user performs image retrieval, if only visual information of an image is used to generate image features for retrieval, effective clues provided by a large amount of other modality information are omitted, and the retrieval performance is directly affected. Therefore, how to construct representative image features to further realize rapid and effective image retrieval is an urgent problem to be solved.

However, in the conventional image retrieval method, single-mode information of an image is generally processed, and other mode information such as text, time, place and the like is omitted, so that the generated features have a large bias in retrieval, and the requirement of a user on quick and effective retrieval in a multi-mode data mashup environment is difficult to meet.

Disclosure of Invention

The present invention is directed to provide an image retrieval method, which solves the above problems of the prior art, and learns the visual information of an image and the text information of the image, thereby improving the accuracy of image retrieval.

In order to achieve the purpose, the invention provides the following scheme: the invention provides an image retrieval method, which comprises the following steps:

step one, selecting a lightweight multi-modal image data set as a metadata set to construct a database, then classifying and extracting the metadata set, and dividing the metadata set into two types of information: visual information, text description information;

secondly, extracting text features in the text description information and category features of visual features in the visual information in the first step, constructing a source field based on the visual features, constructing an auxiliary field based on the text features, constructing the auxiliary field based on the text features, and capturing commonalities among the source field, the auxiliary field and a target field;

abstracting text features and visual features into points as sample points, firstly extracting high-dimensional sample points of each feature, secondly performing dimensionality reduction on the high-dimensional sample points, then performing enhancement processing on all the sample points, and finally performing feature fusion on the enhanced sample points;

and step four, finally, performing similarity matching on the multiple modes by adopting a cosine similarity method so as to realize the retrieval of similar images.

Preferably, the pretreatment process in the step one is as follows: first, the metadata set extracts image features by using the fused convolutional layers, and adds full connection layers between the fused convolutional layers to reduce loss of feature information.

Preferably, the specific process of the sample point enhancement processing in step three is as follows: and mapping the feature expression on a feature plane to form feature nodes after the sample points are subjected to linear transformation once, and generating enhanced nodes by the obtained feature nodes through nonlinear transformation of an activation function.

Preferably, the weight is set according to the commonality among the cross-domain features, the feature vectors of the source field, the target field and the auxiliary field are calculated according to the Laplace matrix, the efficient dimensionality reduction processing and the fusion processing are carried out on high-dimensional feature data by adopting a typical correlation analysis (CCA), and the fused feature vectors are used as image features.

The invention discloses the following technical effects: the method is different from the traditional image retrieval algorithm based on single-mode information extraction features, the transfer learning is integrated into the image feature construction, the cross-mode transfer learning is realized by using the text features and the visual features of the images, and the finally obtained image features are the results after adjustment is carried out on the basis of the visual features according to the text features obtained by the transfer learning. The method has the advantages that other modal information such as texts, time and places cannot be omitted, so that the generated characteristics have a large bias in retrieval, and the quick and effective retrieval requirements of a user in a multi-modal data mashup environment can be met; the method and the device can learn the visual information of the image and the text information of the image at the same time, thereby improving the accuracy and stability of the image retrieval.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 shows the pretreatment process of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The Wiki database is used herein as a database. The database comprises 2866 pictures of ten types of pictures of art, biology, geography and the like, and also comprises the corresponding word description of the pictures in Wikipedia, and the 2866 pictures and the word description corresponding to the pictures are used as a metadata set. Firstly, extracting image features by utilizing fused convolutional layers, and adding full-connection layers between the fused convolutional layers to reduce loss of feature information, so as to realize classification extraction of a metadata set, namely dividing the metadata set into two types of information: visual information, text description information.

And then extracting text features in the text description information and category features of visual features in the visual information, constructing a source field based on the visual features, constructing an auxiliary field based on the text features, constructing the auxiliary field based on the text features, and capturing commonalities among the source field, the auxiliary field and a target field. Because the dimensionalities of the text features and the visual features are different, the text features and the visual features need to be processed by using a dimensionality reduction algorithm, and information fusion under multiple modes can be realized. Firstly, performing dimensionality reduction on high-dimensional features, then performing enhancement processing on the dimensionality-reduced features, and finally performing feature fusion on the enhanced sample points.

The specific process of the enhancement treatment is as follows: and abstracting the features into sample points, mapping the feature expression on a feature plane to form feature nodes after the sample points are subjected to linear transformation once, and generating enhanced nodes by the obtained feature nodes through nonlinear transformation of an activation function.

And finally, performing similarity matching by adopting a cosine similarity method: and performing similarity measurement on the feature vector of the image to be retrieved and a feature library, returning a feature index with higher similarity, then finding out corresponding pictures from the image set, sequencing according to a decreasing rule, and displaying the first k pictures to obtain a retrieval result.

The similarity matching process is as follows: extracting feature vectors after feature fusion, and judging similarity by comparing the sizes of cosine included angles among the feature vectors.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A method of image retrieval, comprising the steps of:

secondly, extracting text features in the text description information and category features of visual features in the visual information in the first step, then constructing a source field based on the visual features, constructing an auxiliary field based on the text features, and capturing commonalities among the source field, the auxiliary field and a target field;

2. The method of image retrieval according to claim 1, wherein: the pretreatment process in the first step is as follows: first, the metadata set extracts image features by using the fused convolutional layers, and adds full connection layers between the fused convolutional layers to reduce loss of feature information.

3. The method of image retrieval according to claim 1, wherein: the specific process of the sample point enhancement treatment in the third step is as follows: and mapping the feature expression on a feature plane to form feature nodes after the sample points are subjected to linear transformation once, and generating enhanced nodes by the obtained feature nodes through nonlinear transformation of an activation function.

4. The method of image retrieval according to claim 1, wherein: the fourth step is specifically as follows: setting weight according to the commonality among the cross-domain features, calculating according to the Laplace matrix to obtain the feature vectors of the source field, the target field and the auxiliary field, carrying out effective dimensionality reduction processing and fusion processing on high-dimensional feature data by adopting a typical correlation analysis (CCA), and taking the fused feature vectors as image features.