CN111737507A

CN111737507A - A Single-modal Image Hash Retrieval Method

Info

Publication number: CN111737507A
Application number: CN202010577850.3A
Authority: CN
Inventors: 凌泽乐; 高岩; 高明; 金长新
Original assignee: Inspur Group Co Ltd
Current assignee: Inspur Group Co Ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-10-02

Abstract

The invention particularly relates to a single-mode image hash retrieval method. The single-mode image Hash retrieval method comprises four parts of image preprocessing, image feature extraction, attention image output and Hash retrieval model generation. According to the single-mode image Hash retrieval method, semantic information in a picture mode is extracted through an attention mechanism, the quality of Hash function generated by a Hash retrieval model is improved, meanwhile, the retrieval precision among a plurality of label data is enhanced through a multi-stage semantic supervision mode, the most matched item is located in front of the final retrieval result, and therefore the retrieval efficiency is greatly improved.

Description

A Single-modal Image Hash Retrieval Method

技术领域technical field

本发明涉及图像检索技术领域，特别涉及一种单模态图像哈希检索方法。The invention relates to the technical field of image retrieval, in particular to a single-modal image hash retrieval method.

背景技术Background technique

随着科技进步，互联网技术飞速发展，技术更新日新月异，图像视频数据出现大爆炸式的增长。常规的图像检索技术包括基于文本的图像检索技术(Text-based ImageRetrieval，简称TBIR)和基于内容的图像检索(Content-based Image Retrieval，简称CBIR) 技术两种检索方式。其中，基于文本的图像检索技术是利用文本描述的方式描述图像的特征，如绘画作品的作者、年代、流派、尺寸等；基于内容的图像检索技术是对图像的内容语义，如图像的颜色、纹理、布局等进行分析和检索的图像检索技术。目前，基于内容的图像检索技术成为主流的图像检索方法。With the advancement of science and technology, the rapid development of Internet technology, and the rapid technological updates, there has been a big explosion of image and video data. Conventional image retrieval technologies include two retrieval methods: Text-based Image Retrieval (TBIR) and Content-based Image Retrieval (CBIR). Among them, the text-based image retrieval technology uses text description to describe the characteristics of the image, such as the author, age, genre, size, etc. of the painting; content-based image retrieval technology is the content semantics of the image, such as the color, Image retrieval technology for analysis and retrieval of textures, layouts, etc. At present, content-based image retrieval technology has become the mainstream image retrieval method.

图像哈希检索技术旨在将已有数据集合进行搜索，找出符合要求的图像数据。由于哈希码具有存储数据小，检索速度快的优点，所以哈希检索被广泛应用在检索任务中。现有的图像哈希检索技术可以分为深度模型检索技术和非深度模型检索技术两类。传统做法一般是采用深度网络，提取图像特征，并根据提取到的特征使用全连接网络在交叉熵损失将样本转化成哈希码保存在数据库中。Image hash retrieval technology aims to search the existing data set to find the image data that meets the requirements. Because hash codes have the advantages of small storage data and fast retrieval speed, hash retrieval is widely used in retrieval tasks. Existing image hash retrieval techniques can be divided into deep model retrieval techniques and non-deep model retrieval techniques. The traditional method is generally to use a deep network to extract image features, and use a fully connected network to convert the samples into hash codes and store them in the database at cross entropy loss according to the extracted features.

由于在现实环境中，一个图像中包含非常多丰富信息，往往存在多个类被信息，对于传统的针对一个类信息往往精确度不够，图像中的背景中的冗余信息和值得重点关注区域的信息在哈希学习过程中居于同样的地位。而现有的大多哈希检索模型旺旺只关注图像中值得重点关注区域的信息，不能充分利用全部图像信息。Since in the real environment, an image contains a lot of rich information, there are often multiple classes of information, and the traditional information for one class is often not accurate enough, redundant information in the background in the image and areas worthy of attention. Information occupies the same place in the hash learning process. However, most of the existing hash retrieval models only focus on the information in the image worthy of attention, and cannot make full use of all the image information.

基于上述问题，本发明提出了一种单模态图像哈希检索方法。Based on the above problems, the present invention proposes a single-modal image hash retrieval method.

发明内容SUMMARY OF THE INVENTION

本发明为了弥补现有技术的缺陷，提供了一种简单高效的单模态图像哈希检索方法。In order to make up for the defects of the prior art, the present invention provides a simple and efficient single-modal image hash retrieval method.

本发明是通过如下技术方案实现的：The present invention is achieved through the following technical solutions:

一种单模态图像哈希检索方法，其特征在于：包括图像预处理，图像特征提取，输出注意力图像和生成哈希检索模型四部分；A single-modal image hash retrieval method, characterized in that it includes four parts: image preprocessing, image feature extraction, outputting attention images and generating a hash retrieval model;

首先通过定义多级语义相似关系矩阵来保持多标签数据中丰富的语义信息，同时采用Attention机制自发寻找图像中的重点关注区域，通过学习生成与图像表示大小相同的掩码，从而提取图片模态中的语义信息，辅助哈希检索模型得到更高质量的哈希函数。Firstly, the rich semantic information in the multi-label data is maintained by defining a multi-level semantic similarity relationship matrix. At the same time, the Attention mechanism is used to spontaneously find the key attention areas in the image, and the mask of the same size as the image representation is generated by learning to extract the image modalities. Semantic information in the auxiliary hash retrieval model to obtain a higher quality hash function.

本发明单模态图像哈希检索方法，具体实施步骤如下：The single-modal image hash retrieval method of the present invention, the specific implementation steps are as follows:

第一步，获取训练集原始图片，将图像分别对应不同残差网络进行输入；The first step is to obtain the original image of the training set, and input the images corresponding to different residual networks;

第二步，将训练样本输入哈希检索模型，通过最小化损失函数优化哈希检索模型参数；The second step is to input the training samples into the hash retrieval model, and optimize the parameters of the hash retrieval model by minimizing the loss function;

第三步，固定模型，将所有样本通过哈希检索模型得到对应的哈希码，存入输入库以备使用；The third step is to fix the model, obtain the corresponding hash codes of all samples through the hash retrieval model, and store them in the input library for use;

第四步，使用哈希检索模型进行检索任务时，只需要将图片任意模态样本输入模型生成该模态对应哈希码，然后在另一模态的哈希码数据库中寻找海明距离最近的N个(按需求自定义)哈希码，返回与之对应的样本即可。The fourth step, when using the hash retrieval model for retrieval tasks, only need to input any modal sample of the picture into the model to generate the corresponding hash code of the modal, and then search for the nearest Hamming distance in the hash code database of another modal. The N hash codes (customized according to requirements) are returned, and the corresponding samples can be returned.

所述第二步中，采用迭代优化的方法优化模型参数，即固定一个参数，优化另外的参数。In the second step, an iterative optimization method is used to optimize the model parameters, that is, one parameter is fixed and another parameter is optimized.

所述第二步中，优化哈希检索模型，包括以下步骤：In the second step, optimizing the hash retrieval model includes the following steps:

(1)生成具有多级语义的相似性矩阵S；(1) Generate a similarity matrix S with multi-level semantics;

(2)提取图片模态的特征，得到图像模态特征P_i，并对图像进行分类任务，输出注意力图像；(2) extracting the feature of the image modality, obtaining the image modality feature P _i , and classifying the image, and outputting the attention image;

(3)将得到的特征图像与注意力图像进行点乘，得到图片模态的特征表示F_i和文本模态的特征表示F_j；(3) Dot multiplication of the obtained feature image and the attention image to obtain the feature representation F _{i of the picture modality and the feature representation F j} _of the text modality;

(4)采用损失函数对哈希检索模型进行迭代优化，最终得到优化的哈希检索模型。(4) Use the loss function to iteratively optimize the hash retrieval model, and finally obtain the optimized hash retrieval model.

所述步骤(1)中，具有多级语义的相似性矩阵S表示为：In the step (1), the similarity matrix S with multi-level semantics is expressed as:

其中，|C_i|和|C_j|分别表示样本i和样本j所具有的类别数，D(i,j)表示两个样本所共有的类别数；样本i和样本j组成的的相似性矩阵S_ij∈[0,1]，从而保证生成的S矩阵具有更大的区分性。Among them, |C _i | and |C _j | represent the number of categories of sample i and sample j, respectively, D(i,j) represents the number of categories shared by the two samples; the similarity composed of sample i and sample j The matrix S _ij ∈ [0,1], thus ensuring that the generated S matrix has greater discrimination.

所述步骤(2)中，采用Resnet101网络进行提取，并得到图像模态特征P_i；同时采用Resnet01网络，去掉全连接层，加入平均池化层，输出为样本类别数据，对图像进行分类任务，最后一层加入Attention机制，输出为注意力图像，激活重点关注的区域。In the step (2), the Resnet101 network is used for extraction, and the image modal feature P _i is obtained; at the same time, the Resnet01 network is used, the fully connected layer is removed, the average pooling layer is added, and the output is sample category data, and the image is classified. , the last layer adds the Attention mechanism, the output is an attention image, and the focus area is activated.

所述步骤(3)中，将的得到的特征图像与注意力图像进行点乘，并将得到的结果作为输入进入全连接层得到图片模态的特征表示F_i；将文本模态的BOW(Bag of words)表示输入到全连接层得到文本模态的特征表示F_j。In the described step (3), the obtained feature image and the attention image are dot multiplied, and the obtained result is used as an input to enter the fully connected layer to obtain the feature representation F _i of the picture mode; the BOW ( Bag of words) means input to the fully connected layer to obtain the feature representation F _j of the text mode.

所述步骤(4)中，损失函数表示为：In the step (4), the loss function is expressed as:

其中，S_ij为样本i和样本j组成的的相似性矩阵，σ为超参数，用来平衡惩罚项和数据损失项，F^T _i为图片模态的特征表示的转置，F_j为文本模态的特征表示，L₂为常见的量化损失，L₃为位平衡损失。Among them, S _ij is the similarity matrix composed of sample i and sample j, σ is a hyperparameter used to balance the penalty term and data loss term, F ^T _i is the transpose of the feature representation of the image modality, and F _j is the text The feature representation of the modality, L ₂ is the common quantization loss, and L ₃ is the bit balance loss.

本发明的有益效果是：该单模态图像哈希检索方法，通过attention机制提取图片模态中的语义信息，提高了哈希检索模型生成哈希函数的质量，同时通过使用多级语义监督方式，增强了拥有多个标签数据间检索的精度，使得最为匹配的项位于最终检索结果的前面，从而极大的提高了检索效率。The beneficial effects of the present invention are: the single-modal image hash retrieval method extracts the semantic information in the image mode through the attention mechanism, improves the quality of the hash function generated by the hash retrieval model, and at the same time uses the multi-level semantic supervision method. , which enhances the retrieval accuracy between data with multiple tags, so that the most matching item is located in front of the final retrieval result, thus greatly improving retrieval efficiency.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some embodiments of the present invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

附图1为本发明单模态图像哈希检索方法示意图。FIG. 1 is a schematic diagram of a single-modal image hash retrieval method according to the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好的理解本发明中的技术方案，下面将结合本发明实施例，对本发明实施例中的技术方案进行清楚，完整的描述。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. Obviously, the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

该单模态图像哈希检索方法，包括图像预处理，图像特征提取，输出注意力图像和生成哈希检索模型四部分；The single-modal image hash retrieval method includes four parts: image preprocessing, image feature extraction, output attention image and generation of hash retrieval model;

首先通过定义多级语义相似关系矩阵来保持多标签数据中丰富的语义信息，同时采用 Attention机制自发寻找图像中的重点关注区域，通过学习生成与图像表示大小相同的掩码，从而提取图片模态中的语义信息，辅助哈希检索模型得到更高质量的哈希函数。Firstly, the rich semantic information in the multi-label data is maintained by defining a multi-level semantic similarity relationship matrix. At the same time, the Attention mechanism is used to spontaneously find the key attention areas in the image, and the mask with the same size as the image representation is generated by learning to extract the image modalities. Semantic information in the auxiliary hash retrieval model to obtain a higher quality hash function.

该单模态图像哈希检索方法，具体实施步骤如下：The specific implementation steps of the single-modal image hash retrieval method are as follows:

与目前的现有技术相比，该单模态图像哈希检索方法，具有以下特点：Compared with the current prior art, the single-modal image hash retrieval method has the following characteristics:

第一、将所关注的重点与冗余数据进行分离开来，将重点区域进行突出，从而提高了检索效率；First, the focus of attention is separated from redundant data, and the key areas are highlighted, thereby improving the retrieval efficiency;

在以前图像检索中，对于图像信息现有的大多哈希检索模型都不能充分利用，图像中的背景及冗余信息和值得重点关注区域的信息在哈希学习过程中居于同样的地位，因而该方法将背景及冗余信息和值得重点关注区域进行分离，能够极大的提高检索效率。In the previous image retrieval, most of the existing hash retrieval models for image information could not be fully utilized. The background and redundant information in the image and the information of the area worthy of attention occupy the same position in the hash learning process. The method separates background and redundant information from areas worthy of key attention, which can greatly improve retrieval efficiency.

第二、通过Attention机制，哈希检索模型能够关注到更具区分性的区域，提高了模型生成哈希函数的质量；Second, through the Attention mechanism, the hash retrieval model can focus on more discriminative regions, which improves the quality of the hash function generated by the model;

近年来，Attention机制在计算机视觉中广泛应用，并且都取得了不错的效果。Attention 机制用在图像识别上，能够自发寻找图像中的需要重点关注的地方，即通过学习生成一个和图像表示大小相同的Mask，对于关注区域，Mask对应位置具有较高关注区域，将 Attention机制融合进哈希检索方法中，使其更具有解释性，提高检索效率。In recent years, the Attention mechanism has been widely used in computer vision, and has achieved good results. The Attention mechanism is used in image recognition, and can spontaneously find the places in the image that need to be focused, that is, by learning to generate a Mask with the same size as the image representation, for the area of interest, the corresponding position of the Mask has a higher area of attention, and the Attention mechanism is integrated. into the hash retrieval method to make it more interpretable and improve retrieval efficiency.

第三、使用多级语义监督方式，增强了拥有多个标签数据间检索的精度，使得最为匹配的项位于最终检索结果的前面。Third, the multi-level semantic supervision method is used to enhance the retrieval accuracy of data with multiple labels, so that the most matching item is located in front of the final retrieval result.

以上所述的实施例，只是本发明具体实施方式的一种，本领域的技术人员在本发明技术方案范围内进行的通常变化和替换都应包含在本发明的保护范围内。The above-mentioned embodiment is only one of the specific embodiments of the present invention, and the usual changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the protection scope of the present invention.

Claims

1. a unimodal image hash retrieval method is characterized in that: comprise image preprocessing, image feature extraction, output attention image and generate hash retrieval model four parts;

Firstly, the rich semantic information in the multi-label data is maintained by defining a multi-level semantic similarity relationship matrix. At the same time, the Attention mechanism is used to spontaneously find the key attention areas in the image, and the mask with the same size as the image representation is generated by learning to extract the image modalities. Semantic information in the auxiliary hash retrieval model to obtain a higher quality hash function.

2. The single-modality image hash retrieval method according to claim 1, wherein the specific implementation steps are as follows:

The first step is to obtain the original image of the training set, and input the images corresponding to different residual networks;

The second step is to input the training samples into the hash retrieval model, and optimize the parameters of the hash retrieval model by minimizing the loss function;

The third step is to fix the model, obtain the corresponding hash codes of all samples through the hash retrieval model, and store them in the input library for use;

The fourth step, when using the hash retrieval model for retrieval tasks, only need to input any modal sample of the picture into the model to generate the corresponding hash code of the modal, and then search for the nearest Hamming distance in the hash code database of another modal. The n hash codes of , and the corresponding samples can be returned.

3 . The single-modality image hash retrieval method according to claim 2 , wherein in the second step, an iterative optimization method is used to optimize model parameters, that is, one parameter is fixed and another parameter is optimized. 4 .

4. The single-modality image hash retrieval method according to claim 3, wherein: in the second step, optimizing the hash retrieval model comprises the following steps:

(1) Generate a similarity matrix S with multi-level semantics;

(2) extracting the feature of the image modality, obtaining the image modality feature P _i , and classifying the image, and outputting the attention image;

(3) Dot multiplication of the obtained feature image and the attention image to obtain the feature representation F _{i of the picture modality and the feature representation F j} _of the text modality;

(4) Use the loss function to iteratively optimize the hash retrieval model, and finally obtain the optimized hash retrieval model.

5. The single-modal image hash retrieval method according to claim 4, wherein: in the step (1), the similarity matrix S with multi-level semantics is expressed as:

Among them, |C _i | and |C _j | represent the number of categories of sample i and sample j, respectively, D(i,j) represents the number of categories shared by the two samples; the similarity composed of sample i and sample j The matrix S _ij ∈ [0,1], thus ensuring that the generated S matrix has greater discrimination.

6. single-modal image hash retrieval method according to claim 4, is characterized in that: in described step (2), adopt Resnet101 network to extract, and obtain image modal characteristic P _i ; Adopt Resnet01 network simultaneously, The fully connected layer is removed, the average pooling layer is added, the output is sample category data, and the image is classified.

7. The single-modal image hash retrieval method according to claim 6, wherein: in the step (3), the obtained feature image and the attention image are dot-multiplied, and the obtained result is used as Input into the fully connected layer to obtain the feature representation F _i of the picture modality; input the BOW representation of the text modality into the fully connected layer to obtain the feature representation _Fj of the text modality.

8. The single-modal image hash retrieval method according to claim 4, 5, 6 or 7, characterized in that: in the step (4), the loss function is expressed as:

Among them, S _ij is the similarity matrix composed of sample i and sample j, σ is a hyperparameter used to balance the penalty term and data loss term, F ^T _i is the transpose of the feature representation of the image modality, and F _j is the text The feature representation of the modality, L ₂ is the common quantization loss, and L ₃ is the bit balance loss.