CN109871454B

CN109871454B - A Robust Discrete Supervised Cross-media Hashing Retrieval Method

Info

Publication number: CN109871454B
Application number: CN201910096204.2A
Authority: CN
Inventors: 姚涛; 闫连山; 吕高焕; 崔光海; 岳峻
Original assignee: Ludong University
Current assignee: Ludong University
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2023-08-29
Anticipated expiration: 2039-01-31
Also published as: CN109871454A

Abstract

The invention discloses a robust discrete-supervised cross-media hash retrieval method. By learning a robust similarity matrix between two samples, the semantic association between heterogeneous samples can be mined, and content-based cross-media retrieval can be realized by using the method. , the method includes the following steps: establishing an image and text data set, and extracting visual and text features from the image and text samples in the data set respectively; using the class labels, image and text features of the samples to construct a similarity matrix between two samples, And use the low rank of the similarity matrix between the two samples and the sparse characteristics of the sample noise to learn a robust similarity matrix between the two samples; and then use the robust similarity matrix between the two samples to learn a more discriminative Ha Greek code; applied to the hash function Norm regular term constraints, to learn a more robust hash function; a discrete iterative optimization algorithm is proposed to directly obtain the discrete solution of the hash code; the method of the present invention learns a robust similarity matrix between pairs of samples. Effectively resist the noise that may exist in the sample, thereby greatly improving the performance of multimedia retrieval.

Description

A Robust Discrete Supervised Cross-media Hashing Retrieval Method

技术领域：Technical field:

本发明涉及一种鲁棒离散监督跨模态哈希检索方法,属于多媒体检索和机器学习领域。The invention relates to a robust discrete-supervised cross-modal hash retrieval method, which belongs to the fields of multimedia retrieval and machine learning.

背景技术：Background technique:

近年来，互联网上每天都会产生大量的数据，这给多媒体检索任务带来了巨大的挑战，如何高效和有效查找近似样本成为迫切需求。哈希方法通过学习一组哈希函数将样本从原始特征空间映射到汉明空间，由于其在大规模应用中的计算速度快和节省存储空间，引起了研究者的极大的关注。哈希码比原始特征的存储成本低得多，同时通过汉明空间中利用XOR运算可以快速地计算样本之间的相似度。哈希方法已经得到了广泛的研究，但大多数研究仅关注一种模态，然而在因特网上相同语义的样本通常可表示为多个模态，这导致不同模态之间的异构语义鸿沟。例如，图像可以由视觉和相应的文本特征表示。另外，当用户提交查询样本给搜索引擎时，用户更喜欢搜索引擎返回多种模态的相似样本。因此，跨媒体检索引起了越来越多的关注。跨媒体哈希方法的目标是将异构样本映射到一个共享汉明空间，并在此空间保持样本的相似结构。具体地，对于相似的异构样本，在共享汉明空间中汉明距离要小，反之亦然。根据在训练过程中是否使用类标签，跨媒体哈希方法通常可以分为两类：无监督和监督方法。前者通常通过保持样本的模态内和模态间相似性来学习哈希码，而后者可以进一步结合类标签学习区分性更好的哈希码。最近的工作表明，结合样本的类标签可以提高检索性能。In recent years, a large amount of data is generated on the Internet every day, which brings great challenges to multimedia retrieval tasks. How to efficiently and effectively find similar samples has become an urgent need. The hashing method maps the samples from the original feature space to the Hamming space by learning a set of hash functions. Due to its fast calculation speed and saving storage space in large-scale applications, it has attracted great attention from researchers. The storage cost of the hash code is much lower than that of the original feature, and the similarity between samples can be quickly calculated by using the XOR operation in the Hamming space. Hashing methods have been extensively studied, but most studies only focus on one modality, however samples with the same semantics on the Internet can usually be represented as multiple modalities, which leads to a heterogeneous semantic gap between different modalities . For example, images can be represented by visual and corresponding textual features. In addition, when a user submits a query sample to a search engine, the user prefers that the search engine returns similar samples in multiple modalities. Therefore, cross-media retrieval has attracted more and more attention. The goal of cross-media hashing methods is to map heterogeneous samples into a shared Hamming space and preserve the similar structure of samples in this space. Specifically, for similar heterogeneous samples, the Hamming distance is smaller in the shared Hamming space, and vice versa. According to whether class labels are used during training, cross-media hashing methods can generally be divided into two categories: unsupervised and supervised methods. The former usually learns hash codes by maintaining the intra-modal and inter-modal similarity of samples, while the latter can further combine class labels to learn more discriminative hash codes. Recent work has shown that incorporating class labels of samples can improve retrieval performance.

虽然许多监督跨模态哈希方法已经提出，并取得了令人满意的结果，然而仍有一些问题需要进一步解决。首先，在现实世界中，样本可能含有噪音。但是，大多数监督跨模态哈希方法仅使用训练数据的类标签构造两两样本间相似度矩阵，而没有考虑样本中的噪声，例如：离群点。显然，这些噪声样本会严重损害两两样本间相似度矩阵的结构，从而误导哈希码的学习，导致检索性能降低。其次，哈希码的离散约束导致混合整数优化问题通常很难解决，大多数方法首先放松哈希码的离散约束，得到连续解，然后量化生成哈希码。然而，量化会导致信息丢失，使得哈希码的区分性能降低。Although many supervised cross-modal hashing methods have been proposed and achieved satisfactory results, there are still some issues that need to be further addressed. First, in the real world, samples may contain noise. However, most supervised cross-modal hashing methods only use the class labels of the training data to construct the similarity matrix between two samples, without considering the noise in the samples, such as: outliers. Obviously, these noise samples will seriously damage the structure of the similarity matrix between pairs of samples, thereby misleading the learning of hash codes and resulting in reduced retrieval performance. Secondly, the discrete constraints of hash codes make mixed integer optimization problems usually difficult to solve. Most methods first relax the discrete constraints of hash codes to obtain continuous solutions, and then quantize to generate hash codes. However, quantization leads to loss of information, making hash codes less discriminative.

发明内容：Invention content:

本发明的目的在于克服上述已有技术的不足而提供一种以学习性能更好的哈希码，提升算法的性能，以更好的抵抗噪声，提升了哈希码的区分能力，适用于现实网络数据的跨媒体检索的鲁棒离散监督跨模态哈希检索方法。The purpose of the present invention is to overcome the deficiencies of the above-mentioned prior art and provide a hash code with better learning performance, improve the performance of the algorithm, better resist noise, improve the distinguishing ability of the hash code, and be suitable for real-world applications. A Robust Discrete Supervised Cross-Modal Hashing Retrieval Method for Cross-Media Retrieval of Web Data.

本发明的目的可以提供如下措施来达到：一种鲁棒离散监督跨模态哈希检索方法，其特征在于，该方法包括以下步骤：The object of the present invention can be achieved by providing the following measures: a robust discrete supervised cross-modal hash retrieval method, characterized in that the method comprises the following steps:

第一步：搜集含有类标签的图像和文本样本对，构成图像、文本一一对应的跨模态检索的图文数据集；Step 1: Collect image and text sample pairs containing class labels to form a one-to-one image-text cross-modal retrieval image-text dataset;

第二步：分别对图像和文本模态样本提取特征，并分别对图像和文本模态样本的特征去均值，使两个模态样本的特征数据均值为0；Step 2: Extract features from the image and text modal samples respectively, and remove the mean value from the features of the image and text modal samples, so that the mean value of the feature data of the two modal samples is 0;

第三步：将数据集中的所有样本对随机划分为训练集和测试集；Step 3: Randomly divide all sample pairs in the data set into training set and test set;

第四步：利用训练集中样本对的类标签、图像和文本模态的样本特征分别构造两两样本间相似度矩阵，并利用两两样本间相似度矩阵的低秩特性和噪声样本的稀疏特性，学习一个鲁棒的两两样本间相似度矩阵；训练样本对的特征设为X，X＝{X⁽¹⁾，X⁽²⁾}，其中X⁽¹⁾表示训练集中图像模态的样本特征，X⁽²⁾表示训练集中文本模态的样本特征，其中d₁和d₂分别表示图像和文本模态样本特征的维度，N表示训练集中图像或文本模态样本数量，样本对的类标签用L表示，c表示样本类别的数量，l_i∈{0，1}^c，如果l_ij＝1，表示第i个样本属于第j类；反之，如果l_ij＝0，表示第i个样本不属于第j类；学习鲁棒两两样本间相似度矩阵的目标函数包括以下步骤：Step 4: Use the class label of the sample pair in the training set, the sample features of the image and the text mode to construct the similarity matrix between the two samples, and use the low-rank characteristics of the similarity matrix between the two samples and the sparse characteristics of the noise samples , to learn a robust similarity matrix between two samples; the feature of the training sample pair is set to X, X={X ⁽¹⁾ , X ⁽²⁾ }, where X ⁽¹⁾ represents the sample of the image modality in the training set feature, X ⁽²⁾ represents the sample feature of the text modality in the training set, where d ₁ and d ₂ represent the dimensions of image and text modality sample features respectively, N represents the number of image or text modality samples in the training set, and the class label of the sample pair is denoted by L, c represents the number of sample categories, l _i ∈ {0, 1} ^c , if l _ij = 1, it means that the i-th sample belongs to the j-th class; otherwise, if l _ij = 0, it means that the i-th sample does not belong to the j-th class; learning the objective function of a robust pairwise similarity matrix includes the following steps:

(1)利用图像模态的样本特征计算基于图像模态特征的两两样本间的相似度矩阵，定义如下：(1) Use the sample features of the image modality to calculate the similarity matrix between two samples based on the image modality features, which is defined as follows:

其中||·||_F表示Frobenius范数，S⁽¹⁾表示图像模态的两两样本间相似度矩阵，表示第i个图像样本和第j个图像样本的相似度，σ₁为尺度参数；Where ||·|| _F represents the Frobenius norm, S ⁽¹⁾ represents the similarity matrix between two samples of the image modality, Indicates the similarity between the i-th image sample and the j-th image sample, σ ₁ is the scale parameter;

(2)利用文本模态的样本特征计算基于文本模态特征的两两样本间相似度矩阵，定义如下：(2) Using the sample features of the text modality to calculate the similarity matrix between two samples based on the text modality features, the definition is as follows:

其中S⁽²⁾表示文本模态的两两样本间相似度矩阵，表示第i个文本样本和第j个文本样本的相似度，σ₂为尺度参数；Among them, S ⁽²⁾ represents the similarity matrix between two samples of the text modality, Indicates the similarity between the i-th text sample and the j-th text sample, and σ ₂ is a scale parameter;

(3)利用样本对的类标签计算基于类标签的两两样本间相似度矩阵，定义如下：(3) Use the class labels of the sample pairs to calculate the similarity matrix between two samples based on the class labels, which is defined as follows:

其中S⁽³⁾表示样本对标签的两两相似度矩阵，表示第i个样本对标签和第j个样本对标签的相似度；where S ⁽³⁾ represents the pairwise similarity matrix of samples to labels, Indicates the similarity between the i-th sample pair label and the j-th sample pair label;

(4)学习鲁棒两两样本间相似度矩阵的目标函数定义如下：(4) The objective function for learning a robust pairwise similarity matrix is defined as follows:

s.t.S⁽ⁱ⁾＝S+||E⁽ⁱ⁾||₀ stS ⁽ⁱ⁾ = S+||E ⁽ⁱ⁾ || ₀

其中S表示学习的鲁棒样本间两两相似度矩阵，E⁽ⁱ⁾表示第i个两两相似度矩阵中的噪声，rank(·)表示矩阵的秩，||·||₀表示l₀范数；where S represents the learned pairwise similarity matrix between robust samples, E ⁽ⁱ⁾ represents the noise in the i-th pairwise similarity matrix, rank( ) represents the rank of the matrix, and ||·|| ₀ represents l ₀ norm;

(5)由于上述(4)中的目标函数存在离散低秩和l₀范数的约束，所以问题很难直接求解，可以放松这两个约束条件，得到问题的近似解，所以上式可改写为(5) Since the objective function in (4) above has constraints of discrete low rank and l ₀ norm, it is difficult to solve the problem directly. These two constraints can be relaxed to obtain an approximate solution to the problem, so the above formula can be rewritten for

s.t.S⁽ⁱ⁾＝S+||E⁽ⁱ⁾||₁ stS ⁽ⁱ⁾ = S+||E ⁽ⁱ⁾ || ₁

其中||·||_*表示核范数，||·||₁表示l₁范数，where ||·|| _* represents the nuclear norm, ||·|| ₁ represents the l ₁ norm,

(6)利用增广拉格朗日乘子法求解这个问题，得到鲁棒两两样本间相似度矩阵；(6) Use the augmented Lagrangian multiplier method to solve this problem, and obtain a robust similarity matrix between two samples;

第五步：构造目标函数，具体包括以下步骤：Step 5: Construct the objective function, which specifically includes the following steps:

(1)在汉明空间保持基于鲁棒两两样本间相似度矩阵的相似性，并且由于图像文本样本对类标签相同，因此它们的距离应尽量小，所以哈希码学习的目标函数定义如下：(1) Maintain the similarity based on the similarity matrix between robust pairs of samples in the Hamming space, and since the image and text samples have the same class labels, their distance should be as small as possible, so the objective function of hash code learning is defined as follows :

其中k表示哈希码的长度，B₁为图像模态样本的哈希码，B₂为文本模态样本的哈希码，λ为权重参数；Where k represents the length of the hash code, _B1 is the hash code of the image modality sample, _B2 is the hash code of the text modality sample, and λ is the weight parameter;

(2)利用线性映射作为哈希函数，并利用l_2，1范数作为正则项约束图像和文本模态哈希函数的学习，以增强其抵抗噪声的能力，因此各模态哈希函数学习的目标函数定义如下：(2) Use the linear map as the hash function, and use the l _2,1 norm as the regular item to constrain the learning of image and text modal hash functions to enhance its ability to resist noise, so each modal hash function learns The objective function of is defined as follows:

其中W₁，W₂分别表示图像模态和文本模态的哈希函数，Reg(·)表示正则项防止过拟合，在这里β_i和μ为权重参数；Among them, W ₁ and W ₂ represent the hash functions of the image modality and the text modality respectively, and Reg(·) represents the regular term to prevent overfitting, here β _i and μ are weight parameters;

(3)将哈希码和哈希函数学习的目标函数相加即为本方法的目标函数，定义如下：(3) Adding the objective function of hash code and hash function learning is the objective function of this method, which is defined as follows:

其中β_i为权重参数；Where β _i is the weight parameter;

第六步：由于目标函数包含多个未知变量和哈希码的离散约束，因此目标函数很难求解，但通过观察可以发现，当固定其他变量求解其中某一个变量时是凸优化问题，因此可以利用迭代优化算法求解，求解过程包括以下步骤：Step 6: Since the objective function contains discrete constraints of multiple unknown variables and hash codes, the objective function is difficult to solve, but it can be found through observation that when other variables are fixed to solve one of the variables, it is a convex optimization problem, so An iterative optimization algorithm can be used to solve the problem, and the solution process includes the following steps:

(1)固定W₁，W₂和B₂，求解B₁：(1) Fix W ₁ , W ₂ and B ₂ , and solve for B ₁ :

去除常数项，目标函数可写为：Remove the constant term, the objective function can be written as:

由于B₁是离散的，问题很难直接求解，在此可以逐样本求解，令b_1i表示B₁的第i列，b_2j表示B₂的第j列，去除常数项目标函数可写为：Since B ₁ is discrete, it is difficult to solve the problem directly. Here, it can be solved sample by sample. Let b _1i represent the i-th column of B ₁ , and b _2j represent the j-th column of B _2. The objective function can be written as:

这个问题依然很难直接求解，在此采用循环坐标梯度下降法逐比特求解，设b_1im表示b_1i的第m比特，表示b_1i除了第m比特外的其他比特构成的向量，则b_1im可由下式得到：This problem is still difficult to solve directly. Here, the cyclic coordinate gradient descent method is used to solve it bit by bit. Let b _1im represent the mth bit of b _1i , Indicates that b _1i is a vector composed of bits other than the mth bit, then b _1im can be obtained by the following formula:

重复上式直至求解完所有图像模态样本的哈希码；Repeat the above formula until the hash codes of all image modality samples are solved;

(2)固定W₁，W₂和B₁，求解B₂：(2) Fix W ₁ , W ₂ and B ₁ , solve for B ₂ :

与求解B₁类似，可得Similar to solving B ₁ , we can get

重复上式直至求解完所有文本模态样本的哈希码；Repeat the above formula until the hash codes of all text modal samples are solved;

(3)固定W₂，B₁和B₂，求解W₁：(3) Fix W ₂ , B ₁ and B ₂ , and solve for W ₁ :

这个问题存在闭合解There is a closed solution to this problem

其中D₁为对角阵， where _D1 is a diagonal matrix,

(4)固定W₁，B₁和B₂，求解W₂：(4) Fix W ₁ , B ₁ and B ₂ , and solve for W ₂ :

与求解W₁类似，W₂存在闭合解Similar to solving W ₁ , there is a closed solution for W ₂

其中D₂为对角阵， where _D2 is a diagonal matrix,

(5)重复执行(1)-(4)至算法收敛或达到最大迭代次数；(5) Repeat (1)-(4) until the algorithm converges or reaches the maximum number of iterations;

第七步：用户输入查询样本，提取其特征，并对提取的特征去均值；Step 7: The user enters the query sample, extracts its features, and removes the average value of the extracted features;

第八步：利用已学习的哈希函数生成查询样本的哈希码：Step 8: Use the learned hash function to generate the hash code of the query sample:

第九步：计算查询样本与目标(训练)集中异构样本的汉明距离，并对汉明距离按升序排列，前r个汉明距离对应的样本即为检索结果。Step 9: Calculate the Hamming distance between the query sample and the heterogeneous samples in the target (training) set, and arrange the Hamming distance in ascending order, and the samples corresponding to the first r Hamming distances are the retrieval results.

本发明同已有技术相比可产生如下积极效果：本发明方法通过将类标签、图像和文本模态的特征融入一个框架学习一个鲁棒的两两样本间相似度矩阵，以学习性能更好的哈希码，提升算法的性能；提出施加l_2，1范数作为正则项约束哈希函数的学习，以更好的抵抗噪声；提出了一种离散优化算法，可以直接得到离散哈希码，提升了哈希码的区分能力，该发明适用于现实网络数据的跨媒体检索。Compared with the prior art, the present invention can produce the following positive effects: the method of the present invention learns a robust similarity matrix between two samples by integrating the features of class labels, images and text modalities into a framework, so that the learning performance is better The hash code of the algorithm improves the performance of the algorithm; it is proposed to apply the l _2,1 norm as a regular term to constrain the learning of the hash function to better resist noise; a discrete optimization algorithm is proposed, which can directly obtain the discrete hash code , which improves the distinguishing ability of hash codes, and the invention is suitable for cross-media retrieval of actual network data.

附图说明：Description of drawings:

图1是本发明鲁棒离散监督跨模态哈希检索方法的流程图。Fig. 1 is a flow chart of the robust discrete supervised cross-modal hash retrieval method of the present invention.

具体实施方式：Detailed ways:

为使本发明的技术方案更加明白，以下结合具体实施方式对本发明进一步详细描述，而并非是对其保护范围的限制。In order to make the technical solution of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments, rather than limiting its protection scope.

实施例：一种鲁棒离散监督跨模态哈希检索方法，其包括以下步骤：Embodiment: a kind of robust discrete supervised cross-modal hash retrieval method, it comprises the following steps:

第二步：提取图像和文本的特征，其中图像模态样本用150维的纹理特征表示，文本模态样本用500维的BOW(Bag Of Words)特征表示，并对特征去均值，使两个模态样本的特征数据均值为0；The second step: extract the features of the image and text, where the image modality sample is represented by a 150-dimensional texture feature, and the text modality sample is represented by a 500-dimensional BOW (Bag Of Words) feature, and the features are averaged to make the two The mean value of the characteristic data of the modal sample is 0;

第四步：利用训练集中样本对的类标签、图像和文本模态的样本特征分别构造两两样本间相似度矩阵，并利用两两样本间相似度矩阵的低秩特性和噪声样本的稀疏特性，学习一个鲁棒的两两样本间相似度矩阵；训练样本对的特征设为X，X＝{X⁽¹⁾，X⁽²⁾}，其中X⁽¹⁾表示训练集中图像模态的样本特征，X⁽²⁾表示训练集中文本模态的样本特征，其中d₁和d₂分别表示图像和文本模态样本特征的维度，N表示训练集中图像或文本模态样本数量，样本对的类标签用L表示，c表示样本类别的数量，l_i∈{0，1}^c，如果l_ij＝1，表示第i个样本属于第j类；反之，如果l_ij＝0，表示第i个样本不属于第j类；此处d₁＝150，d₂＝500；Step 4: Use the class label of the sample pair in the training set, the sample features of the image and the text mode to construct the similarity matrix between the two samples, and use the low-rank characteristics of the similarity matrix between the two samples and the sparse characteristics of the noise samples , to learn a robust similarity matrix between two samples; the feature of the training sample pair is set to X, X={X ⁽¹⁾ , X ⁽²⁾ }, where X ⁽¹⁾ represents the sample of the image modality in the training set feature, X ⁽²⁾ represents the sample feature of the text modality in the training set, where d ₁ and d ₂ represent the dimensions of image and text modality sample features respectively, N represents the number of image or text modality samples in the training set, and the class label of the sample pair is denoted by L, c represents the number of sample categories, l _i ∈ {0, 1} ^c , if l _ij = 1, it means that the i-th sample belongs to the j-th class; otherwise, if l _ij = 0, it means that the i-th sample does not belong to the j-th class; here d ₁ =150, d ₂ =500;

学习鲁棒两两样本间相似度矩阵的目标函数包括以下步骤：Learning the objective function of a robust pairwise similarity matrix includes the following steps:

其中||·||_F表示Frobenius范数，S⁽¹⁾表示图像模态的两两样本间相似度矩阵，Where ||·|| _F represents the Frobenius norm, S ⁽¹⁾ represents the similarity matrix between two samples of the image modality,

表示第i个图像样本和第j个图像样本的相似度，σ₁为尺度参数，此处σ₁＝0.8； Indicates the similarity between the i-th image sample and the j-th image sample, σ ₁ is a scale parameter, where σ ₁ =0.8;

其中S⁽²⁾表示文本模态的两两样本间相似度矩阵，表示第i个文本样本和第j个文本样本的相似度，σ₂为尺度参数，此处σ₂＝0.3；Among them, S ⁽²⁾ represents the similarity matrix between two samples of the text modality, Indicates the similarity between the i-th text sample and the j-th text sample, σ ₂ is a scale parameter, where σ ₂ =0.3;

s.t.S⁽ⁱ⁾＝S+||E⁽ⁱ⁾||₀ stS ⁽ⁱ⁾ = S+||E ⁽ⁱ⁾ || ₀

s.t.S⁽ⁱ⁾＝S+||E⁽ⁱ⁾||₁ stS ⁽ⁱ⁾ = S+||E ⁽ⁱ⁾ || ₁

其中k表示哈希码的长度，B₁为图像模态样本的哈希码，B₂为文本模态样本的哈希码，λ为权重参数，此处λ＝1；Wherein k represents the length of the hash code, B ₁ is the hash code of the image modality sample, B ₂ is the hash code of the text modality sample, and λ is a weight parameter, where λ=1;

其中W₁，W₂分别表示图像模态和文本模态的哈希函数，Reg(·)表示正则项防止过拟合，在这里β_i和μ为权重参数，此处β₁＝10，β₂＝10，μ＝0.1：Among them, W ₁ and W ₂ represent the hash functions of the image modality and the text modality respectively, and Reg(·) represents the regular term to prevent overfitting, here β _i and μ are weight parameters, where β ₁ =10, β ₂ =10, μ=0.1:

与求解B₁类似，可得Similar to solving B ₁ , we can get

这个问题存在闭合解There is a closed solution to this problem

其中D₁为对角阵， where _D1 is a diagonal matrix,

其中D₂为对角阵， where _D2 is a diagonal matrix,

(5)重复执行(1)-(4)，如果最近两次迭代的误差的绝对值小于0.01或迭代次数大于20则结束；(5) Repeat (1)-(4), and end if the absolute value of the error of the last two iterations is less than 0.01 or the number of iterations is greater than 20;

第九步：计算查询样本与目标(训练)集中异构样本的汉明距离，并对汉明距离按升序排列，前r个汉明距离对应的样本即为检索结果，此处r＝100。Step 9: Calculate the Hamming distance between the query sample and the heterogeneous samples in the target (training) set, and arrange the Hamming distance in ascending order. The samples corresponding to the first r Hamming distances are the retrieval results, where r=100.

为了验证本发明的有效性，本实施例以公开数据集Mirflickr25K为例，本数据集包含20015个图像文本对，所有样本对可划分为24个类别；随机选取15011(75％)个样本对构成训练集，而剩余的5004(25％)个样本对构成测试集；图像模态样本用150维的纹理特征表示，文本模态样本用500维的BOW(Bag Of Words)特征表示，并对特征去均值，两个模态样本的特征数据均值为0；为了客观地评价本发明方法的检索性能，用平均准确率(MeanAverage Precision，MAP)作为评价标准，在Mirflickr25K数据集上，不同哈希码长p的MAP结果如表1所示。In order to verify the effectiveness of the present invention, this embodiment takes the public data set Mirflickr25K as an example. This data set contains 20015 image-text pairs, and all sample pairs can be divided into 24 categories; 15011 (75%) sample pairs are randomly selected to form The training set, and the remaining 5004 (25%) sample pairs constitute the test set; image modality samples are represented by 150-dimensional texture features, text modality samples are represented by 500-dimensional BOW (Bag Of Words) features, and the feature To remove the mean value, the mean value of the feature data of the two modal samples is 0; in order to objectively evaluate the retrieval performance of the method of the present invention, the average accuracy rate (MeanAverage Precision, MAP) is used as the evaluation standard, and on the Mirflickr25K data set, different hash codes The MAP results of long p are shown in Table 1.

表1在Mirflickr25K数据集上的MAP结果Table 1 MAP results on the Mirflickr25K dataset

p＝16p=16 p＝32p=32 p＝64p=64 p＝96p=96 图像检索文本image retrieval text 0.67180.6718 0.67850.6785 0.68430.6843 0.69180.6918 文本检索图像Text Retrieval Image 0.68130.6813 0.69530.6953 0.69770.6977 0.70450.7045

应当理解的是，本说明书未详细阐述的部分都属于现有技术。上述针对较佳实施例的描述较细致，但不能因此认为是对本发明专利保护范围的限制。It should be understood that the parts not described in detail in this specification belong to the prior art. The above description of the preferred embodiments is more detailed, but it should not be considered as limiting the protection scope of the patent of the present invention.

Claims

1. A robust discrete supervision cross-media hash retrieval method, characterized in that the method comprises the steps:

Step 1: Collect image and text sample pairs containing class labels to form a one-to-one image-text cross-modal retrieval image-text dataset;

Step 2: Extract features from the image and text modal samples respectively, and remove the mean value from the features of the image and text modal samples, so that the mean value of the feature data of the two modal samples is 0;

Step 3: Randomly divide all sample pairs in the data set into training set and test set;

Step 4: Use the class label of the sample pair in the training set, the sample features of the image and the text mode to construct the similarity matrix between the two samples, and use the low-rank characteristics of the similarity matrix between the two samples and the sparse characteristics of the noise samples , to learn a robust similarity matrix between two samples; the feature of the training sample pair is set to X, X={X ⁽¹⁾ , X ⁽²⁾ }, where X ⁽¹⁾ represents the sample of the image modality in the training set feature, X ⁽²⁾ represents the sample feature of the text modality in the training set, where d ₁ and d ₂ represent the dimensions of image and text modality sample features respectively, N represents the number of image or text modality samples in the training set, and the class label of the sample pair is denoted by L, c represents the number of sample categories, l _i ∈ {0, 1} ^c , if l _ij = 1, it means that the i-th sample belongs to the j-th class; otherwise, if l _ij = 0, it means that the i-th sample does not belong to the j-th class; learning the objective function of a robust pairwise similarity matrix includes the following steps:

(1) Use the sample features of the image modality to calculate the similarity matrix between two samples based on the image modality features, which is defined as follows:

Where ||·|| _F represents the Frobenius norm, S ⁽¹⁾ represents the similarity matrix between two samples of the image modality, Indicates the similarity between the i-th image sample and the j-th image sample, σ ₁ is the scale parameter;

(2) Using the sample features of the text modality to calculate the similarity matrix between two samples based on the text modality features, the definition is as follows:

Among them, S ⁽²⁾ represents the similarity matrix between two samples of the text modality, Indicates the similarity between the i-th text sample and the j-th text sample, and σ ₂ is a scale parameter;

(3) Use the class labels of the sample pairs to calculate the similarity matrix between two samples based on the class labels, which is defined as follows:

where S ⁽³⁾ represents the pairwise similarity matrix of samples to labels, Indicates the similarity between the i-th sample pair label and the j-th sample pair label;

(4) The objective function for learning a robust pairwise similarity matrix is defined as follows:

stS ⁽ⁱ⁾ = S+||E ⁽ⁱ⁾ || ₀

where S represents the learned pairwise similarity matrix between robust samples, E ⁽ⁱ⁾ represents the noise in the i-th pairwise similarity matrix, rank( ) represents the rank of the matrix, and ||·|| ₀ represents l ₀ norm;

(5) The objective function in (4) above has constraints of discrete low rank and l ₀ norm, the above formula can be rewritten as

stS ⁽ⁱ⁾ = S+||E ⁽ⁱ⁾ || ₁

where ||·|| _* represents the nuclear norm, ||·|| ₁ represents the l ₁ norm,

(6) Use the augmented Lagrangian multiplier method to solve this problem, and obtain a robust similarity matrix between two samples;

Step 5: Construct the objective function, which specifically includes the following steps:

(1) In the Hamming space, the similarity based on the similarity matrix between two samples is maintained, and the objective function of hash code learning is defined as follows:

Where k represents the length of the hash code, _B1 is the hash code of the image modality sample, _B2 is the hash code of the text modality sample, and λ is the weight parameter;

(2) Use the linear map as the hash function, and use the _l2,1 norm as the regular item to constrain the learning of image and text modal hash functions. The objective function of each modal hash function learning is defined as follows:

Among them, W ₁ and W ₂ represent the hash functions of the image modality and the text modality respectively, and Reg(·) represents the regular term to prevent overfitting, here β _i and μ are weight parameters;

(3) Adding the objective function of hash code and hash function learning is the objective function of this method, which is defined as follows:

Where β _i is the weight parameter;

Step 6: The objective function is solved using an iterative optimization algorithm. The solution process includes the following steps:

(1) Fix W ₁ , W ₂ and B ₂ , and solve for B ₁ :

Remove the constant term, the objective function can be written as:

Here it can be solved sample by sample, let b _1i represent the i-th column of B ₁ , b _2j represent the j-th column of B ₂ , and the objective function can be written as:

Here, the cyclic coordinate gradient descent method is used to solve bit by bit, let b _1im represent the mth bit of b _1i , Indicates that b _1i is a vector composed of bits other than the mth bit, then b _1im can be obtained by the following formula:

Repeat the above formula until the hash codes of all image modality samples are solved;

(2) Fix W ₁ , W ₂ and B ₁ , solve for B ₂ :

The same method as solving B ₁ , we can get

Repeat the above formula until the hash codes of all text modal samples are solved;

(3) Fix W ₂ , B ₁ and B ₂ , and solve for W ₁ :

Remove the constant term, the objective function can be written as:

There is a closed solution to this problem

where _D1 is a diagonal matrix,

(4) Fix W ₁ , B ₁ and B ₂ , and solve for W ₂ :

Same as the method for solving W ₁ , W ₂ has a closed solution

where _D2 is a diagonal matrix,

(5) Repeat (1)-(4) until the algorithm converges or reaches the maximum number of iterations;

Step 7: The user enters the query sample, extracts its features, and removes the average value of the extracted features;

Step 8: Use the learned hash function to generate the hash code of the query sample:

Step 9: Calculate the Hamming distance between the query sample and the heterogeneous samples in the training set, and arrange the Hamming distance in ascending order, and the samples corresponding to the first r Hamming distances are the retrieval results.