CN109271486B

CN109271486B - Similarity-preserving cross-modal Hash retrieval method

Info

Publication number: CN109271486B
Application number: CN201811097048.3A
Authority: CN
Inventors: 董西伟; 杨茂保; 孙丽; 董小刚; 尧时茂; 王玉伟; 邓安远; 邓长寿
Original assignee: Jiujiang University
Current assignee: Jiujiang University
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2021-11-26
Anticipated expiration: 2038-09-19
Also published as: CN109271486A

Abstract

A similarity-preserving cross-modal hash retrieval method, the method comprising the steps of: (1) constructing an objective function based on a similarity retention strategy; (2) solving an objective function; (3) generating a binary Hash code of the query sample and the sample in the retrieval sample set; (4) calculating the Hamming distance from the query sample to each sample in the retrieval sample set; (5) the retrieval of the query sample is accomplished using a cross-modality retriever. The method can fully retain the similarity of samples between the modes during Hash learning, and can fully retain the similarity of samples in the modes, so that a hamming space obtained by learning has stronger identification capability, and cross-mode retrieval is facilitated.

Description

Similarity-preserving cross-modal Hash retrieval method

Technical Field

The invention relates to a similarity-preserving cross-modal Hash retrieval method.

Background

In various industries of the present society, a large amount of user data (such as the data volume owned by the search engine Chrome exceeds 100PB) is accumulated by a large number of users, and the data volume is still increasing in an exponential trend, and the big data era is coming. The big data plays a very important role in the industries such as internet finance, medical treatment, education, military affairs and transportation, for example, the big data is combined with a machine learning technology, so that reliable bases can be provided for financial investment, market decision and the like. Today's big data has the following characteristics: (1) the volume is large, and the data volume is in PB (positive displacement); (2) the dimensionality is high, and the data features have thousands of dimensionalities; (3) the mode is many, and data variety is many, the form is diversified, including form such as image, text, audio frequency and video. These features of big data pose serious challenges to machine learning. In the face of the current situation, how to reasonably utilize the big data, extract valuable information from the big data and provide a basis for actual project work is an urgent problem to be solved.

The information retrieval technology can retrieve valuable information for users, similarity search is a research hotspot in the field of information retrieval, and Approximate Nearest Neighbor search (ANN) is concerned due to the high search speed. The ANN searching method mainly comprises a tree method and a Hash learning method, and the two methods have respective characteristics. In particular, the tree-based approach has the following characteristics: (1) recursively dividing the data and divide and treat it; (2) query time complexity is o (logn); (3) as the data dimension increases, the tree-based ANN search performance gradually decreases; (4) a storage tree structure is needed, and the storage overhead is large; (5) when the system runs, original data needs to be stored, and the expense of a memory is increased. The hash learning method has excellent characteristics, including: (1) each item in the database is represented by a binary string, so that the data storage capacity and the memory space are greatly reduced; (2) the query time complexity is constant O (1) or sub-linear. Therefore, the hash learning method is widely used in practice.

The cross-modal hash is mainly used for solving the problem of mutual retrieval between multi-modal data, such as searching for text by using an image or searching for an image by using text. The cross-modal hash algorithm needs to perform hash coding on data in different modalities to generate a compact binary string, and then perform mutual retrieval on the data in different modalities. The cross-modal hash algorithm needs to consider not only the correlation between data in the same modality, but also the correlation between data in different modalities. In recent years, a hash method of more than a few cross-modal retrieval is proposed in succession. For example, Bronstein et al propose a Cross-modal Similarity Search Hashing (CMSSH) method, in which a hash function corresponding to each binary code is regarded as a weak classifier, and the learning of the hash function is performed by using an AdaBoost lifting algorithm. Kuma et al propose a Cross-View Hashing (CVH) method that learns respective hash functions for different modal data by minimizing the difference between semantic similarity and Hamming distance. Song et al propose an Inter-Media Hashing (IMH) method that finds a common hamming space by maintaining the consistency between Media and Media, and then learns respective corresponding hash functions for different modalities using a linear regression algorithm. Ding et al propose a Collaborative Matrix Factorization Hashing (CMFH) method, which learns a common semantic representation for different modalities through a collaborative Matrix decomposition, and then generates a uniform binary hash code through a quantization method. Zhu et al propose a Linear Cross-Modal Hashing (LCMH) method, which uses a K-means clustering algorithm to data in each mode to generate K clustering points, reconstructs a feature space of the data according to distances between the data points in the mode and the K clustering points, and solves a feature vector through feature value decomposition to obtain a hash function corresponding to each mode. Zhou et al propose a Latent Semantic Sparse Hashing (LSSH) method, which combines Sparse coding and matrix decomposition techniques to learn a common Latent semantic representation for features of different modalities, and then optimally solve an objective function through an iterative optimization algorithm. Zhang et al, based on the Semantic association Maximization (SCM) method, complete the learning of the hash function by maximizing the Semantic association, and propose a feature decomposition method SCM _ orth and a sequence learning method SCM _ seq. Lin et al propose a semantic Preserving Hashing (SePH) method, which converts a similarity matrix into a probability distribution calculation by minimizing K-L divergence, performs probability estimation on binary hash code strings corresponding to each sample, and then learns hash functions corresponding to each modality through a kernel function regression algorithm.

When mapping data of image modalities and text modalities from the original feature space to other feature spaces, some features of the original data are inevitably lost. For cross-modal retrieval based on Hash learning, when data of an image modality and a text modality are mapped to a Hamming space from an original feature space, effective retention and mining of identification information in the original data have a vital role in effectively completing a cross-modal retrieval task. For the samples of the image modality and the text modality, the similarity relationship of the samples of different modalities and the similarity relationship of the samples of the same modality are key factors influencing the cross-modality retrieval. Many existing cross-modal hash learning methods do not process similarity relations between modals and samples in the modals well, some methods only pay attention to keeping the similarity relations between the modals, and some methods only pay attention to keeping the similarity relations between the modals. This can adversely affect the discrimination performance of the learned hamming space. In addition, many methods do not fully consider the redundancy problem of the information contained in each bit of the hash code, so that the learned hash code has not only redundancy but also lacks sufficient discrimination capability. Therefore, in the cross-modal hash learning, the similarity relation between the modes and the samples in the modes is simultaneously kept, the information redundancy on each bit of the hash code is made to be as small as possible, and the cross-modal hash learning method has very important significance for effectively promoting the improvement of the cross-modal retrieval performance.

Disclosure of Invention

The invention aims to provide a similarity-preserving cross-modal Hash retrieval method, which solves the problems that the similarity of intra-modal and inter-modal samples is not fully preserved in the existing methods, and redundant information on each bit of Hash codes is not fully reduced, so that the learned Hash codes have good identification capability.

The technical scheme adopted for achieving the purpose is that a similarity-preserving cross-modal Hash retrieval method is provided, and n objects are assumed

The characteristics in the image mode and the text mode are respectively

And

wherein d is₁And d₂Representing the dimensions of the image modality and text modality feature vectors respectively,

and

respectively representing the characteristics of the ith object in an image modality and a text modality; meanwhile, the feature vectors of the image mode and the text mode are assumed to be subjected to zero-centering preprocessing, namely, the condition that the feature vectors meet the requirement of zero-centering preprocessing is met

Let L be [ L ] for the tag matrix formed by the class tags of n objects₁,l₂,…,l_n]∈{0,1}^m×nWherein l is_i(i ═ 1,2, …, n) denotes category label information of the ith object, and m is the number of categories; assume a cross-modal similarity matrix of S, whose element S_ijRepresenting the similarity of the ith sample in the image modality and the jth sample in the text modality; if the ith sample in the image modality is similar to the jth sample in the text modality (at least belonging to the same category), S_ij1, otherwise S_ij0; the method comprises the following steps:

(1) constructing an objective function based on a similarity retention strategy: obtaining binary Hash codes U and V of n object image mode and text mode characteristic data in Hamming space by using an objective function designed based on an inter-mode similarity retaining strategy and an intra-mode similarity retaining strategy, and Hash projection matrixes P corresponding to the image mode and the text mode respectively₁And P₂And two coefficient matrices W₁And W₂；

(2) Solving an objective function: by means of alternate updating, in view of the non-convex nature of the objective functionObtain a solution U, V, P of the objective function₁、P₂、W₁And W₂Namely, the following four subproblems are solved alternately: fixed U, V, W₁And W₂Solving for P₁And P₂(ii) a Fixed U, V, P₁And P₂Solving for W₁And W₂(ii) a Fixed V, P₁、P₂、W₁And W₂Solving U; fixed U, P₁、P₂、W₁And W₂Solving V;

(3) generating a sample binary hash code of the query sample and the retrieval sample set: hash projection matrix P based on image mode and text mode obtained by solving₁And P₂Generating binary hash codes for the query samples and the samples in the retrieval sample set;

(4) calculating the Hamming distance from the query sample to each sample in the retrieval sample set: calculating the Hamming distance from the query sample to each sample in the retrieval sample set based on the generated binary Hash codes;

(5) the retrieval of the query sample is accomplished using a cross-modality retriever: the retrieval of the query sample is accomplished using a cross-modal retriever based on approximate nearest neighbor searching.

The objective function designed based on the intra-modal similarity retention strategy and the inter-modal similarity retention strategy in the step (1) is in the form as follows:

wherein alpha, beta, gamma and eta are nonnegative balance factors, c is the length of binary hash code, I is an identity matrix, 1_n×1 denotes a column vector with elements all being 1,

λ > 0 is an adjustable scaling factor, u_iBinary hash coding of the ith sample for an image modality, v_jBinary hash coding of jth sample in text mode, | · | | torry_FRepresentation matrixFrobenius norm, (.)^TRepresenting a transpose operation of the matrix.

The solution U, V, P of the objective function is obtained by the alternative solution in the step (2)₁、P₂、W₁And W₂Specifically, the following four sub-problems are solved alternately:

(1) fixed U, V, W₁And W₂Solving for P₁And P₂(ii) a When the binary hash is fixed, U and V, and the coefficient matrix W₁And W₂Thereafter, the objective function in equation (1) is reduced to that regarding the Hash projection matrix P₁And P₂The sub-problems of (a):

(2) fixed U, V, P₁And P₂Solving for W₁And W₂(ii) a When the binary hash codes U and V are fixed, and the hash projection matrix P₁And P₂Thereafter, the objective function in equation (1) is reduced to about the coefficient matrix W₁And W₂The sub-problems of (a):

(3) fixed V, P₁、P₂、W₁And W₂Solving U; when the binary Hash code V of the fixed text mode is adopted, the Hash projection matrix P₁And P₂And a coefficient matrix W₁And W₂After that, the objective function in equation (1) reduces to a sub-problem about the image modality binary hash encoding U, namely:

(4) fixed U, P₁、P₂、W₁And W₂Solving V; binary hash coding in fixed image modeCode U, Hash projection matrix P₁And P₂And a coefficient matrix W₁And W₂The objective function in equation (1) is then reduced to the sub-problem with the text modality binary hash V, namely:

the Hash projection matrix P based on the image mode and the text mode obtained by solving in the step (3)₁And P₂Generating a binary hash code for the query sample and the samples in the search sample set, in particular, assuming that a feature vector of a query sample of the image modality is

The feature vector of a query sample of the text modality is

The image mode searches the characteristics of the samples in the sample set as

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

representing the number of samples in the search sample set; the binary hash codes of the query samples in the image mode and the text mode and the binary hash codes of the samples in the retrieval sample set in the image mode and the text mode are respectively as follows:

and

wherein the content of the first and second substances,

in the step (4), the hamming distance from the query sample to each sample in the retrieval sample set is calculated based on the generated binary hash code, specifically, a formula is used

Calculating the Hamming distance from the query sample of the image mode to each sample in the text mode retrieval sample set, using a formula

The hamming distance from a query sample of a text modality to each sample in a set of image modality retrieval samples is calculated.

In the step (5), a cross-modal retriever based on approximate nearest neighbor search is used to complete the retrieval of the query sample, specifically, when the text is retrieved by using the image, the Hamming distance is selected

Sorting in order from small to large, for hamming distance when searching images with text

And sorting according to the sequence from small to large, and then taking samples corresponding to the first K minimum distances in the retrieval sample set as retrieval results.

Advantageous effects

Compared with the prior art, the invention has the following advantages.

1. The method can fully retain the similarity of samples between the modes during Hash learning, and can fully retain the similarity of samples in the modes, so that a hamming space obtained by learning has stronger identification capability, and cross-mode retrieval is facilitated;

2. the method fully considers the redundancy of the Hash code and minimizes the redundancy of each bit of the Hash code by implementing orthogonal constraint, so that the learned Hash code can contain more identification information, and the cross-modal retrieval performance is effectively improved.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a flowchart of a similarity preserving cross-modal hash retrieval method according to the present invention.

Detailed Description

A similarity-preserving cross-modal Hash retrieval method assumes n objects

The characteristics in the image mode and the text mode are respectively

And

and

Let L be [ L ] for the tag matrix formed by the class tags of n objects₁,l₂,…,l_n]∈{0,1}^m×nWherein l is_i(i ═ 1,2, …, n) denotes category label information of the ith object, and m is the number of categories; assume a cross-modal similarity matrix of S, whose element S_ijRepresenting the similarity of the ith sample in the image modality and the jth sample in the text modality; if the ith sample in the image mode and the jth sample in the text modeIf they are similar (belonging to at least one of the same categories), S_ij1, otherwise S_ij0; as shown in fig. 1, the method comprises the following steps:

(2) Solving an objective function: the solution U, V, P of the objective function is obtained by means of alternative solving in view of the non-convex nature of the objective function₁、P₂、W₁And W₂Namely, the following four subproblems are solved alternately: fixed U, V, W₁And W₂Solving for P₁And P₂(ii) a Fixed U, V, P₁And P₂Solving for W₁And W₂(ii) a Fixed V, P₁、P₂、W₁And W₂Solving U; fixed U, P₁、P₂、W₁And W₂Solving V;

wherein alpha, beta, gamma and eta are nonnegative balance factors, c is the length of binary hash code, I is an identity matrix, 1_n×1A column vector with elements all being 1 is represented,

λ > 0 is an adjustable scaling factor, u_iBinary hash coding of the ith sample for an image modality, v_jBinary hash coding of jth sample in text mode, | · | | torry_FFrobenius norm of the representation matrix (·)^TRepresenting a transpose operation of the matrix.

(4) fixed U, P₁、P₂、W₁And W₂Solving V; when the binary Hash code U of the fixed image mode is used, the Hash projection matrix P₁And P₂And a coefficient matrix W₁And W₂The objective function in equation (1) is then reduced to the sub-problem with the text modality binary hash V, namely:

The feature vector of a query sample of the text modality is

The image mode searches the characteristics of the samples in the sample set as

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

and

wherein the content of the first and second substances,

The specific implementation process mainly comprises the following steps:

(1) target function construction based on similarity retention strategy

In the method, the purpose of cross-modal Hash learning is to utilize the characteristic data X of an image modality and a text modality⁽¹⁾And X⁽²⁾And class label information of the object, learning hash function f of image modality and text modality⁽¹⁾(x⁽¹⁾)∈{-1,+1}^c×1And f⁽²⁾(x⁽²⁾)∈{-1,+1}^c×1Where c is the binary hash encoding adjustable length. Suppose U ═ U₁,u₂,…,u_n]∈{-1,+1}^c×nAnd V ═ V₁,v₂,…,v_n]∈{-1,+1}^c×nIs a binary hash code in a hash space generated by using the characteristic data of image modality and text modality of n objects and corresponding hash function, wherein u is_iAnd v_iThe hash codes of the ith (i ═ 1,2, …, n) objects in the image modality and the text modality are respectively represented. In order to have good discrimination of the binary hashes U and V in the hash space, it is desirable for the hashes U and V to retain similar information in S, i.e. if S has_ij1, then u_iAnd v_j、u_jAnd v_iThe Hamming distance of (1) is as small as possible, whereas the distance between the Hamming distance and the Hamming distance is as large as possible.

For the sake of illustration, only u is formulaically described_iAnd v_jRelation of (a) u_jAnd v_iThe relationships of (c) can be similarly formulated. For paired binary hash encoding u_i,v_jDefine Hash code u based on inner product of the two_iAnd v_jThe similarity relationship between the two is shown in the formula (1):

wherein λ > 0 is an adjustable scale factor, c is a predetermined hash code length,<·,·>representing the inner product of the vector. Using Sigmoid function to convert theta_ijFrom the original interval projection to the (0,1) range, we can get:

based on A_ijDefining the posterior probability of the cross-modal similarity matrix S, and obtaining:

according to the likelihood function estimation method in probability theory, the negative logarithm of formula (3) is expressed as

Where const denotes a constant.

Minimizing equation (4) may enable preserving cross-modality similarity in hash encodings U and V for image and text modalities. Specifically, as can be seen from equation (4), if S_ijWhen 1, then Θ_ijIt needs to be as large as possible, i.e. u_iAnd v_jThe inner product of (a) needs to be as large as possible, i.e., the binary hash code u_iAnd v_jThe Hamming distance between the two needs to be as small as possible; on the contrary, if S_ijWhen the value is 0, then theta_ijNeeds to be as small as possible, binary hash code u_iAnd v_jThe hamming distance between them needs to be as small as possible.

An effective cross-modal retrieval method needs to consider not only inter-modal similarity but also intra-modal neighbor structure, so intra-modal similarityA reservation needs to be made. For a single modality, the hash encoding of that modality is some transformation of the original feature vector of that modality. The intra-hash-coding-modality similarity preservation problem can be treated as a classification problem, i.e. optimal hash-coding can be well used for the complete classification as well. Suppose an image modality feature X⁽¹⁾And text modality feature X⁽²⁾The projection matrices mapped to hash codes U and V are respectively

And

the coefficient matrices for classifying the hash codes U and V are respectively

And

based on l₂Loss, minimizing the function that preserves intra-modal similarity can be achieved:

the hash code applied to the cross-modal retrieval task is expected to have the following characteristics while keeping similarity between and within the modalities:

(1) independence. If each bit of the hash code is considered as an attribute, it is desirable that the redundancy between the attributes is as small as possible, that is, it is desirable that the bits are independent of each other. The formulation of this characteristic is described by equation (6):

UU^T＝nI,VV^T＝nI， (6)

wherein I is an identity matrix.

(2) And (4) balance. That is, it is desirable that the probability of each bit hash code being +1 and-1 be equal, each 50%. This constraint may maximize the information provided by each bit. The formulation of this characteristic is described by equation (7):

U1_n×1＝0,V1_n×1＝0， (7)

wherein 1 is_n×1Representing a column vector with elements all being 1.

By combining the above analysis, the overall objective function design of the similarity-preserving cross-modal hash retrieval method is as follows:

wherein, alpha, beta, gamma and eta are non-negative balance factors.

(2) Solving of an objective function

The objective function equation (8) contains six variables to be solved, namely: hash codes U and V of image modality and text modality, and Hash projection matrix P of image modality and text modality₁And P₂Coefficient matrix W₁And W₂. The objective function in equation (8) is non-convex for the six variables to be solved, and therefore, the analytical solutions of the six variables to be solved cannot be obtained simultaneously. The unknown variable to be solved in equation (8) can be solved by solving the following four subproblems alternately, namely: fixed U, V, W₁And W₂Solving for P₁And P₂(ii) a Fixed U, V, P₁And P₂Solving for W₁And W₂(ii) a Fixed V, P₁、P₂、W₁And W₂Solving U; fixed U, P₁、P₂、W₁And W₂And solving for V.

(a) Fixed U, V, W₁And W₂Solving for P₁And P₂

When the binary hash is fixed, U and V, and the coefficient matrix W₁And W₂Thereafter, the objective function in equation (8) is reduced to that regarding the Hash projection matrix P₁And P₂The sub-problems of (a):

the problem in equation (9) is a standard partial least squares regression problem. Respectively to J about P₁And P₂Taking the partial derivative and making the derivative equal to 0, one can obtain:

a simple derivation of equations (10) and (11) can be found:

P₁＝(X⁽¹⁾X^(1)T+γI)^-1X⁽¹⁾U^T， (12)

P₂＝(X⁽²⁾X^(2)T+γI)^-1X⁽²⁾V^T， (13)

wherein, (.)^-1Representing the inverse of the matrix.

(b) Fixed U, V, P₁And P₂Solving for W₁And W₂

When the binary hash codes U and V are fixed, and the hash projection matrix P₁And P₂Thereafter, the objective function in equation (8) is reduced to about the coefficient matrix W₁And W₂The sub-problems of (a):

the problem in equation (14) is also a standard partial least squares regression problem. Respectively to J about W₁And W₂Taking the partial derivative and making the derivative equal to 0, one can obtain:

a simple derivation of equations (15) and (16) can yield:

W₁＝(UU^T+(γI)/α)^-1UL^T， (17)

W₂＝(VV^T+(γI)/α)^-1VL^T。 (18)

(c) fixed V, P₁、P₂、W₁And W₂Solving for U

When the binary Hash code V of the fixed text mode is adopted, the Hash projection matrix P₁And P₂And a coefficient matrix W₁And W₂Then, the objective function in equation (8) reduces to a sub-problem with the image modality binary hash encoding U, namely:

for solving conveniently, the method relaxes the discrete hash variable U into a continuous variable for solving, so that the objective function in the formula (19) is converted into:

to pair

The derivative with respect to U can be found:

to find the optimal solution for U, U is iteratively updated using a gradient descent method, i.e., by iteratively updating U (t +1) ═ U (t) + Δ U (t). Specifically, according to the taylor expansion, there are:

therefore, to satisfy J (U + Δ U) < J (U), one may select:

wherein the step size omega₁Is a predefined relatively small constant. After the continuous variable U is found, a discrete hash variable U is obtained using the formula U-sign (U), where sign (·) is a sign function, that is: sign (x) ═ 1 when x ≧ 0, and sign (x) ═ 1 when x < 0.

(d) Fixed U, P₁、P₂、W₁And W₂Solving for V

When the binary Hash code U of the fixed image mode is used, the Hash projection matrix P₁And P₂And a coefficient matrix W₁And W₂The objective function in equation (8) is then reduced to the sub-problem with the text modality binary hash V, namely:

similar to the solution of the discrete variable U, the discrete hash variable V is also relaxed to be a continuous variable for solution. After the continuous variable V is found, a discrete hash variable V is obtained using the formula V sign (V).

(3) Generating a binary hash of a sample in a set of query and search samples

Assume that a query sample of an image modality has a feature vector of

The feature vector of a query sample of the text modality is

The image mode searches the characteristics of the samples in the sample set as

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

representing the number of samples in the search sample set. The binary hash codes of the query samples in the image mode and the text mode and the binary hash codes of the samples in the retrieval sample set in the image mode and the text mode are respectively as follows:

and

wherein the content of the first and second substances,

(4) calculating the Hamming distance from the query sample to each sample in the search sample set

Query sample for image modalities

Using the formula

Computing query samples for image modalities

Retrieving each sample in a sample set to a text modality

Hamming distance of. Query sample for text modality

Using the formula

Computing query samples of text modalities

Retrieving each sample in a set of samples to an image modality

Hamming distance of.

(5) Completing retrieval of query samples using a cross-modality retriever

When searching text with image, the Hamming distance is adjusted

The following describes the advantageous effects of the present invention with reference to specific experiments.

The MIRFLICKR-25K data set contains 25000 images collected from the Flickr website, and each image is appended with several of the 24 text labels, so the MIRFLICKR-25K data set can be considered a multi-label data set. Only samples containing at least 20 text labels were taken in the experiment, constituting 20015 pairs of image-text samples. For each pair of image-text, each image is represented by a 512-dimensional GIST feature vector and the features of the text are represented by 1386-dimensional bag-of-words vectors. In the experiment, 1000 pairs of image-text samples are randomly selected for constructing a query sample set, and 5000 pairs of image-text samples are randomly selected for training a cross-modal retrieval model.

In the experiment, the performance of the cross-modal search method was measured by the average Precision average (MAP). To calculate the MAP requires first calculating the Average Precision (AP). Assuming that a query sample returns R retrieved samples when performing cross-modality retrieval, the average precision AP corresponding to this query sample is defined as:

in equation (25), p (r) represents the precision of the first r retrieved samples, i.e., how much of the first r retrieved samples are truly relevant to the query sample. For δ (r), δ (r) is 1 when the r-th retrieved sample is truly relevant to the query sample, and δ (r) is 0 otherwise. And when the average precision APs of all the query samples are obtained, calculating the average value of the average precision APs to obtain an average precision average value MAP.

In experiments, the parameters alpha, beta, gamma and eta in the method of the invention adopt 5-fold cross validation to determine the optimal values. For the parameters in the other methods, parameter setting is performed with reference to the parameter setting methods employed in the respective method documents. The results reported in the experiments are all the average of 10 random experimental results.

The method for comparing with the method of the invention respectively comprises the following steps: a typical Correlation Analysis (CCA) method, a Cross-View Hashing (CVH) method, an Inter-Media Hashing (IMH) method, and a Latent Semantic Sparse Hashing (LSSH) method. Table 1 summarizes the average accuracy mean MAP for cross-modal search on MIRFLICKR-25K data sets by the proposed method and related methods. In table 1, Img2Txt and Txt2Img respectively represent a task of retrieving a text with an image and a task of retrieving an image with a text. As can be seen from Table 1, for two retrieval tasks, namely image retrieval text and text retrieval image, the retrieval performance of the method of the invention is superior to that of the comparison method under four hash coding lengths. Specifically, the inventive method improves MAP on 16bits, 32bits, 64bits, and 128bits by at least 0.0152(═ 0.3121-0.2969), 0.022(═ 0.3285-0.3065), 0.0253(═ 0.3371-0.3118), and 0.0196(═ 0.3442-0.3246) for Img2Txt tasks as compared to other comparative methods; for the Txt2Img task, the MAP of the inventive method on 16bits, 32bits, 64bits and 128bits is improved by at least 0.0242(═ 0.3925-0.3683), 0.0278(═ 0.4257-0.3979), 0.0273(═ 0.4618-0.4345) and 0.0351(═ 0.4969-0.4618). This shows that the similarity-preserving cross-modal hash retrieval method proposed by the present invention is effective.

TABLE 1 MAP for each method on 1 MIRFLICKR-25K data set

The invention also includes: an inter-modality sample similarity retention policy, an intra-modality sample similarity retention policy, and a hash coding redundancy minimization scheme.

The sample similarity retaining strategy among the modes is as follows: for the cross-modal retrieval task, heterogeneous data with different modal properties and great differences needs to be confronted in a specific retrieval process, the heterogeneity of the data with different modalities is effectively eliminated, essential relations among the data with different modalities are fully mined from the complex relations of the data with different modalities, and the improvement of the cross-modal retrieval performance can be promoted. In order to fully mine identification information from data of different modes, the method defines the similarity relation of Hash codes of samples of different modes based on inner products, models the similarity relation into a probability model by utilizing a Sigmoid function, and then completes the retention of the similarity between the modes based on the posterior probability of a cross-mode similarity matrix, thereby achieving the aim of effectively mining the identification information from the cross-mode heterogeneous data.

The intra-modal sample similarity retention strategy is as follows: for the samples inside the modality, the marking information of the samples can effectively reflect the neighboring structure and similarity relation between the samples. For a single modality, the hash encoding of that modality is some transformation of the original feature data of that modality from the original feature space to the hamming space. In order to enable the Hash code to reserve the similarity between the samples in the mode, the method utilizes the marking information of the samples to finish the reservation of the similarity in the mode by utilizing a linear regression model which is used for classification tasks.

The hash coding redundancy minimization scheme comprises the following steps: for hash coding, if each bit of the hash coding is regarded as an attribute, it is desirable that the redundancy between different attributes is as small as possible, that is, it is desirable that bits can be independent of each other. The method of the present invention achieves this by implementing orthogonal constraints on the different bits of the hash code.

Claims

1. A similarity-preserving cross-modal Hash retrieval method assumes n objects

The characteristics in the image mode and the text mode are respectively

And

and

Assuming that n objects are involvedThe mark matrix formed by the category marks is L ═ L₁,l₂,…,l_n]∈{0,1}^m×nWherein l is_i(i ═ 1,2, …, n) denotes category label information of the ith object, and m is the number of categories; assume a cross-modal similarity matrix of S, whose element S_ijRepresenting the similarity of the ith sample in the image modality and the jth sample in the text modality; if the ith sample in the image modality is similar to the jth sample in the text modality, S_ij1, otherwise S_ij0; the method is characterized by comprising the following steps:

2. The similarity-preserving cross-modal hash retrieval method according to claim 1, wherein the objective function designed based on the inter-modal similarity preserving policy and the intra-modal similarity preserving policy in the step (1) is in the form of:

3. The similarity-preserving cross-modal hash retrieval method according to claim 2, wherein the solution U, V, P of the objective function is obtained by an alternate solution in the step (2)₁、P₂、W₁And W₂Specifically, the following four sub-problems are solved alternately:

4. the similarity-preserving cross-modal hash retrieval method of claim 1,the Hash projection matrix P based on the image mode and the text mode obtained by solving in the step (3)₁And P₂Generating a binary hash code for the query sample and the samples in the search sample set, in particular, assuming that a feature vector of a query sample of the image modality is

The feature vector of a query sample of the text modality is

The image mode searches the characteristics of the samples in the sample set as

The text modal search sample set is characterized by

Wherein the content of the first and second substances,

and

wherein the content of the first and second substances,

5. the similarity-preserving cross-modal hash retrieval method of claim 4, wherein the step (4) is performed based on a generated binary hash codeThe Hamming distance from the query sample to each sample in the set of search samples is determined by using a formula

6. The similarity-preserving cross-modal hash retrieval method according to claim 5, wherein in the step (5), the retrieval of the query sample is performed by using a cross-modal retriever based on approximate nearest neighbor search, specifically, when the text is retrieved by using the image, the hamming distance is determined