CN112199531A

CN112199531A - Cross-modal retrieval method and device based on Hash algorithm and neighborhood map

Info

Publication number: CN112199531A
Application number: CN202011224930.7A
Authority: CN
Inventors: 杜翠凤; 蒋仕宝; 孙广波; 朱春荣
Original assignee: Guangzhou Jiesai Communication Planning And Design Institute Co ltd; GCI Science and Technology Co Ltd
Current assignee: Guangzhou Jiesai Communication Planning And Design Institute Co ltd; GCI Science and Technology Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-01-08
Anticipated expiration: 2040-11-05
Also published as: CN112199531B

Abstract

The invention discloses a cross-modal retrieval method and a cross-modal retrieval device based on a Hash algorithm and a neighborhood map, wherein the retrieval method comprises the following steps: obtaining a multi-modal original sample, and performing minimization processing on residual values obtained before and after the multi-modal original sample is subjected to feature transformation to obtain minimized residual values; learning potential correlation among the multi-modal original samples according to a collaborative matrix decomposition method, and calculating semantic consistency among the modalities of the multi-modal original samples according to the potential correlation; calculating to obtain semantic consistency in the modality of the multi-modality original sample by adopting the popular learning of the neighborhood map; and combining the minimized residual value, the semantic consistency among the modes and the semantic consistency in the modes with regularization calculation for avoiding overfitting to obtain the target function. According to the embodiment of the invention, the target function for cross-modal retrieval is obtained by calculating by comprehensively considering the global characteristics of multiple modes and the local characteristics among the modes, so that the comprehensiveness and the accuracy of the cross-modal retrieval are improved.

Description

Cross-modal retrieval method and device based on Hash algorithm and neighborhood map

Technical Field

The invention relates to the technical field of retrieval, in particular to a cross-modal retrieval method and a cross-modal retrieval device based on a hash algorithm and a neighborhood map.

Background

The rapid development of information technology brings about the explosive growth of multi-modal data, including multi-source heterogeneous data such as images, audio, text, video and the like. Since there are heterogeneous differences in semantic representations between modalities, efficient multi-modal retrieval is one of the key issues in current multi-modal fusion. In the prior art, the multi-modal retrieval is mostly realized by using a hash algorithm, the hash algorithm maps multi-modal data to a uniform potential space, and the alignment of the multi-modal space is realized by using a hash code obtained by quantizing a feature vector through a hash function. However, the applicant finds that, in research, the existing cross-modal retrieval method does not consider the similarity between samples in the same modality and the similarity between modalities, so that the cross-modal retrieval effect is poor.

Disclosure of Invention

The invention provides a cross-modal retrieval method and device based on a Hash algorithm and a neighborhood graph, and aims to solve the technical problem that the cross-modal retrieval effect is poor due to the fact that the similarity between samples in the same modality and the similarity between modalities are not considered in the conventional cross-modal retrieval method.

The first embodiment of the present invention provides a cross-modal retrieval method based on a hash algorithm and a neighborhood graph, including:

obtaining a multi-modal original sample, and performing minimization processing on residual values obtained before and after the multi-modal original sample is subjected to feature transformation to obtain a minimized residual value;

learning potential associations among the multi-modal original samples according to a collaborative matrix decomposition method, and calculating semantic consistency among the modalities of the multi-modal original samples according to the potential associations;

calculating to obtain semantic consistency in the modality of the multi-modality original sample by adopting the popular learning of a neighborhood map;

and combining the minimized residual value, the semantic consistency among the modes and the semantic consistency in the modes with regularization calculation for avoiding overfitting to obtain a target function.

Further, the obtaining of the multi-modal original sample performs minimization processing on residual values obtained before and after the multi-modal original sample is subjected to feature transformation to obtain a minimized residual value, and specifically includes:

and obtaining a hash code set of a training set by setting a hash code corresponding to each sample in the multi-modal samples, and obtaining a minimized residual value of the multi-modal original sample by using the hash code set and a preset semantic label matrix according to a principle of error minimization.

Further, the learning of the potential correlation between the multi-modal original samples according to the collaborative matrix decomposition method and the calculation of the semantic consistency between the modalities of the multi-modal original samples according to the potential correlation specifically include:

performing feature extraction on the multi-modal original sample to obtain an image basic feature matrix and a text basic feature matrix; and calculating according to the image basic feature matrix and the text basic feature matrix to obtain semantic consistency among the modes.

Further, the calculating to obtain the semantic consistency in the modality of the multi-modality original sample by adopting the popular learning of the neighborhood map specifically comprises:

constructing neighborhood graphs among the data in the same mode to represent the local relation of samples, and calculating according to the neighborhood graphs, the image basic feature matrix and the text basic feature image to obtain the semantic consistency in the mode of the multi-mode original samples.

Further, the regularization term includes a regression coefficient matrix, a sample noise matrix, the image basis feature matrix, and the text basis feature matrix.

A second embodiment of the present invention provides a cross-modal search apparatus based on a hash algorithm and a neighborhood map, including:

the minimum processing module is used for acquiring a multi-modal original sample, and performing minimum processing on residual values obtained before and after the multi-modal original sample is subjected to feature transformation to obtain a minimum residual value;

the first calculation module is used for learning potential associations among the multi-modal original samples according to a collaborative matrix decomposition method and calculating semantic consistency among the modalities of the multi-modal original samples according to the potential associations;

the second calculation module is used for calculating and obtaining semantic consistency in the modality of the multi-modality original sample by adopting the popular learning of the neighborhood map;

and the third calculation module is used for combining the minimized residual value, the semantic consistency among the modalities and the semantic consistency in the modalities with regularization calculation for avoiding overfitting to obtain a target function.

Further, the minimization processing module is specifically configured to:

Further, the first calculating module is specifically configured to:

Further, the second calculation module is specifically configured to:

constructing neighborhood graphs among the data in the same mode to represent the local relation of samples, calculating according to the neighborhood graphs, the image basic feature matrix and the text basic feature image to obtain the semantic consistency in the mode of the multi-mode original samples,

The embodiment of the invention combines the residual value before and after the minimum original sample transformation, the semantic consistency between the modes and the semantic consistency in the modes, and considers the global characteristics of the multiple modes and the local characteristics between the modes, calculates and obtains the target function for cross-mode retrieval, and realizes the improvement of the comprehensiveness and the accuracy of the cross-mode retrieval.

Drawings

Fig. 1 is a schematic flowchart of a cross-modal retrieval method based on a hash algorithm and a neighborhood map according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a cross-modal retrieval apparatus based on a hash algorithm and a neighborhood map according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it is to be understood that the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.

In the description of the present application, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.

Referring to fig. 1, a first embodiment of the present invention provides a cross-modal retrieval method based on a hash algorithm and a neighborhood map, including:

s1, obtaining a multi-modal original sample, and performing minimization processing on residual values obtained before and after the multi-modal original sample is subjected to feature transformation to obtain a minimized residual value;

s2, learning potential correlation among the multi-modal original samples according to a collaborative matrix decomposition method, and calculating semantic consistency among the modalities of the multi-modal original samples according to the potential correlation;

s3, calculating semantic consistency in the modality of the multi-modality original sample by adopting the popular learning of the neighborhood map;

and S4, combining the minimized residual value, the semantic consistency among the modes and the semantic consistency in the modes with regularization calculation for avoiding overfitting to obtain a target function.

When the cross-modal retrieval is carried out, residual values before and after the original sample is converted are minimized, the semantic consistency among the modalities and the semantic consistency in the modalities are comprehensively considered, and the influence factors in the original data conversion process are comprehensively considered, so that the target function is calculated, and the comprehensiveness and the accuracy of the cross-modal retrieval are improved.

Specifically, the embodiment of the invention embodies the overall characteristics of multiple modes by minimizing the residual values before and after the original sample transformation, embodies the local characteristics between the modes by the consistency between the modes, embodies the local characteristics in the modes by the consistency in the modes, realizes the high-efficiency extraction of the overall characteristics of the multiple modes, and thus can effectively improve the accuracy and reliability of the cross-mode retrieval.

As a specific implementation manner of the embodiment of the present invention, a multi-modal original sample is obtained, and a residual value obtained before and after the multi-modal original sample is subjected to feature transformation is subjected to minimization processing to obtain a minimized residual value, which specifically is:

and obtaining a hash code set of the training set by setting a hash code corresponding to each sample in the multi-modal samples, and obtaining the minimized residual value of the multi-modal original sample by using the hash code set and a preset semantic label matrix according to the principle of error minimization.

Specifically, under the condition of giving a semantic label matrix L, setting the hash code corresponding to each sample as B, setting the hash code set of the training set as B, and according to the principle of error minimization, minimizing the residual value before and after the transformation of the original sample as:

wherein, W is a regression coefficient matrix obtained by hash learning, and L can be understood as a uniform potential semantic space between modalities. Assuming that a linear relation exists between original samples, learning and obtaining a hash code b for retrieval by using a linear regression mode, and further decomposing different modal data to obtain a uniform potential semantic space.

As a specific implementation manner of the embodiment of the present invention, the potential correlation between the multi-modal original samples is learned according to a collaborative matrix decomposition method, and the semantic consistency between the modalities of the multi-modal original samples is obtained according to the potential correlation calculation, which specifically includes:

performing feature extraction on the multi-modal original sample to obtain an image basic feature matrix and a text basic feature matrix; and calculating according to the image basic feature matrix and the text basic feature matrix to obtain the semantic consistency among the modes.

Specifically, the semantic consistency between modalities is expressed according to the relevance between modalities, wherein the relevance obtained by learning the potential relevance between multimodal samples in a collaborative matrix decomposition mode is obtained by the modality facultative relevance, namely the process of hash function learning and feature extraction, and the semantic consistency between modalities is specifically expressed as follows:

wherein X is a multimodal original sample, U_XFor a basic feature matrix of the image, U_YAnd B is a text basic feature matrix and a Hash code set.

As a specific implementation manner of the embodiment of the present invention, the semantic consistency in the modality of the multi-modality original sample is calculated by adopting the popular learning of the neighborhood map, which specifically includes:

and constructing neighborhood graphs among the data in the same mode to represent the local relation of the samples, and calculating according to the neighborhood graphs, the basic feature matrix of the images and the basic feature images of the texts to obtain the semantic consistency in the mode of the multi-mode original samples.

It should be noted that the semantics within the modality are based on the assumption that the data is approximately extracted from the same underlying space. The method constructs a neighborhood graph S among the same modal data to represent the local relation of original samples, and obtains the semantic consistency in the modal of the multi-modal original samples through calculation and optimization according to the neighborhood graph, an image basic feature matrix and a text basic feature image, wherein the semantic consistency in the modal is specifically represented as follows:

wherein, therein

And

respectively representing a similarity matrix in the X-modality and a similarity matrix in the Y-modality.

As a specific implementation manner of the embodiment of the present invention, the regularization term includes a regression coefficient matrix, a sample noise matrix, an image basic feature matrix, and a text basic feature matrix.

In the embodiment of the invention, an objective function is established by introducing residual values before and after the transformation of a small original sample, semantic consistency between modes and semantic consistency in the modes and combining a regularization term:

the first term of the objective function is a minimum residual value obtained by semantic label learning, which is beneficial to obtaining a model with high discrimination, the second term and the third term are consistency among modalities, the fourth term and the fifth term are consistency in the modalities, the sixth term is a constraint term, which is a regularization term avoiding overfitting, wherein the regularization term comprises a regression coefficient matrix W and an image basic feature matrix U_XAnd text basic feature matrix U_YAnd a sample noise matrix E.

The embodiment of the invention has the following beneficial effects:

Referring to fig. 2, a second embodiment of the present invention provides a cross-modal search apparatus based on a hash algorithm and a neighborhood map, including:

the minimization processing module 10 is configured to obtain a multi-modal original sample, and perform minimization processing on residual values obtained before and after the multi-modal original sample is subjected to feature transformation to obtain a minimized residual value;

the first calculation module 20 is configured to learn potential associations among the multi-modal original samples according to a collaborative matrix decomposition method, and calculate semantic consistency among the modalities of the multi-modal original samples according to the potential associations;

the second calculation module 30 is configured to calculate semantic consistency in the modality of the multi-modality original sample by using the popular learning of the neighborhood map;

and the third calculation module 40 is configured to calculate a target function by combining the minimized residual value, the semantic consistency between the modalities, and the semantic consistency within the modalities with regularization that avoids overfitting.

As a specific implementation manner of the embodiment of the present invention, the minimization processing module 10 is specifically configured to:

As a specific implementation manner of the embodiment of the present invention, the first calculating module 20 is specifically configured to:

As a specific implementation manner of the embodiment of the present invention, the second calculating module 30 is specifically configured to:

wherein, therein

And

The embodiment of the invention has the following beneficial effects:

The foregoing is a preferred embodiment of the present invention, and it should be noted that it would be apparent to those skilled in the art that various modifications and enhancements can be made without departing from the principles of the invention, and such modifications and enhancements are also considered to be within the scope of the invention.

Claims

1. A cross-modal retrieval method based on a hash algorithm and a neighborhood graph is characterized by comprising the following steps:

and calculating the minimized residual value, the semantic consistency among the modalities and the semantic consistency in the modalities by combining a regularization item for avoiding overfitting to obtain a target function.

2. The cross-modal retrieval method based on the hash algorithm and the neighborhood map as claimed in claim 1, wherein the obtaining of the multi-modal original sample is performed by performing minimization processing on residual values obtained before and after the multi-modal original sample is subjected to feature transformation to obtain minimized residual values, and specifically comprises:

3. The cross-modal retrieval method based on the hash algorithm and the neighborhood graph according to claim 1, wherein the potential associations between the multi-modal original samples are learned according to a collaborative matrix decomposition method, and semantic consistency between modalities of the multi-modal original samples is obtained by calculation according to the potential associations, specifically:

4. The cross-modal retrieval method based on a hash algorithm and a neighborhood map as claimed in claim 3, wherein the semantic consistency in the modality of the multi-modal original sample is obtained by calculation using the popular learning of the neighborhood map, specifically:

5. The cross-modal search method based on a hashing algorithm and neighborhood map of claim 1, wherein the regularization terms comprise a regression coefficient matrix, a sample noise matrix, the image basis feature matrix, and the text basis feature matrix.

6. A cross-modal retrieval device based on a hash algorithm and a neighborhood graph is characterized by comprising:

and the third calculation module is used for calculating the minimized residual value, the semantic consistency among the modalities and the semantic consistency in the modalities by combining a regularization term avoiding overfitting to obtain a target function.

7. The hash algorithm and neighborhood map based cross-modal retrieval device of claim 6, wherein the minimization process module is specifically configured to:

8. The cross-modal retrieval device based on a hash algorithm and a neighborhood graph of claim 6, wherein the first computing module is specifically configured to:

9. The cross-modal retrieval device based on a hash algorithm and a neighborhood graph of claim 8, wherein the second computing module is specifically configured to:

10. The cross-modal search method based on a hashing algorithm and neighborhood map of claim 1, wherein the regularization terms comprise a regression coefficient matrix, a sample noise matrix, the image basis feature matrix, and the text basis feature matrix.