CN113076465A

CN113076465A - Universal cross-modal retrieval model based on deep hash

Info

Publication number: CN113076465A
Application number: CN202110526554.5A
Authority: CN
Inventors: 段友祥; 陈宁; 孙歧峰
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-07-06

Abstract

The invention discloses a depth hash-based universal cross-modal retrieval model, which comprises an image model, a text model, a binary code conversion model and a Hamming space. The image model is used for extracting the characteristics and the semantics of the image data; the text model is used for extracting the characteristics and the semantics of the text data; the binary conversion model is used for converting the original characteristics into binary codes; hamming space is a common subspace of image and text data in which the similarity across modal data can be directly computed. The universal model for solving the cross-modal retrieval by combining the deep learning and the Hash learning, which is provided by the invention, maps data points in an original characteristic space into binary codes in a public Hamming space, and performs similarity sequencing by calculating the Hamming distance between the code of the data to be inquired and the code of the original data, thereby obtaining a retrieval result and greatly improving the retrieval efficiency. And the original data storage is replaced by binary coding, so that the requirement of the retrieval task on the storage capacity is greatly reduced.

Description

Universal cross-modal retrieval model based on deep hash

Technical Field

The invention relates to the field of cross-modal retrieval, in particular to cross-modal retrieval of images and texts.

Background

In recent years, with the explosion of the internet and the popularization of smart devices and social networks, multimedia data has been explosively increased on the internet. These massive data include various modality forms such as text, images, video, and audio, and the same thing may have descriptions of data of different modalities. These data are formally "heterogeneous multi-sources" and semantically related to each other. The requirement for acquiring information by people is not satisfied with data retrieval of a single modality, and the realization of cross-modality retrieval through knowledge collaboration of different modalities becomes a hot spot of research in recent years.

Deep learning makes breakthrough progress in the field of single mode, such as the field of natural language processing, the field of images and the field of voice recognition, and the strong abstract capability of a neural network shows infinite potential in different multimedia applications, such as object recognition and text generation, thereby laying theoretical basis and technical practice for the research of cross-mode retrieval.

Most of the prior art techniques directly model based on extracted feature values, thereby enabling cross-modality retrieval, which is very time consuming for large-scale datasets and requires a large amount of storage space. And only the retrieval precision is pursued, but the retrieval efficiency is neglected, so that the model after training has the problems of huge retrieval delay and low efficiency, and the method cannot be applied in reality. Hash learning has a good effect on large-scale data due to low storage requirement and high retrieval speed.

Disclosure of Invention

Aiming at the problems and the defects in the prior art, the invention provides a universal cross-modal retrieval model based on deep hash, and simultaneously combines the good performance of a deep learning algorithm in representation learning and the high-efficiency low-storage characteristic of a hash method, so that the method is beneficial to reducing the isomerism gap and the semantic gap between different modal form data and simultaneously reducing the algorithm operation complexity. The method properly combines the deep learning algorithm and the Hash learning to model different types of data for cross-modal retrieval, is a future trend, not only can obtain excellent retrieval precision, but also can obtain good balance between the calculation efficiency and the retrieval performance.

Specifically, the present application provides a deep hash-based universal cross-modal search model, including:

the image model is used for extracting features and semantics of input image data;

the text model is used for extracting features and semantics of input text data;

the binary code conversion model is responsible for mapping data points in the original characteristic space to binary codes in a public Hamming space;

and the Hamming space is a public subspace of the feature spaces of the image model and the text model, and similarity sequencing can be performed by calculating the Hamming distance between the Hash code of the data to be queried and the original data code, so that a cross-modal retrieval result is obtained.

The invention is based on deep learning and Hash learning technology.

1. The image model includes:

1.1, preprocessing image data, preprocessing a picture into a characteristic form, and inputting the characteristic form into an image convolution neural network;

1.2 image feature and semantic feature extraction model, can adopt the pre-trained ResNet, SeNet, DenseNet or GCN on ImageNet data set to show the CNN model of the excellent performance in the aspect of image feature extraction, image classification.

2. The text model includes:

2.1 converting the text into a vector form, and generally converting text data into the vector form by adopting a Bow model or a Word2Vec model;

2.2 text feature and semantic feature extraction model, usually using recurrent neural network that processes data-related tasks of time-series relationship most successfully, especially LSTM and Transformer that show excellent performance in natural language processing in recent years are the model choices that we can give priority.

3. The binary transcoding model includes:

3.1, a plurality of layers of full connection layers are adopted, and image features and text features extracted by an image model and a text model are mapped into binary hash codes with specific digits respectively;

the number of bits of the 3.2 binary code depends on the number of nodes of the last fully-connected layer, and is generally set to 16, 32, 64 and 128, so that the retrieval results of binary codes with different numbers of bits can be checked.

4. And the Hamming space is a public subspace of image and text characteristics in different characteristic spaces, and binary codes mapped by the binary code conversion model are stored in the Hamming space. And generating a uniform characteristic representation form for different modal data, so that the similarity measurement can be directly carried out.

The beneficial effect that the frame that this application provided brought is:

the universal deep hash model is provided, so that a cross-modal retrieval model based on the deep hash technology can be built more quickly;

meanwhile, the good performance of deep learning expressed in feature learning and the high-efficiency low-storage characteristic expressed by Hash learning are combined, so that the method is beneficial to reducing the heterogeneity difference and the semantic difference between different modal form data and simultaneously reducing the algorithm complexity;

not only can excellent retrieval accuracy be obtained, but also a good balance between calculation efficiency and retrieval performance can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the structures shown in the drawings without any inventive work.

FIG. 1 is a diagram of a model proposed by the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, a depth hash-based universal cross-modal search model includes an image model 1, a text model 2, a binary transcoding model 3, and a hamming space 4, where:

the image model 1 is used for extracting image characteristics and abstracting the original characteristics and semantics of an image;

the text model 2 is used for converting text data into a vector form and extracting the characteristics and semantics of the text;

the binary code conversion model 3 is used for converting the characteristics and semantics extracted by the image and text model into binary codes, and further mapping data points in original characteristic spaces of different modes to a public Hamming space;

and a Hamming space 4, a public subspace of the image and the text mode characteristic space, wherein the Hamming distance between the data to be inquired and the binary Hash code of the original data is calculated for similarity sequencing.

The image model 1 mainly suggests a CNN model which shows excellent performance in the aspects of image feature extraction and image classification, such as ResNet, DenseNet, SeNet, GCN and the like.

The text model 2 includes a text vector conversion model 21 and a text feature extraction model 22, wherein:

the text vector conversion model 21 is used for converting input text data into a vector form, and a Bow or Word2Vec model is suggested;

the text feature extraction model 22 is used for feature extraction of the text vector after conversion, and LSTM and Transformer models which achieve excellent performance in natural language processing are suggested.

The binary code conversion model 3 is composed of a plurality of layers of full connection layers, and maps the extracted image and text characteristics and semantics into binary codes with specific digits.

And the last layer of the binary code conversion model 3 is used for controlling the bit number of the generated binary code.

The binary code conversion model 3 usually adopts contrast loss or triple loss to monitor the binary code generation process and keep the neighbor similarity in the original feature space as much as possible.

The hamming space 4, in which hash codes of different modalities can directly measure similarity by distance.

The hamming space 4, computing similarity, e.g., for data X from the X modality_aMapping it to Hamming space using the above transfer function

Calculating all Y-mode data v in Hamming space_jDegree of similarity d_j＝sim(u_a,v_j) Sorting the similarity to obtain the similarity with x in the Y mode_aThe retrieval result of the associated data.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A generic cross-modal search model based on deep hashing, wherein the framework comprises an image model, a text model, a binary code conversion model, and a hamming space, wherein:

1) the image model is used for extracting features and semantics of input image data;

2) the text model is used for extracting features and semantics of input text data;

3) the binary code conversion model is responsible for mapping data points in the original characteristic space into binary codes in a public Hamming space;

4) and the Hamming space is a public subspace of the feature spaces of the image model and the text model, and similarity sequencing can be performed by calculating the Hamming distance between the Hash code of the data to be queried and the original data code, so that a cross-modal retrieval result is obtained.

2. A deep hash-based generic cross-modal search model as defined in claim 1, wherein: because a Convolutional Neural Network (CNN) can reserve the local characteristics of the connection and space of a neighborhood, and has strong abstract representation capability on local operation; the image translation invariance is ensured to a certain extent by introducing pooling operation by utilizing the high correlation between the two-dimensional structure of the image and the adjacent pixels, so that the model is not influenced by position change; the pooling operation also enables the network to have a larger receptive field, so that the network learns more abstract feature representations at a deeper level. Therefore, the image model usually adopts a convolutional neural network to extract features and semantics. For abstract representation and extraction capability of image data features, the abstract representation and extraction capability are often an important index of performance evaluation, so that the image model can use a CNN model which shows excellent performance in the aspects of image feature extraction and image classification and is pre-trained on an ImageNet data set by residual error networks ResNet, SeNet, DenseNet, GCN and the like.

3. A deep hash-based generic cross-modal search model as defined in claim 1, wherein: the text model firstly adopts a Bow model or a Word2Vec model to convert the text data into a vector form. For the feature and semantic extraction of text vectors, a Recurrent Neural Network (RNN) is generally adopted, the RNN is a multilayer Neural Network model which processes data correlation tasks with a time sequence relationship most successfully, the time sequence of appearance of samples is very important for natural language processing, and the RNN provides a good solution to the problem that other networks cannot model changes in the time sequence. Many existing models only use a full connection layer to extract features for a text mode, and ignore context information and rich semantic information of a text, so that the text model uses RNN to extract and characterize the features. Among these, LSTM and Transformer, which exhibit superior performance in natural language processing especially in recent years, are model choices that we may prioritize.

4. A deep hash-based generic cross-modal search model as defined in claim 1, wherein: binary code conversion models typically employ multiple fully-connected layers to map image and text features into a binary hash code of a particular number of bits, where the number of bits of the binary code depends on the number of nodes of the last fully-connected layer. If the last fully-connected layer adopts 16, 32 and 64 nodes, the number of finally converted binary code bits is 16, 32 and 64 bits.

5. A deep hash-based universal cross-modal search model as defined in claim 1, wherein the mathematical definition of the framework is:

for clarity, image and text modalities are denoted by X and Y. The training data is defined as D ═ X, Y, where

Where n denotes the amount of data, x, of an instance of the training sample_iRepresenting the feature vector from the ith sample instance of the X modality. As such, define

Wherein y is_jRepresenting the feature vector from the jth sample instance of the Y modality. Feature vector and x due to data of different modalities_iAnd y_jAre located in different feature representation spaces and usually have different statistical properties, so they cannot be directly compared. Thus one conversion function is learned for each modality: for the X-mode,

for the Y mode of the optical system,

where d is the dimension of the Hamming space, γ_XAnd gamma_YAre parameters of the training of the two modality data. The transfer function will be from data x of different feature spaces_iAnd y_jMapping into a feature vector u in Hamming space_iAnd v_j. So that data from different modalities can be directly compared and the similarity of samples of the same class is greater than the similarity of samples of different classes in hamming space.

The framework aims to calculate similarity of data across modes so as to perform cross-mode retrieval. For example, for data X from the X modality_aMapping it to Hamming space using the above transfer function

6. The deep hash-based generic cross-modal search model of claim 4, wherein: the binary code conversion model needs to ensure that the binary code retains the neighbor similarity in the original feature space as much as possible, i.e. two adjacent points in the original space should have similarity when being mapped into the hamming space. Therefore, the model is trained to ensure the principle of similarity preservation, and commonly used Loss functions include contrast Loss (contrast Loss) and triple Loss (triple Loss), and the Loss functions are respectively as follows and are used for supervising the generation process of binary codes:

1) loss of contrast

Wherein d | | | u_i-v_j||₂Representing the euclidean distance between two sample features; y is the label of whether the two samples are matched, if the two modal data x_iAnd y_jIf semantic association exists between the two groups, y is 1, otherwise, y is 0; margin is a set threshold.

2) Loss of triad

Wherein | | | | is the Euclidean distance,

the euclidean distance between Positive and Anchor is shown,

the Euclidean distance between Negative and Anchor is shown; alpha represents the minimum interval between the distance between Positive and Anchor and the distance between Negative and Anchor; + represents [, ]]And when the internal value is greater than 0, taking the value as loss, otherwise, taking the loss as 0.