CN113076465A - Universal cross-modal retrieval model based on deep hash - Google Patents

Universal cross-modal retrieval model based on deep hash Download PDF

Info

Publication number
CN113076465A
CN113076465A CN202110526554.5A CN202110526554A CN113076465A CN 113076465 A CN113076465 A CN 113076465A CN 202110526554 A CN202110526554 A CN 202110526554A CN 113076465 A CN113076465 A CN 113076465A
Authority
CN
China
Prior art keywords
model
data
text
image
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110526554.5A
Other languages
Chinese (zh)
Inventor
段友祥
陈宁
孙歧峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN202110526554.5A priority Critical patent/CN113076465A/en
Publication of CN113076465A publication Critical patent/CN113076465A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a depth hash-based universal cross-modal retrieval model, which comprises an image model, a text model, a binary code conversion model and a Hamming space. The image model is used for extracting the characteristics and the semantics of the image data; the text model is used for extracting the characteristics and the semantics of the text data; the binary conversion model is used for converting the original characteristics into binary codes; hamming space is a common subspace of image and text data in which the similarity across modal data can be directly computed. The universal model for solving the cross-modal retrieval by combining the deep learning and the Hash learning, which is provided by the invention, maps data points in an original characteristic space into binary codes in a public Hamming space, and performs similarity sequencing by calculating the Hamming distance between the code of the data to be inquired and the code of the original data, thereby obtaining a retrieval result and greatly improving the retrieval efficiency. And the original data storage is replaced by binary coding, so that the requirement of the retrieval task on the storage capacity is greatly reduced.

Description

Universal cross-modal retrieval model based on deep hash
Technical Field
The invention relates to the field of cross-modal retrieval, in particular to cross-modal retrieval of images and texts.
Background
In recent years, with the explosion of the internet and the popularization of smart devices and social networks, multimedia data has been explosively increased on the internet. These massive data include various modality forms such as text, images, video, and audio, and the same thing may have descriptions of data of different modalities. These data are formally "heterogeneous multi-sources" and semantically related to each other. The requirement for acquiring information by people is not satisfied with data retrieval of a single modality, and the realization of cross-modality retrieval through knowledge collaboration of different modalities becomes a hot spot of research in recent years.
Deep learning makes breakthrough progress in the field of single mode, such as the field of natural language processing, the field of images and the field of voice recognition, and the strong abstract capability of a neural network shows infinite potential in different multimedia applications, such as object recognition and text generation, thereby laying theoretical basis and technical practice for the research of cross-mode retrieval.
Most of the prior art techniques directly model based on extracted feature values, thereby enabling cross-modality retrieval, which is very time consuming for large-scale datasets and requires a large amount of storage space. And only the retrieval precision is pursued, but the retrieval efficiency is neglected, so that the model after training has the problems of huge retrieval delay and low efficiency, and the method cannot be applied in reality. Hash learning has a good effect on large-scale data due to low storage requirement and high retrieval speed.
Disclosure of Invention
Aiming at the problems and the defects in the prior art, the invention provides a universal cross-modal retrieval model based on deep hash, and simultaneously combines the good performance of a deep learning algorithm in representation learning and the high-efficiency low-storage characteristic of a hash method, so that the method is beneficial to reducing the isomerism gap and the semantic gap between different modal form data and simultaneously reducing the algorithm operation complexity. The method properly combines the deep learning algorithm and the Hash learning to model different types of data for cross-modal retrieval, is a future trend, not only can obtain excellent retrieval precision, but also can obtain good balance between the calculation efficiency and the retrieval performance.
Specifically, the present application provides a deep hash-based universal cross-modal search model, including:
the image model is used for extracting features and semantics of input image data;
the text model is used for extracting features and semantics of input text data;
the binary code conversion model is responsible for mapping data points in the original characteristic space to binary codes in a public Hamming space;
and the Hamming space is a public subspace of the feature spaces of the image model and the text model, and similarity sequencing can be performed by calculating the Hamming distance between the Hash code of the data to be queried and the original data code, so that a cross-modal retrieval result is obtained.
The invention is based on deep learning and Hash learning technology.
1. The image model includes:
1.1, preprocessing image data, preprocessing a picture into a characteristic form, and inputting the characteristic form into an image convolution neural network;
1.2 image feature and semantic feature extraction model, can adopt the pre-trained ResNet, SeNet, DenseNet or GCN on ImageNet data set to show the CNN model of the excellent performance in the aspect of image feature extraction, image classification.
2. The text model includes:
2.1 converting the text into a vector form, and generally converting text data into the vector form by adopting a Bow model or a Word2Vec model;
2.2 text feature and semantic feature extraction model, usually using recurrent neural network that processes data-related tasks of time-series relationship most successfully, especially LSTM and Transformer that show excellent performance in natural language processing in recent years are the model choices that we can give priority.
3. The binary transcoding model includes:
3.1, a plurality of layers of full connection layers are adopted, and image features and text features extracted by an image model and a text model are mapped into binary hash codes with specific digits respectively;
the number of bits of the 3.2 binary code depends on the number of nodes of the last fully-connected layer, and is generally set to 16, 32, 64 and 128, so that the retrieval results of binary codes with different numbers of bits can be checked.
4. And the Hamming space is a public subspace of image and text characteristics in different characteristic spaces, and binary codes mapped by the binary code conversion model are stored in the Hamming space. And generating a uniform characteristic representation form for different modal data, so that the similarity measurement can be directly carried out.
The beneficial effect that the frame that this application provided brought is:
the universal deep hash model is provided, so that a cross-modal retrieval model based on the deep hash technology can be built more quickly;
meanwhile, the good performance of deep learning expressed in feature learning and the high-efficiency low-storage characteristic expressed by Hash learning are combined, so that the method is beneficial to reducing the heterogeneity difference and the semantic difference between different modal form data and simultaneously reducing the algorithm complexity;
not only can excellent retrieval accuracy be obtained, but also a good balance between calculation efficiency and retrieval performance can be obtained.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the structures shown in the drawings without any inventive work.
FIG. 1 is a diagram of a model proposed by the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Example 1
As shown in fig. 1, a depth hash-based universal cross-modal search model includes an image model 1, a text model 2, a binary transcoding model 3, and a hamming space 4, where:
the image model 1 is used for extracting image characteristics and abstracting the original characteristics and semantics of an image;
the text model 2 is used for converting text data into a vector form and extracting the characteristics and semantics of the text;
the binary code conversion model 3 is used for converting the characteristics and semantics extracted by the image and text model into binary codes, and further mapping data points in original characteristic spaces of different modes to a public Hamming space;
and a Hamming space 4, a public subspace of the image and the text mode characteristic space, wherein the Hamming distance between the data to be inquired and the binary Hash code of the original data is calculated for similarity sequencing.
The image model 1 mainly suggests a CNN model which shows excellent performance in the aspects of image feature extraction and image classification, such as ResNet, DenseNet, SeNet, GCN and the like.
The text model 2 includes a text vector conversion model 21 and a text feature extraction model 22, wherein:
the text vector conversion model 21 is used for converting input text data into a vector form, and a Bow or Word2Vec model is suggested;
the text feature extraction model 22 is used for feature extraction of the text vector after conversion, and LSTM and Transformer models which achieve excellent performance in natural language processing are suggested.
The binary code conversion model 3 is composed of a plurality of layers of full connection layers, and maps the extracted image and text characteristics and semantics into binary codes with specific digits.
And the last layer of the binary code conversion model 3 is used for controlling the bit number of the generated binary code.
The binary code conversion model 3 usually adopts contrast loss or triple loss to monitor the binary code generation process and keep the neighbor similarity in the original feature space as much as possible.
The hamming space 4, in which hash codes of different modalities can directly measure similarity by distance.
The hamming space 4, computing similarity, e.g., for data X from the X modalityaMapping it to Hamming space using the above transfer function
Figure BDA0003066141250000051
Calculating all Y-mode data v in Hamming spacejDegree of similarity dj=sim(ua,vj) Sorting the similarity to obtain the similarity with x in the Y modeaThe retrieval result of the associated data.
The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (6)

1. A generic cross-modal search model based on deep hashing, wherein the framework comprises an image model, a text model, a binary code conversion model, and a hamming space, wherein:
1) the image model is used for extracting features and semantics of input image data;
2) the text model is used for extracting features and semantics of input text data;
3) the binary code conversion model is responsible for mapping data points in the original characteristic space into binary codes in a public Hamming space;
4) and the Hamming space is a public subspace of the feature spaces of the image model and the text model, and similarity sequencing can be performed by calculating the Hamming distance between the Hash code of the data to be queried and the original data code, so that a cross-modal retrieval result is obtained.
2. A deep hash-based generic cross-modal search model as defined in claim 1, wherein: because a Convolutional Neural Network (CNN) can reserve the local characteristics of the connection and space of a neighborhood, and has strong abstract representation capability on local operation; the image translation invariance is ensured to a certain extent by introducing pooling operation by utilizing the high correlation between the two-dimensional structure of the image and the adjacent pixels, so that the model is not influenced by position change; the pooling operation also enables the network to have a larger receptive field, so that the network learns more abstract feature representations at a deeper level. Therefore, the image model usually adopts a convolutional neural network to extract features and semantics. For abstract representation and extraction capability of image data features, the abstract representation and extraction capability are often an important index of performance evaluation, so that the image model can use a CNN model which shows excellent performance in the aspects of image feature extraction and image classification and is pre-trained on an ImageNet data set by residual error networks ResNet, SeNet, DenseNet, GCN and the like.
3. A deep hash-based generic cross-modal search model as defined in claim 1, wherein: the text model firstly adopts a Bow model or a Word2Vec model to convert the text data into a vector form. For the feature and semantic extraction of text vectors, a Recurrent Neural Network (RNN) is generally adopted, the RNN is a multilayer Neural Network model which processes data correlation tasks with a time sequence relationship most successfully, the time sequence of appearance of samples is very important for natural language processing, and the RNN provides a good solution to the problem that other networks cannot model changes in the time sequence. Many existing models only use a full connection layer to extract features for a text mode, and ignore context information and rich semantic information of a text, so that the text model uses RNN to extract and characterize the features. Among these, LSTM and Transformer, which exhibit superior performance in natural language processing especially in recent years, are model choices that we may prioritize.
4. A deep hash-based generic cross-modal search model as defined in claim 1, wherein: binary code conversion models typically employ multiple fully-connected layers to map image and text features into a binary hash code of a particular number of bits, where the number of bits of the binary code depends on the number of nodes of the last fully-connected layer. If the last fully-connected layer adopts 16, 32 and 64 nodes, the number of finally converted binary code bits is 16, 32 and 64 bits.
5. A deep hash-based universal cross-modal search model as defined in claim 1, wherein the mathematical definition of the framework is:
for clarity, image and text modalities are denoted by X and Y. The training data is defined as D ═ X, Y, where
Figure FDA0003066141240000021
Where n denotes the amount of data, x, of an instance of the training sampleiRepresenting the feature vector from the ith sample instance of the X modality. As such, define
Figure FDA0003066141240000022
Wherein y isjRepresenting the feature vector from the jth sample instance of the Y modality. Feature vector and x due to data of different modalitiesiAnd yjAre located in different feature representation spaces and usually have different statistical properties, so they cannot be directly compared. Thus one conversion function is learned for each modality: for the X-mode,
Figure FDA0003066141240000023
for the Y mode of the optical system,
Figure FDA0003066141240000024
where d is the dimension of the Hamming space, γXAnd gammaYAre parameters of the training of the two modality data. The transfer function will be from data x of different feature spacesiAnd yjMapping into a feature vector u in Hamming spaceiAnd vj. So that data from different modalities can be directly compared and the similarity of samples of the same class is greater than the similarity of samples of different classes in hamming space.
The framework aims to calculate similarity of data across modes so as to perform cross-mode retrieval. For example, for data X from the X modalityaMapping it to Hamming space using the above transfer function
Figure FDA0003066141240000031
Calculating all Y-mode data v in Hamming spacejDegree of similarity dj=sim(ua,vj) Sorting the similarity to obtain the similarity with x in the Y modeaThe retrieval result of the associated data.
6. The deep hash-based generic cross-modal search model of claim 4, wherein: the binary code conversion model needs to ensure that the binary code retains the neighbor similarity in the original feature space as much as possible, i.e. two adjacent points in the original space should have similarity when being mapped into the hamming space. Therefore, the model is trained to ensure the principle of similarity preservation, and commonly used Loss functions include contrast Loss (contrast Loss) and triple Loss (triple Loss), and the Loss functions are respectively as follows and are used for supervising the generation process of binary codes:
1) loss of contrast
Figure FDA0003066141240000032
Wherein d | | | ui-vj||2Representing the euclidean distance between two sample features; y is the label of whether the two samples are matched, if the two modal data xiAnd yjIf semantic association exists between the two groups, y is 1, otherwise, y is 0; margin is a set threshold.
2) Loss of triad
Figure FDA0003066141240000033
Wherein | | | | is the Euclidean distance,
Figure FDA0003066141240000034
the euclidean distance between Positive and Anchor is shown,
Figure FDA0003066141240000035
the Euclidean distance between Negative and Anchor is shown; alpha represents the minimum interval between the distance between Positive and Anchor and the distance between Negative and Anchor; + represents [, ]]And when the internal value is greater than 0, taking the value as loss, otherwise, taking the loss as 0.
CN202110526554.5A 2021-05-14 2021-05-14 Universal cross-modal retrieval model based on deep hash Pending CN113076465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110526554.5A CN113076465A (en) 2021-05-14 2021-05-14 Universal cross-modal retrieval model based on deep hash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110526554.5A CN113076465A (en) 2021-05-14 2021-05-14 Universal cross-modal retrieval model based on deep hash

Publications (1)

Publication Number Publication Date
CN113076465A true CN113076465A (en) 2021-07-06

Family

ID=76616918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110526554.5A Pending CN113076465A (en) 2021-05-14 2021-05-14 Universal cross-modal retrieval model based on deep hash

Country Status (1)

Country Link
CN (1) CN113076465A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658683A (en) * 2021-08-05 2021-11-16 重庆金山医疗技术研究院有限公司 Disease diagnosis system and data recommendation method
CN113971209A (en) * 2021-12-22 2022-01-25 松立控股集团股份有限公司 Non-supervision cross-modal retrieval method based on attention mechanism enhancement
CN115081627A (en) * 2022-07-27 2022-09-20 中南大学 Cross-modal data hash retrieval attack method based on generative network
CN116128846A (en) * 2023-02-01 2023-05-16 南通大学 Visual transducer hash method for lung X-ray image retrieval
CN117633263A (en) * 2024-01-26 2024-03-01 中国标准化研究院 Encoding method of digital asset based on big data

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658683A (en) * 2021-08-05 2021-11-16 重庆金山医疗技术研究院有限公司 Disease diagnosis system and data recommendation method
CN113971209A (en) * 2021-12-22 2022-01-25 松立控股集团股份有限公司 Non-supervision cross-modal retrieval method based on attention mechanism enhancement
CN113971209B (en) * 2021-12-22 2022-04-19 松立控股集团股份有限公司 Non-supervision cross-modal retrieval method based on attention mechanism enhancement
CN115081627A (en) * 2022-07-27 2022-09-20 中南大学 Cross-modal data hash retrieval attack method based on generative network
CN116128846A (en) * 2023-02-01 2023-05-16 南通大学 Visual transducer hash method for lung X-ray image retrieval
CN116128846B (en) * 2023-02-01 2023-08-22 南通大学 Visual transducer hash method for lung X-ray image retrieval
CN117633263A (en) * 2024-01-26 2024-03-01 中国标准化研究院 Encoding method of digital asset based on big data
CN117633263B (en) * 2024-01-26 2024-03-22 中国标准化研究院 Encoding method of digital asset based on big data

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN108427738B (en) Rapid image retrieval method based on deep learning
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN112347268A (en) Text-enhanced knowledge graph joint representation learning method and device
CN110222218B (en) Image retrieval method based on multi-scale NetVLAD and depth hash
CN108959522B (en) Migration retrieval method based on semi-supervised countermeasure generation network
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN109783691B (en) Video retrieval method for deep learning and Hash coding
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
CN112115253B (en) Depth text ordering method based on multi-view attention mechanism
CN113157886B (en) Automatic question and answer generation method, system, terminal and readable storage medium
CN116204706A (en) Multi-mode content retrieval method and system for text content and image analysis
CN114020906A (en) Chinese medical text information matching method and system based on twin neural network
CN111858984A (en) Image matching method based on attention mechanism Hash retrieval
CN115879473B (en) Chinese medical named entity recognition method based on improved graph attention network
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN111461175A (en) Label recommendation model construction method and device of self-attention and cooperative attention mechanism
CN112883199A (en) Collaborative disambiguation method based on deep semantic neighbor and multi-entity association
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN114357148A (en) Image text retrieval method based on multi-level network
CN110598022A (en) Image retrieval system and method based on robust deep hash network
CN116628192A (en) Text theme representation method based on Seq2Seq-Attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication