CN116204706A - Multi-mode content retrieval method and system for text content and image analysis - Google Patents

Multi-mode content retrieval method and system for text content and image analysis Download PDF

Info

Publication number
CN116204706A
CN116204706A CN202211723519.3A CN202211723519A CN116204706A CN 116204706 A CN116204706 A CN 116204706A CN 202211723519 A CN202211723519 A CN 202211723519A CN 116204706 A CN116204706 A CN 116204706A
Authority
CN
China
Prior art keywords
text
features
modal
image
hash code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211723519.3A
Other languages
Chinese (zh)
Inventor
周凡
张富为
林谋广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202211723519.3A priority Critical patent/CN116204706A/en
Publication of CN116204706A publication Critical patent/CN116204706A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-mode content retrieval method for text content and image analysis. Comprising the following steps: preprocessing the data set to obtain a text image information pair; extracting image and text features, and performing multi-modal attention calculation on the image features and the text features to obtain multi-modal features; encoding the image, the text and the multi-modal feature to form a corresponding hash code; constructing a target loss function, and training to obtain a multi-mode hash code generation model; constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing a multi-mode hash code generation model; and generating a multi-mode hash code according to text information input by a user, and matching the multi-mode hash code with a multi-mode hash code database to obtain a retrieval result. The invention also discloses a multi-mode content retrieval system for text content and image analysis. The method uses the multi-mode hash codes to fundamentally capture the commonalities among the modes, make up the heterogeneous gaps among the modes and remarkably improve the extraction efficiency of the effective features.

Description

Multi-mode content retrieval method and system for text content and image analysis
Technical Field
The invention relates to a retrieval technology, in particular to a multi-mode content retrieval method and system for text content and image analysis.
Background
In the last decade, different types of multimedia data have been explosively propagated over the internet by textual descriptions, presentation of images and video, and the like. When the web page transmits information to the audience, language text expression and image auxiliary description are adopted to describe the same event or theme in a mode of simultaneously presenting video. The different types of data in this expression are referred to as multi-modal data. Therefore, how to quickly and accurately implement multi-modal retrieval in the internet era has attracted considerable attention from researchers.
Today, mobile communication devices and emerging social networking sites (e.g., facebook, flickr, youTube and Twitter) are changing the way people interact with the world and search for information of interest. It would be convenient if the user could use any media content as input for query-related information. Assuming we are playing the great wall, we may wish to retrieve the relevant literal material from the photos taken, and thus learn about the history of the place being played and the interest. Thus, multimodal retrieval is becoming increasingly important as a common search modality. Multimodal retrieval aims at retrieving one type of data as a query against another type of related data. In addition, when users submit data of any media type to search information, they can obtain search results of various media types, and the search results obtained in this way are more comprehensive because different manifestations of the data can complement each other. In the field of multi-modal research, the initial direction of research is multi-modal search. The multi-modal search technology relates to the fields of natural language processing, image processing, voice recognition, computer vision, machine learning and the like, and is closely related to various subjects of computer science, statistics, mathematics and the like. In one aspect, research into multimodal search methods will motivate the further development of many machine learning theories (e.g., multi-perspective learning, hash learning, subspace learning, metric learning, deep learning, etc.). On the other hand, the research of the multi-mode searching method is an indispensable stage of the generation and development of a new searching technology, and the new searching technology can promote the social and economic development of the information technology, so that the research of the multi-mode searching method has important significance. As the phenomenon of mutual complementary descriptions between multimodal data becomes more common, various search engines and multimodal data on social media all exhibit explosive growth. Researchers have focused on how to quickly and accurately search for data representing the same event topic in different modalities in large-scale multi-modality data. Since data from different modalities typically have incomparable feature representations and distributions, it is necessary to map them to a common feature space. In order to meet the requirements of low storage cost and high query speed in practical application, scientific researchers propose a hash multi-mode retrieval method. The method maps the high-dimensional multi-modal data to a public Hamming space, and can calculate the similarity between the multi-modal data only through exclusive OR operation after the hash codes are obtained. Most of the existing hash multi-mode retrieval methods use hand-made features for hash learning, and the hash code learning speed of the methods is high, so that good retrieval effect can be achieved. However, a common disadvantage of the hash multi-mode search algorithm is that the manual feature making process and the hash learning process are completely independent, and further the manual feature may not be completely compatible with the hash learning process, so that the hash multi-mode search method using the manual feature is poor in search effect. Proved by experiments, on a plurality of commonly used multi-modal data sets used at present, if hash learning is carried out by continuously using hand-made features, the multi-modal retrieval effect is difficult to be improved. In order to solve the problem that the hand-made features are not fully compatible with the hash learning process, feature learning matched with the hash learning needs to be researched. The method aims at exploring a Hash multi-modal retrieval technology from the aspects of reducing coding errors, mining semantic information of multi-modal data, reducing the difference among the multi-modal data and the like when performing feature learning matched with the Hash learning by using deep learning.
The basic idea of hash multi-modal retrieval is to map data from the original feature space to the binary encoding space for similarity retrieval. The Hash multi-mode retrieval method maps multi-mode data into binary codes, and the binary codes have the advantages of small occupied space and high calculation speed. The general hash multi-mode searching method is divided into two steps, firstly, data of different modes or hand-made features are mapped into a public feature space through linear transformation or non-linear transformation, and secondly, the features of the public feature space are encoded, and most of the hash multi-mode searching methods use binary partition functions for encoding. The key difficulty of multi-modal retrieval is how to correlate semantic relatedness between multi-modal data, usually solved by learning uniform hash codes from different modalities of the same sample or narrowing the hamming distance between semantically related multi-modal data. The hash multi-mode search method can be classified into a supervised method and an unsupervised method according to whether tag information is used or not, and can be classified into a linear method and a nonlinear method according to a projection mode.
One of the existing technologies is multi-modal similarity sensitive hash (Cross-Modal Similarity Sensitive Hashing, CMSSH) in the paper "Large-scale supervised multimodal hashing with semantic correlation maximization", the method firstly maps the original data into hash codes, the similarity matrix is obtained by using the hash codes and a defined multi-modal data fusion mode, the weight of the next hash function is obtained by using the similarity matrix, the method regards the learning of each hash function as a two-classification process, namely a weak classifier, and finally the weak classifier is integrated into a strong classifier by using a standard lifting algorithm. The disadvantage of this approach is that CMSSH maintains consistency between multimodal data using point pair based methods, but does not take into account intra-modality data similarity.
The second prior art is the supervised matrix decomposition hash (Supervised Matrix Factorization Hashing, SMFH) proposed in paper Kenel-based supervised hashing for cross-view similarity search, the method uses matrix decomposition for retrieval in hash multi-mode, tag information and local geometry are effectively utilized, the consistency of tags in different modes is considered, the local geometric consistency in each mode is considered, the two elements are formulated into a drawing Laplace's subitem in an objective function, and the discrimination capability of potential semantic features obtained by joint matrix decomposition is greatly improved. The Discrete multi-mode hash (DCH) proposed in paper "the DCH retains Discrete constraints, and a linear classifier is constructed by using a label as supervision information, and the discriminative hash code is directly learned. The method has the disadvantage that compared with a nonlinear structure, semantic information learned by the linear structure is limited, and the linear method cannot be classified into a point pair-based method and a label-based method.
The third prior art is a multi-mode hidden binary embedding (Multimodal Latent Binary Embedding, MLBE) model proposed in the paper Deep cross-mode mapping, and the MLBE uses a generation model to encode intra-mode and inter-mode similarities of multi-mode data. Based on the maximum a posteriori estimates, binary latent factors that maintain both intra-and inter-modality similarities are effectively obtained and then used as learned hash codes. The method has the defects that during learning, especially when the code length is large, parameters are relatively large, the calculation complexity is high, and the optimization is easy to sink into a local minimum value.
Disclosure of Invention
The invention aims to overcome the defects of the existing method and provides a multi-mode content retrieval method and system for text content and image analysis. The invention solves the main problems that the CMSSH adopts a point pair method to maintain consistency among multi-mode data, but neglects similarity of data in modes; secondly, the current multi-mode hash method has insufficient hash code characteristic capacity expression and low effective characteristic extraction efficiency; thirdly, when the code length is larger, the parameters are more, and the calculation complexity is high, the MLBE model optimization is easy to sink into a local minimum.
In order to solve the above problems, the present invention proposes a multi-modal content retrieval method for text content combined with image analysis, the method comprising:
labeling the image set in the dataset Imagenet to obtain a text image information pair;
inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;
inputting the image features and the text features, and performing multi-modal attention calculation to obtain weighted attention multi-modal features;
respectively carrying out hash generation on the image features, the text features and the multi-modal features, and outputting an image hash code, a text hash code and the multi-modal hash code;
Inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;
constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;
and generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to text information input by a user, and then matching with the constructed multi-mode hash code database to obtain a retrieval result.
Preferably, labeling is performed on the image set in the dataset Imagenet to obtain a text image information pair, specifically:
and collecting the images and the information corresponding to the text description, forming a data set image, and labeling the image set in the data set image to obtain a text image information pair.
Preferably, the text image information pair is input, a feature extraction network is constructed to extract image and text features, and the image features and the text features are output, specifically:
for image features, a convolutional neural network is combined with a full-connection layer of 512 nodes, K nodes are connected at the same time, an image feature extraction network is constructed by using softmax as an activation function, the output is used as a learned image feature, a full-connection layer of sigmoid of the activation function with the label category number is combined, and a predicted label is output by the last layer of the image network to be used for maintaining the label feature of an example; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.
Preferably, the inputting the image feature and the text feature performs multi-modal attention calculation to obtain weighted multi-modal features, which specifically are:
the method comprises the steps of capturing the correlation between a text and an image by adopting a multi-mode cross attention mechanism based on a self-attention mechanism, taking the text as a query sub Q, taking the image as key values K and V, then carrying out multi-mode attention calculation, merging the calculation of different heads, and carrying out normalization processing to obtain the final multi-mode feature.
Preferably, the hash generation is performed on the image feature, the text feature and the multi-mode feature, and an image hash code, a text hash code and a multi-mode hash code are output, specifically:
and respectively acquiring the text, the image and the hash codes of the multi-modal features by adopting a sign function, wherein the formula is as follows:
Figure BDA0004028960710000071
the multi-modal feature comprises text features and image features, the multi-modal feature is used as an intermediate bridge to link a hash code learning process between different modalities, the image hash code learning process firstly encodes the image features by a sign function, and an image hash code is formed; encoding the multi-modal features by using a sign function to form a multi-modal hash code; in the text hash code learning process, text features are encoded by adopting sign functions to form text hash codes, and after modeling the hash codes of different modes, a target loss function is constructed.
Preferably, the inputting the image hash code, the text hash code and the multi-modal hash code constructs a target loss function, trains a model by using the loss function, and finally obtains a multi-modal hash code generation model, which is specifically as follows:
modeling similarity between image features and multi-modal features using negative log likelihood loss, modeling similarity between image and multi-modal features using cosine similarity function
Figure BDA0004028960710000072
The following is shown:
Figure BDA0004028960710000073
wherein S is a matrix of n x n,
Figure BDA0004028960710000074
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure BDA0004028960710000075
Representing similarity, -1 representing dissimilarity, +.>
Figure BDA0004028960710000076
Figure BDA0004028960710000077
Representing the similarity of two different modal samples in real value space, f=f (x i ;θ x )∈R C And l=g (y i ;θ y )∈R C Respectively representing image features and multi-modal features, wherein the image features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and is the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
then modeling the similarity between the text and the multi-modal feature using a cosine similarity function using the similarity relationship between the negative log-likelihood loss modeling text and the multi-modal feature
Figure BDA0004028960710000078
The following is shown:
Figure BDA0004028960710000081
wherein S is a matrix of n x n,
Figure BDA0004028960710000082
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure BDA0004028960710000083
Representing similarity, -1 representing dissimilarity, +.>
Figure BDA0004028960710000084
Figure BDA0004028960710000085
Representing two different modes of a sample in realSimilarity of value space, l=f (x i ;θ x )∈R C And g=g (y i ;θ y )∈R C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and also represents the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
meanwhile, in order to compensate the information lost in the quantization process of different modes, the quantization loss is constructed as follows:
Figure BDA0004028960710000086
Figure BDA0004028960710000087
wherein B is a matrix of n×c for continuously storing and updating the binary hash representation of each sample during training, and B is actually a post-processing process, and sign (·) obtained by sign (f+l) and sign (f+g) is a function of converting a number greater than 0 to 1 and a number less than 0 to-1;
subsequently, to balance the relationship between the different modalities, a balance loss is introduced, as follows:
Figure BDA0004028960710000088
/>
finally, the total target loss is:
L all =L1+L2+L3+L4+L5
Mapping similar data of different modes into a similar Hamming space through learning the loss function, enabling Hamming codes of dissimilar data among different modes to have a further Hamming distance in the Hamming space, and training to obtain a multi-mode Hamming code generation model;
when a user searches, the multi-mode hash code generation model only needs to upload data, the data is subjected to the trained model operation to obtain a hash code, and the hash code is compared with the hash codes in the database so as to realize a search function, so that the search speed and accuracy are improved.
Correspondingly, the invention also provides a multi-mode content retrieval system for text content combined with image analysis, which comprises:
the data preprocessing unit is used for labeling the image set in the data set Imagenet to obtain a text image information pair;
the feature extraction unit is used for inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;
the multi-modal attention unit is used for inputting the image features and the text features, carrying out multi-modal attention calculation and obtaining weighted multi-modal features;
The hash code generation unit is used for respectively carrying out hash generation on the image features, the text features and the multi-mode features and outputting an image hash code, a text hash code and the multi-mode hash code;
the model training unit is used for inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;
the database construction unit is used for constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;
and the matching unit is used for generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to text information input by a user, and then matching the multi-mode hash code with the constructed multi-mode hash code database to obtain a search result.
The implementation of the invention has the following beneficial effects:
according to the invention, the convolutional neural network is utilized to extract image features, the word bag model is utilized to extract text features, the existing advanced multi-modal attention method is adopted to fuse two modalities, the complex network is not adopted to fuse multi-modal features, and the computational complexity of the network is reduced. Meanwhile, the multi-mode hash codes are added, due to the correlation among different modes, the purpose of multi-mode attention is to capture the commonality among the modes, the commonality characteristics are distributed in different single modes, a bridge for similarity calculation among the modes is constructed, the problem of heterogeneous gaps among the modes is more easily compensated compared with the traditional method of directly adopting the hash codes of the different modes for similarity calculation, and the commonality among the modes is basically mined. The scheme creatively builds the multi-mode hash code, thereby improving the feature expression capability of the current hash code, and remarkably improving the extraction efficiency of the effective features through the extraction process of the multi-mode features of the reverse propagation link of the hash code.
Drawings
FIG. 1 is a general flow diagram of a multi-modal content retrieval method for text content in combination with image analysis in accordance with an embodiment of the present invention;
FIG. 2 is a training flow diagram of a multi-modal hash generation model in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of a multi-modal attention computation in accordance with an embodiment of the present invention;
FIG. 4 is a flow chart of a multi-modal hash code database generation in accordance with an embodiment of the present invention;
FIG. 5 is a search matching flow chart of an embodiment of the present invention;
fig. 6 is a block diagram of a multi-modal content retrieval system for text content in combination with image analysis in accordance with an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
FIG. 1 is a general flow chart of a multi-modal content retrieval method of text content in combination with image analysis according to an embodiment of the invention, as shown in FIG. 1, the method comprising:
S1, labeling an image set in a dataset Imagenet to obtain a text image information pair;
s2, inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;
s3, inputting the image features and the text features, and performing multi-modal attention calculation to obtain weighted attention multi-modal features;
s4, respectively carrying out hash generation on the image features, the text features and the multi-mode features, and outputting an image hash code, a text hash code and the multi-mode hash code;
s5, inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;
s6, as shown in FIG. 4, constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;
and S7, as shown in FIG. 5, generating a multi-mode hash code by using the multi-mode hash code generation model according to text information input by a user, and then matching with the constructed multi-mode hash code database to obtain a retrieval result.
Step S1, specifically, the following steps are performed:
s1-1, collecting the images and the information of the corresponding text descriptions thereof, forming a data set image, and labeling the image set in the data set image to obtain a text image information pair.
Step S2, as shown in fig. 2, is specifically as follows:
s2-1, for image features, a convolutional neural network is adopted to combine a full-connection layer of 512 nodes, K nodes are connected at the same time, an activation function is softmax to construct an image feature extraction network, the image feature is output as learned image features, a full-connection layer of an activation function sigmoid with a label category number is combined, and a predicted label is output at the last layer of the image network to be used for keeping the label features of an example; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.
Step S3, as shown in fig. 3, is specifically as follows:
s3-1, capturing the correlation between the text and the image by adopting a multi-mode cross attention mechanism based on a self-attention mechanism, taking the text as a query sub Q, taking the image as key values K and V, then carrying out multi-mode attention calculation, merging the calculation of different heads, and carrying out normalization processing to obtain the final multi-mode feature.
Step S4, specifically, the following steps are performed:
s4-1, respectively acquiring the text, the image and the hash codes of the multi-modal features by adopting a sign function, wherein the formula is as follows:
Figure BDA0004028960710000131
s4-2, the multi-mode features comprise text features and image features, the multi-mode features are used as an intermediate bridge to link hash code learning processes among different modes, the image hash code learning processes adopt sign functions to encode the image features first, and image hash codes are formed; encoding the multi-modal features by using a sign function to form a multi-modal hash code; in the text hash code learning process, text features are encoded by adopting sign functions to form text hash codes, and after modeling the hash codes of different modes, a target loss function is constructed.
Step S5, specifically, the following steps are performed:
modeling similarity between image features and multi-modal features using negative log likelihood loss, modeling similarity between image and multi-modal features using cosine similarity function
Figure BDA0004028960710000132
The following is shown:
Figure BDA0004028960710000133
wherein S is a matrix of n x n,
Figure BDA0004028960710000134
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure BDA0004028960710000135
Representing similarity, -1 representing dissimilarity, +.>
Figure BDA0004028960710000136
Figure BDA0004028960710000137
Representing the similarity of two different modal samples in real value space, f=f (x i ;θ x )∈R C And l=g (y i ;θ y )∈R C Respectively representing image features and multi-modal features, wherein the image features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and is the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
then modeling the similarity between the text and the multi-modal feature using a cosine similarity function using the similarity relationship between the negative log-likelihood loss modeling text and the multi-modal feature
Figure BDA0004028960710000138
The following is shown:
Figure BDA0004028960710000141
wherein S is a matrix of n x n,
Figure BDA0004028960710000142
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure BDA0004028960710000143
Representing similarity, -1 representing dissimilarity, +.>
Figure BDA0004028960710000144
Figure BDA0004028960710000145
Representing the similarity of two different modal samples in real value space, l=f (x i ;θ x )∈R C And g=g (y i ;θ y )∈R C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and also represents the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
Meanwhile, in order to compensate the information lost in the quantization process of different modes, the quantization loss is constructed as follows:
Figure BDA0004028960710000146
Figure BDA0004028960710000147
wherein B is a matrix of n×c for continuously storing and updating the binary hash representation of each sample during training, and B is actually a post-processing process, and sign (·) obtained by sign (f+l) and sign (f+g) is a function of converting a number greater than 0 to 1 and a number less than 0 to-1;
subsequently, to balance the relationship between the different modalities, a balance loss is introduced, as follows:
Figure BDA0004028960710000148
finally, the total target loss is:
L all =L1+L2+L3+L4+L5
mapping similar data of different modes into a similar Hamming space through learning the loss function, enabling Hamming codes of dissimilar data among different modes to have a further Hamming distance in the Hamming space, and training to obtain a multi-mode Hamming code generation model;
when a user searches, the multi-mode hash code generation model only needs to upload data, the data is subjected to the trained model operation to obtain a hash code, and the hash code is compared with the hash codes in the database so as to realize a search function, so that the search speed and accuracy are improved.
Correspondingly, the invention also provides a multi-mode content retrieval system for text content combined with image analysis, as shown in fig. 6, comprising:
The data preprocessing unit 1 is used for labeling the image set in the dataset Imagenet to obtain a text image information pair.
Specifically, collecting images and information corresponding to text descriptions thereof, forming a data set image, and labeling the image set in the data set image to obtain a text image information pair;
and the feature extraction unit 2 is used for inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features.
Specifically, for image features, a convolutional neural network is combined with a full-connection layer of 512 nodes, a K nodes are connected at the same time, an activation function is softmax to construct an image feature extraction network, the image feature is output as a learned image feature, a full-connection layer of an activation function sigmoid with a label category number is combined, and a predicted label is output by the last layer of the image network to be used for keeping the label feature of an example; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.
And the multi-modal attention unit 3 is used for inputting the image features and the text features, performing multi-modal attention calculation and obtaining weighted multi-modal features.
Specifically, a multi-mode cross attention mechanism based on a self-attention mechanism is adopted to capture the correlation between a text and an image, the text is used as a query sub Q, the image is used as key values K and V, then multi-mode attention calculation is carried out, calculation of different heads is fused, and normalization processing is carried out, so that the final multi-mode feature is obtained.
And the hash code generation unit 4 is used for respectively carrying out hash generation on the image feature, the text feature and the multi-mode feature and outputting an image hash code, a text hash code and the multi-mode hash code.
Specifically, a sign function is adopted to respectively obtain the text, the image and the hash codes of the multi-modal characteristics, and the formula is as follows:
Figure BDA0004028960710000161
/>
the multi-modal feature comprises text features and image features, the multi-modal feature is used as an intermediate bridge to link a hash code learning process between different modalities, the image hash code learning process firstly encodes the image features by a sign function, and an image hash code is formed; encoding the multi-modal features by using a sign function to form a multi-modal hash code; in the text hash code learning process, text features are encoded by adopting sign functions to form text hash codes, and after modeling the hash codes of different modes, a target loss function is constructed.
The model training unit 5 is configured to input the image hash code, the text hash code, and the multi-modal hash code, construct a target loss function, train a model using the loss function, and finally obtain a multi-modal hash code generation model.
Specifically, a negative log-likelihood loss is used to model the similarity relationship between image features and multi-modal features, and a cosine similarity function is used to model the similarity between image and multi-modal features
Figure BDA0004028960710000171
The following is shown:
Figure BDA0004028960710000172
wherein S is a matrix of n x n,
Figure BDA0004028960710000173
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure BDA0004028960710000174
Representing similarity, -1 representing dissimilarity, +.>
Figure BDA0004028960710000175
Figure BDA0004028960710000176
Representing two modalitiesSimilarity of different samples in real value space, f=f (x i ;θ x )∈R C And l=g (y i ;θ y )∈R C Respectively representing image features and multi-modal features, wherein the image features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and is the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
then modeling the similarity between the text and the multi-modal feature using a cosine similarity function using the similarity relationship between the negative log-likelihood loss modeling text and the multi-modal feature
Figure BDA0004028960710000177
The following is shown:
Figure BDA0004028960710000178
wherein S is a matrix of n x n,
Figure BDA0004028960710000179
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure BDA00040289607100001710
Representing similarity, -1 representing dissimilarity, +.>
Figure BDA00040289607100001711
Figure BDA00040289607100001712
Representing the similarity of two different modal samples in real value space, l=f (x i ;θ x )∈R C And g=g (y i ;θ y )∈R C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and the length of a subsequent binary hash representation, also called bit, and optimizes the negative pair aboveThe number likelihood loss may be such that the similarity relationship is maintained in the feature space;
meanwhile, in order to compensate the information lost in the quantization process of different modes, the quantization loss is constructed as follows:
Figure BDA0004028960710000181
Figure BDA0004028960710000182
wherein B is a matrix of n×c for continuously storing and updating the binary hash representation of each sample during training, and B is actually a post-processing process, and sign (·) obtained by sign (f+l) and sign (f+g) is a function of converting a number greater than 0 to 1 and a number less than 0 to-1;
subsequently, to balance the relationship between the different modalities, a balance loss is introduced, as follows:
Figure BDA0004028960710000183
finally, the total target loss is:
L all =L1+L2+L3+L4+L5
And mapping similar data of different modes into a similar Hamming space through learning the loss function, so that Hamming codes of dissimilar data among different modes have a further Hamming distance in the Hamming space, and training to obtain a multi-mode Hamming code generation model.
And the database construction unit 6 is used for constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training.
And the matching unit 7 is used for generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to the text information input by the user, and then matching the multi-mode hash code with the constructed multi-mode hash code database to obtain a retrieval result.
Therefore, the invention utilizes the convolutional neural network to extract image features, utilizes the word bag model to extract text features, adopts the existing advanced multi-modal attention method to merge two modalities, does not adopt a complex network to merge multi-modal features, and reduces the computational complexity of the network. Meanwhile, the multi-mode hash codes are added, due to the correlation among different modes, the purpose of multi-mode attention is to capture the commonality among the modes, the commonality characteristics are distributed in different single modes, a bridge for similarity calculation among the modes is constructed, the problem of heterogeneous gaps among the modes is more easily compensated compared with the traditional method of directly adopting the hash codes of the different modes for similarity calculation, and the commonality among the modes is basically mined. The scheme creatively builds the multi-mode hash code, thereby improving the feature expression capability of the current hash code, and remarkably improving the extraction efficiency of the effective features through the extraction process of the multi-mode features of the reverse propagation link of the hash code.
The above description is provided in detail of a method and a system for retrieving multi-modal content by combining text content with image analysis, and specific examples are applied to illustrate the principles and embodiments of the present invention, and the above description of the embodiments is only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (12)

1. A multi-modal content retrieval method of text content in combination with image analysis, the method comprising:
labeling the image set in the dataset Imagenet to obtain a text image information pair;
inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;
inputting the image features and the text features, and performing multi-modal attention calculation to obtain weighted attention multi-modal features;
respectively carrying out hash generation on the image features, the text features and the multi-modal features, and outputting an image hash code, a text hash code and the multi-modal hash code;
Inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;
constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;
and generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to text information input by a user, and then matching with the constructed multi-mode hash code database to obtain a retrieval result.
2. The method for multi-modal content retrieval by combining text content with image analysis according to claim 1, wherein the labeling of the image collection in the dataset Imagenet to obtain the text image information pair is specifically as follows:
and collecting the images and the information corresponding to the text description, forming a data set image, and labeling the image set in the data set image to obtain a text image information pair.
3. The multi-modal content retrieval method of claim 1, wherein the inputting the text image information pair, constructing a feature extraction network to extract image and text features, and outputting image features and text features, specifically:
For image features, a convolutional neural network is combined with a full-connection layer of 512 nodes, K nodes are connected at the same time, an image feature extraction network is constructed by using softmax as an activation function, the output is used as a learned image feature, a full-connection layer of sigmoid of the activation function with the label category number is combined, and a predicted label is output by the last layer of the image network to be used for maintaining the label feature of an example; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.
4. The method for retrieving multi-modal content by combining text content with image analysis according to claim 1, wherein the inputting the image feature and the text feature performs multi-modal attention calculation to obtain weighted attention multi-modal features, specifically comprises:
The method comprises the steps of capturing the correlation between a text and an image by adopting a multi-mode cross attention mechanism based on a self-attention mechanism, taking the text as a query sub Q, taking the image as key values K and V, then carrying out multi-mode attention calculation, merging the calculation of different heads, and carrying out normalization processing to obtain the final multi-mode feature.
5. The method for retrieving multi-modal content by combining text content with image analysis according to claim 1, wherein the hash generation is performed on the image feature, the text feature and the multi-modal feature, respectively, and an image hash code, a text hash code and a multi-modal hash code are output, specifically:
and respectively acquiring the text, the image and the hash codes of the multi-modal features by adopting a sign function, wherein the formula is as follows:
Figure FDA0004028960700000031
the multi-modal feature comprises text features and image features, the multi-modal feature is used as an intermediate bridge to link a hash code learning process between different modalities, the image hash code learning process firstly encodes the image features by a sign function, and an image hash code is formed; encoding the multi-modal features by using a sign function to form a multi-modal hash code; in the text hash code learning process, text features are encoded by adopting sign functions to form text hash codes, and after modeling the hash codes of different modes, a target loss function is constructed.
6. The method for multi-modal content retrieval by combining text content with image analysis according to claim 1, wherein the inputting the image hash code, the text hash code and the multi-modal hash code constructs a target loss function, and a loss function training model is utilized to finally obtain a multi-modal hash code generation model, which is specifically as follows:
modeling similarity between image features and multi-modal features using negative log likelihood loss, modeling similarity between image and multi-modal features using cosine similarity function
Figure FDA0004028960700000032
The following is shown:
Figure FDA0004028960700000033
wherein S is a matrix of n x n,
Figure FDA0004028960700000034
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure FDA0004028960700000035
Representing similarity, -1 representing dissimilarity, +.>
Figure FDA0004028960700000036
Figure FDA0004028960700000037
Representing the similarity of two different modal samples in real value space, f=f (x i ;θ x )∈R C And l=g (y i ;θ y )∈R C Respectively representing image features and multi-modal features, wherein the image features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and is the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
then modeling the similarity between the text and the multi-modal feature using a cosine similarity function using the similarity relationship between the negative log-likelihood loss modeling text and the multi-modal feature
Figure FDA0004028960700000041
The following is shown:
Figure FDA0004028960700000042
wherein S is a matrix of n x n,
Figure FDA0004028960700000043
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure FDA0004028960700000044
Representing similarity, -1 representing dissimilarity, +.>
Figure FDA0004028960700000045
Figure FDA0004028960700000046
Representing two modes of failureSimilarity to the sample in real space, l=f (x i ;θ x )∈R C And g=g (y i ;θ y )∈R C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and also represents the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
meanwhile, in order to compensate the information lost in the quantization process of different modes, the quantization loss is constructed as follows:
Figure FDA0004028960700000047
Figure FDA0004028960700000048
wherein B is a matrix of n×c for continuously storing and updating the binary hash representation of each sample during training, and B is actually a post-processing process, and sign (·) obtained by sign (f+l) and sign (f+g) is a function of converting a number greater than 0 to 1 and a number less than 0 to-1;
subsequently, to balance the relationship between the different modalities, a balance loss is introduced, as follows:
Figure FDA0004028960700000049
finally, the total target loss is:
L all =L1+L2+L3+L4+L5
Mapping similar data of different modes into a similar Hamming space through learning the loss function, enabling Hamming codes of dissimilar data among different modes to have a further Hamming distance in the Hamming space, and training to obtain a multi-mode Hamming code generation model;
when a user searches, the multi-mode hash code generation model only needs to upload data, the data is subjected to the trained model operation to obtain a hash code, and the hash code is compared with the hash code in the database to realize a search function, so that the search speed and accuracy are improved.
7. A multi-modal content retrieval system for text content in combination with image analysis, the system comprising:
the data preprocessing unit is used for labeling the image set in the data set Imagenet to obtain a text image information pair;
the feature extraction unit is used for inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;
the multi-modal attention unit is used for inputting the image features and the text features, carrying out multi-modal attention calculation and obtaining weighted multi-modal features;
The hash code generation unit is used for respectively carrying out hash generation on the image features, the text features and the multi-mode features and outputting an image hash code, a text hash code and the multi-mode hash code;
the model training unit is used for inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;
the database construction unit is used for constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;
and the matching unit is used for generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to text information input by a user, and then matching the multi-mode hash code with the constructed multi-mode hash code database to obtain a search result.
8. The multi-modal content retrieval system with text content combined with image analysis as claimed in claim 7, wherein the data preprocessing unit is required to collect information of images and corresponding text descriptions thereof, form a dataset Imagenet, and label image sets in the dataset Imagenet to obtain text image information pairs.
9. The multi-modal content retrieval system of claim 7, wherein the feature extraction unit is configured to construct an image feature extraction network by combining a 512 node fully connected layer with a convolutional neural network for image features, and simultaneously connecting K nodes with an activation function being softmax, output the image feature as a learned image feature, and output a predicted tag for maintaining the tag feature of the instance in combination with a fully connected layer with an activation function sigmoid of tag class number; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.
10. The system of claim 7, wherein the multi-modal attention module captures the correlation between text and image using a multi-modal cross attention mechanism based on a self-attention mechanism, uses the text as a query Q, uses the image as key values K and V, performs multi-modal attention calculations, fuses the calculations of different heads, and performs normalization processing to obtain final multi-modal features.
11. The multi-modal content retrieval system of claim 7, wherein the hash code generation unit is configured to obtain hash codes of text, image and multi-modal features using sign functions, respectively, according to the following formula:
Figure FDA0004028960700000071
the multi-modal feature comprises text features and image features, the multi-modal feature is used as an intermediate bridge to link a hash code learning process between different modalities, the image hash code learning process firstly encodes the image features by a sign function, and an image hash code is formed; encoding the multi-modal features by using a sign function to form a multi-modal hash code; in the text hash code learning process, text features are encoded by adopting sign functions to form text hash codes, and after modeling the hash codes of different modes, a target loss function is constructed.
12. The system for multi-modal content retrieval combined with image analysis as claimed in claim 7, wherein said model training unit uses a negative log likelihood loss to model similarity between image features and multi-modal features, and uses a cosine similarity function to model similarity between image features and multi-modal features
Figure FDA0004028960700000072
The following is shown:
Figure FDA0004028960700000073
wherein S is a matrix of n x n,
Figure FDA0004028960700000074
representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure FDA0004028960700000075
Representing similarity, -1 representing dissimilarity, +.>
Figure FDA0004028960700000076
Figure FDA0004028960700000077
Representing the similarity of two different modal samples in real value space, f=f (x i ;θ x )∈R C And l=g (y i ;θ y )∈R C Respectively representing image features and multi-modal features, wherein the image features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and is the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
then modeling the similarity between the text and the multi-modal feature using a cosine similarity function using the similarity relationship between the negative log-likelihood loss modeling text and the multi-modal feature
Figure FDA0004028960700000078
The following is shown:
Figure FDA0004028960700000079
wherein S is a matrix of n x n,
Figure FDA00040289607000000710
Representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>
Figure FDA0004028960700000081
Representing similarity, -1 representing dissimilarity, +.>
Figure FDA0004028960700000082
Figure FDA0004028960700000083
Representing the similarity of two different modal samples in real value space, l=f (x i ;θ x )∈R C And g=g (y i ;θ y )∈R C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and also represents the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;
meanwhile, in order to compensate the information lost in the quantization process of different modes, the quantization loss is constructed as follows:
Figure FDA0004028960700000084
Figure FDA0004028960700000085
/>
wherein B is a matrix of n×c for continuously storing and updating the binary hash representation of each sample during training, and B is actually a post-processing process, and sign (·) obtained by sign (f+l) and sign (f+g) is a function of converting a number greater than 0 to 1 and a number less than 0 to-1;
subsequently, to balance the relationship between the different modalities, a balance loss is introduced, as follows:
Figure FDA0004028960700000086
finally, the total target loss is:
L all =L1+L2+L3+L4+L5
mapping similar data of different modes into a similar Hamming space through learning the loss function, enabling Hamming codes of dissimilar data among different modes to have a further Hamming distance in the Hamming space, and training to obtain a multi-mode Hamming code generation model;
When a user searches, the multi-mode hash code generation model only needs to upload data, the data is subjected to the trained model operation to obtain a hash code, and the hash code is compared with the hash codes in the database so as to realize a search function, so that the search speed and accuracy are improved.
CN202211723519.3A 2022-12-30 2022-12-30 Multi-mode content retrieval method and system for text content and image analysis Pending CN116204706A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211723519.3A CN116204706A (en) 2022-12-30 2022-12-30 Multi-mode content retrieval method and system for text content and image analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211723519.3A CN116204706A (en) 2022-12-30 2022-12-30 Multi-mode content retrieval method and system for text content and image analysis

Publications (1)

Publication Number Publication Date
CN116204706A true CN116204706A (en) 2023-06-02

Family

ID=86513886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211723519.3A Pending CN116204706A (en) 2022-12-30 2022-12-30 Multi-mode content retrieval method and system for text content and image analysis

Country Status (1)

Country Link
CN (1) CN116204706A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116431847A (en) * 2023-06-14 2023-07-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
CN116994069A (en) * 2023-09-22 2023-11-03 武汉纺织大学 Image analysis method and system based on multi-mode information
CN117094367A (en) * 2023-10-19 2023-11-21 腾讯科技(深圳)有限公司 Content generation method, model training method, device, electronic equipment and medium
CN117194605A (en) * 2023-11-08 2023-12-08 中南大学 Hash encoding method, terminal and medium for multi-mode medical data deletion
CN117521017A (en) * 2024-01-03 2024-02-06 支付宝(杭州)信息技术有限公司 Method and device for acquiring multi-mode characteristics

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116431847A (en) * 2023-06-14 2023-07-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
CN116431847B (en) * 2023-06-14 2023-11-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
CN116994069A (en) * 2023-09-22 2023-11-03 武汉纺织大学 Image analysis method and system based on multi-mode information
CN116994069B (en) * 2023-09-22 2023-12-22 武汉纺织大学 Image analysis method and system based on multi-mode information
CN117094367A (en) * 2023-10-19 2023-11-21 腾讯科技(深圳)有限公司 Content generation method, model training method, device, electronic equipment and medium
CN117094367B (en) * 2023-10-19 2024-03-29 腾讯科技(深圳)有限公司 Content generation method, model training method, device, electronic equipment and medium
CN117194605A (en) * 2023-11-08 2023-12-08 中南大学 Hash encoding method, terminal and medium for multi-mode medical data deletion
CN117194605B (en) * 2023-11-08 2024-01-19 中南大学 Hash encoding method, terminal and medium for multi-mode medical data deletion
CN117521017A (en) * 2024-01-03 2024-02-06 支付宝(杭州)信息技术有限公司 Method and device for acquiring multi-mode characteristics
CN117521017B (en) * 2024-01-03 2024-04-05 支付宝(杭州)信息技术有限公司 Method and device for acquiring multi-mode characteristics

Similar Documents

Publication Publication Date Title
Zhu et al. Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval
Ma et al. Multi-level correlation adversarial hashing for cross-modal retrieval
CN116204706A (en) Multi-mode content retrieval method and system for text content and image analysis
Xiao et al. Convolutional hierarchical attention network for query-focused video summarization
CN110765281A (en) Multi-semantic depth supervision cross-modal Hash retrieval method
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
Zou et al. Multi-label semantics preserving based deep cross-modal hashing
CN112347268A (en) Text-enhanced knowledge graph joint representation learning method and device
CN111291188B (en) Intelligent information extraction method and system
CN112417097B (en) Multi-modal data feature extraction and association method for public opinion analysis
CN111581401A (en) Local citation recommendation system and method based on depth correlation matching
Lin et al. Mask cross-modal hashing networks
WO2020042597A1 (en) Cross-modal retrieval method and system
CN111324765A (en) Fine-grained sketch image retrieval method based on depth cascade cross-modal correlation
CN113157886B (en) Automatic question and answer generation method, system, terminal and readable storage medium
CN112115253B (en) Depth text ordering method based on multi-view attention mechanism
CN112580362A (en) Visual behavior recognition method and system based on text semantic supervision and computer readable medium
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN113094534B (en) Multi-mode image-text recommendation method and device based on deep learning
Wang et al. Fusion-supervised deep cross-modal hashing
CN110598022A (en) Image retrieval system and method based on robust deep hash network
Song et al. A weighted topic model learned from local semantic space for automatic image annotation
CN113806554A (en) Knowledge graph construction method for massive conference texts
Yu et al. Text-image matching for cross-modal remote sensing image retrieval via graph neural network
Wang et al. Cross-modal image–text search via efficient discrete class alignment hashing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination