CN116204706A

CN116204706A - Multi-mode content retrieval method and system for text content and image analysis

Info

Publication number: CN116204706A
Application number: CN202211723519.3A
Authority: CN
Inventors: 周凡; 张富为; 林谋广
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-06-02

Abstract

The invention discloses a multi-mode content retrieval method for text content and image analysis. Comprising the following steps: preprocessing the data set to obtain a text image information pair; extracting image and text features, and performing multi-modal attention calculation on the image features and the text features to obtain multi-modal features; encoding the image, the text and the multi-modal feature to form a corresponding hash code; constructing a target loss function, and training to obtain a multi-mode hash code generation model; constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing a multi-mode hash code generation model; and generating a multi-mode hash code according to text information input by a user, and matching the multi-mode hash code with a multi-mode hash code database to obtain a retrieval result. The invention also discloses a multi-mode content retrieval system for text content and image analysis. The method uses the multi-mode hash codes to fundamentally capture the commonalities among the modes, make up the heterogeneous gaps among the modes and remarkably improve the extraction efficiency of the effective features.

Description

Multi-mode content retrieval method and system for text content and image analysis

Technical Field

The invention relates to a retrieval technology, in particular to a multi-mode content retrieval method and system for text content and image analysis.

Background

In the last decade, different types of multimedia data have been explosively propagated over the internet by textual descriptions, presentation of images and video, and the like. When the web page transmits information to the audience, language text expression and image auxiliary description are adopted to describe the same event or theme in a mode of simultaneously presenting video. The different types of data in this expression are referred to as multi-modal data. Therefore, how to quickly and accurately implement multi-modal retrieval in the internet era has attracted considerable attention from researchers.

Today, mobile communication devices and emerging social networking sites (e.g., facebook, flickr, youTube and Twitter) are changing the way people interact with the world and search for information of interest. It would be convenient if the user could use any media content as input for query-related information. Assuming we are playing the great wall, we may wish to retrieve the relevant literal material from the photos taken, and thus learn about the history of the place being played and the interest. Thus, multimodal retrieval is becoming increasingly important as a common search modality. Multimodal retrieval aims at retrieving one type of data as a query against another type of related data. In addition, when users submit data of any media type to search information, they can obtain search results of various media types, and the search results obtained in this way are more comprehensive because different manifestations of the data can complement each other. In the field of multi-modal research, the initial direction of research is multi-modal search. The multi-modal search technology relates to the fields of natural language processing, image processing, voice recognition, computer vision, machine learning and the like, and is closely related to various subjects of computer science, statistics, mathematics and the like. In one aspect, research into multimodal search methods will motivate the further development of many machine learning theories (e.g., multi-perspective learning, hash learning, subspace learning, metric learning, deep learning, etc.). On the other hand, the research of the multi-mode searching method is an indispensable stage of the generation and development of a new searching technology, and the new searching technology can promote the social and economic development of the information technology, so that the research of the multi-mode searching method has important significance. As the phenomenon of mutual complementary descriptions between multimodal data becomes more common, various search engines and multimodal data on social media all exhibit explosive growth. Researchers have focused on how to quickly and accurately search for data representing the same event topic in different modalities in large-scale multi-modality data. Since data from different modalities typically have incomparable feature representations and distributions, it is necessary to map them to a common feature space. In order to meet the requirements of low storage cost and high query speed in practical application, scientific researchers propose a hash multi-mode retrieval method. The method maps the high-dimensional multi-modal data to a public Hamming space, and can calculate the similarity between the multi-modal data only through exclusive OR operation after the hash codes are obtained. Most of the existing hash multi-mode retrieval methods use hand-made features for hash learning, and the hash code learning speed of the methods is high, so that good retrieval effect can be achieved. However, a common disadvantage of the hash multi-mode search algorithm is that the manual feature making process and the hash learning process are completely independent, and further the manual feature may not be completely compatible with the hash learning process, so that the hash multi-mode search method using the manual feature is poor in search effect. Proved by experiments, on a plurality of commonly used multi-modal data sets used at present, if hash learning is carried out by continuously using hand-made features, the multi-modal retrieval effect is difficult to be improved. In order to solve the problem that the hand-made features are not fully compatible with the hash learning process, feature learning matched with the hash learning needs to be researched. The method aims at exploring a Hash multi-modal retrieval technology from the aspects of reducing coding errors, mining semantic information of multi-modal data, reducing the difference among the multi-modal data and the like when performing feature learning matched with the Hash learning by using deep learning.

The basic idea of hash multi-modal retrieval is to map data from the original feature space to the binary encoding space for similarity retrieval. The Hash multi-mode retrieval method maps multi-mode data into binary codes, and the binary codes have the advantages of small occupied space and high calculation speed. The general hash multi-mode searching method is divided into two steps, firstly, data of different modes or hand-made features are mapped into a public feature space through linear transformation or non-linear transformation, and secondly, the features of the public feature space are encoded, and most of the hash multi-mode searching methods use binary partition functions for encoding. The key difficulty of multi-modal retrieval is how to correlate semantic relatedness between multi-modal data, usually solved by learning uniform hash codes from different modalities of the same sample or narrowing the hamming distance between semantically related multi-modal data. The hash multi-mode search method can be classified into a supervised method and an unsupervised method according to whether tag information is used or not, and can be classified into a linear method and a nonlinear method according to a projection mode.

One of the existing technologies is multi-modal similarity sensitive hash (Cross-Modal Similarity Sensitive Hashing, CMSSH) in the paper "Large-scale supervised multimodal hashing with semantic correlation maximization", the method firstly maps the original data into hash codes, the similarity matrix is obtained by using the hash codes and a defined multi-modal data fusion mode, the weight of the next hash function is obtained by using the similarity matrix, the method regards the learning of each hash function as a two-classification process, namely a weak classifier, and finally the weak classifier is integrated into a strong classifier by using a standard lifting algorithm. The disadvantage of this approach is that CMSSH maintains consistency between multimodal data using point pair based methods, but does not take into account intra-modality data similarity.

The second prior art is the supervised matrix decomposition hash (Supervised Matrix Factorization Hashing, SMFH) proposed in paper Kenel-based supervised hashing for cross-view similarity search, the method uses matrix decomposition for retrieval in hash multi-mode, tag information and local geometry are effectively utilized, the consistency of tags in different modes is considered, the local geometric consistency in each mode is considered, the two elements are formulated into a drawing Laplace's subitem in an objective function, and the discrimination capability of potential semantic features obtained by joint matrix decomposition is greatly improved. The Discrete multi-mode hash (DCH) proposed in paper "the DCH retains Discrete constraints, and a linear classifier is constructed by using a label as supervision information, and the discriminative hash code is directly learned. The method has the disadvantage that compared with a nonlinear structure, semantic information learned by the linear structure is limited, and the linear method cannot be classified into a point pair-based method and a label-based method.

The third prior art is a multi-mode hidden binary embedding (Multimodal Latent Binary Embedding, MLBE) model proposed in the paper Deep cross-mode mapping, and the MLBE uses a generation model to encode intra-mode and inter-mode similarities of multi-mode data. Based on the maximum a posteriori estimates, binary latent factors that maintain both intra-and inter-modality similarities are effectively obtained and then used as learned hash codes. The method has the defects that during learning, especially when the code length is large, parameters are relatively large, the calculation complexity is high, and the optimization is easy to sink into a local minimum value.

Disclosure of Invention

The invention aims to overcome the defects of the existing method and provides a multi-mode content retrieval method and system for text content and image analysis. The invention solves the main problems that the CMSSH adopts a point pair method to maintain consistency among multi-mode data, but neglects similarity of data in modes; secondly, the current multi-mode hash method has insufficient hash code characteristic capacity expression and low effective characteristic extraction efficiency; thirdly, when the code length is larger, the parameters are more, and the calculation complexity is high, the MLBE model optimization is easy to sink into a local minimum.

In order to solve the above problems, the present invention proposes a multi-modal content retrieval method for text content combined with image analysis, the method comprising:

labeling the image set in the dataset Imagenet to obtain a text image information pair;

inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;

inputting the image features and the text features, and performing multi-modal attention calculation to obtain weighted attention multi-modal features;

respectively carrying out hash generation on the image features, the text features and the multi-modal features, and outputting an image hash code, a text hash code and the multi-modal hash code;

Inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;

constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;

and generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to text information input by a user, and then matching with the constructed multi-mode hash code database to obtain a retrieval result.

Preferably, labeling is performed on the image set in the dataset Imagenet to obtain a text image information pair, specifically:

and collecting the images and the information corresponding to the text description, forming a data set image, and labeling the image set in the data set image to obtain a text image information pair.

Preferably, the text image information pair is input, a feature extraction network is constructed to extract image and text features, and the image features and the text features are output, specifically:

for image features, a convolutional neural network is combined with a full-connection layer of 512 nodes, K nodes are connected at the same time, an image feature extraction network is constructed by using softmax as an activation function, the output is used as a learned image feature, a full-connection layer of sigmoid of the activation function with the label category number is combined, and a predicted label is output by the last layer of the image network to be used for maintaining the label feature of an example; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.

Preferably, the inputting the image feature and the text feature performs multi-modal attention calculation to obtain weighted multi-modal features, which specifically are:

the method comprises the steps of capturing the correlation between a text and an image by adopting a multi-mode cross attention mechanism based on a self-attention mechanism, taking the text as a query sub Q, taking the image as key values K and V, then carrying out multi-mode attention calculation, merging the calculation of different heads, and carrying out normalization processing to obtain the final multi-mode feature.

Preferably, the hash generation is performed on the image feature, the text feature and the multi-mode feature, and an image hash code, a text hash code and a multi-mode hash code are output, specifically:

and respectively acquiring the text, the image and the hash codes of the multi-modal features by adopting a sign function, wherein the formula is as follows:

the multi-modal feature comprises text features and image features, the multi-modal feature is used as an intermediate bridge to link a hash code learning process between different modalities, the image hash code learning process firstly encodes the image features by a sign function, and an image hash code is formed; encoding the multi-modal features by using a sign function to form a multi-modal hash code; in the text hash code learning process, text features are encoded by adopting sign functions to form text hash codes, and after modeling the hash codes of different modes, a target loss function is constructed.

Preferably, the inputting the image hash code, the text hash code and the multi-modal hash code constructs a target loss function, trains a model by using the loss function, and finally obtains a multi-modal hash code generation model, which is specifically as follows:

modeling similarity between image features and multi-modal features using negative log likelihood loss, modeling similarity between image and multi-modal features using cosine similarity function

The following is shown:

wherein S is a matrix of n x n,

representing a similarity relationship between the ith sample of the current modality and the jth sample of the other modality,/>

Representing similarity, -1 representing dissimilarity, +.>

Representing the similarity of two different modal samples in real value space, f=f (x _i ；θ _x )∈R ^C And l=g (y _i ；θ _y )∈R ^C Respectively representing image features and multi-modal features, wherein the image features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and is the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;

then modeling the similarity between the text and the multi-modal feature using a cosine similarity function using the similarity relationship between the negative log-likelihood loss modeling text and the multi-modal feature

The following is shown:

wherein S is a matrix of n x n,

Representing similarity, -1 representing dissimilarity, +.>

Representing two different modes of a sample in realSimilarity of value space, l=f (x _i ；θ _x )∈R ^C And g=g (y _i ；θ _y )∈R ^C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and also represents the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;

meanwhile, in order to compensate the information lost in the quantization process of different modes, the quantization loss is constructed as follows:

wherein B is a matrix of n×c for continuously storing and updating the binary hash representation of each sample during training, and B is actually a post-processing process, and sign (·) obtained by sign (f+l) and sign (f+g) is a function of converting a number greater than 0 to 1 and a number less than 0 to-1;

subsequently, to balance the relationship between the different modalities, a balance loss is introduced, as follows:

/>

finally, the total target loss is:

L _all ＝L1+L2+L3+L4+L5

Mapping similar data of different modes into a similar Hamming space through learning the loss function, enabling Hamming codes of dissimilar data among different modes to have a further Hamming distance in the Hamming space, and training to obtain a multi-mode Hamming code generation model;

when a user searches, the multi-mode hash code generation model only needs to upload data, the data is subjected to the trained model operation to obtain a hash code, and the hash code is compared with the hash codes in the database so as to realize a search function, so that the search speed and accuracy are improved.

Correspondingly, the invention also provides a multi-mode content retrieval system for text content combined with image analysis, which comprises:

the data preprocessing unit is used for labeling the image set in the data set Imagenet to obtain a text image information pair;

the feature extraction unit is used for inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;

the multi-modal attention unit is used for inputting the image features and the text features, carrying out multi-modal attention calculation and obtaining weighted multi-modal features;

The hash code generation unit is used for respectively carrying out hash generation on the image features, the text features and the multi-mode features and outputting an image hash code, a text hash code and the multi-mode hash code;

the model training unit is used for inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;

the database construction unit is used for constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;

and the matching unit is used for generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to text information input by a user, and then matching the multi-mode hash code with the constructed multi-mode hash code database to obtain a search result.

The implementation of the invention has the following beneficial effects:

according to the invention, the convolutional neural network is utilized to extract image features, the word bag model is utilized to extract text features, the existing advanced multi-modal attention method is adopted to fuse two modalities, the complex network is not adopted to fuse multi-modal features, and the computational complexity of the network is reduced. Meanwhile, the multi-mode hash codes are added, due to the correlation among different modes, the purpose of multi-mode attention is to capture the commonality among the modes, the commonality characteristics are distributed in different single modes, a bridge for similarity calculation among the modes is constructed, the problem of heterogeneous gaps among the modes is more easily compensated compared with the traditional method of directly adopting the hash codes of the different modes for similarity calculation, and the commonality among the modes is basically mined. The scheme creatively builds the multi-mode hash code, thereby improving the feature expression capability of the current hash code, and remarkably improving the extraction efficiency of the effective features through the extraction process of the multi-mode features of the reverse propagation link of the hash code.

Drawings

FIG. 1 is a general flow diagram of a multi-modal content retrieval method for text content in combination with image analysis in accordance with an embodiment of the present invention;

FIG. 2 is a training flow diagram of a multi-modal hash generation model in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a multi-modal attention computation in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of a multi-modal hash code database generation in accordance with an embodiment of the present invention;

FIG. 5 is a search matching flow chart of an embodiment of the present invention;

fig. 6 is a block diagram of a multi-modal content retrieval system for text content in combination with image analysis in accordance with an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a general flow chart of a multi-modal content retrieval method of text content in combination with image analysis according to an embodiment of the invention, as shown in FIG. 1, the method comprising:

S1, labeling an image set in a dataset Imagenet to obtain a text image information pair;

s2, inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features;

s3, inputting the image features and the text features, and performing multi-modal attention calculation to obtain weighted attention multi-modal features;

s4, respectively carrying out hash generation on the image features, the text features and the multi-mode features, and outputting an image hash code, a text hash code and the multi-mode hash code;

s5, inputting the image hash code, the text hash code and the multi-modal hash code, constructing a target loss function, training a model by using the loss function, and finally obtaining a multi-modal hash code generation model;

s6, as shown in FIG. 4, constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training;

and S7, as shown in FIG. 5, generating a multi-mode hash code by using the multi-mode hash code generation model according to text information input by a user, and then matching with the constructed multi-mode hash code database to obtain a retrieval result.

Step S1, specifically, the following steps are performed:

s1-1, collecting the images and the information of the corresponding text descriptions thereof, forming a data set image, and labeling the image set in the data set image to obtain a text image information pair.

Step S2, as shown in fig. 2, is specifically as follows:

s2-1, for image features, a convolutional neural network is adopted to combine a full-connection layer of 512 nodes, K nodes are connected at the same time, an activation function is softmax to construct an image feature extraction network, the image feature is output as learned image features, a full-connection layer of an activation function sigmoid with a label category number is combined, and a predicted label is output at the last layer of the image network to be used for keeping the label features of an example; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.

Step S3, as shown in fig. 3, is specifically as follows:

s3-1, capturing the correlation between the text and the image by adopting a multi-mode cross attention mechanism based on a self-attention mechanism, taking the text as a query sub Q, taking the image as key values K and V, then carrying out multi-mode attention calculation, merging the calculation of different heads, and carrying out normalization processing to obtain the final multi-mode feature.

Step S4, specifically, the following steps are performed:

s4-1, respectively acquiring the text, the image and the hash codes of the multi-modal features by adopting a sign function, wherein the formula is as follows:

s4-2, the multi-mode features comprise text features and image features, the multi-mode features are used as an intermediate bridge to link hash code learning processes among different modes, the image hash code learning processes adopt sign functions to encode the image features first, and image hash codes are formed; encoding the multi-modal features by using a sign function to form a multi-modal hash code; in the text hash code learning process, text features are encoded by adopting sign functions to form text hash codes, and after modeling the hash codes of different modes, a target loss function is constructed.

Step S5, specifically, the following steps are performed:

The following is shown:

wherein S is a matrix of n x n,

Representing similarity, -1 representing dissimilarity, +.>

The following is shown:

wherein S is a matrix of n x n,

Representing similarity, -1 representing dissimilarity, +.>

Representing the similarity of two different modal samples in real value space, l=f (x _i ；θ _x )∈R ^C And g=g (y _i ；θ _y )∈R ^C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and also represents the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;

finally, the total target loss is:

L _all ＝L1+L2+L3+L4+L5

Correspondingly, the invention also provides a multi-mode content retrieval system for text content combined with image analysis, as shown in fig. 6, comprising:

The data preprocessing unit 1 is used for labeling the image set in the dataset Imagenet to obtain a text image information pair.

Specifically, collecting images and information corresponding to text descriptions thereof, forming a data set image, and labeling the image set in the data set image to obtain a text image information pair;

and the feature extraction unit 2 is used for inputting the text image information pair, constructing a feature extraction network to extract images and text features, and outputting the image features and the text features.

Specifically, for image features, a convolutional neural network is combined with a full-connection layer of 512 nodes, a K nodes are connected at the same time, an activation function is softmax to construct an image feature extraction network, the image feature is output as a learned image feature, a full-connection layer of an activation function sigmoid with a label category number is combined, and a predicted label is output by the last layer of the image network to be used for keeping the label feature of an example; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.

And the multi-modal attention unit 3 is used for inputting the image features and the text features, performing multi-modal attention calculation and obtaining weighted multi-modal features.

Specifically, a multi-mode cross attention mechanism based on a self-attention mechanism is adopted to capture the correlation between a text and an image, the text is used as a query sub Q, the image is used as key values K and V, then multi-mode attention calculation is carried out, calculation of different heads is fused, and normalization processing is carried out, so that the final multi-mode feature is obtained.

And the hash code generation unit 4 is used for respectively carrying out hash generation on the image feature, the text feature and the multi-mode feature and outputting an image hash code, a text hash code and the multi-mode hash code.

Specifically, a sign function is adopted to respectively obtain the text, the image and the hash codes of the multi-modal characteristics, and the formula is as follows:

/>

The model training unit 5 is configured to input the image hash code, the text hash code, and the multi-modal hash code, construct a target loss function, train a model using the loss function, and finally obtain a multi-modal hash code generation model.

Specifically, a negative log-likelihood loss is used to model the similarity relationship between image features and multi-modal features, and a cosine similarity function is used to model the similarity between image and multi-modal features

The following is shown:

wherein S is a matrix of n x n,

Representing similarity, -1 representing dissimilarity, +.>

Representing two modalitiesSimilarity of different samples in real value space, f=f (x _i ；θ _x )∈R ^C And l=g (y _i ；θ _y )∈R ^C Respectively representing image features and multi-modal features, wherein the image features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and is the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;

The following is shown:

wherein S is a matrix of n x n,

Representing similarity, -1 representing dissimilarity, +.>

Representing the similarity of two different modal samples in real value space, l=f (x _i ；θ _x )∈R ^C And g=g (y _i ；θ _y )∈R ^C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and the length of a subsequent binary hash representation, also called bit, and optimizes the negative pair aboveThe number likelihood loss may be such that the similarity relationship is maintained in the feature space;

finally, the total target loss is:

L _all ＝L1+L2+L3+L4+L5

And mapping similar data of different modes into a similar Hamming space through learning the loss function, so that Hamming codes of dissimilar data among different modes have a further Hamming distance in the Hamming space, and training to obtain a multi-mode Hamming code generation model.

And the database construction unit 6 is used for constructing a multi-mode hash code database of the database from the database to be retrieved by utilizing the multi-mode hash code generation model obtained through training.

And the matching unit 7 is used for generating a multi-mode hash code by utilizing the multi-mode hash code generation model according to the text information input by the user, and then matching the multi-mode hash code with the constructed multi-mode hash code database to obtain a retrieval result.

Therefore, the invention utilizes the convolutional neural network to extract image features, utilizes the word bag model to extract text features, adopts the existing advanced multi-modal attention method to merge two modalities, does not adopt a complex network to merge multi-modal features, and reduces the computational complexity of the network. Meanwhile, the multi-mode hash codes are added, due to the correlation among different modes, the purpose of multi-mode attention is to capture the commonality among the modes, the commonality characteristics are distributed in different single modes, a bridge for similarity calculation among the modes is constructed, the problem of heterogeneous gaps among the modes is more easily compensated compared with the traditional method of directly adopting the hash codes of the different modes for similarity calculation, and the commonality among the modes is basically mined. The scheme creatively builds the multi-mode hash code, thereby improving the feature expression capability of the current hash code, and remarkably improving the extraction efficiency of the effective features through the extraction process of the multi-mode features of the reverse propagation link of the hash code.

The above description is provided in detail of a method and a system for retrieving multi-modal content by combining text content with image analysis, and specific examples are applied to illustrate the principles and embodiments of the present invention, and the above description of the embodiments is only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A multi-modal content retrieval method of text content in combination with image analysis, the method comprising:

2. The method for multi-modal content retrieval by combining text content with image analysis according to claim 1, wherein the labeling of the image collection in the dataset Imagenet to obtain the text image information pair is specifically as follows:

3. The multi-modal content retrieval method of claim 1, wherein the inputting the text image information pair, constructing a feature extraction network to extract image and text features, and outputting image features and text features, specifically:

4. The method for retrieving multi-modal content by combining text content with image analysis according to claim 1, wherein the inputting the image feature and the text feature performs multi-modal attention calculation to obtain weighted attention multi-modal features, specifically comprises:

5. The method for retrieving multi-modal content by combining text content with image analysis according to claim 1, wherein the hash generation is performed on the image feature, the text feature and the multi-modal feature, respectively, and an image hash code, a text hash code and a multi-modal hash code are output, specifically:

6. The method for multi-modal content retrieval by combining text content with image analysis according to claim 1, wherein the inputting the image hash code, the text hash code and the multi-modal hash code constructs a target loss function, and a loss function training model is utilized to finally obtain a multi-modal hash code generation model, which is specifically as follows:

The following is shown:

wherein S is a matrix of n x n,

Representing similarity, -1 representing dissimilarity, +.>

The following is shown:

wherein S is a matrix of n x n,

Representing similarity, -1 representing dissimilarity, +.>

Representing two modes of failureSimilarity to the sample in real space, l=f (x _i ；θ _x )∈R ^C And g=g (y _i ；θ _y )∈R ^C Respectively representing text features and multi-modal features, wherein the text features and the multi-modal features are real-valued feature representations, c represents the length of a feature vector, and also represents the length of a subsequent binary hash representation, which is also called bit, and the negative log likelihood loss above optimization can enable the similarity relationship to be kept in a feature space;

finally, the total target loss is:

L _all ＝L1+L2+L3+L4+L5

when a user searches, the multi-mode hash code generation model only needs to upload data, the data is subjected to the trained model operation to obtain a hash code, and the hash code is compared with the hash code in the database to realize a search function, so that the search speed and accuracy are improved.

7. A multi-modal content retrieval system for text content in combination with image analysis, the system comprising:

8. The multi-modal content retrieval system with text content combined with image analysis as claimed in claim 7, wherein the data preprocessing unit is required to collect information of images and corresponding text descriptions thereof, form a dataset Imagenet, and label image sets in the dataset Imagenet to obtain text image information pairs.

9. The multi-modal content retrieval system of claim 7, wherein the feature extraction unit is configured to construct an image feature extraction network by combining a 512 node fully connected layer with a convolutional neural network for image features, and simultaneously connecting K nodes with an activation function being softmax, output the image feature as a learned image feature, and output a predicted tag for maintaining the tag feature of the instance in combination with a fully connected layer with an activation function sigmoid of tag class number; for text features, each text is represented by a word bag vector, in order to solve the problem that the word bag vector easily causes feature sparseness, a multi-scale fusion model MS is adopted to extract text data features, the MS comprises five-level pooling layers (1 x1,2x2,3x3,5x5,10x 10), a 4096-node full-connection layer is connected after the MS, a 512-node full-connection layer is connected, a K-node full-connection layer with an activation function of softmax is connected, a full-connection layer with a label category number and a node activation function of sigmoid is finally adopted, the output of the last second layer of the text network is used as the learned text features, and the last layer outputs predicted labels.

10. The system of claim 7, wherein the multi-modal attention module captures the correlation between text and image using a multi-modal cross attention mechanism based on a self-attention mechanism, uses the text as a query Q, uses the image as key values K and V, performs multi-modal attention calculations, fuses the calculations of different heads, and performs normalization processing to obtain final multi-modal features.

11. The multi-modal content retrieval system of claim 7, wherein the hash code generation unit is configured to obtain hash codes of text, image and multi-modal features using sign functions, respectively, according to the following formula:

12. The system for multi-modal content retrieval combined with image analysis as claimed in claim 7, wherein said model training unit uses a negative log likelihood loss to model similarity between image features and multi-modal features, and uses a cosine similarity function to model similarity between image features and multi-modal features